leaving the clean world of "Just Numbers" (NumPy) and entering the messy world of "Mixed Data" (Pandas).

The Problem: NumPy matrices must be all numbers. But real movie data looks like this:

Title: "The Matrix" (String)

Rating: 8.7 (Float)

Views: 10,000 (Int)

Is_Blockbuster: True (Boolean)

You cannot store "The Matrix" inside a NumPy array of numbers. It will crash. The Solution: The Pandas DataFrame.

Think of a DataFrame as "Programmable Excel." It has rows and columns, but you can do Calculus on it.

üêº Pandas Drill 1: The "CSV Simulator"
We aren't going to download a file yet. We are going to create a raw dataset from scratch so you understand the structure.

Your Task:

Import pandas (import pandas as pd).

Create a Dictionary where keys are column names ("Movie", "Rating") and values are lists of data.

Convert that dictionary into a pd.DataFrame.

Print the dataframe.

In [3]:
import pandas as pd

# 1. Create the Raw Data (A Dictionary)
data = {
    "Movie_Title": ["The Matrix", "Avengers", "Titanic", "Sharknado"],
    "User_Rating": [10.0, 9.5, 8.0, 1.5],
    "Genre": ["Sci-Fi", "Action", "Drama", "Horror"]
}

# 2. Convert to DataFrame (The "Excel" Object)
df = pd.DataFrame(data)  # <--- Put the dictionary variable here

# 3. View the Data
print("--- My First DataFrame ---")
print(df)

# 4. The "Senior Engineer" Check
# Check the data types of columns
print("\n--- Column Types ---")
print(df.dtypes)

--- My First DataFrame ---
  Movie_Title  User_Rating   Genre
0  The Matrix         10.0  Sci-Fi
1    Avengers          9.5  Action
2     Titanic          8.0   Drama
3   Sharknado          1.5  Horror

--- Column Types ---
Movie_Title     object
User_Rating    float64
Genre           object
dtype: object


In [4]:
# 1. Get just the names
names = df["Movie_Title"]
print("--- Just Names ---")
print(names)

# 2. Get the Blockbusters (Rating > 8)
# Remember: df[ condition ]
high_rated = df[ df["User_Rating"] > 8 ]
print("\n--- High Rated Movies ---")
print(high_rated)

# 3. Challenge: Get the entry for "Sharknado"
sharknado_row = df[ df["Movie_Title"] == "Sharknado" ]
print("\n--- The Sharknado Row ---")
print(sharknado_row)

--- Just Names ---
0    The Matrix
1      Avengers
2       Titanic
3     Sharknado
Name: Movie_Title, dtype: object

--- High Rated Movies ---
  Movie_Title  User_Rating   Genre
0  The Matrix         10.0  Sci-Fi
1    Avengers          9.5  Action

--- The Sharknado Row ---
  Movie_Title  User_Rating   Genre
3   Sharknado          1.5  Horror


The Problem:Your Neural Network (PyTorch) is a math machine. It eats Numbers.Your DataFrame currently contains Text ("Sci-Fi", "Action").If you try torch.tensor("Sci-Fi"), it will crash.The Solution: One-Hot EncodingWe convert the text into a "Truth Matrix" of 1s and 0s."Sci-Fi" $\rightarrow$ [0, 0, 1]"Action" $\rightarrow$ [1, 0, 0]Your Task:We will use a magic Pandas function called pd.get_dummies(). It automatically scans your text column and turns it into a matrix of boolean numbers

In [5]:
# 1. Look at the original text
print("Original Genres:")
print(df["Genre"])

# 2. THE MAGIC TRANSLATION
# pd.get_dummies turns unique text values into new columns
genre_matrix = pd.get_dummies(df["Genre"])

print("\n--- The One-Hot Encoded Matrix ---")
print(genre_matrix)

# 3. (Optional) Convert True/False to 1/0 (Int)
# Some versions of pandas give True/False. AI likes 1/0.
genre_matrix_int = genre_matrix.astype(int)
print("\n--- The Final AI-Ready Matrix ---")
print(genre_matrix_int)

Original Genres:
0    Sci-Fi
1    Action
2     Drama
3    Horror
Name: Genre, dtype: object

--- The One-Hot Encoded Matrix ---
   Action  Drama  Horror  Sci-Fi
0   False  False   False    True
1    True  False   False   False
2   False   True   False   False
3   False  False    True   False

--- The Final AI-Ready Matrix ---
   Action  Drama  Horror  Sci-Fi
0       0      0       0       1
1       1      0       0       0
2       0      1       0       0
3       0      0       1       0


Now you have two separate pieces of data:df[['User_Rating']] (The numerical rating).genre_matrix_int (The genre features).To train an AI, we need One Big Matrix ($X$). We need to glue these two together side-by-side.Your Task:Use pd.concat to join them.

In [6]:
# 1. Select the numerical columns from the original df
# We don't want the text names, just the numbers!
rating_data = df[['User_Rating']] 

# 2. GLUE THEM TOGETHER
# axis=1 means "glue side-by-side" (add columns)
# axis=0 would mean "glue top-to-bottom" (add rows)
final_features = pd.concat([rating_data, genre_matrix_int], axis=1)

print("--- The Final Training Data (X) ---")
print(final_features)

# 3. CONVERT TO NUMPY (Ready for PyTorch)
# This .values attribute strips away the column names and leaves just the matrix
X_matrix = final_features.values
print("\n--- The NumPy Matrix ---")
print(X_matrix)
print(f"Shape: {X_matrix.shape}")

--- The Final Training Data (X) ---
   User_Rating  Action  Drama  Horror  Sci-Fi
0         10.0       0      0       0       1
1          9.5       1      0       0       0
2          8.0       0      1       0       0
3          1.5       0      0       1       0

--- The NumPy Matrix ---
[[10.   0.   0.   0.   1. ]
 [ 9.5  1.   0.   0.   0. ]
 [ 8.   0.   1.   0.   0. ]
 [ 1.5  0.   0.   1.   0. ]]
Shape: (4, 5)


Input ($X$): The movie data you just prepared (Rating + Genres).Target ($Y$): Did You like the movie? (1 = Yes, 0 = No).The Matrix (Sci-Fi, High Rating) $\to$ 1 (Like)Avengers (Action, High Rating) $\to$ 1 (Like)Titanic (Drama, High Rating) $\to$ 0 (Dislike - too sad)Sharknado (Horror, Low Rating) $\to$ 0 (Dislike - trash)Copy this into a new cell. This is your first "Professional" ML Script.

In [10]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim

# ==========================================
# STEP 1: DATA ENGINEERING (Pandas)
# ==========================================
print("--- 1. Preparing Data ---")
# Raw Data
data = {
    "Movie_Title": ["The Matrix", "Avengers", "Titanic", "Sharknado"],
    "User_Rating": [10.0, 9.5, 8.0, 1.5],
    "Genre": ["Sci-Fi", "Action", "Drama", "Horror"],
    "Target_Label": [1.0, 1.0, 0.0, 0.0] # 1=Like, 0=Dislike
}
df = pd.DataFrame(data)

# One-Hot Encoding (Text -> Numbers)
genre_matrix = pd.get_dummies(df["Genre"]).astype(int)

# Combine Features: [User_Rating] + [Genre Columns]
# Note: We DROP the Title (AI can't read titles yet) and Target (that's Y, not X)
X_features = pd.concat([df[['User_Rating']], genre_matrix], axis=1).values
y_labels = df[['Target_Label']].values

print(f"Features (X):\n{X_features}")
print(f"Targets (Y):\n{y_labels}")

# ==========================================
# STEP 2: CONVERT TO PYTORCH TENSORS
# ==========================================
# Computers need 32-bit floats
X_tensor = torch.tensor(X_features, dtype=torch.float32)
y_tensor = torch.tensor(y_labels, dtype=torch.float32)

# ==========================================
# STEP 3: MODEL ARCHITECTURE
# ==========================================
# We have 5 Input Features: (1 Rating + 4 Genres)
# We want 1 Output: (Probability of Liking)
model = nn.Linear(in_features=5, out_features=1)
optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()

# ==========================================
# STEP 4: TRAINING LOOP (The "Learning")
# ==========================================
print("\n--- 2. Training Model ---")
for epoch in range(500):
    # A. Forward Pass
    prediction = model(X_tensor)
    
    # B. Calculate Loss
    loss = loss_fn(prediction, y_tensor)
    
    # C. Backward Pass (Auto-Gradient)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if epoch % 100 == 0:
        print(f"Epoch {epoch} | Loss: {loss.item():.4f}")

# ==========================================
# STEP 5: PREDICTION (Testing)
# ==========================================
print("\n--- 3. Final Results ---")
final_pred = model(X_tensor)
print("Predicted Scores (Closer to 1 is better):")
# .detach() removes the gradient tracking so we can print clean numbers
print(final_pred.detach().numpy())

--- 1. Preparing Data ---
Features (X):
[[10.   0.   0.   0.   1. ]
 [ 9.5  1.   0.   0.   0. ]
 [ 8.   0.   1.   0.   0. ]
 [ 1.5  0.   0.   1.   0. ]]
Targets (Y):
[[1.]
 [1.]
 [0.]
 [0.]]

--- 2. Training Model ---
Epoch 0 | Loss: 3.7107
Epoch 100 | Loss: 0.0459
Epoch 200 | Loss: 0.0168
Epoch 300 | Loss: 0.0062
Epoch 400 | Loss: 0.0023

--- 3. Final Results ---
Predicted Scores (Closer to 1 is better):
[[ 0.9660142 ]
 [ 0.999071  ]
 [ 0.04545099]
 [-0.00999052]]
