# TASK-2: Hit Prediction (Classification)

**Objective:** Predict whether a movie will be a Hit (0/1).

**Dataset:** movies.csv (15 movies)

**Approach:**
1. Feature Selection (include cluster labels from Part A)
2. Stratified Train-Test Split
3. Model 1: Logistic Regression
4. Model 2: Random Forest
5. Cross-Validation for reliable evaluation
6. Model Comparison and Selection
7. Predict hit status for new movie

## Step 1: Import Libraries and Load Data

In [40]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.cluster import KMeans
import warnings
warnings.filterwarnings('ignore')

In [41]:
# Load Dataset
df = pd.read_csv("../Dataset/original/movies.csv")
print("Dataset Shape:", df.shape)
print("\n--- Dataset Preview ---")
print(df.head())
print("\n--- Target Distribution ---")
print(df['hit'].value_counts())

Dataset Shape: (15, 6)

--- Dataset Preview ---
   movie_id  avg_watch_time  completion_rate  ratings_count  avg_rating  hit
0         1              45             0.60           1200         3.8    0
1         2             110             0.90           8500         4.6    1
2         3              60             0.65           2000         4.0    0
3         4             130             0.95          12000         4.8    1
4         5              40             0.55            900         3.6    0

--- Target Distribution ---
hit
1    8
0    7
Name: count, dtype: int64


In [42]:
# Create Engagement & Popularity features FIRST
# Objective: Group movies based on viewer engagement and popularity
feature_cols = ['avg_watch_time', 'completion_rate', 'ratings_count', 'avg_rating']

scaler_norm = MinMaxScaler()
df_norm = pd.DataFrame(scaler_norm.fit_transform(df[feature_cols]), columns=feature_cols)

# Create composite features (weighted)
df['engagement'] = (df_norm['avg_watch_time'] + df_norm['completion_rate']) / 2
df['popularity'] = 0.7 * df_norm['ratings_count'] + 0.3 * df_norm['avg_rating']

# Use ALL 6 features: 4 original + engagement + popularity
all_features = feature_cols + ['engagement', 'popularity']
X = df[all_features].copy()
y = df['hit']

# Stratified Split FIRST - ensures both classes in train and test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("✅ Created Engagement & Popularity features (weighted)")
print("✅ Train-Test Split Done FIRST (Before any preprocessing)")
print(f"Training on 6 features: {all_features}")
print("Training Set Size:", len(X_train))
print("Testing Set Size:", len(X_test))
print("\nTraining Features:")
print(X_train.head())

✅ Created Engagement & Popularity features (weighted)
✅ Train-Test Split Done FIRST (Before any preprocessing)
Training on 6 features: ['avg_watch_time', 'completion_rate', 'ratings_count', 'avg_rating', 'engagement', 'popularity']
Training Set Size: 12
Testing Set Size: 3

Training Features:
    avg_watch_time  completion_rate  ratings_count  avg_rating  engagement  \
0               45             0.60           1200         3.8    0.080867   
10             115             0.91           9000         4.6    0.759514   
12             125             0.93          11000         4.8    0.828224   
2               60             0.65           2000         4.0    0.207188   
5              120             0.92          10000         4.7    0.793869   

    popularity  
0     0.055138  
10    0.545865  
12    0.670593  
2     0.130744  
5     0.608229  


In [43]:
# Step 3: Fit cluster scaler on TRAINING data only (using all 6 features)
scaler_cluster = StandardScaler()
X_train_cluster_scaled = scaler_cluster.fit_transform(X_train)

# Fit KMeans on TRAINING data only
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
train_clusters = kmeans.fit_predict(X_train_cluster_scaled)

# Add cluster labels to training set
X_train = X_train.copy()
X_train['cluster'] = train_clusters

# Transform test data using FITTED scaler and kmeans (no fit!)
X_test_cluster_scaled = scaler_cluster.transform(X_test)
test_clusters = kmeans.predict(X_test_cluster_scaled)

# Add cluster labels to test set
X_test = X_test.copy()
X_test['cluster'] = test_clusters

print("✅ Clustering Done on 6 Features (Fit on Train, Transform on Test)")
print("\nTraining Cluster Distribution:")
print(pd.Series(train_clusters).value_counts())
print("\nTest Cluster Distribution:")
print(pd.Series(test_clusters).value_counts())
print(f"\nFeatures (7 total): {list(X_train.columns)}")

✅ Clustering Done on 6 Features (Fit on Train, Transform on Test)

Training Cluster Distribution:
1    6
0    6
Name: count, dtype: int64

Test Cluster Distribution:
0    2
1    1
Name: count, dtype: int64

Features (7 total): ['avg_watch_time', 'completion_rate', 'ratings_count', 'avg_rating', 'engagement', 'popularity', 'cluster']


In [44]:
# Scale features - Fit on TRAINING only, Transform both
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("✅ Scaling Done Properly (Fit on Train, Transform on Test)")
print("\nTraining Scaled Shape:", X_train_scaled.shape)
print("Testing Scaled Shape:", X_test_scaled.shape)

✅ Scaling Done Properly (Fit on Train, Transform on Test)

Training Scaled Shape: (12, 7)
Testing Scaled Shape: (3, 7)


Model 1 - Logistic Regression

In [45]:
# Train Logistic Regression on properly scaled training data
model_lr = LogisticRegression(random_state=42)
model_lr.fit(X_train_scaled, y_train)

# Predictions on properly scaled test data
y_pred_lr = model_lr.predict(X_test_scaled)

# Evaluation
print("=== Logistic Regression Results ===")
print("Test Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))

=== Logistic Regression Results ===
Test Accuracy: 1.0

Confusion Matrix:
[[1 0]
 [0 2]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         2

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



Model 2 - Random Forest

In [46]:
# Train Random Forest on properly scaled training data
model_rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf.fit(X_train_scaled, y_train)

# Predictions on properly scaled test data
y_pred_rf = model_rf.predict(X_test_scaled)

# Evaluation
print("=== Random Forest Results ===")
print("Test Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))

=== Random Forest Results ===
Test Accuracy: 1.0

Confusion Matrix:
[[1 0]
 [0 2]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
           1       1.00      1.00      1.00         2

    accuracy                           1.00         3
   macro avg       1.00      1.00      1.00         3
weighted avg       1.00      1.00      1.00         3



Cross-Validation with Pipeline (Proper - No Leakage)

**Key:** Use Pipeline to ensure scaling happens inside each CV fold, not before.

In [47]:
# Cross-Validation with Pipeline - scaling happens INSIDE each fold
# This is the proper way to do CV without data leakage
from sklearn.pipeline import Pipeline

print("=== Cross-Validation Results (5-Fold with Pipeline) ===\n")
print("NOTE: Using Pipeline ensures scaling is done inside each CV fold (no leakage)")

# Logistic Regression Pipeline
pipeline_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

# Random Forest Pipeline
pipeline_rf = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# CV on TRAINING data only (not full dataset)
cv_scores_lr = cross_val_score(pipeline_lr, X_train, y_train, cv=5)
print("Logistic Regression:")
print(f"  CV Scores: {cv_scores_lr}")
print(f"  Mean CV Accuracy: {cv_scores_lr.mean():.4f} (+/- {cv_scores_lr.std():.4f})")

cv_scores_rf = cross_val_score(pipeline_rf, X_train, y_train, cv=5)
print("\nRandom Forest:")
print(f"  CV Scores: {cv_scores_rf}")
print(f"  Mean CV Accuracy: {cv_scores_rf.mean():.4f} (+/- {cv_scores_rf.std():.4f})")

=== Cross-Validation Results (5-Fold with Pipeline) ===

NOTE: Using Pipeline ensures scaling is done inside each CV fold (no leakage)
Logistic Regression:
  CV Scores: [1. 1. 1. 1. 1.]
  Mean CV Accuracy: 1.0000 (+/- 0.0000)



Random Forest:
  CV Scores: [1. 1. 1. 1. 1.]
  Mean CV Accuracy: 1.0000 (+/- 0.0000)


Hyperparameter Tuning (Logistic Regression)

In [48]:
# GridSearch with Pipeline - proper way to tune hyperparameters
# Note: When using Pipeline, prefix param name with step name

param_grid = {'classifier__C': [0.01, 0.1, 1, 10, 100]}
grid_search = GridSearchCV(pipeline_lr, param_grid, cv=5)
grid_search.fit(X_train, y_train)  # Fit on TRAINING data only

print("Best Parameters:", grid_search.best_params_)
print("Best CV Score:", grid_search.best_score_)

Best Parameters: {'classifier__C': 0.1}
Best CV Score: 1.0


Model Comparison and Final Selection

In [49]:
# Compare Models
print("=== MODEL COMPARISON ===\n")

results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Random Forest'],
    'Test Accuracy': [accuracy_score(y_test, y_pred_lr), accuracy_score(y_test, y_pred_rf)],
    'CV Mean Accuracy': [cv_scores_lr.mean(), cv_scores_rf.mean()],
    'CV Std': [cv_scores_lr.std(), cv_scores_rf.std()]
})
print(results)

print("\n=== FINAL MODEL SELECTION ===")
print("""
Based on the comparison:
1. Logistic Regression is preferred for this small dataset because:
   - Simpler model with less risk of overfitting
   - More interpretable coefficients
   - Works well with scaled features
   - More stable with limited training data
   
2. Random Forest may overfit on 15 samples with 100 trees.

FINAL MODEL: Logistic Regression
""")

=== MODEL COMPARISON ===

                 Model  Test Accuracy  CV Mean Accuracy  CV Std
0  Logistic Regression            1.0               1.0     0.0
1        Random Forest            1.0               1.0     0.0

=== FINAL MODEL SELECTION ===

Based on the comparison:
1. Logistic Regression is preferred for this small dataset because:
   - Simpler model with less risk of overfitting
   - More interpretable coefficients
   - Works well with scaled features
   - More stable with limited training data

2. Random Forest may overfit on 15 samples with 100 trees.

FINAL MODEL: Logistic Regression



Predict Hit Status for New Movie

In [50]:
# Use the best model from GridSearch (already trained on training data)
best_model = grid_search.best_estimator_

# Evaluate on held-out test set
y_pred_final = best_model.predict(X_test)
print("=== Final Model Evaluation on Test Set ===")
print("Test Accuracy:", accuracy_score(y_test, y_pred_final))

# New movie prediction (6 features + cluster)
new_movie_raw = pd.DataFrame({
    'avg_watch_time': [100],
    'completion_rate': [0.88],
    'ratings_count': [7000],
    'avg_rating': [4.5]
})

# Convert to engagement & popularity using the fitted scaler (weighted)
new_movie_norm = scaler_norm.transform(new_movie_raw)
new_engagement = (new_movie_norm[0][0] + new_movie_norm[0][1]) / 2
new_popularity = 0.7 * new_movie_norm[0][2] + 0.3 * new_movie_norm[0][3]

# Create full feature set (6 features)
new_movie = pd.DataFrame({
    'avg_watch_time': [100],
    'completion_rate': [0.88],
    'ratings_count': [7000],
    'avg_rating': [4.5],
    'engagement': [new_engagement],
    'popularity': [new_popularity]
})

# Get cluster for new movie using fitted scaler and kmeans
new_movie_cluster_scaled = scaler_cluster.transform(new_movie)
new_movie_cluster = kmeans.predict(new_movie_cluster_scaled)
new_movie['cluster'] = new_movie_cluster

# Predict using the pipeline (it handles scaling internally)
prediction = best_model.predict(new_movie)
probability = best_model.predict_proba(new_movie)

print("\n=== NEW MOVIE PREDICTION (6 Features + Cluster) ===")
print(f"Raw: watch_time=100, completion=0.88, ratings_count=7000, avg_rating=4.5")
print(f"Engagement: {new_engagement:.3f}, Popularity (weighted): {new_popularity:.3f}")
print(f"Cluster: {new_movie_cluster[0]}")
print(f"\nPredicted Hit Status: {'HIT' if prediction[0] == 1 else 'NOT HIT'}")
print(f"Prediction Probability: {probability[0][1]:.2%} chance of being a hit")

=== Final Model Evaluation on Test Set ===
Test Accuracy: 1.0

=== NEW MOVIE PREDICTION (6 Features + Cluster) ===
Raw: watch_time=100, completion=0.88, ratings_count=7000, avg_rating=4.5
Engagement: 0.656, Popularity (weighted): 0.443
Cluster: 0

Predicted Hit Status: HIT
Prediction Probability: 71.70% chance of being a hit
