# Regression Pipeline for Movies Dataset

This notebook preprocesses data for **linear or logistic regression** to analyze relationships.

**Note:** This pipeline uses the output from `analysis_pipeline.ipynb` (`movies_cleaned_analysis.csv`).

## Simplified Pipeline Steps:
1. Load Analysis-Ready Data
2. Select Features (Simple Numerical + Categorical)
3. Handle Missing Values
4. Encode Categorical Features (One-Hot Encoding)
5. Scale Numerical Features (Optional but Recommended)
6. Save Preprocessed Dataset


In [2]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder


## STEP 1: Load Analysis-Ready Data

Load the preprocessed dataset from the analysis pipeline.


In [3]:
# Load analysis-ready data
# Make sure to run analysis_pipeline.ipynb first to generate this file
df = pd.read_csv('movies_cleaned_analysis.csv')
print(f"Initial shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()


Initial shape: (62216, 35)

First few rows:


Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,budget,original_language,...,primary_company,company_count,release_quarter,release_decade,is_summer_blockbuster,is_holiday_season,budget_revenue_ratio,vote_reliability,runtime_bins,financial_success
0,27205,Inception,8.364,34495,Released,2010-07-15,825532800.0,148,160000000.0,en,...,Legendary Pictures,3,Q3,2010,True,False,5.15958,87.392079,Long,True
1,157336,Interstellar,8.417,32571,Released,2014-11-05,701729200.0,169,165000000.0,en,...,Legendary Pictures,3,Q4,2010,False,True,4.252904,87.4628,Long,True
2,155,The Dark Knight,8.512,30619,Released,2008-07-16,1004558000.0,152,185000000.0,en,...,DC Comics,5,Q3,2000,True,False,5.430046,87.923927,Long,True
3,19995,Avatar,7.573,29815,Released,2009-12-15,2923706000.0,162,237000000.0,en,...,Dune Entertainment,4,Q4,2000,False,True,12.336312,78.023108,Long,True
4,24428,The Avengers,7.71,29166,Released,2012-04-25,1518816000.0,143,220000000.0,en,...,Marvel Studios,1,Q2,2010,False,False,6.903707,79.264916,Long,True


## STEP 2: Select Features for Regression

Select simple numerical and categorical features.
For linear/logistic regression, we'll use:
- Key numerical features (budget, runtime, vote_count, etc.)
- Primary categorical features (original_language, primary_genre, etc.)


In [4]:
# Drop unnecessary columns
columns_to_drop = ['id', 'title', 'status']
for col in columns_to_drop:
    if col in df.columns:
        df = df.drop(columns=[col])

# Select numerical features (simple, key predictors)
numerical_features = [
    'budget', 'runtime', 'vote_count', 'vote_average', 
    'popularity', 'release_year'
]

# Select categorical features (use primary categories - simpler)
categorical_features = [
    'original_language', 'primary_genre', 'primary_country', 'release_month'
]

# Filter to only include columns that exist
numerical_features = [f for f in numerical_features if f in df.columns]
categorical_features = [f for f in categorical_features if f in df.columns]

print(f"Numerical features ({len(numerical_features)}): {numerical_features}")
print(f"\nCategorical features ({len(categorical_features)}): {categorical_features}")
print(f"\nData shape: {df.shape}")


Numerical features (6): ['budget', 'runtime', 'vote_count', 'vote_average', 'popularity', 'release_year']

Categorical features (4): ['original_language', 'primary_genre', 'primary_country', 'release_month']

Data shape: (62216, 32)


## STEP 3: Handle Missing Values

Impute missing values:
- Numerical: median imputation
- Categorical: fill with 'Unknown'


In [5]:
# Check missing values
print("Missing values in selected features:")
print(f"\nNumerical features:")
for col in numerical_features:
    missing = df[col].isnull().sum()
    if missing > 0:
        print(f"  {col}: {missing} ({missing/len(df)*100:.1f}%)")

print(f"\nCategorical features:")
for col in categorical_features:
    missing = df[col].isnull().sum()
    if missing > 0:
        print(f"  {col}: {missing} ({missing/len(df)*100:.1f}%)")


Missing values in selected features:

Numerical features:
  budget: 48721 (78.3%)

Categorical features:
  primary_genre: 403 (0.6%)


## STEP 4: Encode and Scale Features

Apply preprocessing:
- Numerical: median imputation + standardization
- Categorical: fill 'Unknown' + one-hot encoding


In [6]:
# Create preprocessing pipelines
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, numerical_features),
        ('cat', cat_pipeline, categorical_features)
    ],
    verbose_feature_names_out=False
)

# Fit and transform
print("Fitting preprocessor...")
processed_array = preprocessor.fit_transform(df)

# Get column names
numeric_names = numerical_features
categorical_names = preprocessor.named_transformers_['cat']['onehot'].get_feature_names_out(categorical_features)
all_columns = list(numeric_names) + list(categorical_names)

# Create final dataframe
df_regression = pd.DataFrame(processed_array, columns=all_columns)

print(f"Final shape: {df_regression.shape}")
print("Preprocessing complete!")


Fitting preprocessor...
Final shape: (62216, 296)
Preprocessing complete!


## STEP 5: Save Preprocessed Dataset

Save the regression-ready dataset.


In [None]:
# Save preprocessed data
output_file = 'movies_regression_ready.csv'
df_regression.to_csv(output_file, index=False)
print(f"Preprocessed dataset saved as '{output_file}'")
print(f"\nShape: {df_regression.shape}")
print(f"\nFirst few rows:")
df_regression.head()