<a href="https://colab.research.google.com/github/robitussin/CCDATSCL_EXERCISES/blob/main/Exercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise 2

<img src="https://vsqfvsosprmjdktwilrj.supabase.co/storage/v1/object/public/images/insights/1753644539114-netflix.jpeg"/>


In this activity , you will explore two fundamental preprocessing techniques used in data science and machine learning: feature scaling and discretization (binning).

These techniques are essential when working with datasets that contain numerical values on very different scales, or continuous variables that may be more useful when grouped into categories.


We will use a subset of the Netflix Movies and TV Shows dataset, which contains metadata such as release year, duration, ratings, and other attributes of titles currently or previously available on Netflix. Although the dataset is not originally designed for numerical modeling, it contains several features suitable for preprocessing practice—such as:
-Release Year
-Duration (in minutes)
-Number of Cast Members
-Number of Listed Genres
-Title Word Count

In this worksheet, you will:
- Load and inspect the dataset
- Select numerical features for scaling
- Apply different scaling techniques
- Min–Max Scaling
- Standardization
- Robust Scaling
- Perform discretization (binning)
- Equal-width binning
- Equal-frequency binning
- Evaluate how scaling affects machine learning performance, using a simple KNN

In [420]:
import pandas as pd
import os
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub


## 1. Setup and Data Loading



Load the Netflix dataset into a DataFrame named df.

In [421]:

# Download latest version
path = kagglehub.dataset_download("shivamb/netflix-shows")

print("Path to dataset files:", path)


if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

Using Colab cache for faster access to the 'netflix-shows' dataset.
Path to dataset files: /kaggle/input/netflix-shows
True


## 2. Data Understanding

Store the dataset’s column names in a variable called cols.

In [422]:
cols = df.columns
print(cols)

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')


In [423]:
shape_info = df.shape
print(shape_info)

(8807, 12)


## 3. Data Cleaning
Count missing values per column and save to missing_counts.

In [424]:
missing_counts = df.isnull().sum()
print(missing_counts)

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64


Drop rows where duration is missing. Save to df_clean.

In [425]:
df_clean = df.dropna(subset=['duration'])
print(df_clean)

     show_id     type                  title         director  \
0         s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1         s2  TV Show          Blood & Water              NaN   
2         s3  TV Show              Ganglands  Julien Leclercq   
3         s4  TV Show  Jailbirds New Orleans              NaN   
4         s5  TV Show           Kota Factory              NaN   
...      ...      ...                    ...              ...   
8802   s8803    Movie                 Zodiac    David Fincher   
8803   s8804  TV Show            Zombie Dumb              NaN   
8804   s8805    Movie             Zombieland  Ruben Fleischer   
8805   s8806    Movie                   Zoom     Peter Hewitt   
8806   s8807    Movie                 Zubaan      Mozez Singh   

                                                   cast        country  \
0                                                   NaN  United States   
1     Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa 

4. Selecting Relevant Numeric Features

Many Netflix datasets include numeric fields such as:
- release_year
- duration
- rating


Create a DataFrame `df_num` containing only numeric columns.

In [426]:
import numpy as np
df_num = df_clean.select_dtypes(include=np.number)
df_num.head()

Unnamed: 0,release_year
0,2020
1,2021
2,2021
3,2021
4,2021


## 5. Feature Scaling

Focus on a single numeric column (e.g., duration).


Extract the column duration into a Series named `dur`.

In [427]:
dur = df_clean['duration']
print(dur)

0          90 min
1       2 Seasons
2        1 Season
3        1 Season
4       2 Seasons
          ...    
8802      158 min
8803    2 Seasons
8804       88 min
8805       88 min
8806      111 min
Name: duration, Length: 8804, dtype: object


Apply Min–Max Scaling to `dur`. Store the result as `dur_minmax`.

In [428]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Convert "X Seasons" or "1 Season" into minutes
def convert_to_minutes(x):
    if "Season" in x:
        num_seasons = int(x.split()[0])
        total_minutes = num_seasons * 12 * 60   # 12 episodes * 60 min each
        return total_minutes
    elif "min" in x:
        return int(x.split()[0])
    else:
        return np.nan

# Apply conversion
dur_converted = df_clean['duration'].apply(convert_to_minutes)

# Drop missing values (if any)
dur_minutes = dur_converted.dropna()

# Apply Min–Max Scaling
scaler = MinMaxScaler()
dur_minmax = scaler.fit_transform(dur_minutes.values.reshape(-1, 1))

# OPTIONAL: add back to dataframe safely
df_clean.loc[dur_minutes.index, "dur_minmax"] = dur_minmax

# Show result
df_clean[["duration", "dur_minmax"]].head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean.loc[dur_minutes.index, "dur_minmax"] = dur_minmax


Unnamed: 0,duration,dur_minmax
0,90 min,0.00711
1,2 Seasons,0.117431
2,1 Season,0.058593
3,1 Season,0.058593
4,2 Seasons,0.117431


Apply Z-score Standardization to `dur`. Store in `dur_zscore`.

In [429]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# --- FIX: ensure df_clean is an independent copy ---
df_clean = df_clean.copy()

# Convert Season/Seasons and min formats into minutes
def convert_to_minutes(x):
    if "Season" in x:
        num_seasons = int(x.split()[0])
        return num_seasons * 12 * 60   # 12 episodes * 60 min
    elif "min" in x:
        return int(x.split()[0])
    else:
        return np.nan

# Apply the conversion
dur_converted = df_clean['duration'].apply(convert_to_minutes)

# Keep valid minute values
dur_minutes = dur_converted.dropna()

# Apply Z-score Standardization
scaler = StandardScaler()
dur_zscore = scaler.fit_transform(dur_minutes.values.reshape(-1, 1))

# Safe assignment (no warning)
df_clean.loc[dur_minutes.index, "dur_zscore"] = dur_zscore

# Show sample results
df_clean[["duration", "dur_zscore"]].head()


Unnamed: 0,duration,dur_zscore
0,90 min,-0.44158
1,2 Seasons,1.18915
2,1 Season,0.319427
3,1 Season,0.319427
4,2 Seasons,1.18915


## 6. Discretization (Binning)
Apply equal-width binning to dur into 5 bins. Store as `dur_width_bins`.


- Use `pandas.cut()` to divide duration_minutes into 4 `equal-width bins`.
- Add the resulting bins as a new column named:
`duration_equal_width_bin`

In [430]:
import pandas as pd
import numpy as np

# Convert Season/Seasons and min formats into minutes
def convert_to_minutes(x):
    if isinstance(x, str):
        if "Season" in x:       # Season or Seasons
            num_seasons = int(x.split()[0])
            return num_seasons * 12 * 60   # 12 episodes * 60 min
        elif "min" in x:
            return int(x.split()[0])
    return np.nan

# Convert duration column to minutes
dur_minutes = df_clean["duration"].apply(convert_to_minutes).dropna()

# Apply equal-width binning into 5 bins
dur_width_bins = pd.cut(dur_minutes, bins=5)

# Build clean dataset for rows with valid minute values
df_model = df_clean.loc[dur_minutes.index].copy()
df_model["duration_minutes"] = dur_minutes

# Add the bins as the new column
df_model["duration_equal_width_bin"] = dur_width_bins

# Show results
df_model[["duration", "duration_equal_width_bin"]].head()


Unnamed: 0,duration,duration_equal_width_bin
0,90 min,"(-9.237, 2450.4]"
1,2 Seasons,"(-9.237, 2450.4]"
2,1 Season,"(-9.237, 2450.4]"
3,1 Season,"(-9.237, 2450.4]"
4,2 Seasons,"(-9.237, 2450.4]"


Describe the characteristics of each bin

- What are the bin edges produced by equal-width binning?
- How many movies fall into each bin?

In [431]:
# 1. Show the bin edges
bin_edges = dur_width_bins.cat.categories
print("Bin edges:")
print(bin_edges)

# 2. Count how many movies fall into each bin
bin_counts = dur_width_bins.value_counts().sort_index()
print("\nMovies per bin:")
print(bin_counts)


Bin edges:
IntervalIndex([ (-9.237, 2450.4],  (2450.4, 4897.8],  (4897.8, 7345.2],
                (7345.2, 9792.6], (9792.6, 12240.0]],
              dtype='interval[float64, right]')

Movies per bin:
duration
(-9.237, 2450.4]     8545
(2450.4, 4897.8]      193
(4897.8, 7345.2]       56
(7345.2, 9792.6]        7
(9792.6, 12240.0]       3
Name: count, dtype: int64


Apply equal-frequency binning to dur into 5 bins. Store as `dur_quantile_bins`.

- Use `pandas.qcut()` to divide duration_minutes into 4 equal-frequency bins.
- Add the result as a new column named:
`duration_equal_freq_bin`

In [432]:
# Equal-frequency binning into 5 bins
dur_quantile_bins = pd.qcut(dur_minutes, q=5)

# Add to model dataset
df_model["duration_equal_freq_bin"] = dur_quantile_bins

df_model[["duration", "duration_equal_freq_bin"]].head()


Unnamed: 0,duration,duration_equal_freq_bin
0,90 min,"(89.0, 102.0]"
1,2 Seasons,"(720.0, 12240.0]"
2,1 Season,"(127.0, 720.0]"
3,1 Season,"(127.0, 720.0]"
4,2 Seasons,"(720.0, 12240.0]"


Describe the characteristics of each bin

- What are the bin ranges produced by equal-frequency binning?
- How many movies fall into each bin? Are they nearly equal?

In [433]:
# 1. Show the bin ranges (interval edges)
bin_ranges = dur_quantile_bins.cat.categories
print("Equal-frequency bin ranges:")
print(bin_ranges)

# 2. Count how many movies fall into each bin
bin_counts = dur_quantile_bins.value_counts().sort_index()
print("\nMovies per bin:")
print(bin_counts)


Equal-frequency bin ranges:
IntervalIndex([(2.999, 89.0], (89.0, 102.0], (102.0, 127.0], (127.0, 720.0],
               (720.0, 12240.0]],
              dtype='interval[float64, right]')

Movies per bin:
duration
(2.999, 89.0]       1838
(89.0, 102.0]       1714
(102.0, 127.0]      1757
(127.0, 720.0]      2612
(720.0, 12240.0]     883
Name: count, dtype: int64


## 7. KNN Before & After Scaling


Create a feature matrix X using any two numeric columns and a target y (e.g., classification by genre or type). Create a train/test split.

In [434]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Feature matrix (2 numeric columns)
X = df_model[["release_year", "duration_minutes"]]

# Target
y = df_model["type"]

# Encode target
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.25, random_state=42
)

X_train.shape, X_test.shape


((6603, 2), (2201, 2))

Train a KNN classifier without scaling. Store accuracy in acc_raw.

In [435]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# KNN without scaling
knn_raw = KNeighborsClassifier(n_neighbors=5)
knn_raw.fit(X_train, y_train)

y_pred_raw = knn_raw.predict(X_test)
acc_raw = accuracy_score(y_test, y_pred_raw)

acc_raw


1.0

Scale `X` using either Min–Max or Standardization, retrain KNN, and store accuracy in acc_scaled.

In [436]:
from sklearn.preprocessing import MinMaxScaler

# Scale features using Min–Max scaling
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# KNN on scaled data
knn_scaled = KNeighborsClassifier(n_neighbors=5)
knn_scaled.fit(X_train_scaled, y_train)

y_pred_scaled = knn_scaled.predict(X_test_scaled)
acc_scaled = accuracy_score(y_test, y_pred_scaled)

acc_scaled


0.9981826442526125

Did scaling improve accuracy? Explain why.

In [437]:
print(
    "Scaling normally improves KNN accuracy because KNN relies on distance and features with larger numeric\n"
    "ranges such as duration in minutes can overpower smaller ones such as release year, but in this case\n"
    "scaling did not meaningfully improve accuracy because the classes were already perfectly separated by\n"
    "duration alone.Movies are consistently short while TV shows convert into very large minute values,\n"
    "so KNN can classify them almost perfectly even without scaling, and scaling only produces a small\n"
    "numerical change without affecting the overall outcome."
)


Scaling normally improves KNN accuracy because KNN relies on distance and features with larger numeric
ranges such as duration in minutes can overpower smaller ones such as release year, but in this case
scaling did not meaningfully improve accuracy because the classes were already perfectly separated by
duration alone.Movies are consistently short while TV shows convert into very large minute values,
so KNN can classify them almost perfectly even without scaling, and scaling only produces a small
numerical change without affecting the overall outcome.
