# Exercise 2

<img src="https://vsqfvsosprmjdktwilrj.supabase.co/storage/v1/object/public/images/insights/1753644539114-netflix.jpeg"/>


In this activity , you will explore two fundamental preprocessing techniques used in data science and machine learning: feature scaling and discretization (binning).

These techniques are essential when working with datasets that contain numerical values on very different scales, or continuous variables that may be more useful when grouped into categories.


We will use a subset of the Netflix Movies and TV Shows dataset, which contains metadata such as release year, duration, ratings, and other attributes of titles currently or previously available on Netflix. Although the dataset is not originally designed for numerical modeling, it contains several features suitable for preprocessing practice—such as:
- Release Year
- Duration (in minutes)
- Genre

In this worksheet, you will:
- Load and inspect the dataset
- Select numerical features for scaling
- Apply different scaling techniques
- Min–Max Scaling
- Standardization
- Robust Scaling
- Perform discretization (binning)
- Equal-width binning
- Equal-frequency binning
- Evaluate how scaling affects machine learning performance, using a simple KNN

In [218]:
import pandas as pd
import os
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub


## 1. Setup and Data Loading



Load the Netflix dataset into a DataFrame named df.

In [219]:

# Download latest version
path = kagglehub.dataset_download("shivamb/netflix-shows")

print("Path to dataset files:", path)


if os.path.isdir(path):
  print(True)

contents = os.listdir(path)
contents

mydataset = path + "/" + contents[0]
mydataset


df = pd.read_csv(mydataset)

Using Colab cache for faster access to the 'netflix-shows' dataset.
Path to dataset files: /kaggle/input/netflix-shows
True


## 2. Data Understanding

Store the dataset’s column names in a variable called cols.

In [220]:
cols = df.columns

Store the shape of the dataset as a tuple (rows, columns) in shape_info.

In [221]:
shape_info = df.shape

## 3. Data Cleaning
Count missing values per column and save to missing_counts.

In [222]:
missing_counts = df.isnull().sum()

Drop rows where duration is missing. Save to df_clean.

In [223]:
df_clean = df.dropna(subset=['duration'])

## 4. Selecting Relevant Numeric Features

Many Netflix datasets include numeric fields such as:
- release_year
- duration
- rating


Create a DataFrame `df_num` containing only numeric columns.

In [224]:
df_num = df_clean.select_dtypes(include='number')

## 5. Feature Scaling

Focus on a single numeric column (e.g., duration).


Extract the column duration into a Series named `dur`.

In [225]:
# Let's assume, for the purpose of alignment, that one season is approximately equivalent to 500 minutes (e.g., 10 episodes * 50 minutes/episode).

df_clean['seasons_in_minutes'] = pd.NA

seasons_mask = df_clean['duration'].str.contains('season', case=False, na=False)

df_clean.loc[seasons_mask, 'seasons_in_minutes'] = df_clean.loc[seasons_mask, 'duration'].str.extract(r'(\d+)')[0].astype(int) * 500

print("Rows with 'seasons_in_minutes' populated:")
print(df_clean[seasons_mask][['duration', 'seasons_in_minutes']].head())

Rows with 'seasons_in_minutes' populated:
    duration seasons_in_minutes
1  2 Seasons               1000
2   1 Season                500
3   1 Season                500
4  2 Seasons               1000
5   1 Season                500


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['seasons_in_minutes'] = pd.NA


In [226]:
df_clean['minutes_duration_int'] = pd.NA

minutes_mask = df_clean['duration'].str.contains('min', case=False, na=False)

df_clean.loc[minutes_mask, 'minutes_duration_int'] = df_clean.loc[minutes_mask, 'duration'].str.extract(r'(\d+)')[0].astype(int)

print("Rows with 'minutes_duration_int' populated:")
print(df_clean[minutes_mask][['duration', 'minutes_duration_int']].head())

Rows with 'minutes_duration_int' populated:
   duration minutes_duration_int
0    90 min                   90
6    91 min                   91
7   125 min                  125
9   104 min                  104
12  127 min                  127


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['minutes_duration_int'] = pd.NA


In [227]:
dur = df_clean['minutes_duration_int'].fillna(df_clean['seasons_in_minutes'])

print("First 5 values of 'dur' Series:")
print(dur.head())
print("Type of 'dur':", type(dur))
print("Number of non-null values in 'dur':", dur.count())

First 5 values of 'dur' Series:
0      90
1    1000
2     500
3     500
4    1000
Name: minutes_duration_int, dtype: int64
Type of 'dur': <class 'pandas.core.series.Series'>
Number of non-null values in 'dur': 8804


  dur = df_clean['minutes_duration_int'].fillna(df_clean['seasons_in_minutes'])


In [228]:
dur = df_clean['minutes_duration_int'].fillna(df_clean['seasons_in_minutes']).infer_objects(copy=False)

print("First 5 values of 'dur' Series:")
print(dur.head())
print("Type of 'dur':", type(dur))
print("Number of non-null values in 'dur':", dur.count())

First 5 values of 'dur' Series:
0      90
1    1000
2     500
3     500
4    1000
Name: minutes_duration_int, dtype: int64
Type of 'dur': <class 'pandas.core.series.Series'>
Number of non-null values in 'dur': 8804


  dur = df_clean['minutes_duration_int'].fillna(df_clean['seasons_in_minutes']).infer_objects(copy=False)


Apply Min–Max Scaling to `dur`. Store the result as `dur_minmax`.

In [229]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
dur_minmax = pd.Series(scaler.fit_transform(dur.values.reshape(-1, 1)).flatten(), name='duration_minmax', index=dur.index)

print("First 5 values of dur_minmax:")
print(dur_minmax.head())
print("Min value:", dur_minmax.min())
print("Max value:", dur_minmax.max())

First 5 values of dur_minmax:
0    0.010239
1    0.117336
2    0.058491
3    0.058491
4    0.117336
Name: duration_minmax, dtype: float64
Min value: 0.0
Max value: 1.0


Apply Z-score Standardization to `dur`. Store in `dur_zscore`.

In [230]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
dur_zscore = pd.Series(scaler.fit_transform(dur.values.reshape(-1, 1)).flatten(), name='duration_zscore', index=dur.index)

print("First 5 values of dur_zscore:")
print(dur_zscore.head())
print("Mean value (should be close to 0):", dur_zscore.mean())
print("Standard deviation (should be close to 1):", dur_zscore.std())

First 5 values of dur_zscore:
0   -0.437240
1    1.170126
2    0.286958
3    0.286958
4    1.170126
Name: duration_zscore, dtype: float64
Mean value (should be close to 0): -3.22827231149523e-18
Standard deviation (should be close to 1): 1.0000567972056422


## 6. Discretization (Binning)
Apply equal-width binning to dur into 5 bins. Store as `dur_width_bins`.


- Use `pandas.cut()` to divide duration_minutes into 4 `equal-width bins`.
- Add the resulting bins as a new column named:
`duration_equal_width_bin`

In [231]:
dur_width_bins = pd.cut(dur, bins=5, labels=False, include_lowest=True)

df_clean.loc[:, 'duration_equal_width_bin'] = dur_width_bins

print("First 5 values of dur_width_bins:")
print(dur_width_bins.head())
print("Bin counts:")
print(dur_width_bins.value_counts().sort_index())

First 5 values of dur_width_bins:
0    0
1    0
2    0
3    0
4    0
Name: minutes_duration_int, dtype: int64
Bin counts:
minutes_duration_int
0    8545
1     193
2      56
3       7
4       3
Name: count, dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean.loc[:, 'duration_equal_width_bin'] = dur_width_bins


Describe the characteristics of each bin

- What are the bin edges produced by equal-width binning?
- How many movies fall into each bin?

In [232]:
print("Bin edges (Intervals) and counts:")
print(bins_info.value_counts().sort_index())

print("\nExplicit bin edges:")
for i, interval in enumerate(unique_bins):
    print(f"Bin {i}: {interval.left:.2f} to {interval.right:.2f}")

Bin edges (Intervals) and counts:
minutes_duration_int
(-5.498, 1702.4]    8545
(1702.4, 3401.8]     193
(3401.8, 5101.2]      56
(5101.2, 6800.6]       7
(6800.6, 8500.0]       3
Name: count, dtype: int64

Explicit bin edges:
Bin 0: -5.50 to 1702.40
Bin 1: 1702.40 to 3401.80
Bin 2: 3401.80 to 5101.20
Bin 3: 5101.20 to 6800.60
Bin 4: 6800.60 to 8500.00


Apply equal-frequency binning to dur into 5 bins. Store as `dur_quantile_bins`.

- Use `pandas.qcut()` to divide duration_minutes into 4 equal-frequency bins.
- Add the result as a new column named:
`duration_equal_freq_bin`

In [233]:
df_quantile_bins = pd.qcut(dur, q=5, labels=False, duplicates='drop')

df_clean = df_clean.copy()
df_clean.loc[:, 'duration_equal_freq_bin'] = dur_quantile_bins

print("First 5 values of dur_quantile_bins:")
print(dur_quantile_bins.head())
print("Bin counts:")
print(dur_quantile_bins.value_counts().sort_index())

First 5 values of dur_quantile_bins:
0    1
1    4
2    3
3    3
4    4
Name: minutes_duration_int, dtype: int64
Bin counts:
minutes_duration_int
0    1838
1    1714
2    1757
3    2612
4     883
Name: count, dtype: int64


Describe the characteristics of each bin

- What are the bin ranges produced by equal-frequency binning?
- How many movies fall into each bin? Are they nearly equal?

In [234]:
bins_info_qcut = pd.qcut(dur, q=5, duplicates='drop')

print("Bin ranges (Intervals) and counts:")
print(bins_info_qcut.value_counts().sort_index())

print("\nAre the bin counts nearly equal? A perfect equal-frequency binning would have each bin containing approximately 8804 / 5 = 1760.8 entries.")
print("The counts are: ", bins_info_qcut.value_counts().sort_index().values)


Bin ranges (Intervals) and counts:
minutes_duration_int
(2.999, 89.0]      1838
(89.0, 102.0]      1714
(102.0, 127.0]     1757
(127.0, 500.0]     2612
(500.0, 8500.0]     883
Name: count, dtype: int64

Are the bin counts nearly equal? A perfect equal-frequency binning would have each bin containing approximately 8804 / 5 = 1760.8 entries.
The counts are:  [1838 1714 1757 2612  883]


## 7. KNN Before & After Scaling


Create a feature matrix X using any two numeric columns and a target y (e.g., classification by genre or type). Create a train/test split.

In [235]:
X = pd.DataFrame({
    'release_year': df_clean['release_year'],
    'duration': dur
})
y = df_clean['type']

print("First 5 rows of X:")
print(X.head())
print("\nFirst 5 values of y:")
print(y.head())

First 5 rows of X:
   release_year  duration
0          2020        90
1          2021      1000
2          2021       500
3          2021       500
4          2021      1000

First 5 values of y:
0      Movie
1    TV Show
2    TV Show
3    TV Show
4    TV Show
Name: type, dtype: object


Train a KNN classifier without scaling. Store accuracy in acc_raw.

In [236]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred_raw = knn.predict(X_test)

acc_raw = accuracy_score(y_test, y_pred_raw)

print(f"Accuracy of KNN on raw data: {acc_raw:.4f}")

Accuracy of KNN on raw data: 1.0000


Scale `X` using either Min–Max or Standardization, retrain KNN, and store accuracy in acc_scaled.

In [237]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

knn_scaled = KNeighborsClassifier(n_neighbors=5)

knn_scaled.fit(X_train_scaled_df, y_train)

y_pred_scaled = knn_scaled.predict(X_test_scaled_df)

acc_scaled = accuracy_score(y_test, y_pred_scaled)

print(f"Accuracy of KNN on scaled data: {acc_scaled:.4f}")

Accuracy of KNN on scaled data: 0.9989


Did scaling improve accuracy? Explain why.

The comparison between KNN with and without scaling shows that the raw model achieved an accuracy of 1.0000, while the scaled version using StandardScaler slightly decreased to 0.9989. This result is unusual because scaling typically improves KNN performance, especially when features have very different ranges. KNN is a distance-based algorithm, and features with larger numerical scales tend to dominate the distance calculation. In this dataset, duration has a much larger range (reaching up to thousands of minutes for TV shows) compared to release_year, which spans only about a century. The extremely high unscaled accuracy suggests that duration alone was strong enough to separate movies from TV shows because these two categories differ significantly in typical duration. As a result, KNN relied almost entirely on the unscaled duration feature, producing a perfect score. After scaling, both duration and release_year were given equal influence in the distance computation. However, release_year does not meaningfully distinguish between the two classes, so giving it equal weight introduced weakly relevant information. This slightly reduced the model's accuracy compared to when duration was allowed to dominate the decision.
