<a href="https://colab.research.google.com/github/crystalclcm/JobPostings/blob/main/9_Lab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab: Data Transformation – Scaling, Encoding, Binning & Handling Skew

In this lab you will practice:

1. Scaling numeric data
2. Encoding categorical data with OneHotEncoder
3. Binning / discretising numeric data
4. Handling skew and long tails

Dataset: `data_transformation.csv` from the Lecture Examples

## 0. Setup and data loading

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the Week 9 dataset; adjust path if your file is elsewhere
df = pd.read_csv('/content/data_transformation.csv')

# Import your dataset

In [3]:
# Inspect before you transform
print(df.shape)             # rows, columns
print(df.info())            # types & non-null counts
print(df.describe(include='all'))  # quick summary
print(df.isnull().sum())    # missing values check


(100, 9)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   UserID          100 non-null    int64  
 1   Age             100 non-null    int64  
 2   Income          100 non-null    float64
 3   SignupDate      100 non-null    object 
 4   City            100 non-null    object 
 5   Device          100 non-null    object 
 6   PurchaseAmount  100 non-null    float64
 7   ReviewText      100 non-null    object 
 8   Balance         100 non-null    float64
dtypes: float64(3), int64(2), object(4)
memory usage: 7.2+ KB
None
            UserID         Age        Income  SignupDate     City  Device  \
count   100.000000  100.000000    100.000000         100      100     100   
unique         NaN         NaN           NaN         100        8       3   
top            NaN         NaN           NaN  2021-01-01  Chicago  Mobile   
freq           NaN

## 1. Scaling numeric data

In [5]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Choose the numeric features we want to scale (as in the lab: Age, Income, PurchaseAmount, Balance)
num_cols = ['Age', 'Income', 'PurchaseAmount', 'Balance']

# Create plots to view the frequency distributions of 'Age', 'Income', 'PurchaseAmount', 'Balance'

In [6]:
# Create Standard scaling columns for 'Age', 'Income', 'PurchaseAmount', 'Balance'
# Standardisation (mean = 0, std = 1); good when features have different scales and we want z-scores
std = StandardScaler()
df[[c + '_std' for c in num_cols]] = std.fit_transform(df[num_cols])

In [7]:
# Create Min-max scaling columns for 'Age', 'Income', 'PurchaseAmount', 'Balance'
# Min-Max scaling (map to 0..1 range); simple normalisation often used for models needing bounded inputs
mm = MinMaxScaler()
df[[c + '_mm' for c in num_cols]] = mm.fit_transform(df[num_cols])

df.head()


Unnamed: 0,UserID,Age,Income,SignupDate,City,Device,PurchaseAmount,ReviewText,Balance,Age_std,Income_std,PurchaseAmount_std,Balance_std,Age_mm,Income_mm,PurchaseAmount_mm,Balance_mm
0,1,62,77728.88,2021-01-01,new york,Desktop,175.7,Average.,37.54,1.321548,1.340407,0.372345,-0.486784,0.862745,0.771738,0.529003,0.107136
1,2,65,33930.72,2021-01-11,NYC,Mobile,140.09,Highly recommend.,240.81,1.515608,-1.164528,-0.142748,2.289606,0.921569,0.250232,0.416437,0.694129
2,3,18,74184.21,2021-01-21,new york,Mobile,151.77,Love it!,105.34,-1.524664,1.137678,0.026202,0.439271,0.0,0.729531,0.453359,0.302925
3,4,21,42401.13,2021-01-31,Chicago,Desktop,124.3,Highly recommend.,73.04,-1.330604,-0.680082,-0.371147,-0.001903,0.058824,0.35109,0.366524,0.209651
4,5,21,57593.42,2021-02-10,LA,Tablet,205.41,Average.,13.57,-1.330604,0.188806,0.802095,-0.814182,0.058824,0.531984,0.622918,0.037916


## 2. Encoding categorical data

In [8]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# Inspect the values for 'City' and clean to fix any inconsistent data
# Verify by plotting the frequency distribution

# Clean City names: strip spaces, Title Case, unify common short forms (NYC, LA, SF)
df['City'] = df['City'].str.strip().str.title()
df['City'] = df['City'].replace({'Nyc': 'New York', 'La': 'Los Angeles', 'Sf': 'San Francisco'})


In [9]:
# One-hot encode 'City' and 'Device'

# One-hot encode City and Device using scikit-learn (recommended in lecture over get_dummies)
# handle_unknown='ignore' allows unseen categories in validation/test
# drop='first' reduces perfect collinearity (dummy variable trap) for linear models
enc = OneHotEncoder(handle_unknown='ignore', drop='first', sparse_output=False)
X_enc = enc.fit_transform(df[['City', 'Device']])
enc_cols = enc.get_feature_names_out(['City', 'Device'])

# Join encoded columns back to df
import numpy as np
import pandas as pd
df_ohe = pd.DataFrame(X_enc, columns=enc_cols, index=df.index)
df = pd.concat([df.drop(columns=['City','Device']), df_ohe], axis=1)

df.head()


Unnamed: 0,UserID,Age,Income,SignupDate,PurchaseAmount,ReviewText,Balance,Age_std,Income_std,PurchaseAmount_std,Balance_std,Age_mm,Income_mm,PurchaseAmount_mm,Balance_mm,City_Los Angeles,City_New York,City_San Francisco,Device_Mobile,Device_Tablet
0,1,62,77728.88,2021-01-01,175.7,Average.,37.54,1.321548,1.340407,0.372345,-0.486784,0.862745,0.771738,0.529003,0.107136,0.0,1.0,0.0,0.0,0.0
1,2,65,33930.72,2021-01-11,140.09,Highly recommend.,240.81,1.515608,-1.164528,-0.142748,2.289606,0.921569,0.250232,0.416437,0.694129,0.0,1.0,0.0,1.0,0.0
2,3,18,74184.21,2021-01-21,151.77,Love it!,105.34,-1.524664,1.137678,0.026202,0.439271,0.0,0.729531,0.453359,0.302925,0.0,1.0,0.0,1.0,0.0
3,4,21,42401.13,2021-01-31,124.3,Highly recommend.,73.04,-1.330604,-0.680082,-0.371147,-0.001903,0.058824,0.35109,0.366524,0.209651,0.0,0.0,0.0,0.0,0.0
4,5,21,57593.42,2021-02-10,205.41,Average.,13.57,-1.330604,0.188806,0.802095,-0.814182,0.058824,0.531984,0.622918,0.037916,1.0,0.0,0.0,0.0,1.0


## 3. Binning / Discretising numeric data

In [12]:
# Create 4 Equal-width bins for 'Age' in a feature called 'AgeGroup'


# Binning converts continuous values to labelled buckets.
# We demonstrate both fixed bins (equal width) and quantile bins (Q1..Q4).

# Age: 4 equal-width bins between min..max
df['AgeGroup'] = pd.cut(df['Age'], bins=4, labels=['Bin1','Bin2','Bin3','Bin4'])

# Income: 4 quantile-based bins (each ~25% of data)
df['IncomeQuartile'] = pd.qcut(df['Income'], q=4, labels=['Q1','Q2','Q3','Q4'])

# Quick check
df[['Age','AgeGroup','Income','IncomeQuartile']].head()


Unnamed: 0,Age,AgeGroup,Income,IncomeQuartile
0,62,Bin4,77728.88,Q4
1,65,Bin4,33930.72,Q1
2,18,Bin1,74184.21,Q4
3,21,Bin1,42401.13,Q1
4,21,Bin1,57593.42,Q3


In [None]:
# Create 4 Quantile-based bins for 'Income' in a feature called 'IncomeQuartile'

## 4. Handling skew and long tails

In [13]:
# Plot and inspect the skew for 'Balance'

import numpy as np
from sklearn.preprocessing import PowerTransformer

# Measure skew first
print("Skew(Balance) before:", df['Balance'].skew())


Skew(Balance) before: 1.5227119564622034


In [14]:
# Create Log and sqrt transforms for 'Balance' and plot

# Log transform: compress the right tail; add a small constant to handle zeros safely.
eps = 1e-6
df['Balance_log'] = np.log(df['Balance'] + eps)

# Square root transform: milder than log; useful for moderate skew.
df['Balance_sqrt'] = np.sqrt(df['Balance'].clip(lower=0))


In [15]:
# Create Yeo-Johnson transfomr for 'Balance' and plot

# Yeo-Johnson: handles zeros and negatives; aims to normalise automatically.
pt = PowerTransformer(method='yeo-johnson')
df['Balance_yj'] = pt.fit_transform(df[['Balance']])

print("Skew(log):", df['Balance_log'].skew())
print("Skew(sqrt):", df['Balance_sqrt'].skew())
print("Skew(yj):", pd.Series(df['Balance_yj']).skew())


Skew(log): -0.7759543386036094
Skew(sqrt): 0.5566932630806655
Skew(yj): -0.05981298218649103


Log transform (Balance_log) → skew −0.776

The log compressed the right tail so much that it slightly over‑corrected, giving a moderate left skew.


Square root transform (Balance_sqrt) → skew +0.557

Sqrt is a milder compression than log. It reduced the right tail but still leaves a moderate right skew.


Yeo‑Johnson (Balance_yj) → skew −0.060

Very close to 0. This is the most symmetric of the three—what we generally want when preparing data for many ML algorithms.



Bottom line: For this dataset, Yeo‑Johnson gave the best balance (closest to 0), so it’s usually the safest choice to use going forward.