Describe at least two common data transformation technique
1. Normalization (Scaling values between 0 and 1)
Normalization is used to rescale numeric data so that values lie in a specific range, often 0 to 1.

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Sample data
data = pd.DataFrame({'Profit': [200, 500, 1000, 1500, 3000]})
# Initialize scaler
scaler = MinMaxScaler()
# Apply normalization
data['Profit_Normalized'] = scaler.fit_transform(data[['Profit']])
print(data)

   Profit  Profit_Normalized
0     200           0.000000
1     500           0.107143
2    1000           0.285714
3    1500           0.464286
4    3000           1.000000


 Use: Helps when combining features with different scales for modeling

2. Log Transformation

Log transformation is used to reduce skewness in data and handle large ranges.

In [2]:
import numpy as np
# Sample data
data = pd.DataFrame({'Sales': [10, 50, 200, 1000, 5000]})
# Apply log transformation
data['Sales_Log'] = np.log(data['Sales'])
print(data)

   Sales  Sales_Log
0     10   2.302585
1     50   3.912023
2    200   5.298317
3   1000   6.907755
4   5000   8.517193


Use: Makes highly skewed data more normal, improving analysis or modeling

In [2]:
import pandas as pd
titanic_data = {
    "PassengerId": [1,2,3,4,5,6,7,8,9,10],
    "Name": ["John Smith", "Jane Doe", "William Brown", "Emily Davis", "Michael Johnson",
             "Linda Wilson", "Robert Taylor", "Patricia Martinez", "David Anderson", "Barbara Thomas"],
    "Age": [22, 38, 26, 35, 35, 27, 54, 2, 27, 14],
    "Sex": ["male", "female", "male", "female", "male", "female", "male", "female", "male", "female"],
    "Pclass": [3,1,3,1,3,3,1,3,3,2],
    "Survived": [0,1,1,1,0,0,0,0,0,1],
    "Fare": [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51.8625, 21.075, 11.1333, 30.0708],
    "Embarked": ["S","C","S","S","S","Q","S","S","S","C"]
}
df = pd.DataFrame(titanic_data)

df.head()


Unnamed: 0,PassengerId,Name,Age,Sex,Pclass,Survived,Fare,Embarked
0,1,John Smith,22,male,3,0,7.25,S
1,2,Jane Doe,38,female,1,1,71.2833,C
2,3,William Brown,26,male,3,1,7.925,S
3,4,Emily Davis,35,female,1,1,53.1,S
4,5,Michael Johnson,35,male,3,0,8.05,S


1.Label Encoding (Mapping Categories to Numbers)
Explanation:
Label encoding converts categorical values into numeric codes. 
This is useful for machine learning models that require numeric input.
Example: Converting Sex column from 'male'/'female' to 0/1.

In [8]:
import pandas as pd
# Display unique values before encoding
print(df['Sex'].unique())

['male' 'female']


In [9]:
# Label Encoding using map
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
# Check results
df['Sex'].head()

0    0
1    1
2    0
3    1
4    0
Name: Sex, dtype: int64

One-Hot Encoding (Creating Dummy Variables)
Explanation:
One-hot encoding converts each category into a separate binary column (0 or 1). This is useful when a categorical variable has more than two categories, like Embarked (C, Q, S).
Example: Encoding the Embarked column.
 

In [4]:
df_encoded = pd.get_dummies(df, columns=['Embarked'], drop_first=False, dtype=int)
print(df_encoded.head())

   PassengerId             Name  Age     Sex  Pclass  Survived     Fare  \
0            1       John Smith   22    male       3         0   7.2500   
1            2         Jane Doe   38  female       1         1  71.2833   
2            3    William Brown   26    male       3         1   7.9250   
3            4      Emily Davis   35  female       1         1  53.1000   
4            5  Michael Johnson   35    male       3         0   8.0500   

   Embarked_C  Embarked_Q  Embarked_S  
0           0           0           1  
1           1           0           0  
2           0           0           1  
3           0           0           1  
4           0           0           1  
