**Data Transformation - Variable Scaling**

> Variable scaling refers to adjusting variables to a common scale, which is critical for many data analysis and machine learning tasks. Scaling methods depend on the type of data and the problem being addressed. For numerical variables, common techniques include normalization (rescaling values to a range, such as \[0, 1]) and standardization (transforming values to have a mean of 0 and a standard deviation of 1). Categorical variables may need to be transformed into dummy variables to make them usable in models.
> 
> **Ease of Analysis:** When variables are on the same scale or unit, comparisons between them become more meaningful. For instance, in regression or correlation analysis, having variables on comparable scales helps avoid bias toward features with larger values.
>
> **Model Performance:** Many machine learning algorithms, such as those relying on distance metrics (e.g., k-nearest neighbors, SVM, or clustering algorithms), or gradient-based optimization methods (e.g., neural networks), perform better when input features are scaled similarly. Without scaling, variables with larger magnitudes may dominate model learning, leading to suboptimal results.
>
> **Handling Outliers:** The choice of scaling method is important, especially when the dataset contains outliers. Standardization is less sensitive to extreme values compared to Min-Max scaling, making it a better choice when dealing with datasets that have significant outliers.
> 
> **Not Always Necessary:** It’s also worth noting that not all algorithms require scaling. For instance, tree-based algorithms like decision trees, random forests, and gradient-boosted trees do not depend on variable scaling since they split data based on feature values rather than distances between points.

<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing as pre

<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

In [2]:
data = np.random.randint(40, 250, size=30)
data = pd.DataFrame(data)

<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Max Min Scaler </p>

In [3]:
arr = pre.MinMaxScaler().fit_transform(data)
res = pd.DataFrame(arr)
display(pd.concat([res.max(), res.min(), res.sample()]))

Unnamed: 0,0
0,1.0
0,0.0
21,0.873786


<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Robust Scaler </p>

In [4]:
arr = pre.RobustScaler().fit_transform(data)
res = pd.DataFrame(arr)
display(pd.concat([res.max(), res.min(), res.sample()]))

Unnamed: 0,0
0,1.072539
0,-1.062176
27,-1.062176


<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Standardization </p>

In [5]:
arr = pre.StandardScaler().fit_transform(data)
res = pd.DataFrame(arr)
display(pd.concat([res.max(), res.min(), res.sample()]))
display(pd.concat([res.mean(), res.std()]).to_frame())

Unnamed: 0,0
0,1.667499
0,-1.752727
0,0.936965


Unnamed: 0,0
0,6.383782e-17
0,1.017095


<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

In [6]:
data = {'Sex':[], 'Color':[], 'Education':[]}
for _ in range(30):
    data['Sex'].append(np.random.choice(['Female', 'Male']))
    data['Color'].append(np.random.choice(['Black', 'White', 'Blue', 'Green', 'Purple']))
    data['Education'].append(np.random.choice(['Primary', 'High', 'Bachelor\'s', 'Master', 'Doctorate']))
    
data = pd.DataFrame(data)
df = data.copy()
display(data.sample(3))

Unnamed: 0,Sex,Color,Education
3,Male,Black,Doctorate
10,Male,Blue,High
8,Female,Blue,Doctorate


<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Labeling - Encoding </p>

In [7]:
df['Sex_Label'] = pre.LabelEncoder().fit_transform(df['Sex'])
df['Color_Label'] = pre.LabelEncoder().fit_transform(df['Color'])
df['Education_Label'] = pre.LabelEncoder().fit_transform(df['Education'])
display(df.sample(3))

Unnamed: 0,Sex,Color,Education,Sex_Label,Color_Label,Education_Label
3,Male,Black,Doctorate,1,0,1
17,Female,White,Master,0,4,3
2,Male,Black,Primary,1,0,4


<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; One-Hot Encoding </p>

In [8]:
one_hot = pd.get_dummies(data, columns=['Sex'], dtype='int') # converted column was not taken
display(one_hot.head(3))

Unnamed: 0,Color,Education,Sex_Female,Sex_Male
0,Black,Primary,0,1
1,White,High,0,1
2,Black,Primary,0,1


<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Dummy Variable Trap - multicollinearity </p>

In [9]:
one_hot = pd.get_dummies(data, columns=['Sex'], dtype='int', drop_first=True)
display(one_hot.head(3))

Unnamed: 0,Color,Education,Sex_Male
0,Black,Primary,1
1,White,High,1
2,Black,Primary,1


<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Ordinal Variable Transformation (Labeling) </p>

In [10]:
data['Education_Label'] = pre.LabelEncoder().fit_transform(df['Education'])

ordered = pd.Categorical(data['Education'], ordered=True,
                        categories=['Primary', 'High', 'Bachelor\'s', 'Master', 'Doctorate'])
data['Education_Ordered'], order = pd.factorize(ordered, sort=True)

display(data.sample(5))

Unnamed: 0,Sex,Color,Education,Education_Label,Education_Ordered
21,Female,Black,Doctorate,1,4
24,Male,Black,High,2,1
25,Male,Blue,Doctorate,1,4
9,Male,Black,Master,3,3
3,Male,Black,Doctorate,1,4


<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>

<p style="background-image: linear-gradient(#0aa98f, #ffffff 10%); font-weight:bold;"> 
    &nbsp; Polynomial Features </p>

In [11]:
data = pd.read_csv('data/cylinder.csv')
data.head(3)

Unnamed: 0,Radius,Height,Volume
0,7.33,13.23,2253.51
1,3.65,15.73,673.19
2,7.85,14.04,2738.02


In [12]:
y = data['Volume']
X = data.drop(columns='Volume')

poly = pre.PolynomialFeatures(degree=2, interaction_only=False) # defaults
pX = poly.fit_transform(X)

display(X.head(3), pd.DataFrame(pX).head(3))

Unnamed: 0,Radius,Height
0,7.33,13.23
1,3.65,15.73
2,7.85,14.04


Unnamed: 0,0,1,2,3,4,5
0,1.0,7.33,13.23,53.7289,96.9759,175.0329
1,1.0,3.65,15.73,13.3225,57.4145,247.4329
2,1.0,7.85,14.04,61.6225,110.214,197.1216


<p style="background-image: linear-gradient(to right, #0aa98f, #68dab2)"> &nbsp; </p>