# Feature Scaling

1. Standardization
2. Min Max Scaler
3. Unit Vector

# Standardization (Usually for Machine Learning use cases)
Standardization in Machine Learning is like putting data on the same scale so that it is easier to compare and analyze. It makes sure that all the data points have similar ranges and helps algorithms work better by giving them a fair and consistent way to understand the data.

In [1]:
import seaborn as sns

In [2]:
df = sns.load_dataset('tips')

In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [4]:
from sklearn.preprocessing import StandardScaler

In [8]:
scaler= StandardScaler()

In [9]:
# fit only computes the mean and std to be used for later scaling and not applies for that we use data.transform
scaler.fit(df[['total_bill','tip']])

In [12]:
import pandas as pd
pd.DataFrame(scaler.transform(df[['total_bill','tip']]),columns=['total_bill','tips'])

Unnamed: 0,total_bill,tips
0,-0.314711,-1.439947
1,-1.063235,-0.969205
2,0.137780,0.363356
3,0.438315,0.225754
4,0.540745,0.443020
...,...,...
239,1.040511,2.115963
240,0.832275,-0.722971
241,0.324630,-0.722971
242,-0.221287,-0.904026


# Normalization-Min Max Scaler (Usually for Deep Learning use cases)
Min-max scaler in Machine Learning is like stretching and squishing data to fit within a specific range, making it easier to understand and compare.

In [3]:
import seaborn as sns

In [6]:
df = sns.load_dataset('taxis')

In [7]:
df.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


In [8]:
from sklearn.preprocessing import MinMaxScaler

In [9]:
min_max = MinMaxScaler()

In [12]:
min_max.fit(df[['distance','fare','tip']])

In [16]:
import pandas as pd
pd.DataFrame(min_max.transform(df[['distance','fare','tip']]),columns=['distance','fare','tip'])

Unnamed: 0,distance,fare,tip
0,0.043597,0.040268,0.064759
1,0.021526,0.026846,0.000000
2,0.037330,0.043624,0.071084
3,0.209809,0.174497,0.185241
4,0.058856,0.053691,0.033133
...,...,...,...
6428,0.020436,0.023490,0.031928
6429,0.510627,0.382550,0.000000
6430,0.112807,0.100671,0.000000
6431,0.030518,0.033557,0.000000


In [15]:
# can also use fit_transform
min_max.fit_transform(df[['distance','fare','tip']])

array([[0.04359673, 0.04026846, 0.06475904],
       [0.02152589, 0.02684564, 0.        ],
       [0.0373297 , 0.04362416, 0.07108434],
       ...,
       [0.11280654, 0.10067114, 0.        ],
       [0.03051771, 0.03355705, 0.        ],
       [0.10490463, 0.09395973, 0.10120482]])

# Unit Vectors Feature Scaling

### When do we use Standardization & Unit Vectors

Standardization:

Let's say you have two features: size and number of rooms. Size is measured in square feet and ranges from 500 to 5000, while the number of rooms ranges from 1 to 5. These features have different scales, and their magnitudes can influence the learning process. Standardization helps bring these features to a similar scale.


Unit Vectors : 

Unit vectors are used when direction or similarity is more important than the magnitude. In our house example, let's consider a different use case. Suppose you want to measure the similarity between houses based on their features to provide recommendations to potential buyers.


In summary, standardization is useful in classification tasks to ensure consistent scales and mitigate the influence of feature magnitudes. On the other hand, unit vectors are used when direction or similarity is important, allowing for comparison and recommendation tasks.

In [1]:
import seaborn as sns

In [2]:
df = sns.load_dataset('iris')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
from sklearn.preprocessing import normalize   ### normalize is a class used for unit vector conversion

In [6]:
df.columns

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')

In [7]:
normalize(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

array([[0.80377277, 0.55160877, 0.22064351, 0.0315205 ],
       [0.82813287, 0.50702013, 0.23660939, 0.03380134],
       [0.80533308, 0.54831188, 0.2227517 , 0.03426949],
       [0.80003025, 0.53915082, 0.26087943, 0.03478392],
       [0.790965  , 0.5694948 , 0.2214702 , 0.0316386 ],
       [0.78417499, 0.5663486 , 0.2468699 , 0.05808704],
       [0.78010936, 0.57660257, 0.23742459, 0.0508767 ],
       [0.80218492, 0.54548574, 0.24065548, 0.0320874 ],
       [0.80642366, 0.5315065 , 0.25658935, 0.03665562],
       [0.81803119, 0.51752994, 0.25041771, 0.01669451],
       [0.80373519, 0.55070744, 0.22325977, 0.02976797],
       [0.786991  , 0.55745196, 0.26233033, 0.03279129],
       [0.82307218, 0.51442011, 0.24006272, 0.01714734],
       [0.8025126 , 0.55989251, 0.20529392, 0.01866308],
       [0.81120865, 0.55945424, 0.16783627, 0.02797271],
       [0.77381111, 0.59732787, 0.2036345 , 0.05430253],
       [0.79428944, 0.57365349, 0.19121783, 0.05883625],
       [0.80327412, 0.55126656,

In [10]:
# converting it into the dataframe
import pandas as pd
df_new=pd.DataFrame(normalize(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]),
                       columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])

In [11]:
df_new.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,0.803773,0.551609,0.220644,0.031521
1,0.828133,0.50702,0.236609,0.033801
2,0.805333,0.548312,0.222752,0.034269
3,0.80003,0.539151,0.260879,0.034784
4,0.790965,0.569495,0.22147,0.031639


# PCA (Princial Component Analysis)

### PCA comibnes features and generates one feature statiscally to prevent data loss and that feature can be used with the target /  label to create a model