Almost all machine learning algorithms work with vectors of numbers. Should know how to convert it into a usable format

#### Scales of Measurement data distinction

##### Nominal Data

Can take any arbitrary nonnumerical values. Can neither be measured nor compared. Can be qualitative or quantitative. No arithmetic operations can be applied. Only statistical tendency that can be studied is the mode.

##### Ordinal Data

A type of data whose values follow an order. Don't support basic arithmetic operations, but can be compared using comparison operators. Median can be considered as a valid measure of central tendencies. For example T-shit sies (S, M, L, XL, etc.)

##### Interval data

Values can be compared and the intervals are equally split. Support addition and subtraction. For example, degrees fahrenheit. Difference between 30 and 31F is the same as between 90 and 91F, however, we can't say that 60F is twice as hot as 30F.

##### Ratio Data

Ratio data is a type of data that has a natural zero point and supports all the properties of interval data, along with arithmetic operations of multiplication, division, etc. Values are continuous and support all numeric operations. Can study statistical measures of central tendencies as well as measures of spread like variation for this kind of data.

#### Transforming Ordinal Attributes

Consider the Gender attribute which may have three values - Male, Female and Other. This is a nominal attribute. This can be expressed as a vector of possible values. 

Say, there’s a student with the following values:
Edward Remirez, Male, 28 years, Bachelors Degree

We can convert the gender column to the set of three values:
Edward Remirez, 0, 1, 0, 28 years, Bachelors Degree

This is called one-hot encoding. SKLearn provides simple interface for such transformation using sklearn.preprocessing.

In [22]:
#Transforming Ordinal Attributes

import sklearn
import pandas as pd

df = pd.DataFrame([["Edward Remirez", "Male", 28, "Bachelors"],["Arnav Sharma", "Female",23,"Masters"]], columns = ['Name','Gender','Age','Degree'])

df

from sklearn.preprocessing import OneHotEncoder
encoder_for_gender = OneHotEncoder().fit(df[['Gender']])

encoder_for_gender.categories_
gender_values = encoder_for_gender.transform(df[['Gender']])
gender_values.toarray()

df[['Gender_F', 'Gender_M']] = gender_values.toarray()

df

Unnamed: 0,Name,Gender,Age,Degree,Gender_F,Gender_M
0,Edward Remirez,Male,28,Bachelors,0.0,1.0
1,Arnav Sharma,Female,23,Masters,1.0,0.0


#### Transforming Ordinal Attributes

Can be transformed in a simpler manner as to preserve information about the ordering and create more meaningful models. Consider again
    Edward Ramirez, Male, 28 years, Bachelors Degree

Education follows an order. We can assign a numeric value to each level: 
HS: 0, Bachelors: 1, Masters:2, Doctorate: 3.

In [34]:
from sklearn.preprocessing import OrdinalEncoder
encoder_for_eduacation = OrdinalEncoder()
encoder_for_eduacation.fit_transform(df[['Degree']])
encoder_for_eduacation.categories_

[array(['Bachelors', 'Masters'], dtype=object)]

In [41]:
encoder_for_eduacation = OrdinalEncoder(categories=[['Masters','Bachelors','High School', 'Doctoral']])
df[['Degree_encoded']] = encoder_for_eduacation.fit
df


Unnamed: 0,Name,Gender,Age,Degree,Gender_F,Gender_M,Degree_encoded
0,Edward Remirez,Male,1.0,Bachelors,0.0,1.0,<bound method OrdinalEncoder.fit of OrdinalEnc...
1,Arnav Sharma,Female,0.0,Masters,1.0,0.0,<bound method OrdinalEncoder.fit of OrdinalEnc...


### Normalisation

#### Min-Max Scaling

Transforms each feature by compressing it down to a scale where the minimum number in the dataset maps to zero and the maximum maps to one.



In [39]:
"""

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df[['Age']])
df[['Age']] = scaler.transform(df[['Age']])
df

"""

#Boring with two values in the dataset

"\n\nfrom sklearn.preprocessing import MinMaxScaler\nscaler = MinMaxScaler()\nscaler.fit(df[['Age']])\ndf[['Age']] = scaler.transform(df[['Age']])\ndf\n\n"

##### Standard Scaling

Standard scaling standardizes the feature values by removing the mean and scaling to unit variance. 

In [43]:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])
df

Unnamed: 0,Name,Gender,Age,Degree,Gender_F,Gender_M,Degree_encoded
0,Edward Remirez,Male,1.0,Bachelors,0.0,1.0,<bound method OrdinalEncoder.fit of OrdinalEnc...
1,Arnav Sharma,Female,-1.0,Masters,1.0,0.0,<bound method OrdinalEncoder.fit of OrdinalEnc...


In [47]:
#Can view the parameters of the scaler using

scaler.mean_
scaler.scale_

array([1.])