# **Data Preprocessing: Scaling, Encoding, and Dimensionality Reduction**

Goal= """
in this notebook i will get raw data ready for machine learning. 
we'll walks through some of the most common preprocessing steps you’ll need before training a model like scaling features, turning text into numbers, filling in missing values, & reducing the complexity of your data (PCA & t-SNE)   
Each section is done both manually & with Scikit-Learn so I can really understand what’s happening under the hood while also learning how to use
"""


## Scaling Data with StandardScaler

### Manual Standardization

In [1]:
import numpy as np
X=[1,2,3,4,5,6,7,8,9,10]
X=np.array(X)-np.average(X)
print(np.average(X))
X=X/np.sqrt(np.var(X))
print(X)
print(np.var(X))


0.0
[-1.5666989  -1.21854359 -0.87038828 -0.52223297 -0.17407766  0.17407766
  0.52223297  0.87038828  1.21854359  1.5666989 ]
1.0


### Scikit-Learn

In [2]:
import numpy as np
from sklearn.preprocessing import scale, MinMaxScaler, StandardScaler, RobustScaler
X=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
print(X.shape)
scaler = StandardScaler()
scaledX=scaler.fit_transform(X) #scaling & reduction use fit_transform so thats why we assign it to s.th else
print(scaledX)

(10, 1)
[[-1.5666989 ]
 [-1.21854359]
 [-0.87038828]
 [-0.52223297]
 [-0.17407766]
 [ 0.17407766]
 [ 0.52223297]
 [ 0.87038828]
 [ 1.21854359]
 [ 1.5666989 ]]


## Encoding Categorical Data

### manually

In [3]:
data=np.array([['b'],['c'],['d'],['a'],['b']])
encode=[1,2,3,4]
meow=[]
for i in range(len(data)):
    if data[i][0]=='a':
        data[i][0]=encode[0]

    if data[i][0]=='b':
        data[i][0]=encode[1]

    if data[i][0]=='c':
        data[i][0]=encode[2]

    if data[i][0]=='d':
        data[i][0]=encode[3]
print(data)

[['2']
 ['3']
 ['4']
 ['1']
 ['2']]


### Scikit-Learn

#### Ordinal Encoding

In [4]:
data=np.array([[1],[2],[3],['a'],[5],[6],[7],[8],[9],[10]])
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
encoding=OrdinalEncoder()
encoded_data = encoding.fit_transform(data)
print(encoded_data)

[[0.]
 [2.]
 [3.]
 [9.]
 [4.]
 [5.]
 [6.]
 [7.]
 [8.]
 [1.]]


#### one hot encoding

In [5]:
from sklearn.preprocessing import OneHotEncoder
data=np.array([[1],[2],[3],['a'],[5],[6],[7],[8],[9],[10]])

encoder = OneHotEncoder()  # set sparse=True to get sparse matrix
Xencoded = encoder.fit_transform(data).toarray() # X should be 2D
print(Xencoded)

[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]


## Handling Missing Data


In [6]:
from sklearn.impute import SimpleImputer
data=np.array([[1,2],[2,3],[3,np.nan],[4,5],[6,7]])

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
khiar=imp.fit_transform(data)
#other methods:
#* mean
#* median
#* most_frequent
#* out of range => Turn into outliers
print(khiar)

[[1.   2.  ]
 [2.   3.  ]
 [3.   4.25]
 [4.   5.  ]
 [6.   7.  ]]


## Feature Selection with Variance Threshold


In [7]:
from sklearn.feature_selection import VarianceThreshold
data=np.array([[1,2],[2,3],[3,4],[4,60],[6,7]])
VT = VarianceThreshold(threshold=10)
khiar = VT.fit_transform(data)

print("The reduced X has a shape of {}".format(khiar.shape))
print(khiar)

The reduced X has a shape of (5, 1)
[[ 2]
 [ 3]
 [ 4]
 [60]
 [ 7]]


## Dimensionality Reduction using PCA (Linear)

In [8]:
from sklearn.decomposition import PCA
data=np.array([[1,2],[2,3],[3,4],[4,5],[6,7]])
pca = PCA(n_components=1)  # Choose number of components
khiar = pca.fit_transform(data)
print(khiar)

[[-3.11126984]
 [-1.69705627]
 [-0.28284271]
 [ 1.13137085]
 [ 3.95979797]]


## Dimensionality Reduction using t-SNE (Non-Linear)


In [9]:
from sklearn.manifold import TSNE
data=np.array([[1,2],[2,3],[3,4],[4,5],[6,7]])

tsne = TSNE(n_components=1, perplexity=2) # how many feature you want to keep   # 
X_tsne = tsne.fit_transform(data)
print(X_tsne)

[[ 780.7292 ]
 [-607.5442 ]
 [-305.00314]
 [ -33.44689]
 [ 278.30508]]
