# 0. Review
## 0.A Scikit-Learn

Scikit-Learn is a machine learning python package. It allows users to access machine learning algorithms via **object-oriented programming**.

## 0.B Data Set

I will be using a dataset of antibiotic resistance in bacteria strains. 

- Each bacteria is labeled with its resistance to the antibiotic, azithromycin.
- Additionally, each bacteria sample is labelled if its genome contains certain strands of DNA.

We would like to learn antibiotic resistance from the bacterial genome. 

- Our predictors are whether strands of DNA are present.
- Our response are resistance classes.

First, we have to clean our data up. **This section will focus on data preprocessing.**


## 0.C Data Preprocessing

We did a bit of data preprocessing: 

- encoded the resistance feature as 0 - "resistant," 1 - "susceptible".
- encoded all features of the DNA strands as, 0 - "if its genome does not contain the strand of DNA", 1 - "if its genome contains the strand of DNA."
- did a 70:30 training-test split

## 0.D Load Data
Now, we load our dataset. Run the code below to load 

- the dataset, ```antibiotic_resistance_all_labels```, containing antibotic resistance phentype for each bacteria
- and dataset, ```DNA_slices_all_df```, containing the genome of each bacteria 

In [1]:
import pandas as pd
antibiotic_resistance_all_labels = pd.read_csv('datasets/antibiotic_resistance_encoded_labels',index_col=0)
DNA_slices_all_df = pd.read_csv('datasets/DNA_slices_encoded_csv',index_col=0)

**In this section, we will be learning about normalization and standardization preprocessing. This is important for many unsupervised ML algorithms that are sensitive to the scales in the variables.**

# 6. Normalization

Normalization rescales quantitative variables to be between $0$ and $1$. Normalizing allows for comparison of columns.

For each variable, normalizing subtracts the minimium value and divides by the difference between the maximum and minimum. 


## 6.A Example

Let's digress a little from the k-mer dataset and let's consider the dataset of weight and weight,

| . | Weight/pounds | Height/cm |
| --- | --- | --- |
| **Observation 1** | 120 | 177 |
| **Observation 2** | 200 | 100 |
| **Observation 3** | 150 | 155 |
| **Observation 4** | 172 | 125 |

Let's normalize the weight column. First, we substract the minimum, $120$, and then divide by the difference between the maximum and minimum, $200-120$.


| . | Weight | Height |
| --- | --- | --- |
| **Observation 1** | (120-120)/(200 - 120) | 177 |
| **Observation 2** | (200-120)/(200 - 120) | 100 |
| **Observation 3** | (150-120)/(200 - 120) | 155 |
| **Observation 4** | (172-120)/(200 - 120) | 125 |

Let's normalize the height column. First, again, we substract the minimum, $100$, and then divide by the difference between the maximum and minimum, $177-100$.


| . | Weight | Height |
| --- | --- | --- |
| **Observation 1** | 0 | (177-100)/(177-100)|
| **Observation 2** | 1 | (100-100)/(177-100) |
| **Observation 3** | 0.375 | (155-100)/(177-100) |
| **Observation 4** | 0.65 | (125-100)/(177-100) |

Our final normalized dataset is then, 


| . | Weight | Height |
| --- | --- | --- |
| **Observation 1** | 0 | 1 |
| **Observation 2** | 1 | 0 |
| **Observation 3** | 0.375 | 0.71 |
| **Observation 4** | 0.65 | 0.32 |

## 6.B Normalization with scikit-learn

### 6.B.1 Normalizing body measurements dataframe

Now, we will normalize the same data set of the measurements with scikit-learn. We will then compare our calculations with the results of the scikit-learn.  

Below, I create a pandas dataframe of observations.

In [2]:
data = {'Weight': [120.0, 200.0, 150.0, 172.0],
        'Height': [177.0, 100.0, 155.0, 125.0]}

bd_measurements_df = pd.DataFrame(data, 
             index=["Observation 1","Observation 2", "Observation 3", "Observation 4"])

In [3]:
bd_measurements_df

Unnamed: 0,Weight,Height
Observation 1,120.0,177.0
Observation 2,200.0,100.0
Observation 3,150.0,155.0
Observation 4,172.0,125.0


#### I. Initialize MinMaxScaler

Note that the class that normalizes the data is ```MinMaxScaler```, not ```Normalizer```.

In [4]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

#### II. Train MinMaxScaler 


Using the ```fit``` method, we train our ```MinMaxScaler()``` instance on how to scale our data. We do this on the weight column. 

In [5]:
weight_column = bd_measurements_df['Weight']
weight_column

Observation 1    120.0
Observation 2    200.0
Observation 3    150.0
Observation 4    172.0
Name: Weight, dtype: float64

In [6]:
weight_column = bd_measurements_df['Weight'].to_numpy()
weight_column

array([120., 200., 150., 172.])

In [7]:
weight_column = bd_measurements_df['Weight'].to_numpy().reshape(-1, 1)
weight_column

array([[120.],
       [200.],
       [150.],
       [172.]])

In [8]:
scaler.fit(weight_column)

MinMaxScaler(copy=True, feature_range=(0, 1))

#### III. Transform Data

In [9]:
scaler.transform(weight_column)

array([[0.   ],
       [1.   ],
       [0.375],
       [0.65 ]])

Note that ```MinMaxScaler()``` object learn the parameters, minimum, maximum and range of the data, to normalize the data.

In [10]:
#data minimum
scaler.data_min_

array([120.])

In [11]:
#data maximum
scaler.data_max_

array([200.])

In [12]:
#data range
scaler.data_range_

array([80.])

### 6.B.2 Exercise: Normalizing the height column

Following the steps above normalize ```height_column``` and compare it to the results calculated.

In [13]:
height_column = bd_measurements_df['Height'].to_numpy().reshape(-1, 1)

In [14]:
# enter solution here
scaler.fit(height_column)
scaler.transform(height_column)

array([[1.        ],
       [0.        ],
       [0.71428571],
       [0.32467532]])

### 6.B.3 ```fit_transform```

There exists a convience function, ```fit_transform```, which fit the Normalizer and transform the data in one step.

In [15]:
scaler.fit_transform(height_column)

array([[1.        ],
       [0.        ],
       [0.71428571],
       [0.32467532]])

### 6.B.4 Drawbacks of Normalization

Normalization has two main issues: 
- the values it returns aren't easily interpretable
- it's sensitive to outliers.
    
Consider the data set,

| . | Weight/pounds | Height/cm |
| --- | --- | --- |
| **Observation 1** | 150 | 200 |
| **Observation 2** | 1000 | 10240 |
| **Observation 3** | 20000 | 10000 |
| **Observation 4** | 1021 | 10020 |

Normalization returns the data set, 

| . | Weight | Height |
| --- | --- | --- |
| **Observation 1** | 0.00 | 0 |
| **Observation 2** | 0.04 | 1.00 |
| **Observation 3** | 1.00 | 0.98 |
| **Observation 4** | 0.04 | 0.98 |

With outliers, it is possible that the data can either be squashed around 0 or 1.

# 6. Standardization

Standardization overcomes these drawbacks of normalization. 

Standardization rescales variables so that the variables are measured by the number of standard deviations away from the mean.

From each column, standardization substracts its mean and divides the column entries by its standard deviation.

## 6.A Example

Let's consider the dataset of height and weight,

| . | Weight/pounds | Height/cm |
| --- | --- | --- |
| **Observation 1** | 150 | 200 |
| **Observation 2** | 1000 | 10240 |
| **Observation 3** | 20000 | 10000 |
| **Observation 4** | 1021 | 10020 |

Let's standardize the weight column. First, we substract the mean, $5542.75$, and then divide by the standard deviation, $8354$.


| . | Weight | Height |
| --- | --- | --- |
| **Observation 1** | (150-5542.75)/8354 | 200 |
| **Observation 2** | (1000-5542.75)/8354 | 10240 |
| **Observation 3** | (20000-5542.75)/8354 | 10000 |
| **Observation 4** | (1021-5542.75)/8354 | 10020 |

Let's standardize the height column. First, we substract the mean, $7615$, and then divide by the standard deviation, $4282$.


| . | Weight | Height |
| --- | --- | --- |
| **Observation 1** | -0.64 | (200-7615)/4282 |
| **Observation 2** | -0.54 | (10240-7615)/4282 |
| **Observation 3** | 1.73 | (10000-7615)/4282 |
| **Observation 4** | -0.54 | (10020-7615)/4282 |

Our final normalized dataset is then, 


| . | Weight | Height |
| --- | --- | --- |
| **Observation 1** | -0.64 | -1.73 |
| **Observation 2** | -0.54 | 0.61 |
| **Observation 3** | 1.73 | 0.56 |
| **Observation 4** | -0.54 | 0.56 |

## 6.B Standardization with scikit-learn

### 6.B.1 Standardizing body measurements dataframe

Now, we will standardize the same data set of the measurements with scikit-learn. We will then compare our calculations with the results of the scikit-learn.  

Below, I create a pandas dataframe of observations.

In [16]:
data = {'Weight': [150.0, 1000.0, 20000.0, 1021.0],
        'Height': [200.0, 10240.0, 10000.0,10020.0]}

bd_measurements_df = pd.DataFrame(data, 
             index=["Observation 1","Observation 2", "Observation 3", "Observation 4"])

In [17]:
bd_measurements_df

Unnamed: 0,Weight,Height
Observation 1,150.0,200.0
Observation 2,1000.0,10240.0
Observation 3,20000.0,10000.0
Observation 4,1021.0,10020.0


#### I. Initialize ```StandardScaler()```

In [18]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#### II. Train StandardScaler 


Using the ```fit``` method, we train our ```StandardScaler()``` instance on how to scale our data. We do this on the weight column. 

In [19]:
weight_column = bd_measurements_df['Weight'].to_numpy().reshape(-1, 1)
scaler.fit(weight_column)

StandardScaler(copy=True, with_mean=True, with_std=True)

#### III. Transform Data

In [20]:
scaler.transform(weight_column)

array([[-0.6455067 ],
       [-0.54376256],
       [ 1.73051814],
       [-0.54124888]])

Note that ```MinMaxScaler()``` object learn the parameters, mean, maximum and range of the data, to normalize the data.

In [21]:
#data mean
scaler.mean_

array([5542.75])

In [22]:
#data scale
scaler.scale_

array([8354.28977756])

### 6.B.2 Exercise: Standardizing the height column

Following the steps above normalize ```height_column``` and compare it to the results calculated.

In [23]:
height_column = bd_measurements_df['Height'].to_numpy().reshape(-1, 1)

In [24]:
# enter solution here
scaler.fit(height_column)
scaler.transform(height_column)

array([[-1.73163198],
       [ 0.61301874],
       [ 0.55697131],
       [ 0.56164193]])

### 2.B.3 ```fit_transform```

There exists a convience function, ```fit_transform```, which fit the ```StandardScaler``` and transform the data in one step.

In [25]:
scaler.fit_transform(height_column)

array([[-1.73163198],
       [ 0.61301874],
       [ 0.55697131],
       [ 0.56164193]])

### 6.B.4 Exercise: Standardizing the DNA_ column

Now let's move back to k-mer dataset! 


Following the steps above standardize ```DNA_slices_all_df```. Store the standardized array as ```standardized_DNA_data```.

In [26]:
# enter solution here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_DNA_data = scaler.fit_transform(DNA_slices_all_df)

We convert the output back to a pandas dataframe. 

In [27]:
standardized_DNA_data_df = pd.DataFrame(standardized_DNA_data,
                                        columns=DNA_slices_all_df.columns,
                                        index=DNA_slices_all_df.index)
standardized_DNA_data_df

NameError: name 'kmerlist' is not defined