# Data Normalisation and Transformation

When one or more datasets affects results disproportionately, normalization or scaling gives a level playing field. In this activity, we will apply different data methods for data normalisation and transformation. We first read the dataset that is used for the first part of the analysis. 


In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv( 'wine_data.csv', header = None, usecols = [0,1,2])
df.columns=['Class label', 'Alcohol', 'Malic acid']
df.head()

In [None]:
df.describe()

As we can see in the tables above, the features, Alcohol (percent/volumne) and Malic acid (g/l) are measured on different scales, so scaling is necessary prior to any comparison or combination of data.

In [None]:
df.Alcohol.mean() / df["Malic acid"].mean() # difference is factor of ~5x

## 1. Z-Score Normalisation (standardisation): 

We use scikit-learn linrary for standardise data (mean=0, SD=1). The class you are going to use is the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">StandardScaler</a> class. More reading materials can be found <a href="http://scikit-learn.org/stable/modules/preprocessing.html">here</a>. 

The task here it to standardise the values of Alcohol and Malic Acid, and append the standard variables to the DataFrame "df" as follows

<img src="fig_1.png" alt="Smiley face" height="400" width="500">

In [None]:
from sklearn import preprocessing

In [None]:
std_scale = 
df_std =
df_std[0:5]

In [None]:
# put it alongside data... to view
df['Ascaled'] = df_std[:,0] # so 'Ascaled' is Alcohol scaled
df['MAscaled'] = df_std[:,1] # and 'MAscaled' is Malic acid scaled
df.head()

Now, compute and display the normalised values for both features. Let's check if they have mean of 0 and SD= 1.

In [None]:
df.describe() # check that μ = 0 and σ = 1... approx

Or you can print out values:

In [None]:
print('Mean after standardisation:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
      .format(df_std[:,0].mean(), df_std[:,1].mean()))
print('\nStandard deviation after standardisation:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
      .format(df_std[:,0].std(), df_std[:,1].std()))

#### Compare the variables before and after normalization with plots
In order to investigate how the normalization actually affect the data, we can visualize the data by plotting the variable values.

Firstly, plot the original data, i.e., data before normalization

In [None]:
%matplotlib inline

In [None]:
df["Alcohol"].plot(), df["Malic acid"].plot()

Now, we plot  the standardized data, and observe the range and the centre of the distribution for the standardised features. 

In [None]:
# or split them from the others
df["MAscaled"].plot(), df["Ascaled"].plot()

You can see from above graphs that both original and standardized data are in the same shape but shifted.

In [None]:
df["Ascaled"].plot(), df["Alcohol"].plot()

In [None]:
df["MAscaled"].plot(), df["Malic acid"].plot()


## 2. MinMax Noramlisation:

In this section, we discuss a different type of normalization for reshaping the range of data. We process the same data we used in the previous section. We can implement this either Scikit-Learn or manually. 

### 2.1 Using scikit-learn:
please refer to section 4.3.1.1 "<a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range">Scaling features to a range</a>" for more detailed discussion. Similar to what you have done with the StandardScaler, here you are going to use the <a href="http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html">MinMaxScaler</a>.

In [None]:
minmax_scale = 
df_minmax = 
df_minmax[0:5]

### 2.2 Manually:
Of course, you can implement the Min-Max normalization according to the formulas discussed in the lecture.

Firstly, find the min and max of "df.Alcohol".

In [None]:
minA = df.Alcohol.min()
maxA = df.Alcohol.max()
minA, maxA

Manually apply the min-max normalization to the first value of "df.Alcohol", 

In [None]:
a = df.Alcohol[0] # the first value, for practice
#Write you code here
mma = (a - minA) / (maxA - minA)
mma

and then compare the manually computed value with the one given by the MinMaxScaler above.

In [None]:
df_minmax[0][0]

The two values should be the same.
Now, let's look at the normalization of the max value in "df.Alcohol".

In [None]:
a = df[df.Alcohol == df.Alcohol.max()].Alcohol
mma = (a - minA) / (maxA - minA)
mma

The normalized value of max must be 1.0 exactly, think about the reason! Then, how about the 
min value of "df.Alcohol"?

In [None]:
print('Min-value after min-max scaling:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
      .format(df_minmax[:,0].min(), df_minmax[:,1].min()))
print('\nMax-value after min-max scaling:\nAlcohol = {:.2f}, Malic acid = {:.2f}'
      .format(df_minmax[:,0].max(), df_minmax[:,1].max()))

### 2.3 Plot the original, standardised and normalised data values. 

In [None]:
# and plot
%matplotlib inline

from matplotlib import pyplot as plt

def plot():
    f = plt.figure(figsize=(8,6))

    plt.scatter(df['Alcohol'], df['Malic acid'],
            color='green', label='input scale', alpha=0.5)

 #   plt.scatter(df_std[:,0], df_std[:,1], color='red',
 #           label='Standardized [$$N  (\mu=0, \; \sigma=1)$$]', alpha=0.3)
    plt.scatter(df_std[:,0], df_std[:,1], color='red',
             label='Standardized u=0, s=1', alpha=0.3) # can't print: μ = 0, σ = 0
    
    plt.scatter(df_minmax[:,0], df_minmax[:,1],
            color='blue', label='min-max scaled [min=0, max=1]', alpha=0.3)

    plt.title('Alcohol and Malic Acid content of the wine dataset')
    plt.xlabel('Alcohol')
    plt.ylabel('Malic Acid')
    plt.legend(loc='upper left')
    plt.grid()
    plt.tight_layout()
    #f.savefig("z_min_max.pdf", bbox_inches='tight')

plot()
plt.show()

#### The plot above includes the wine datapoints on all three different scales: 
* the input scale  where the alcohol content was measured in volume-percent (green),
* the standardized features (red), and 
* the normalized features (blue). 

#### In the following plot, we will zoom in into the three different axis-scales while dispalying class values. 

In [None]:
fig, ax = plt.subplots(3, figsize=(6,14))

for a,d,l in zip(range(len(ax)),
               (df[['Alcohol', 'Malic acid']].values, df_std, df_minmax),
               ('Input scale',
                'Standardized [u=0 s=1]',
                'min-max scaled [min=0, max=1]')
                ):
    for i,c in zip(range(1,4), ('red', 'blue', 'green')):
        ax[a].scatter(d[df['Class label'].values == i, 0],
                  d[df['Class label'].values == i, 1],
                  alpha=0.5,
                  color=c,
                  label='Class %s' %i
                  )
    ax[a].set_title(l)
    ax[a].set_xlabel('Alcohol')
    ax[a].set_ylabel('Malic Acid')
    ax[a].legend(loc='upper left')
    ax[a].grid()

plt.tight_layout()

plt.show()

## 3. Data Transformation:

Another way to reshape data is to perform data transformation. We will display an example of data that is with right skew (positive skew). We will need to compress large values. We first read the data used for this activity. 

In [None]:
import pandas as pd
data = pd.read_csv("bmr.csv")

In [None]:
data.head()

In [None]:
plt.scatter(data["BMR(W)"], data["Mass(g)"]) # before

### So, which transformation type will suit this data?

In Tukey's ladder of power, we discussed different kind of transformation. Here you are going to 
compare the following three kinds of transformations
* Root transformation
* Square power transformation
* Log transformation

The implementation of Root transformation is given as follows. You need to finish the other two kinds of transformation.

### 3.1 Root transformation:

In [None]:
import math
data['lmr'] = None
i = 0
for row in data.iterrows():
    #Write you code below
    
    i += 1
data.head()

In [None]:
data['lbm'] = None
i = 0
for row in data.iterrows():
    #write you code below
    
    i += 1  
data.head()

In [None]:
plt.scatter(data.lbm, data.lmr) # and after

#### Does it give a better spread of the data? Let's try something else.

### 3.2 Square power transformation:

In [None]:
import math
data['lmr'] = None
i = 0
for row in data.iterrows():
    #write you code below
    i += 1

    
data.head()

In [None]:
data['lbm'] = None
i = 0
for row in data.iterrows():
    #write you code below
    
    i += 1

    
data.head()

In [None]:
plt.scatter(data.lbm, data.lmr) # and after

#### Can you justify the output of this figure?

### 3.3 Log transformation:

In [None]:
import math
data['lmr'] = None
i = 0
for row in data.iterrows():
    #write you code below
    
    i += 1

    
data.head()

In [None]:
data['lbm'] = None
i = 0
for row in data.iterrows():
    #write you code below
    
    i += 1

    
data.head()

In [None]:
plt.scatter(data.lbm, data.lmr) # and after

Apparently, the best transformation for this data is log transformation. As the data is positively skewed. we will need to compress large values. That means we need to move down the ladder of powers to spread out data that is clustered at lower values. Therefore, logarithmic is the appropriate transformation in this case. 

Some materials used in this tutorial are based on http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

## 4. Home work:
Consider the following dataset:

In [None]:
body_mass = [32000, 37800, 347000, 4200, 196500, 100000, 4290, 
32000, 65000, 69125, 9600, 133300, 150000, 407000, 115000, 67000, 
325000, 21500, 58588, 65320, 85000, 135000, 20500, 1613, 1618]

metabolic_rate = [49.984, 51.981, 306.770, 10.075, 230.073, 
148.949, 11.966, 46.414, 123.287, 106.663, 20.619, 180.150, 
200.830, 224.779, 148.940, 112.430, 286.847, 46.347, 142.863, 
106.670, 119.660, 104.150, 33.165, 4.900, 4.865]

#### What will be the appropriate transformation to apply for this data? Post your finds in the forum.