## Data Normalization

### Preliminary analysis - Code

In [None]:
non_categorical_attributes = [column for column in df.columns if df[column].name not in categorical_attributes and df[column].name != 'Target']

In [None]:
def simple_stats(df,col_names):
  name = []
  mean = []
  min = []
  max = []

  for i in col_names:
    name.append(i)
    mean.append(df[i].mean())
    min.append(df[i].min())
    max.append(df[i].max())

  dict = {
      'Feature': name,
      'Mean': mean,
      'Min': min,
      'Max': max,
  }

  dff = pd.DataFrame(dict)

  return dff.sort_values(by=['Mean'], ascending=False)

In [None]:
simple_stats(df,non_categorical_attributes)

Unnamed: 0,Feature,Mean,Min,Max
0,Previous qualification (grade),132.920606,95.0,190.0
1,Admission grade,127.293939,95.0,190.0
2,Age at enrollment,23.461157,17.0,70.0
15,Unemployment rate,11.630358,7.6,16.2
7,Curricular units 1st sem (grade),10.53486,0.0,18.875
13,Curricular units 2nd sem (grade),10.036155,0.0,18.571429
5,Curricular units 1st sem (evaluations),8.071074,0.0,45.0
11,Curricular units 2nd sem (evaluations),7.763085,0.0,33.0
4,Curricular units 1st sem (enrolled),6.337466,0.0,26.0
10,Curricular units 2nd sem (enrolled),6.296419,0.0,23.0


In [None]:
df.corr()['Target'].sort_values()

Age at enrollment                                -0.267229
Curricular units 2nd sem (without evaluations)   -0.102687
Curricular units 1st sem (without evaluations)   -0.074642
Inflation rate                                   -0.030326
Unemployment rate                                 0.004198
Curricular units 1st sem (credited)               0.046900
GDP                                               0.050260
Curricular units 2nd sem (credited)               0.052402
Curricular units 1st sem (evaluations)            0.059786
Previous qualification (grade)                    0.109464
Curricular units 2nd sem (evaluations)            0.119239
Admission grade                                   0.128058
Curricular units 1st sem (enrolled)               0.161074
Curricular units 2nd sem (enrolled)               0.182897
Curricular units 1st sem (grade)                  0.519927
Curricular units 1st sem (approved)               0.554881
Curricular units 2nd sem (grade)                  0.6053

### Comment

---

After execution the code below and observing statistics of numeric values. 
```
simple_stats(df,non_categorical_attributes)
```
It is evident that the numeric values require scaling.

---

After executing the code below and observing the correlations.
```
df.corr()['Target'].sort_values()
```
Most of the values for Pearson correlation are positive however, only few posses some liniear relationship.

Therefore, the technique for data normalization must work well with variables that have a non-linear relationship with the target variable. 

---
"Box-Cox" technique is a good candidate for feature scaling of numeric data in this dataset. 

The reasoning behind "Box-Cox Transformation" technique is that it reduces skewness, improves model performance, can make the data more normally distributed, and what is important, it works well with variables that have a non-linear relationship with the target variable.

---

### Box-Cox transformation - Code

In [None]:
df_box_cox = df.iloc[:,].copy()

In [None]:
def box_cox_transformation(x):
  x = x + abs(x.min()) + 1
  lambda_param = stats.boxcox_normmax(x)
  data_boxcox = stats.boxcox(x, lmbda=lambda_param)

  return data_boxcox


In [None]:
df_box_cox.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Previous qualification (grade),3630.0,132.920606,13.238373,95.0,125.0,133.1,140.0,190.0
Admission grade,3630.0,127.293939,14.611295,95.0,118.0,126.5,135.1,190.0
Age at enrollment,3630.0,23.461157,7.827994,17.0,19.0,20.0,25.0,70.0
Curricular units 1st sem (credited),3630.0,0.75427,2.477277,0.0,0.0,0.0,0.0,20.0
Curricular units 1st sem (enrolled),3630.0,6.337466,2.570773,0.0,5.0,6.0,7.0,26.0
Curricular units 1st sem (evaluations),3630.0,8.071074,4.286632,0.0,6.0,8.0,10.0,45.0
Curricular units 1st sem (approved),3630.0,4.79146,3.237845,0.0,3.0,5.0,6.0,26.0
Curricular units 1st sem (grade),3630.0,10.53486,5.057694,0.0,11.0,12.341429,13.5,18.875
Curricular units 1st sem (without evaluations),3630.0,0.128926,0.679111,0.0,0.0,0.0,0.0,12.0
Curricular units 2nd sem (credited),3630.0,0.581818,2.022688,0.0,0.0,0.0,0.0,19.0


In [None]:
df_box_cox[non_categorical_attributes] = df_box_cox[non_categorical_attributes].apply(box_cox_transformation)



In [None]:
df_box_cox.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Previous qualification (grade),3630.0,3.527568,0.02261759,3.456182,3.51448,3.52864,3.540247,3.613993
Admission grade,3630.0,0.74563,4.587651e-05,0.74551,0.745602,0.745631,0.745657,0.745783
Age at enrollment,3630.0,0.052786,6.939850000000001e-18,0.052786,0.052786,0.052786,0.052786,0.052786
Curricular units 1st sem (credited),3630.0,0.133987,0.3545596,0.0,0.0,0.0,0.0,1.384014
Curricular units 1st sem (enrolled),3630.0,4.413167,1.487907,0.0,3.688126,4.289148,4.867161,13.75298
Curricular units 1st sem (evaluations),3630.0,5.743552,2.678112,0.0,4.578287,5.850329,7.061682,24.255546
Curricular units 1st sem (approved),3630.0,3.634509,2.262557,0.0,2.509428,3.928117,4.597643,15.678837
Curricular units 1st sem (grade),3630.0,7417.332206,4773.834,0.0,4869.71879,7418.373672,10326.755325,36132.465766
Curricular units 1st sem (without evaluations),3630.0,0.034049,0.1350471,0.0,0.0,0.0,0.0,0.775188
Curricular units 2nd sem (credited),3630.0,0.110482,0.3039932,0.0,0.0,0.0,0.0,1.239678
