### Utkarsha Vidhale

## Data Pre-processing

### Table of Contents


[01. Introductions](#a1) 
 
[02. Deal with missing values](#a2) 
 
[03. Normalization](#a3) 
 
[04. Data transformation](#a4) 
 
[05. Feature selection](#a5) (yet to be added)
 
[06. Feature reduction](#a6) (yet to be added)
 
[07. Data Splits: Examples](#a7) (yet to be added)

<a id="a1"></a>
### 01. Introductions


Data preprocessing may include the following operations:
- file load
- deal with missing values
- slicing data
- data normalization
- data smoothing
- data transformation, numerical to categorical
- data transformation, categorical to numerical
- feature selection
- feature deduction
- some special preprocessing, such as the operations in text mining, e.g., stopword removal, tokenization, TF-IDF weighting


#### The following operations will use Data_Students.csv as the data set

<a id="a2"></a>
### 02. Deal with missing values 

Import Python Libraries

In [None]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns
from IPython.display import display, HTML

In [None]:
import warnings
warnings.filterwarnings('ignore')

Reading the csv file

In [None]:
df=pd.read_csv('data_students.csv')
cols=df.columns

Get dimensions of data

In [None]:
print(df.shape)

Print the data types

In [None]:
print(df.dtypes)

Print header and dataType, as well as boolean value which tells missing values

In [None]:
print('ColumnName, DataType, MissingValues')
for i in cols:
    print(i, ',', df[i].dtype,',',df[i].isnull().any())


Identified columns with missing values:
 - `Age` 
 
 - `Hours on Assignments` 
 
 - `Hours on Games` 
 
 - `Exam` 
 
 - `Grade`








Print and display dataframe as tables in HTML

In [None]:
display(HTML(df.head(10).to_html()))  

By using `GradeLetter` as label, visualize the data

In [None]:
sns.set()
sns.pairplot(df, hue='GradeLetter', height=2);

Calculate mean value by ignoring missing values

In [None]:
mean_age=df['Age'].mean(skipna=True)
mean_hr_assignment=df['Hours on Assignments'].mean(skipna=True)
mean_hr_game=df['Hours on Games'].mean(skipna=True)
mean_exam=df['Exam'].mean(skipna=True)
mean_grade=df['Grade'].mean(skipna=True)

Replace missing values in numerical variables by using mean value

In [None]:
df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Hours on Assignments"].fillna(df["Hours on Assignments"].mean(), inplace=True)
df["Hours on Games"].fillna(df["Hours on Games"].mean(), inplace=True)
df["Exam"].fillna(df["Exam"].mean(), inplace=True)
df["Grade"].fillna(df["Grade"].mean(), inplace=True)

Checking again whether there are missing values

In [None]:
df.isnull().sum()

<a id="a3"></a>
### 03. Normalization
 
 
- Finding numeric columns

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
# get column names
cols_numeric = df.select_dtypes(include=numerics).columns.tolist()
# get column indices
cols_numeric_index=[df.columns.get_loc(col) for col in cols_numeric]
print('Numerical column names:\n',cols_numeric)

In [None]:
print('Numerical column indeices:\n',cols_numeric_index)

In [None]:
# creating a copy first
df_norm=df.copy(deep=True)

#### Normalization method 1

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[cols_numeric]=scaler.fit_transform(df[cols_numeric])
display(HTML(df.head(10).to_html()))

#### Normalization method 2

In [None]:
for col in cols_numeric:
    df_norm[col]=(df[col]-df[col].min())/(df[col].max()-df[col].min())
    
  
# drop column ID since it is not useful in data science tasks    
df_norm=df_norm.drop('ID',1)


df_norm.head(10)

<a id="a4"></a>
### 04. Data transformation

In [None]:
df_transform=df_norm.copy(deep=True)   
# print out and display dataframe as tables in HTML
display(HTML(df_transform.head(5).to_html()))

- Convert numerical to categorical data, e.g., `Age`

In [None]:
df_transform['Age'] = pd.cut(df_transform['Age'],8)
display(HTML(df_transform.head(5).to_html()))

convert categorical data to numerical data, e.g., `Degree` 
 
 - `Degree`

In [None]:
print(df_transform['Degree'].dtype)

In [None]:
df_dummies_degree=pd.get_dummies(df_transform['Degree'])
print(df_dummies_degree.head(5))


In [None]:
print(df_dummies_degree.dtypes)

In [None]:
df_dummies_degree=df_dummies_degree.astype(str).astype(int)
df_dummies_degree.dtypes

 
 - `Nationality`

In [None]:
print(df_transform['Nationality'].dtype)

In [None]:
df_dummies_nation=pd.get_dummies(df_transform['Nationality'])
print(df_dummies_nation.head(5))

 
  -  `Gender`

In [None]:
print(df_transform['Gender'].dtype)

In [None]:
df_dummies_gender=pd.get_dummies(df_transform['Gender'])
print(df_dummies_gender.head(5))

 
 - `GradeLetter`

In [None]:
print(df_transform['GradeLetter'].dtype)

In [None]:
df_dummies_gletter=pd.get_dummies(df_transform['GradeLetter'])
print(df_dummies_gletter.head(5))
print(df_dummies_gletter.dtypes)

Adding binary variables to dataframe

In [None]:
df_transform=df_transform.join(df_dummies_degree)
df_transform=df_transform.join(df_dummies_nation)
df_transform=df_transform.join(df_dummies_gender)
#df_transform=df_transform.join(df_dummies_gletter)
# remove the original categorical variable
df_transform=df_transform.drop('Degree',1)
df_transform=df_transform.drop('Nationality',1)
df_transform=df_transform.drop('Gender',1)
#df_transform=df_transform.drop('GradeLetter',1)



display(HTML(df_transform.head(5).to_html()))

<a id="a5"></a>
### 05. Feature selection

In [None]:
import matplotlib.pyplot as plt

# print out and display dataframe as tables in HTML
display(HTML(df_transform.head(10).to_html()))

# set features and labels
x = df_transform.drop('GradeLetter', 1)
y = df_transform['GradeLetter']

#### Feature selection by using Filter model 
 
By using Pearson correlation as selecting criterion 
 
    Pearson correlation can only be applied among numerical variables
    
    In this data, GradeLetter is highly correlated with numerical variable Grade


In [None]:
# calculate correlation and show in heatmap
plt.figure(figsize=(12,10))
cor = df_transform.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

In [None]:
#Correlation with output variable
cor_target = abs(cor["Grade"])
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.5]
print('\nSelected features by Filter model:\n',relevant_features)

#### Feature selection by using Wrapper model 
 
 A machine learning task is invovled in the Wrapper model
  
    We use the performance of the machine learning task to select influential features
   
    In this example, we use backward elimination in linear regression which predicts Grade

     
##### Backward Elimination





In [None]:
import statsmodels.api as sm
cols = list(df_transform.columns)
cols.remove('GradeLetter') # drop the nominal variable
print('\n x variables: ',cols)

In [None]:
df_transform.dtypes

In [None]:
y=list(df_transform['Grade']) # using Grade as y variable in linear regression
pmax = 1
while (len(cols)>0):
    p= []
    X_1 = df_transform[cols]
    #print('',X_1)
    #print('',X_1.dtypes)
    X_1 = sm.add_constant(X_1)
    #model = sm.OLS(y,X_1).fit()
    model = sm.OLS(y.astypes(float), X_1.astype(float)).fit()
    p = pd.Series(model.pvalues.values[1:],index = cols)      
    pmax = max(p)
    feature_with_p_max = p.idxmax()
    if(pmax>0.05):
        cols.remove(feature_with_p_max)
    else:
        break
selected_features_BE = cols
print('\nSelected features by Wrapper model (regression):\n',selected_features_BE)

In [None]:
# 05. Feature selection #################################################################


# Feature selection by using Filter model ################################################

# by using Pearson correlation as selecting criterion
# Pearson correlation can only be applied among numerical variables
# in this data, GradeLetter is highly correlated with numerical variable Grade






#  ################################################
# A machine learning task is invovled in the Wrapper model
# We use the performance of the machine learning task to select influential features
# In this example, we use backward elimination in linear regression which predicts Grade

#Backward Elimination





# Feature selection by using Wrapper model ################################################
# This example shows that we can use impurity criterion in decision trees to select important features

from sklearn.ensemble import ExtraTreesClassifier

x = df_transform.drop('GradeLetter', 1)
y = df_transform['GradeLetter']
display(HTML(x.head(10).to_html()))

model = ExtraTreesClassifier()
model.fit(x, y)

values=model.feature_importances_.tolist()
keys=x.columns.tolist()
d = dict(zip(keys, values))
# sort pairs by values descending
s = [(k, d[k]) for k in sorted(d, key=d.get, reverse=True)]


print('\nSelected features by Wrapper model (classification):\n')
for k, v in s:
    print(k,'\t',v)

In [None]:
# 06. Feature reduction #################################################################

# Example of PCA

from sklearn.decomposition import PCA

x = df_transform.drop('GradeLetter', 1)
y = df_transform['GradeLetter']
display(HTML(x.head(10).to_html()))

# feature extraction
pca = PCA(n_components=5)
fit = pca.fit(x)

# summarize components
# print("Explained Variance: %s") % fit.explained_variance_ratio_
print('Explained variance: ', fit.explained_variance_ratio_)
print('\nPCAs:\n', fit.components_)

# select PCA and output new features
# for example, we choose the top-3 PCAs

PCAs = pca.fit_transform(x)
PCAs_selected = PCAs[:,:3]
df_PCAs = pd.DataFrame(data=PCAs_selected, columns=['PC1','PC2','PC3'])
df_PCAs['GraderLetter']=y

display(HTML(df_PCAs.head(10).to_html()))

# write new data to external files
df_PCAs.to_csv('Data_Students_PCA.csv', sep=',')

In [None]:
# 07. Data Splits: Examples

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from matplotlib import pyplot as plt
import matplotlib as mpl
import seaborn as sns


# hold-out split and evaluations
# x_train, x_test, y_train, y_test = train_test_split(df, y_encoded, test_size=0.2)

# N-fold cross validation
# acc=cross_val_score(clf, x, y, cv=5, scoring='accuracy').mean()