# Data preparation with sklearn

## Exercise 1

Let's assume we have a small database of flu patients. We want to make a model to predict the need of hospitalization given the characteristics of the patients. However, before we need to pre-process a little bit the data.

We have the following attributes:

* Age: stored in years
* Gender: Male / Female 
* Health status: Excelent / good / poor 
* Disease symptoms: in a scale from 0 to 10 where 0 is asymptomatic 
* Hospitalization need: 0 don't need hospitalization; 1 if they need it
    
The data is stored in lists. We provide here an example with 5 patients:

In [14]:
age = [70, 60, 35, 38, 86]  # None is used to denote a missing value in Python
gender = ["Male", "Female", "Male", "Male", "Female"]  
health_status = ["Excelent", "Poor", "Poor", "Excelent", "Good"]  
disease_symptoms = [0, 7, 5, 8, 9]  
hospitalization = [0, 1, 1, 0, 0] 

#### Exercise 1.1

Create a dataframe called ```df``` that contains the provided data.

In [15]:
import pandas as pd

df = pd.DataFrame(data={'age':age,
                        'gender':gender,
                        'health_status':health_status,
                        'disease_symptoms':disease_symptoms,
                        'hospitalization':hospitalization
                        })

df

Unnamed: 0,age,gender,health_status,disease_symptoms,hospitalization
0,70,Male,Excelent,0,0
1,60,Female,Poor,7,1
2,35,Male,Poor,5,1
3,38,Male,Excelent,8,0
4,86,Female,Good,9,0


#### Exercise 1.2

Extract the features matrix and target array from the original DataFrame and store them in two new variables ```X``` and ```y```. Use column ```hospitalization``` as depedent variable.

In [16]:
X = df.drop(columns='hospitalization')
y = df['hospitalization']

#### Exercise 1.3

Use the sklearn library to create a one-hot encoder for the ```gender``` attribute that results in the adding of two new columns, ```female``` and ```male```, to your dataframe. Remember to remove the original attribute.

In [17]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
val = encoder.fit_transform(X[['gender']])

val

array([[0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.]])

In [18]:
X['female'] = val[:,0]
X['male'] = val[:,1]

X.drop(columns='gender',inplace=True)
#X = X.drop(columns='gender',inplace=False)
#X = X.drop(columns='gender')

X

Unnamed: 0,age,health_status,disease_symptoms,female,male
0,70,Excelent,0,0.0,1.0
1,60,Poor,7,1.0,0.0
2,35,Poor,5,0.0,1.0
3,38,Excelent,8,0.0,1.0
4,86,Good,9,1.0,0.0


#### Exercise 1.4

Use the sklearn library to create an integer encoder for the ```health_status``` attribute. Store the new variable in a column called ```health_status_ENC``` and remove the original column.

In [13]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=[['Poor','Good','Excelent']])
X['health_status_ENC'] = encoder.fit_transform(X[['health_status']])

X['health_status_ENC']

X.drop(columns='health_status', inplace=True)

X

KeyError: "None of [Index(['health_status'], dtype='object')] are in the [columns]"

#### Exercise 1.5

Write the code to split the datasets ```X``` and ```y``` into separate training set and a test set using the sklearn library. Use the common names ```X_train, X_test, y_train, y_test``` to refer to the different sets.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)

X_train, X_test, y_train, y_test

(   age  disease_symptoms  female  male  health_status_ENC
 2   35                 5     0.0   1.0                0.0
 4   86                 9     1.0   0.0                1.0
 0   70                 0     0.0   1.0                2.0,
    age  disease_symptoms  female  male  health_status_ENC
 1   60                 7     1.0   0.0                0.0
 3   38                 8     0.0   1.0                2.0,
 2    1
 4    0
 0    0
 Name: hospitalization, dtype: int64,
 1    1
 3    0
 Name: hospitalization, dtype: int64)

Repeat the same exercise above, but this time set the relative sized of the training and test sets to 0.6 and 0.4, respectively.

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.4)
X_train, X_test, y_train, y_test

(   age  disease_symptoms  female  male  health_status_ENC
 4   86                 9     1.0   0.0                1.0
 3   38                 8     0.0   1.0                2.0
 2   35                 5     0.0   1.0                0.0,
    age  disease_symptoms  female  male  health_status_ENC
 1   60                 7     1.0   0.0                0.0
 0   70                 0     0.0   1.0                2.0,
 4    0
 3    0
 2    1
 Name: hospitalization, dtype: int64,
 1    1
 0    0
 Name: hospitalization, dtype: int64)

#### Exercise 1.6

Write the code to correctly normalize datasets ```X_train``` and ```X_test``` using the MinMaxScaler from the sklearn library. When you are done, write the code to check that both sets are indeed properly normalized.

In [9]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)

X_train

array([[1.        , 1.        , 1.        , 0.        , 0.5       ],
       [0.05882353, 0.75      , 0.        , 1.        , 1.        ],
       [0.        , 0.        , 0.        , 1.        , 0.        ]])

## Exercise 2

Follow this [link](https://urledu-my.sharepoint.com/:u:/g/personal/jordi_nin_esade_edu/EecC8IKo5dtBhmZTGKObe94BwcefmbK8__aJVkoOjcO7JQ?e=vFwvtd) to download a dataset containing information about different family structures. The data are presented in a csv format. You might need to unzip it. 

The dataset contains the following attributes:

* gender
* num_children
* income
* social_class

#### Exercise 2.1

Create a dataframe called ```df``` that contains the provided data.

In [12]:
import pandas as pd

df = pd.read_csv('/data/client-files/jessica/prod/social_class.csv')

df.head()

Unnamed: 0,gender,num_children,income,social_class
0,Female,4,2500.0,Middle-Lower
1,Hombre,1,3002.3,Middle-Middle
2,Hombre,1,4274.1,Middle-Upper
3,Hombre,0,1200.0,Middle-Middle
4,Hombre,1,2774.6,Middle-Lower


#### Exercise 2.2

Extract the features matrix and target array from the original DataFrame and store them in two new variables ```X``` and ```y```. Use column ```income``` as the depedent variable.

In [35]:
y = df['income']
X = df.drop(columns='income')

X

Unnamed: 0,gender,num_children,social_class
0,Female,4,Middle-Lower
1,Hombre,1,Middle-Middle
2,Hombre,1,Middle-Upper
3,Hombre,0,Middle-Middle
4,Hombre,1,Middle-Lower
...,...,...,...
995,Hombre,1,Middle-Upper
996,Female,4,Middle-Upper
997,Hombre,0,Middle-Middle
998,Hombre,1,Middle-Middle


#### Exercise 2.3

Use the sklearn library to create a one-hot encoder for the ```gender``` attribute. Store the resulting information in two new columns called ```male``` and ```female``` and remove the original attribute from the dataframe.

In [37]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
val = encoder.fit_transform(X[['gender']])

X['female'] = val[:,0]
X['male'] = val[:,1]

X.drop(columns='gender',inplace=True)
#X = X.drop(columns='gender',inplace=False)
#X = X.drop(columns='gender')

X

Unnamed: 0,num_children,social_class,female,male
0,4,Middle-Lower,1.0,0.0
1,1,Middle-Middle,0.0,1.0
2,1,Middle-Upper,0.0,1.0
3,0,Middle-Middle,0.0,1.0
4,1,Middle-Lower,0.0,1.0
...,...,...,...,...
995,1,Middle-Upper,0.0,1.0
996,4,Middle-Upper,1.0,0.0
997,0,Middle-Middle,0.0,1.0
998,1,Middle-Middle,0.0,1.0


# Exercise 2.4

In [39]:
#X['social_class'].unique()

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=[['Lower','Middle-Lower','Middle-Middle', 'Middle-Upper', 'Upper']])
X['social_class'] = encoder.fit_transform(X[['social_class']])

X

Unnamed: 0,num_children,social_class,female,male
0,4,1.0,1.0,0.0
1,1,2.0,0.0,1.0
2,1,3.0,0.0,1.0
3,0,2.0,0.0,1.0
4,1,1.0,0.0,1.0
...,...,...,...,...
995,1,3.0,0.0,1.0
996,4,3.0,1.0,0.0
997,0,2.0,0.0,1.0
998,1,2.0,0.0,1.0


#### Exercise 2.5

Write the code to split the datasets ```X``` and ```y``` into separate training set and a test set using the sklearn library. Use the common names ```X_train, X_test, y_train, y_test```.

In [40]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)
X_train, X_test, y_train, y_test

(     num_children  social_class  female  male
 288             2           4.0     0.0   1.0
 311             2           2.0     0.0   1.0
 617             4           1.0     0.0   1.0
 262             4           0.0     0.0   1.0
 13              2           4.0     0.0   1.0
 ..            ...           ...     ...   ...
 155             4           1.0     1.0   0.0
 582             2           0.0     0.0   1.0
 709             0           1.0     1.0   0.0
 200             1           2.0     1.0   0.0
 702             1           3.0     0.0   1.0
 
 [750 rows x 4 columns],      num_children  social_class  female  male
 963             0           0.0     0.0   1.0
 711             2           4.0     0.0   1.0
 125             2           0.0     0.0   1.0
 856             2           3.0     1.0   0.0
 526             5           1.0     1.0   0.0
 ..            ...           ...     ...   ...
 768             0           4.0     1.0   0.0
 604             3           1.0  

#### Exercise 2.6

Write the code to correctly normalize datasets ```X_train``` and ```X_test``` using the StandardScaler from the sklearn library.

In [41]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)

X_train

array([[ 0.23338003,  1.83428387, -0.65881288,  0.65881288],
       [ 0.23338003,  0.15042475, -0.65881288,  0.65881288],
       [ 1.46602101, -0.69150481, -0.65881288,  0.65881288],
       ...,
       [-0.99926096, -0.69150481,  1.51788167, -1.51788167],
       [-0.38294047,  0.15042475,  1.51788167, -1.51788167],
       [-0.38294047,  0.99235431, -0.65881288,  0.65881288]])

## Exercise 3

There is a lot of controversy about the relationship between income and ideology. Let's explore a little bit this relationship. However, before we need to pre-process a little bit the data. We have the following attributes:

* Gender
* Political_ideology
* Income
* Job_satisfaction

Follow this [link](https://urledu-my.sharepoint.com/:u:/g/personal/jordi_nin_esade_edu/EQBnu4Tq0E9IqpfN-AYbsTABUdpW6bSa9ax8bziVSa-cRg?e=soeYUG) to download the dataset in a csv format. You might need to unzip it. 

#### Exercise 3.1

Create a dataframe called ```data``` that stores the provided data.

In [76]:
import pandas as pd

data = pd.read_csv('/content/ideology_income.csv')
data.head()

Unnamed: 0,gender,political_ideology,income,job_satisfaction
0,Female,right_wing,2500.0,2
1,Male,left_wing,3002.3,2
2,Male,center_wing,4274.1,7
3,Male,center_wing,1200.0,5
4,Male,center_wing,2774.6,5


#### Exercise 3.2

Extract the features matrix and target array from the original DataFrame and store them in two new variables ```X``` and ```y```. Use column ```job_satisfaction``` as depedent variable.

In [77]:
y = data['job_satisfaction']
X = data.drop(columns='job_satisfaction')

X

Unnamed: 0,gender,political_ideology,income
0,Female,right_wing,2500.0
1,Male,left_wing,3002.3
2,Male,center_wing,4274.1
3,Male,center_wing,1200.0
4,Male,center_wing,2774.6
...,...,...,...
995,Male,center_wing,3467.9
996,Female,right_wing,3445.7
997,Male,center_wing,2728.5
998,Male,left_wing,3398.7


#### Exercise 3.3

Use the sklearn library to create a one-hot encoder for the ```political_ideology``` attribute. Observe the encoder output, and store the information in two new columns called ```left_wing```, ```center_wing``` and ```right_wing```.

In [78]:
#data['political_ideology'].unique()

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
val = encoder.fit_transform(X[['political_ideology']])

#val

X['center_wing'] = val[:,0]
X['left_wing'] = val[:,1]
X['right_wing'] = val[:,2]

X.head()

Unnamed: 0,gender,political_ideology,income,center_wing,left_wing,right_wing
0,Female,right_wing,2500.0,0.0,0.0,1.0
1,Male,left_wing,3002.3,0.0,1.0,0.0
2,Male,center_wing,4274.1,1.0,0.0,0.0
3,Male,center_wing,1200.0,1.0,0.0,0.0
4,Male,center_wing,2774.6,1.0,0.0,0.0


In [79]:
row = 0
print('Information of row on the original dataset')
print(X.iloc[row,:3])
print('Encoded information of the variable')
val[row,:]

Information of row on the original dataset
gender                    Female
political_ideology    right_wing
income                      2500
Name: 0, dtype: object
Encoded information of the variable


array([0., 0., 1.])

In [80]:
row = 1
print('Information of row on the original dataset')
print(X.iloc[row,:3])
print('Encoded information of the variable')
val[row,:]

Information of row on the original dataset
gender                     Male
political_ideology    left_wing
income                   3002.3
Name: 1, dtype: object
Encoded information of the variable


array([0., 1., 0.])

In [81]:
row = 2
print('Information of row on the original dataset')
print(X.iloc[row,:3])
print('Encoded information of the variable')
val[row,:]

Information of row on the original dataset
gender                       Male
political_ideology    center_wing
income                     4274.1
Name: 2, dtype: object
Encoded information of the variable


array([1., 0., 0.])

#### Exercise 3.4

Write the code to split the datasets ```X``` and ```y``` into separate training set and a test set using the sklearn library. Use the common names ```X_train, X_test, y_train, y_test```.

In [52]:
X.drop(columns=['gender','political_ideology'],inplace=True)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y)
X_train, X_test, y_train, y_test

(     income  center_wing  left_wing  right_wing
 693  2229.6          1.0        0.0         0.0
 724  5964.6          1.0        0.0         0.0
 683  2679.0          0.0        0.0         1.0
 116  2181.6          1.0        0.0         0.0
 997  2728.5          1.0        0.0         0.0
 ..      ...          ...        ...         ...
 810  1053.2          0.0        1.0         0.0
 262  1296.2          1.0        0.0         0.0
 944  2716.3          1.0        0.0         0.0
 384  3789.0          0.0        1.0         0.0
 721  2663.3          0.0        1.0         0.0
 
 [750 rows x 4 columns],      income  center_wing  left_wing  right_wing
 678  2796.4          1.0        0.0         0.0
 598  4245.3          1.0        0.0         0.0
 362  1345.6          0.0        1.0         0.0
 700  3255.5          0.0        1.0         0.0
 988  1200.2          1.0        0.0         0.0
 ..      ...          ...        ...         ...
 742  3875.2          0.0        1.0       

#### Exercise 3.5

Write the code to correctly normalize datasets ```X_train``` and ```X_test``` using the MinMaxScaler from the sklearn library.

In [54]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)

#X_test

## Exercise 4

Milk consumption is important variable to avoid aging problems. Let's analyze a little bit if there are different habits in the population. We have the following attributes:

* gender
* height
* weight
* milk

Follow this [link](https://urledu-my.sharepoint.com/:u:/g/personal/jordi_nin_esade_edu/EfzGDqV3KUVKl9D460iyzg0BERHvg-PMZ-yH-90I9rknKQ?e=oNP89D) to download the dataset in a csv format. You might need to unzip it. 

#### Exercise 4.1

Create a dataframe called ```data``` that stores the provided data.

In [82]:
import pandas as pd
data = pd.read_csv('/content/milk_consumption.csv')
data.head()

Unnamed: 0,gender,height,weight,milk
0,Female,181,71,26.38
1,Male,176,68,12.53
2,Male,175,86,13.31
3,Male,170,72,6.22
4,Male,178,82,15.02


#### Exercise 4.2

Extract the features matrix and target array from the original DataFrame and store them in two new variables ```X``` and ```y```. Use column ```milk``` as depedent variable.

In [83]:
X = data.drop(columns='milk')
y = data['milk']
X

Unnamed: 0,gender,height,weight
0,Female,181,71
1,Male,176,68
2,Male,175,86
3,Male,170,72
4,Male,178,82
...,...,...,...
995,Male,173,65
996,Female,167,57
997,Male,172,73
998,Male,168,71


#### Exercise 3.3

Use the sklearn library to create a one-hot encoder for the ```gender``` attribute. Observe the encoder output, and store the information in two new columns called ```female``` and ```male```.

In [84]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
val = encoder.fit_transform(X[['gender']])

val

X['female'] = val[:,0]
X['male'] = val[:,1]

In [85]:
val

array([[1., 0.],
       [0., 1.],
       [0., 1.],
       ...,
       [0., 1.],
       [0., 1.],
       [0., 1.]])

In [86]:
row = 0
print('Information of row on the original dataset')
print(X.iloc[row,:3])
print('Encoded information of the variable')
val[row,:]

Information of row on the original dataset
gender    Female
height       181
weight        71
Name: 0, dtype: object
Encoded information of the variable


array([1., 0.])

In [87]:
row = 1
print('Information of row on the original dataset')
print(X.iloc[row,:3])
print('Encoded information of the variable')
val[row,:]

Information of row on the original dataset
gender    Male
height     176
weight      68
Name: 1, dtype: object
Encoded information of the variable


array([0., 1.])

#### Exercise 4.4

Write the code to split the datasets ```X``` and ```y``` into separate training set and a test set using the sklearn library. Use the common names ```X_train, X_test, y_train, y_test```.

In [91]:
X.drop(columns='gender', inplace=True)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y)

#### Exercise 3.5

Write the code to correctly normalize datasets ```X_train``` and ```X_test``` using the StandarScaler from the sklearn library.

In [92]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)

In [94]:
X_train

array([[ 0.88883751,  1.35075398, -0.66714819,  0.66714819],
       [ 0.20209771, -0.3930809 , -0.66714819,  0.66714819],
       [-1.66191031, -0.49565942,  1.49891736, -1.49891736],
       ...,
       [-0.68085346,  1.4533325 , -0.66714819,  0.66714819],
       [ 0.69262614, -1.21370907, -0.66714819,  0.66714819],
       [-0.3865364 ,  1.35075398, -0.66714819,  0.66714819]])