# APLICACIONES EN CIENCIAS DE COMPUTACION

## Laboratorio 4: Preprocesamiento y clasificación de un dataset

La tarea de este laboratorio consiste en preprocesar un dataset y entrenar modelos de aprendizaje de máquina sobre dicho dataset

In [70]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [71]:
data_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')
data_train.shape, data_test.shape

((891, 12), (418, 11))

In [72]:
data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [73]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [74]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


#### 1. El campo a predecir es *`Survived`*. ¿Deberían considerarse todos los campos del dataset para el entrenamiento del modelo? Justifique (1.5pts)


#### 2. Al parecer hay valores nulos en el dataset, tanto en *`data_train`* como en *`data_test`*. ¿De qué manera podría completar o eliminar los valores nulos? Justifique (1.5pts)

#### 3a. Según lo justificado en las preguntas anteriores, realice las transformaciones necesarias a *`data_train`*. Puede ayudarse de los siguientes métodos que pueden ser aplicados a un objeto de tipo Series (a una columna): (3pts)
```
fillna()
mode() // cuidado pues retorna una lista
mean()
drop()
```
Tip: recuerde que los métodos de pandas no modifican el objeto al cual se aplican sino devuelven un objeto nuevo

In [75]:
def simplify_ages(df):
    df.Age = df.Age.fillna(-0.5)
    bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
    group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
    categories = pd.cut(df.Age, bins, labels=group_names)
    df.Age = categories
    return df

def simplify_fares(df):
    df.Fare = df.Fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
    categories = pd.cut(df.Fare, bins, labels=group_names)
    df.Fare = categories
    return df

def format_name(df):
    df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])
    df['NamePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])
    return df    

def fill_embarked(df):
    df.Embarked = df.Embarked.fillna('S')
    return df
    
def drop_features(df):
    return df.drop(['Ticket', 'Name','Cabin'], axis=1)

def transform_features(df):
    df = simplify_ages(df)
    df = simplify_fares(df)
    df = format_name(df)
    df = fill_embarked(df)
    df = drop_features(df)
    return df

data_train = transform_features(data_train)
data_train

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Lname,NamePrefix
0,1,0,3,male,Student,1,0,1_quartile,S,"Braund,",Mr.
1,2,1,1,female,Adult,1,0,4_quartile,C,"Cumings,",Mrs.
2,3,1,3,female,Young Adult,0,0,1_quartile,S,"Heikkinen,",Miss.
3,4,1,1,female,Young Adult,1,0,4_quartile,S,"Futrelle,",Mrs.
4,5,0,3,male,Young Adult,0,0,2_quartile,S,"Allen,",Mr.
5,6,0,3,male,Unknown,0,0,2_quartile,Q,"Moran,",Mr.
6,7,0,1,male,Adult,0,0,4_quartile,S,"McCarthy,",Mr.
7,8,0,3,male,Baby,3,1,3_quartile,S,"Palsson,",Master.
8,9,1,3,female,Young Adult,0,2,2_quartile,S,"Johnson,",Mrs.
9,10,1,2,female,Teenager,1,0,3_quartile,C,"Nasser,",Mrs.


#### 3b. De la misma manera, realice las transformaciones necesarias a *`data_test`* (2pts)

In [76]:
def simplify_ages(df):
    df.Age = df.Age.fillna(-0.5)
    bins = (-1, 0, 5, 12, 18, 25, 35, 60, 120)
    group_names = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
    categories = pd.cut(df.Age, bins, labels=group_names)
    df.Age = categories
    return df

def simplify_fares(df):
    df.Fare = df.Fare.fillna(-0.5)
    bins = (-1, 0, 8, 15, 31, 1000)
    group_names = ['Unknown', '1_quartile', '2_quartile', '3_quartile', '4_quartile']
    categories = pd.cut(df.Fare, bins, labels=group_names)
    df.Fare = categories
    return df

def format_name(df):
    df['Lname'] = df.Name.apply(lambda x: x.split(' ')[0])
    df['NamePrefix'] = df.Name.apply(lambda x: x.split(' ')[1])
    return df    

def fill_embarked(df):
    df.Embarked = df.Embarked.fillna('S')
    return df
    
def drop_features(df):
    return df.drop(['Ticket', 'Name','Cabin'], axis=1)

def transform_features(df):
    df = simplify_ages(df)
    df = simplify_fares(df)
    df = format_name(df)
    df = fill_embarked(df)
    df = drop_features(df)
    return df

data_test = transform_features(data_test)
data_test

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Lname,NamePrefix
0,892,3,male,Young Adult,0,0,1_quartile,Q,"Kelly,",Mr.
1,893,3,female,Adult,1,0,1_quartile,S,"Wilkes,",Mrs.
2,894,2,male,Senior,0,0,2_quartile,Q,"Myles,",Mr.
3,895,3,male,Young Adult,0,0,2_quartile,S,"Wirz,",Mr.
4,896,3,female,Student,1,1,2_quartile,S,"Hirvonen,",Mrs.
5,897,3,male,Teenager,0,0,2_quartile,S,"Svensson,",Mr.
6,898,3,female,Young Adult,0,0,1_quartile,Q,"Connolly,",Miss.
7,899,2,male,Young Adult,1,1,3_quartile,S,"Caldwell,",Mr.
8,900,3,female,Teenager,0,0,1_quartile,C,"Abrahim,",Mrs.
9,901,3,male,Student,2,0,3_quartile,S,"Davies,",Mr.


In [77]:
data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Lname,NamePrefix
0,1,0,3,male,Student,1,0,1_quartile,S,"Braund,",Mr.
1,2,1,1,female,Adult,1,0,4_quartile,C,"Cumings,",Mrs.
2,3,1,3,female,Young Adult,0,0,1_quartile,S,"Heikkinen,",Miss.
3,4,1,1,female,Young Adult,1,0,4_quartile,S,"Futrelle,",Mrs.
4,5,0,3,male,Young Adult,0,0,2_quartile,S,"Allen,",Mr.


In [78]:
data_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Lname,NamePrefix
0,892,3,male,Young Adult,0,0,1_quartile,Q,"Kelly,",Mr.
1,893,3,female,Adult,1,0,1_quartile,S,"Wilkes,",Mrs.
2,894,2,male,Senior,0,0,2_quartile,Q,"Myles,",Mr.
3,895,3,male,Young Adult,0,0,2_quartile,S,"Wirz,",Mr.
4,896,3,female,Student,1,1,2_quartile,S,"Hirvonen,",Mrs.


In [79]:
list(data_train['Sex'].unique()), list(data_train['Embarked'].unique())

(['male', 'female'], ['S', 'C', 'Q'])

#### 4. Algunos campos aún no son numéricos. Realice la conversión correspondiente. Puede ayudarse del siguiente método (aplicable a un objeto de tipo DataFrame o Series): (2pts)
```
replace()

```
Tip: la función puede recibir un diccionario como parámetro <br>
Nota: por ahora solo se están mapeando strings a números, pero si conoce cómo usar one-hot encoding, puede aplicarlo

In [80]:
data_train['Sex']=data_train['Sex'].replace(['male'],0)
data_train['Sex']=data_train['Sex'].replace(['female'],1)

data_test['Sex']=data_test['Sex'].replace(['male'],0)
data_test['Sex']=data_test['Sex'].replace(['female'],1)

data_test['Age']=data_test['Age'].replace(['Unknown'],0)
data_test['Age']=data_test['Age'].replace(['Baby'],1)
data_test['Age']=data_test['Age'].replace(['Child'],2)
data_test['Age']=data_test['Age'].replace(['Teenager'],3)
data_test['Age']=data_test['Age'].replace(['Student'],4)
data_test['Age']=data_test['Age'].replace(['Young Adult'],5)
data_test['Age']=data_test['Age'].replace(['Adult'],6)
data_test['Age']=data_test['Age'].replace(['Senior'],7)

data_train['Age']=data_train['Age'].replace(['Unknown'],0)
data_train['Age']=data_train['Age'].replace(['Baby'],1)
data_train['Age']=data_train['Age'].replace(['Child'],2)
data_train['Age']=data_train['Age'].replace(['Teenager'],3)
data_train['Age']=data_train['Age'].replace(['Student'],4)
data_train['Age']=data_train['Age'].replace(['Young Adult'],5)
data_train['Age']=data_train['Age'].replace(['Adult'],6)
data_train['Age']=data_train['Age'].replace(['Senior'],7)

data_train['Fare']=data_train['Fare'].replace(['Unknown'],0)
data_train['Fare']=data_train['Fare'].replace(['1_quartile'],1)
data_train['Fare']=data_train['Fare'].replace(['2_quartile'],2)
data_train['Fare']=data_train['Fare'].replace(['3_quartile'],3)
data_train['Fare']=data_train['Fare'].replace(['4_quartile'],4)

data_test['Fare']=data_test['Fare'].replace(['Unknown'],0)
data_test['Fare']=data_test['Fare'].replace(['1_quartile'],1)
data_test['Fare']=data_test['Fare'].replace(['2_quartile'],2)
data_test['Fare']=data_test['Fare'].replace(['3_quartile'],3)
data_test['Fare']=data_test['Fare'].replace(['4_quartile'],4)

data_train['Embarked'] = data_train['Embarked'].replace(['S'],ord('S'))
data_train['Embarked'] = data_train['Embarked'].replace(['Q'],ord('Q'))
data_train['Embarked'] = data_train['Embarked'].replace(['C'],ord('C'))

data_test['Embarked'] = data_test['Embarked'].replace(['S'],ord('S'))
data_test['Embarked'] = data_test['Embarked'].replace(['Q'],ord('Q'))
data_test['Embarked'] = data_test['Embarked'].replace(['C'],ord('C'))

data_train['Lname'] = data_train['Lname'].apply(len)
data_test['Lname'] = data_test['Lname'].apply(len)

data_train['NamePrefix'] = data_train['NamePrefix'].apply(len)
data_test['NamePrefix'] = data_test['NamePrefix'].apply(len)


varX=data_test['Embarked'].mean()
data_test['Embarked'] = data_test['Embarked'].fillna(varX)

#### 5. Es hora de usar Machine Learning. Importe de sklearn y entrene por lo menos 3 modelos de clasificación con X_train y y_train y reporte el accuracy sobre el dataset de validación (X_val, y_val). Puede ayudarse de los siguientes métodos aplicables a un clasificador: (5pts)
```
fit()
score()
```
Tip: puede revisar http://scikit-learn.org/stable/supervised_learning.html para una lista de clasificadores disponibles. Use los que le sean más familiares

In [81]:
from sklearn.model_selection import train_test_split
X_data, y_data = data_train.drop('Survived', axis=1), data_train['Survived']
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=23)

In [82]:
# A continuación importe 3 modelos de clasificación, y entrénelos con la data

In [83]:
from sklearn import preprocessing
def encode_features(df_train, df_test):
    features = ['Fare', 'Embarked', 'Age', 'Sex', 'Lname', 'NamePrefix']
    df_combined = pd.concat([df_train[features], df_test[features]])
    
    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    return df_train, df_test
    
data_train, data_test = encode_features(data_train, data_test)
data_train

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Lname,NamePrefix
0,1,0,3,0,4,1,0,1,2,5,1
1,2,1,1,1,6,1,0,4,0,6,2
2,3,1,3,1,5,0,0,1,2,8,3
3,4,1,1,1,5,1,0,4,2,7,2
4,5,0,3,0,5,0,0,2,2,4,1
5,6,0,3,0,0,0,0,2,1,4,1
6,7,0,1,0,6,0,0,4,2,7,1
7,8,0,3,0,1,3,1,3,2,6,5
8,9,1,3,1,5,0,2,2,2,6,2
9,10,1,2,1,3,1,0,3,0,5,2


#### 6. Ahora probaremos con un clasificador sin usar librerías. Implemente el clasificador KNN

#### 6a. Implemente la función de distancia euclidiana (2pts)

In [85]:
from math import sqrt
def dist_eucli(x_elem, y_elem):
    #########################################
    ## Aqui coloque su código
    suma=0
    for i in range(len(X_train.columns)-1):
        aux= abs(x_elem[i] - y_elem[i])
        suma = suma + aux**2
    return sqrt(suma)

#### 6b. Implemente el clasificador KNN. La función KNN realiza entrenamiento y predicción (2pts)

In [86]:
from random import randint
def my_KNN(K, X_train, y_train, X_val):
    #########################################
    ## Aqui coloque su código
    y_predict=[]
    for cod,r_i in X_val.iterrows():
        lst = []
        neighb=[]
        clasesvecinos=[]
        cantoSob=0
        cantDeath=0
        for codigoJ,r_J in X_train.iterrows():
            d = dist_eucli(r_i,r_J)
            lst.append((d,codigoJ,r_J))
            lst.sort(key=lambda u: u[0])
        neighb=lst[0:(K+1)]
        for w in neighb:
            codigoW = w[1]
            claseW = y_train[codigoW]
            if claseW==1:
                cantoSob=cantoSob+1
            if claseW==0:
                cantDeath=cantDeath+1
        if cantoSob > cantDeath:
            y_predict.append(1)
        if cantoSob < cantDeath:
            y_predict.append(0)
        if cantoSob == cantDeath:
            y_predict.append((randint(0,1)))
    return y_predict

In [87]:
y_prediction = my_KNN(3, X_train, y_train, X_val)
print (y_prediction)

[0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0]


#### 6c. Reporte el accuracy sobre el conjunto de validación (X_val, y_val) (1pto)

In [88]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_val,y_prediction)*100)

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_val)
print(accuracy_score(y_val,y_pred)*100)

55.3072625698
58.1005586592
