# Clasificación en Python

El objetivo de esta práctica es entrenar dos modelos de clasificación, Árboles de decisión y Naive Bayes, para el conjunto de datos del Titanic. En este ejemplo usaremos solo los datos de entranamiento para crear y validar el modelo. 

El primer paso será importar las librerías básicas

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
import sklearn
from sklearn import tree

Ahora debemos cargar los datos

In [15]:
trainx=pd.read_csv("./data/train.csv")

In [16]:
print(trainx.head())


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In [17]:
trainx.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Creamos la variable categórica para la edad. 

In [18]:
train = trainx.copy()
train["Child"] = float('NaN')

# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.

train["Child"][train["Age"] < 18] = 1
train["Child"][train["Age"] >= 18] = 0
print(train["Child"].head())

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: Child, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [26]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Child
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,714.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.158263
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,0.365244
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104,0.0
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542,0.0
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0,0.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292,1.0


Cambiamos la variable de String a Numérica (Nominal)

In [6]:
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")
train["Fare"] = train["Fare"].fillna(train["Fare"].median())
train["Sex"] = train["Sex"].fillna(3)
train["SibSp"] = train["SibSp"].fillna(train["SibSp"].median())
train["Parch"] = train["Parch"].fillna(train["Parch"].median())
train["Pclass"] = train["Pclass"].fillna(train["Pclass"].median())
train["Age"] = train["Age"].fillna(train["Age"].median())



# Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documenta

Buscamos los valores faltantes

In [7]:
print(train.isnull().sum().sum())


864


## Partimos los datos en entrenamiento y test
70% para entrenamiento y 30% para test

In [8]:
from sklearn.model_selection import train_test_split


# Create a new array with the added features: features_two
features = train[["Pclass","Age","Sex","Fare", "SibSp", "Parch", "Embarked"]].values
target = train["Survived"].values


# Split the data into train and test
trainX, testX, trainY, testY = train_test_split(features, target, test_size=0.3)
print(trainX.shape, trainY.shape)
print(testX.shape, testY.shape)


(623, 7) (623,)
(268, 7) (268,)


## Árboles de decisión

Entrenamos un modelo de Árbol de decisión

In [13]:
#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
model1 = tree.DecisionTreeClassifier(max_depth = 10, min_samples_split = 5, random_state = 1)
model1 = model1.fit(trainX, trainY)

#Print the score on the train data
print(model1.score(trainX, trainY))
#Print the score on the test data
print(model1.score(testX, testY))

0.9165329052969502
0.8097014925373134


## Modelo Naive Bayes
Entranamos un modelo Bayesiano

In [14]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

model2 = gnb.fit(trainX, trainY)

#Print the score on the train data
print(model2.score(trainX, trainY))
#Print the score on the test data
print(model2.score(testX, testY))

0.7913322632423756
0.7947761194029851


## Matrices de confusión

In [11]:
# Para árboles
from sklearn.metrics import confusion_matrix

confusion_matrix(model1.predict(testX), testY)


array([[144,  32],
       [ 19,  73]])

In [12]:
# Para Bayes 1
from sklearn.metrics import confusion_matrix

confusion_matrix(model2.predict(testX), testY)


array([[137,  29],
       [ 26,  76]])