**Supervivencia del Titanic**

La base de datos AED_Titanic.csv se corresponde con información relativa a los pasajeros del Titanic. En ella se determina información de las siguientes variables: 


*   Survived: Supervivencia (0 = No; 1 = Sí)
*   Pclass: Clase del Pasajero (1 = 1ª clase; 2 = 2ª clase; 3 = 3ª clase)
*   Sex: Sexo (female = mujer, male = hombre)
*   Age: Edad
*   Sibsp: número de hermanos o cónyuges a bordo
*   Parch: número de padres o hijos a bordo
*   Fare: Tarifa de pasajero (en libras esterlinas)
*   Embarked: Puerto de embarque (C = Cherbourg; Q = Queenstown; S = Southampton)

Se desea conocer el mejor conjunto de variables que predicen la supervivencia de los pasajeros del Titanic.

In [None]:
!ls

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt #for plotting the data 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn import metrics
from subprocess import check_output

In [None]:
## Lectura de datos
from google.colab import files
files.upload()

In [None]:
data = pd.read_csv('AED_Titanic.csv') 
data.head()

Visualización de datos

Sexo versus Supervivencia (survival)

In [None]:
total = data['Sex'].value_counts()
survived_sex = data[data['Survived']==1]['Sex'].value_counts()
died_sex = data[data['Survived']==0]['Sex'].value_counts()
df = pd.DataFrame([total,survived_sex,died_sex])
df.index = ['Total','Survived','Died']
print(df)
df.plot(kind='bar')

Age vs Survival

In [None]:
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'],data[data['Survived']==0]['Age']], color = ['g','r'],
         bins = 10,label = ['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()

Class versus Survival

In [None]:
survived_1 = data[data['Pclass']==1]['Survived'].value_counts()
survived_2 = data[data['Pclass']==2]['Survived'].value_counts()
survived_3 = data[data['Pclass']==3]['Survived'].value_counts()
df = pd.DataFrame([survived_1,survived_2,survived_3])
df['total']=df[0]+df[1]
df.index = ['1st class','2nd class','3rd class']
df.rename(index=str,columns={0:'Survived',1:'Died'})
print (df)
df.plot(kind='bar',label=['Survived','Died'])

Fare versus Survival

In [None]:
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Fare'],data[data['Survived']==0]['Fare']],bins=10,label=['Survived','Died'])
plt.xlabel('Fare')
plt.ylabel('No. of People')
plt.legend()

Embarkment versus Survival

In [None]:
survived_embarkment  = data[data['Survived']==1]['Embarked'].value_counts()
died_embarkment = data[data['Survived']==1]['Embarked'].value_counts()
df = pd.DataFrame([survived_embarkment,died_embarkment])
df.index=['survived','died']
df.plot(kind='bar',stacked=True)

Codificación de variables categóricas a ficticias (dummies)

In [None]:
data.dtypes
data.info()
print(data['Embarked'])

In [None]:
## Convertimos las variables categóricas a dummy
data['Sex'] = pd.get_dummies(data['Sex'], drop_first=True)
data['Sex'] 

In [None]:
Embark2 = pd.get_dummies(data['Embarked'], drop_first=True)
Embark2


In [None]:
df = pd.concat([data['Pclass'],data['Sex'],data['Age'],data['SibSp'], data['Parch'], data['Fare'], Embark2], axis=1)
df

In [None]:
X = df
y = data['Survived']
Xtrain = df.head(800)
Xtest = df.tail(89)
ytrain = y.head(800)
ytest = y.tail(89)

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(penalty='none',  max_iter=10000).fit(Xtrain,ytrain)
logreg
print("Precisión del conjunto de entrenamiento: {:.3f}".format(logreg.score(Xtrain,ytrain)))
print("Precisión del conjunto de prueba: {:.3f}".format(logreg.score(Xtest,ytest)))

In [None]:
df = pd.concat([data['Pclass'],data['Sex'],data['Age'],data['SibSp'], data['Parch'], data['Fare']], axis=1)
X = df
y = data['Survived']
Xtrain = df.head(800)
Xtest = df.tail(89)
ytrain = y.head(800)
ytest = y.tail(89)

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(penalty='none', random_state=0,).fit(Xtrain,ytrain)
logreg
print("Precisión del conjunto de entrenamiento: {:.3f}".format(logreg.score(Xtrain,ytrain)))
print("Precisión del conjunto de prueba: {:.3f}".format(logreg.score(Xtest,ytest)))

In [None]:
import statsmodels.api as sm
# A la matriz de predictores se le tiene que añadir una columna de 1s para el intercept del modelo
Xtrain = sm.add_constant(Xtrain, prepend=True)
modelo2 = sm.Logit(ytrain, Xtrain)
modelo2 = modelo2.fit()
print(modelo2.summary())

In [None]:
df = pd.concat([data['Pclass'],data['Sex'],data['Age'],data['SibSp']], axis=1)
X = df
y = data['Survived']
Xtrain = df.head(800)
Xtest = df.tail(89)
ytrain = y.head(800)
ytest = y.tail(89)

In [None]:
import statsmodels.api as sm
# A la matriz de predictores se le tiene que añadir una columna de 1s para el intercept del modelo
Xtrain = sm.add_constant(Xtrain, prepend=True)
modelo2 = sm.Logit(ytrain, Xtrain)
modelo2 = modelo2.fit()
print(modelo2.summary())