# [Titanic Data Set](https://www.kaggle.com/c/titanic/data)

<img src="../images/titanic.jpeg">

### Data Set Information:

The titanic data frame describes the survival status of individual passengers on the Titanic.
The titanic data frame does not contain information for the crew, but it does contain actual and estimated ages for almost 80% of the passengers.

### Sources:
Hind, Philip. Encyclopedia Titanica. Online-only resource. Retrieved 01Feb2012 from
http://www.encyclopedia-titanica.org/

### Attribute Information:

survival:    Survival 
PassengerId: Unique Id of a passenger. 
pclass:    Ticket class     
sex:    Sex     
Age:    Age in years     
sibsp:    # of siblings / spouses aboard the Titanic     
parch:    # of parents / children aboard the Titanic     
ticket:    Ticket number     
fare:    Passenger fare     
cabin:    Cabin number     
embarked:    Port of Embarkation
train_df.describe()

## Exploratory data analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
# Getting the Data
df = pd.read_csv("../datasets/titanic/train.csv")
df.head()

In [None]:
df.describe()

In [None]:
df.info()

- El conjunto de entrenamiento tiene 891 ejemplos y 11 características + la variable objetivo (Survived). 
- 2 de las features son float, 5 son int y 5 son objetos(string). 

### Desbalanceo de las clases

In [None]:
sns.countplot(x='Survived', data=df)

### Visualización de los datos

In [None]:
sns.barplot(x='Pclass', y='Survived', data=df, ci=None)

Los pasajeros de primera clase, sobrevivieron más

In [None]:
sns.barplot(x = 'Sex', y='Survived', data=df, ci=None)

Las mujeres sobrevivieron más

In [None]:
sns.barplot(x="SibSp", y="Survived", data=df, ci=None)

Los pasajeros con uno o dos acompañantes sobrevivieron más

In [None]:
sns.barplot(x="Parch", y="Survived", data=df, ci=None)

Los pasajeros con 1-3 hijos sobrevivieron más

In [None]:
age = sns.FacetGrid(df, hue="Survived",aspect=2)
age.map(sns.kdeplot,'Age',shade= True)
age.set(xlim=(0, df['Age'].max()))
age.add_legend()

Los pasajeros jóvenes sobrevivieron más

In [None]:
fare = sns.FacetGrid(df, hue="Survived",aspect=2)
fare.map(sns.kdeplot,'Fare',shade= True)
fare.set(xlim=(0, 200))
fare.add_legend()

Los pasajeros que pagaron más, sobrevivieron más

## Preprocesamiento

### Valores nulos

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent_1 = df.isnull().sum()/df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(5)

- La feature `Embarked` tiene solo 2 valores nulos, por lo que se pueden completar fácilmente. 
- La feature `Age` se presenta más complicada, ya que tiene 177 valores nulos. 
- La `Cabin` necesita más investigación, pero parece que podríamos querer eliminarla del conjunto de datos, ya que falta el 77%.

**Embarked**

Como solo tiene 2 valores nulos, los rellenaremos con el más común

In [None]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='most_frequent')

df['Embarked'] = imp.fit_transform(df)

In [None]:
df["Embarked"].isnull().sum()

**Age**

En este caso crearemos una matriz que contenga números aleatorios, que se calculen en función del valor de la media de la edad y la desviación estándar.

In [None]:
df.Age.hist()

In [None]:
mean = df["Age"].mean()
std = df["Age"].std()
is_null = df["Age"].isnull().sum()

# compute random numbers between the mean, std and is_null
rand_age = np.random.randint(mean - std, mean + std, size = is_null)

# fill NaN values in Age column with random values generated
age_slice = df["Age"].copy()
age_slice[np.isnan(age_slice)] = rand_age

df["Age"] = age_slice
df["Age"] = df["Age"].astype(int)

df["Age"].isnull().sum()

**Cabin**

In [None]:
df.Cabin.unique()

Vemos que la variable `Cabin` empieza por una letra que, investigando, representa la cubierta en la que se alojaban los pasajeros. Como puede ser interesante, podemos quedarnos solo con la letra y rellenar con otra letra inventada los valores que faltan para quitarnos los nulos

<img src="../images/titanic_cutaway_diagram.png">

In [None]:
df['Cabin'] = df['Cabin'].fillna("U")
df['Deck'] = df['Cabin'].map(lambda x: x[0])

# sns.catplot("Survived", col="Deck", col_wrap=3,
#             data=titanic[titanic.Deck != 'U'], kind="count")
sns.barplot(x="Deck", y="Survived", data=df, ci=None, order=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'U'])

Ahora podemos borrar el feature `Cabin` ya que es redundante con `Deck`

In [None]:
# we can now drop the cabin feature
df = df.drop(['Cabin'], axis=1)

In [None]:
df["Deck"].isnull().sum()

## Detección de outliers

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
sns.boxplot(x='Age', data=df)

In [None]:
sns.boxplot(x='Fare', data=df)

Como veíamos en la teoría:
> Un valor atípico (outlier) es un valor de una variable muy distante a otras observaciones de la misma variable
- Errores en los instrumentos de medida
- Picos aleatorios en una variable
- La distribución tiene una cola muy “pesada” (heavily-tailed distribution)
    - **Cuidado con hacer asunciones sobre la normalidad de la distribución**

In [None]:
df.Age.hist()

In [None]:
df.Fare.hist()

## Distintos órdenes de magnitud

In [None]:
df.head()

Las dos variables numéricas del dataset son `Age` y `Fare`. Ambas están en distintos órdenes de magnitud, así que vamos a normalizarlos

In [None]:
from sklearn.preprocessing import Normalizer

scaler = Normalizer(norm='l1')
ageAndFare = df[["Age", "Fare"]]

ageAndFare = scaler.fit_transform(ageAndFare)
ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
df['NAge'] = ageAndFare[['age']]
df['NFare'] = ageAndFare[['fare']]

df.head()

## Datos categóricos

In [None]:
df.head()

Tenemos como datos categóricos: `Age` & `Deck`

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['Deck'] = le.fit_transform(df['Deck'])

df.head()

In [None]:
le_sex = LabelEncoder()

df['Sex'] = le_sex.fit_transform(df['Sex'])

df.head()

## Selección de variables

In [None]:
df.head()

Primero eliminamos las variables de identificadores, ya que no aportan nada al modelo

In [None]:
df.drop(['PassengerId', 'Name', 'Ticket', 'Fare', 'Age'], 1, inplace =True)
df.head()

In [None]:
sns.heatmap(df.corr(), annot=True, cbar=True)

# Entrenando los modelos

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop("Survived", axis=1)
y = df["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score

logreg = LogisticRegression(solver='liblinear')
scores = cross_val_score(logreg, X_train, y_train, cv=10, scoring = "accuracy")

print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

## Naïve Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, confusion_matrix

gaussian = GaussianNB() 
gaussian.fit(X_train, y_train)  

y_pred = gaussian.predict(X_test)  

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score

gaussian = GaussianNB() 
scores = cross_val_score(gaussian, X_train, y_train, cv=10, scoring = "accuracy")

print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

### Decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

decision_tree = DecisionTreeClassifier() 
decision_tree.fit(X_train, y_train) 

y_pred = decision_tree.predict(X_test)  

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score

decision_tree = DecisionTreeClassifier() 
scores = cross_val_score(decision_tree, X_train, y_train, cv=10, scoring = "accuracy")

print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, y_train)

y_pred = random_forest.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score

rf = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(rf, X_train, y_train, cv=10, scoring = "accuracy")

print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

In [None]:
importances = pd.DataFrame({'feature':X_train.columns,'importance':np.round(random_forest.feature_importances_,3)})
importances = importances.sort_values('importance',ascending=False).set_index('feature')
importances.head(15)

In [None]:
importances.plot.bar()

### Support Vector Machine

In [None]:
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score, confusion_matrix

linear_svc = LinearSVC(max_iter=1000000)
linear_svc.fit(X_train, y_train)

y_pred = linear_svc.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score

linear_svc = LinearSVC()
scores = cross_val_score(linear_svc, X_train, y_train, cv=10, scoring = "accuracy")

print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

## K Nearest Neighbor

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

knn = KNeighborsClassifier(n_neighbors = 3) 
knn.fit(X_train, y_train)  

y_pred = knn.predict(X_test)  

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score

knn = KNeighborsClassifier(n_neighbors = 3) 
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring = "accuracy")

print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard Deviation:", scores.std())

In [None]:
# experimenting with different n values
k_range = list(range(1,26))
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    scores.append(accuracy_score(y_test, y_pred))
    
plt.plot(k_range, scores)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for Values of k of k-Nearest-Neighbors')
plt.show()