**EXPLORATORY DATA ANALYSIS (EDA)**


**1) Import Data**

Getting the data in the first place is a big challenge for data scientists. Luckily, Kaggle gives us a nice data set to use for the competition so we are just going to import the CSV file using pandas.

From Kaggle we get 2 data sets : "train" and "test". The test set is there only to validate our predictive model at the end of the study but we are going to import it at the same time so it's done.

In [None]:
import pandas as pd 
import numpy as np

train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

Usually the first thing we want to do is have a quick glance at our data. For this we have the attribute pd.shape and 2 basic methods : pd.head() and pd.describe()

In [None]:
train.shape

In [None]:
train.head()

In [None]:
train.describe(include = 'all') ## pro tip : add "include = "all"" to show the non numerical columns

Alright, so far we know that our data set has 891 rows and 12 columns :

* PassengerId
* Survived 
* Pclass: the passenger class. It has three possible values: 1,2,3
* The Name
* The Sex
* The Age
* SibSp: number of siblings and spouses traveling with the passenger
* Parch: number of parents and children traveling with the passenger
* The ticket number
* The ticket Fare
* The cabin number
* The embarkation. It has three possible values S,C,Q

We can see that we have only 204 out of 891 values for Cabin, that's not enough to take into consideration so we are going to drop this feature. Also, we can see from train.head() that the ticket number and PassengerId features seem useless so we're going to delete them aswell. 

We also have missing values for Age. We could try to guess the missing ages based on other features but for the EDA we're just going to drop the rows with a NaN value using pd.dropna()

In [None]:
train = train.drop(['Ticket','Cabin','PassengerId'], axis=1)
train = train.dropna()
train['Sex'] = train['Sex'].map({'male':1,'female':0})
train.head()

Now we're going to have a look at our data using graphs using  matplotlib and seaborn :
Let's start with our Survived feature distribution : 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')

Let's start with our Survived feature distribution : 

In [None]:
g = sns.countplot("Survived", data = train)
plt.title("Survived distribution on the dataset")
sns.plt.show()

We can see that in our dataset more people died than survived on the Titanic. 

Next we are going to create a correlation heatmap. Correlation is a statistical technique that is used to measure and describe the strength and direction of the relationship between two variables. 

In [None]:
g = sns.heatmap(train.corr(), annot=True, fmt=".2f")
plt.title("Heatmap of the correlations")
sns.plt.show()

Here we can see positive correlations between :
* ** Survived and Fare **: People who paid more for their ticket tend to have a higher Survived rate.
* **Parch and Sibsp** : Sibsp is the number of siblings +spouses and Parch is the number of parents and children so it makes sense that the 2 are correlated.
* **Parch and Fare** : People who have a big family pay more for their ticket, because usually if you can afford to travel with a big family you can afford an expensive ticket.

And negative correlations between :
* **Survived and Sex** : Men have a Survived rate way lower than women, thus "Women and child first!"
* **Survived and Pclass** : Our classes go from 1 to 3 with 3 being the cheapest and 1 the most expensive. Here a negative correlation means that people who are in the third class tend to have a lower Survived rate. 
* **Pclass and Fare** : The class is determined by the price of the ticket.
* **Pclass and Age** : The older people tend to be in the most expensive classes.



Let's have a more precize look of these features to confirm our insuptions : 



In [None]:
g = sns.factorplot("Pclass","Survived", hue="Sex", kind = "bar", data=train, ci = None)
plt.title("Survived rate based on class and sex")
sns.plt.show()

In [None]:
train.groupby('Pclass').mean()['Fare'].plot(kind='bar',figsize=(12,6))
plt.title("Average fare for each class")
sns.plt.show()

In [None]:
fig = sns.FacetGrid(train,hue='Pclass',aspect=4)
fig.map(sns.kdeplot,'Age',shade=True)
plt.title("Age distribution for each class")
fig.add_legend()
sns.plt.show()

Alright, we took a glance at the data and spotted some interesting correlations. However, we couldn't manage to analyze more complicated features like the names because these required further processing. This is why in the next part we'll focus on the ways to transform these features to fit our machine learning algorithms.

****FEATURE ENGINEERING****

Feature engineering is the process of reshaping, transforming or creating new features based on the previous ones. We're going to do it on both data sets, so to save time we'll combine the train and test set together :

In [None]:
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

data = train.append(test)
data.shape

In [None]:
data.describe(include = "all")

In [None]:
data.info()

For the same reason as mentionned at the begining, we're going to drop our useless / unusable features : 

In [None]:
data = data.drop(['Survived','Ticket','Cabin','PassengerId'], axis=1)

Processing Name : if we look closely at the Name feature, we can see that each name has title in it : for example "Braund, Mr. Owen Harris" has the title "Mr". We can group these title into categories  and create a new feature. 

So first, we're breaking each name using the first "," and the "." and extract the title. There's a lot of them so we then proceed to group them into 5 categories :
* Officer
* Royalty
* Mrs
* Miss
* Master

In [None]:
def get_title(name):
    if '.' in name:
        return name.split(',')[1].split('.')[0].strip()
    else:
        return 'Unknown'
    
Title_Map = {
                        "Capt":       "Officer",
                        "Col":        "Officer",
                        "Major":      "Officer",
                        "Jonkheer":   "Royalty",
                        "Don":        "Royalty",
                        "Sir" :       "Royalty",
                        "Dr":         "Officer",
                        "Rev":        "Officer",
                        "the Countess":"Royalty",
                        "Dona":       "Royalty",
                        "Mme":        "Mrs",
                        "Mlle":       "Miss",
                        "Ms":         "Mrs",
                        "Mr" :        "Mr",
                        "Mrs" :       "Mrs",
                        "Miss" :      "Miss",
                        "Master" :    "Master",
                        "Lady" :      "Royalty"

                        }
    
data["Title"] = data["Name"].apply(get_title).map(Title_Map)

We now have a Title for each of our passenger. 

Processing Age : We're missing a lot of value for our Age feature and we can't just drop the columns with a missing value because it would hurt our modele accuracy to much. We could just use the mean or the median of all the ages but we can do better here.
We can group our passengers by sex, class and our newly created feature title and get the median age of each group, and then replace our missing values based on the other features for each passenger


In [None]:
data["Age"] = data.groupby(['Sex','Pclass','Title'])['Age'].transform(lambda x: x.fillna(x.median()))
data.info()

This single line of code does all the job, if you have trouble understanding it you can check the well made documentation here : https://pandas.pydata.org/pandas-docs/stable/groupby.html

Processing title :

Modeles usually need  numerical variables, this is why when we have categorical variables we need to transform them.
One way to do it is with dummy encoding. For example here we have one categorical feature "Title" with 5 categories as strings. The function pd.get_dummies will create a column for each categorie and fill it with 0 and 1, 1 meaning that the passenger is in this category.

In [None]:
titles_dummies = pd.get_dummies(data['Title'],prefix='Title')
data = pd.concat([data,titles_dummies],axis=1)

data.drop("Name", axis=1, inplace = True)
data.head()

Processing Fare : 
We have a few missing values here to we're just going to fill them with the mean

In [None]:
data.Fare.fillna(data.Fare.mean(), inplace=True)

Processing Embarked :
We also have a few missing values so we're going to replace it with the most frequent one : S, and then get the dummy columns.

In [None]:
data.Embarked.fillna('S', inplace=True)
Embarked_dummies = pd.get_dummies(data['Embarked'],prefix='Embarked')
data = pd.concat([data,Embarked_dummies],axis=1)

Processing Parch and Sibsp :
We're going to use Parch and Sibsp  to create a new one called "Familysize" that will basicly be the addition of these 2 features plus the passenger.
Then we will break it into 3 categories : singleton (passenger alone), small family and large family.

In [None]:
data['FamilySize'] = data['Parch'] + data['SibSp'] + 1

data['Singleton'] = data['FamilySize'].map(lambda s: 1 if s == 1 else 0)
data['SmallFamily'] = data['FamilySize'].map(lambda s: 1 if 2<=s<=4 else 0)
data['LargeFamily'] = data['FamilySize'].map(lambda s: 1 if 5<=s else 0)

Processing Sex :
Here we're just going to transform our strings into a numerical variable.

In [None]:
data['Sex'] = data['Sex'].map({'male':1,'female':0})

Processing Pclass : 
Just dummy encoding our feature.

In [None]:
pclass_dummies = pd.get_dummies(data['Pclass'], prefix="Pclass") 
data = pd.concat([data,pclass_dummies],axis=1)


Alright, we're almost there. Now we're going to drop the old features that are useless now

In [None]:
data.drop(['Pclass','Embarked','Title'],axis=1,inplace=True)
data.shape

****MODELING****

This is the exciting part where we are going to make predictions using our data. The first thing we want to do is splitting back our data into a train set and a test set. It is a crucial part because we need a way to evaluate our modele at the end to get a score for Kaggle.




In [None]:
train0 = pd.read_csv("../input/train.csv")
targets = train0.Survived
train = data.head(891) #when we created our data set "data" we basicly put our test dataframe at 
                       #the end of the train one, so we can split it just by selecting the values before 891  
test = data.iloc[891:]


Alright so here we have a classification problem : we're trying to predict a categorical response "Suvived" which is translated into a 0 or a 1. There are many classification modeles we can use with the Scikit-learn package but first we need a way to evaluate our modele performance.
Since we can't test our modele on the same data it was trained, we have to split our train dataframe again. It might be confusing but to clarify : we first split train and set only for the Kaggle challenge score and then we do it to evaluate the modele for ourselves.

We're going to use cross validation from Sklearn : 

In [None]:
from sklearn.model_selection import cross_val_score
def compute_score(classifier, X, y, scoring='accuracy'):
    xval = cross_val_score(classifier, X, y, cv = 5, scoring=scoring)
    return np.mean(xval)

Now that we have our tool to evaluate our model, let's start with one of the simplest classification : KNN. You can learn more about KNN here : http://scikit-learn.org/stable/modules/neighbors.html

First we import the model from Sklearn.neighbors, then make an instance of our estimator (=model). This is where we enter the parameters we want, here we're going to select K=1

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn

    Alright now we fit the model on our data and use our compute_score to get an accuracy score

In [None]:
knn.fit(train,targets)
knn_score = compute_score(knn,train,targets)
knn_score

This is our result for K=1.
Now let's try for different values of K

In [None]:
k_range = list(range(1, 31))
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = compute_score(knn, train,targets)
    k_scores.append(scores.mean())
print(k_scores)

We plot our results :

In [None]:
plt.plot(k_range, k_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')

In [None]:
knn.best_score_

We can see that the best accuracy with this model is with K=21. 

There's a tool we can use to test every combination for each parameters and find the best one : GridSearchCV. We feed it with the map of parameters we want it to test and then fit our data.

In [None]:
from sklearn.grid_search import GridSearchCV
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
grid = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid.fit(train, targets)

grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]
plt.plot(k_range, grid_mean_scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Cross-Validated Accuracy')



We notice that this plot is the same as the previous one.

Alright now let's try a more complex model : Random forest classifier. Random forest uses decision trees. You can learn more about it here : https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Let's try it with the basic parameters : 

In [None]:
from sklearn.ensemble import RandomForestClassifier

rdc = RandomForestClassifier()
rdc.fit(train,targets)
rdc_score = compute_score(rdc,train,targets,scoring='accuracy')
rdc_score

Now we use GridSearchCV to find the best combination of parameters : 

In [None]:
run_grid = False
if run_grid :
    param_grid = {'max_depth' : [4, 6, 8],
                 'n_estimators': [50, 10],
                 'max_features': ['sqrt', 'auto', 'log2'],
                 'min_samples_split': [2, 3, 10],
                 'min_samples_leaf': [1, 3, 10],
                 'bootstrap': [True, False],}
    grid = GridSearchCV(rdc, param_grid, cv=10, scoring='accuracy')
    grid.fit(train, targets)
    grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]
    model = grid
    parameters = grid.best_params_
    print(grid.best_score_)
    print(grid.best_params_)
else : 
    parameters = {'bootstrap': False, 'max_depth': 6, 'max_features': 'auto', 'min_samples_leaf': 3,
              'min_samples_split': 3, 'n_estimators': 10}
    model = RandomForestClassifier(**parameters)
    model.fit(train,targets)

In [None]:
rfc_score = compute_score(model, train, targets, scoring='accuracy')
rfc_score

82.8%, that's better ! This number is relative to the problem so we can't say if it is good or bad. The goal of data scientist is to improve this result by :
- creating new features
- try different models  such as Gradient Boosted trees, XGboost