# Titanic Competitions Kaggle
![image](https://user-images.githubusercontent.com/45148200/50421918-9dd08e80-0844-11e9-8ea8-1a62067d3c35.png)


In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, ExtraTreesRegressor
from sklearn import cross_validation
import re
import operator
from sklearn.feature_selection import SelectKBest, f_classif
import warnings
warnings.filterwarnings('ignore')
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.tree import DecisionTreeClassifier
# Going to use these 5 base models for the stacking
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC



# Extraction the data and fixing the key 

In [2]:
train = pd.read_csv("train.csv", dtype={"Age": np.float64} )
test = pd.read_csv("test.csv", dtype={"Age": np.float64})

target = train["Survived"].values
full = pd.concat([train, test])

## Feature Engineering
What is thi importance of Feature Engineering ?
![image](https://user-images.githubusercontent.com/45148200/50422104-4c75ce80-0847-11e9-9665-63d2e37660aa.png)


Therefore we must spent most of our time trying to build the best features in order to get the maximum informations

### Name

- Surname
- NameLength

In [3]:
full['surname'] = full["Name"].apply(lambda x: x.split(',')[0].lower())
full["NameLength"] = full["Name"].apply(lambda x: len(x))


###  Title
A lady has higher chance to be rescued and thus survive same for Master ...

- Get the title


In [4]:

full["Title"] = full["Name"].apply(lambda x: re.search(' ([A-Za-z]+)\.',x).group(1))
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 2, "Mme": 3,"Don": 9,"Dona": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2}
full["TitleCat"] = full.loc[:,'Title'].map(title_mapping)

### Family
If the person is alone he won't be preoccuped by others which means the chance for him to survive are higher.

- We split the Family size columns into 4 categories 
    - 0 = Alone
    - 1 = Small Family
    - 2 = Medium Family
    - 3 = Large Family

In [5]:

full["FamilySize"] = full["SibSp"] + full["Parch"] + 1
full["FamilySize"] = pd.cut(full["FamilySize"], 4, labels=[0,1,2,3])


### Embarked
Here instead of using get_dummies  we split this variables in 3 categories : S, Q, C that we map with 0, 1, 2. Therefore this another way to map a columns 

In [6]:
full["Embarked"] = pd.Categorical(full.Embarked).codes


### Fare
This an other way to do what we did with the Family size but its more manual. We are the ones fixing the thresholds.


In [7]:
full["Fare"] = full["Fare"].fillna(8.05)
full.loc[ full['Fare'] <= 7.91, 'Fare']      = 0
full.loc[(full['Fare'] > 7.91) & (full['Fare'] <= 14.454), 'Fare'] = 1
full.loc[(full['Fare'] > 14.454) & (full['Fare'] <= 31), 'Fare']   = 2
full.loc[ full['Fare'] > 31, 'Fare']= 3

### Sex 
We map this columns the commun way

- 0 = Female
- 1 = Male

In [8]:
full['Sex'] = full['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

### CabinCat 
It might not be clear, what we are doing here is first fill the Cabin columuns but in a new columns so for each row  where there was a NaNs in Cabin there will be a 0 in the adequate row if this new variable. Then we map the the Cabin varibles in this new one and that on the first character of the Cabin number. 

In [9]:
full['CabinCat'] = pd.Categorical(full.Cabin.fillna('0').apply(lambda x: x[0])).codes



###  The Cabin Type 

In [10]:

def get_type_cabine(cabine):
  
    cabine_search = re.search('\d+', cabine)
 
    if cabine_search:
        num = cabine_search.group(0)
        if np.float64(num) % 2 == 0:
            return '2'
        else:
            return '1'
    return '0'
full["Cabin"] = full["Cabin"].fillna(" ")

full["CabinType"] = full["Cabin"].apply(get_type_cabine)


### Age
Here we deal with the age. If we stick to what to common knowledges tells us in catastroph time such as this the children, old people and womens are the first to get rescued.Thus we divide this column like this:

- Child if <14
- Old female or male > 60
- 14< Male or female < 60

After this we use the get_dummies that gives us a column for every case here 

In [11]:
child_age = 14
old_age=60
def get_person(passenger):
    age, sex = passenger
    if (age < child_age):
        return 'child'
    elif (sex == 0  ):
        if (age>old_age):
            return 'female_adult_old'
        else:
            return 'female_adult'
    elif (sex == 1  ):
        if (age>old_age):
            return 'male_adult_old'
        else:
            return 'male_adult'
full = pd.concat([full, pd.DataFrame(full[['Age', 'Sex']].apply(get_person, axis=1), columns=['person'])],axis=1)
full = pd.concat([full,pd.get_dummies(full['person'])],axis=1)

### Ticket
This one was i thing a really elaborated feature research. I didn't do it myself I saw it on a kernel on Kaggle.

Basically what is done here is according relevance to the ticket numbers as it may help knowing if the person is next to an rescue area.
Therefore for each ticket we determine for whom it belongs : 

- A male or A female
- A lonely person 
- A survivor or a dead person

In [12]:
table_ticket = pd.DataFrame(full["Ticket"].value_counts())
table_ticket.rename(columns={'Ticket':'Ticket_Members'}, inplace=True)

table_ticket['Ticket_perishing_women'] = full.Ticket[(full.female_adult == 1.0) 
                                    & (full.Survived == 0.0) 
                                    & ((full.Parch > 0) | (full.SibSp > 0))].value_counts()
table_ticket['Ticket_perishing_women'] = table_ticket['Ticket_perishing_women'].fillna(0)
table_ticket['Ticket_perishing_women'][table_ticket['Ticket_perishing_women'] > 0] = 1.0 

table_ticket['Ticket_surviving_men'] = full.Ticket[(full.male_adult == 1.0) 
                                    & (full.Survived == 1.0) 
                                    & ((full.Parch > 0) | (full.SibSp > 0))].value_counts()
table_ticket['Ticket_surviving_men'] = table_ticket['Ticket_surviving_men'].fillna(0)
table_ticket['Ticket_surviving_men'][table_ticket['Ticket_surviving_men'] > 0] = 1.0 

table_ticket["Ticket_Id"]= pd.Categorical(table_ticket.index).codes

In [13]:
table_ticket["Ticket_Id"][table_ticket["Ticket_Members"] < 3 ] = -1
table_ticket["Ticket_Members"] = pd.cut(table_ticket["Ticket_Members"], bins=[0,1,4,20], labels=[0,1,2])

full = pd.merge(full, table_ticket, left_on="Ticket",right_index=True,how='left', sort=False)

In [14]:
table_surname = pd.DataFrame(full["surname"].value_counts())
table_surname.rename(columns={'surname':'Surname_Members'}, inplace=True)

table_surname['Surname_perishing_women'] = full.surname[(full.female_adult == 1.0) 
                                    & (full.Survived == 0.0) 
                                    & ((full.Parch > 0) | (full.SibSp > 0))].value_counts()
table_surname['Surname_perishing_women'] = table_surname['Surname_perishing_women'].fillna(0)
table_surname['Surname_perishing_women'][table_surname['Surname_perishing_women'] > 0] = 1.0 

table_surname['Surname_surviving_men'] = full.surname[(full.male_adult == 1.0) 
                                    & (full.Survived == 1.0) 
                                    & ((full.Parch > 0) | (full.SibSp > 0))].value_counts()
table_surname['Surname_surviving_men'] = table_surname['Surname_surviving_men'].fillna(0)
table_surname['Surname_surviving_men'][table_surname['Surname_surviving_men'] > 0] = 1.0 

table_surname["Surname_Id"]= pd.Categorical(table_surname.index).codes
# compress under 3 members into one code.
table_surname["Surname_Id"][table_surname["Surname_Members"] < 3 ] = -1

table_surname["Surname_Members"] = pd.cut(table_surname["Surname_Members"], bins=[0,1,4,20], labels=[0,1,2])

full = pd.merge(full, table_surname, left_on="surname",right_index=True,how='left', sort=False)

## The training & testing set
Here we operate an age processing as well as constitute a the training & testing set

In [15]:

classers = ['Fare','Parch','Pclass','SibSp','TitleCat', 
'CabinCat','Sex', 'Embarked', 'FamilySize', 'NameLength','Ticket_Members','Ticket_Id']
etr = RandomForestRegressor(n_estimators=200,max_depth=15)
X_train = full[classers][full['Age'].notnull()]
Y_train = full['Age'][full['Age'].notnull()]
X_test = full[classers][full['Age'].isnull()]
etr.fit(X_train,np.ravel(Y_train))
age_preds = etr.predict(X_test)
full['Age'][full['Age'].isnull()] = age_preds

## Features 
Finally , we regroup every relevant feature for the training and we evaluate the importance of every feature

In [16]:
features = ['Sex','Age','female_adult','male_adult','female_adult_old','male_adult_old', 'child','TitleCat', 'Pclass',
'Pclass','Ticket_Id','NameLength','CabinType','CabinCat', 'SibSp', 'Parch',
'Fare','Embarked','Surname_Members','Ticket_Members','FamilySize',
'Ticket_perishing_women','Ticket_surviving_men',
'Surname_perishing_women','Surname_surviving_men']

train = full[0:891].copy()
test = full[891:].copy()

selector = SelectKBest(f_classif, k=len(features))
selector.fit(train[features], target)
scores = -np.log10(selector.pvalues_)
indices = np.argsort(scores)[::-1]
print("Features importance :")
for f in range(len(scores)):
    print("%0.2f %s" % (scores[indices[f]],features[indices[f]]))


Features importance :
68.85 Sex
66.05 male_adult
59.96 female_adult
26.22 TitleCat
24.60 Pclass
24.60 Pclass
23.69 NameLength
18.73 Fare
17.75 CabinCat
17.41 Ticket_surviving_men
16.28 CabinType
13.54 Ticket_perishing_women
13.16 Surname_surviving_men
10.36 Surname_perishing_women
6.94 Embarked
5.27 Ticket_Members
3.77 child
1.94 male_adult_old
1.83 Parch
1.55 female_adult_old
1.41 Age
1.19 FamilySize
1.07 Ticket_Id
0.73 Surname_Members
0.53 SibSp


## The training & predicition 

In [17]:

rfc = RandomForestClassifier(n_estimators=5000, min_samples_split=4, class_weight={0:0.5,1:0.255})
xgb = GradientBoostingClassifier(n_estimators=5000,learning_rate=0.1,max_depth=10)


kf = cross_validation.KFold(train.shape[0], n_folds=4, random_state=1)

scores = cross_validation.cross_val_score(xgb, train[features], target, cv=kf)
print("Accuracy: %0.3f (+/- %0.2f) [%s]" % (scores.mean()*100, scores.std()*100, 'RFC Cross Validation'))
rfc.fit(train[features], target)
score = rfc.score(train[features], target)
print("Accuracy: %0.3f            [%s]" % (score*100, 'RFC full test'))
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(len(features)):
    print("%d. feature %d (%f) %s" % (f + 1, indices[f]+1, importances[indices[f]]*100, features[indices[f]]))


rfc.fit(train[features], target)
predictions = rfc.predict(test[features])

PassengerId =np.array(test["PassengerId"]).astype(int)
my_prediction = pd.DataFrame(predictions, PassengerId, columns = ["Survived"])

my_prediction.to_csv("submission3_1.csv", index_label = ["PassengerId"])

print("The end ...")

Accuracy: 85.970 (+/- 1.37) [RFC Cross Validation]
Accuracy: 95.960            [RFC full test]
1. feature 2 (10.304259) Age
2. feature 12 (9.658326) NameLength
3. feature 8 (9.546538) TitleCat
4. feature 1 (8.889116) Sex
5. feature 22 (7.680901) Ticket_perishing_women
6. feature 24 (6.822472) Surname_perishing_women
7. feature 4 (6.572493) male_adult
8. feature 3 (6.191833) female_adult
9. feature 23 (4.390043) Ticket_surviving_men
10. feature 10 (3.505881) Pclass
11. feature 9 (3.498047) Pclass
12. feature 20 (3.315192) Ticket_Members
13. feature 17 (2.959209) Fare
14. feature 14 (2.712229) CabinCat
15. feature 25 (2.496848) Surname_surviving_men
16. feature 19 (2.137385) Surname_Members
17. feature 18 (1.901887) Embarked
18. feature 13 (1.901883) CabinType
19. feature 15 (1.450315) SibSp
20. feature 11 (1.330757) Ticket_Id
21. feature 7 (0.816043) child
22. feature 16 (0.750141) Parch
23. feature 6 (0.587522) male_adult_old
24. feature 21 (0.489366) FamilySize
25. feature 5 (0.091315