### Data exploration, feature engineering и prediction с логистична регресия и Random forest classifier.

#### Разгледан dataset от Kaggle за [Titanic](https://www.kaggle.com/c/titanic)

Нека първо добавим нужните библиотеки и други неща.

In [1]:
import sys

import sklearn
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import (LogisticRegression, Ridge, Lasso)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (cross_val_score, GridSearchCV)

%matplotlib inline

warnings.filterwarnings('ignore')

pd.options.display.max_rows = 20

Нека да заредим и видим тестовите данни.

In [2]:
data = pd.read_csv('../data/titanic/train.csv', index_col='PassengerId')
data

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Да си направим копие на данните, с което ще работим

In [3]:
data_trans = data.copy()

Да си направим функция, която добавят "Титла" извлечена от името на всеки човек.

In [4]:
def add_title(df):
    df['Title'] = df['Name'].str.extract('([A-Za-z]+)\.', expand=False)

In [5]:
add_title(data_trans)

In [6]:
data_trans['Title'].value_counts()

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Col           2
Major         2
Sir           1
Ms            1
Lady          1
Mme           1
Jonkheer      1
Don           1
Capt          1
Countess      1
Name: Title, dtype: int64

Нужна ни е и функция, която поправя някои грешки и групира рядко срещаните титли е обща група

In [7]:
major_titles = ['Mr', 'Mrs', 'Miss', 'Master']

def edit_titles(df):
    df.loc[df.Title == 'Mlle', 'Title'] = 'Miss'
    df.loc[df.Title == 'Mme', 'Title']  = 'Mrs'
    df.loc[df.Title == 'Ms', 'Title']   = 'Miss'
    df.loc[~df.Title.isin(major_titles), 'Title'] = 'Unknown'
    
major_titles.append('Unknown')

In [8]:
edit_titles(data_trans)

In [9]:
data_trans['Title'].value_counts()

Mr         517
Miss       185
Mrs        126
Master      40
Unknown     23
Name: Title, dtype: int64

Изглежда доста по-добре.

Нека попълним липсващите възрасти със средното аритметично за всяка група.

In [10]:
def fill_age(df):
    for title in major_titles:
        avg_age = df[df.Title == title]['Age'].mean()
        df.loc[(df.Title == title) & (df.Age.isnull()), 'Age'] = avg_age
        

In [11]:
fill_age(data_trans)

Нека попълним липсващите стойности в колоната "Embarked" с "Q". Имаме само 2 такива стойности, така че няма голямо значение какво ще изберем.

In [12]:
data_trans['Embarked'].fillna('Q', inplace=True);

Нека проверим дали сме попълнили всички липсващи данни, освен "Cabin".

In [13]:
data_trans.isnull().sum().sort_values()

Survived      0
Pclass        0
Name          0
Sex           0
Age           0
SibSp         0
Parch         0
Ticket        0
Fare          0
Embarked      0
Title         0
Cabin       687
dtype: int64

Нека видим средните цени на билетите според класата на пътуване.

In [14]:
print(data_trans[data_trans.Pclass == 1]['Fare'].mean())
print(data_trans[data_trans.Pclass == 2]['Fare'].mean())
print(data_trans[data_trans.Pclass == 3]['Fare'].mean())

84.1546875
20.6621831522
13.6755501018


Нека видим кой е пътувал най-скъпо в 3-та класа.

In [15]:
data_trans[data_trans.Pclass == 3]['Fare'].sort_values(ascending=False).head(12)

PassengerId
793    69.5500
202    69.5500
847    69.5500
864    69.5500
181    69.5500
160    69.5500
325    69.5500
693    56.4958
75     56.4958
839    56.4958
170    56.4958
827    56.4958
Name: Fare, dtype: float64

Доста високи цени за 3-та класа. Също така се повтарят.

In [16]:
data_trans[data_trans.Fare == 69.5500]

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
160,0,3,"Sage, Master. Thomas Henry",male,4.574167,8,2,CA. 2343,69.55,,S,Master
181,0,3,"Sage, Miss. Constance Gladys",female,21.845638,8,2,CA. 2343,69.55,,S,Miss
202,0,3,"Sage, Mr. Frederick",male,32.36809,8,2,CA. 2343,69.55,,S,Mr
325,0,3,"Sage, Mr. George John Jr",male,32.36809,8,2,CA. 2343,69.55,,S,Mr
793,0,3,"Sage, Miss. Stella Anna",female,21.845638,8,2,CA. 2343,69.55,,S,Miss
847,0,3,"Sage, Mr. Douglas Bullen",male,32.36809,8,2,CA. 2343,69.55,,S,Mr
864,0,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,21.845638,8,2,CA. 2343,69.55,,S,Miss


Изглежда, че цели семейства са се водили на един билет. Може би трябва да разделим цената на бройката?

Нека видим колко уникални билети имаме.

In [17]:
len(data_trans['Ticket'].unique())

681

Нека разделим цената на билета на броя пътували с него.

In [18]:
def adjust_tickets(df):
    tickets = df['Ticket'].unique()
    for ticket in tickets:
        ticket_rows = df[df['Ticket'] == ticket]
        same_tickes_num = len(ticket_rows)
        if (same_tickes_num > 1):
            fare = ticket_rows['Fare'].mean()
            new_fare = fare / same_tickes_num
            df.loc[df['Ticket'] == ticket, 'Fare'] = new_fare

In [19]:
adjust_tickets(data_trans)

In [20]:
print(data_trans[data_trans.Pclass == 1]['Fare'].mean())
print(data_trans[data_trans.Pclass == 2]['Fare'].mean())
print(data_trans[data_trans.Pclass == 3]['Fare'].mean())

43.6503472222
13.3225994565
8.08585692464


Сега цените изглеждат повече като за 3-та класа.

In [21]:
data_trans[data_trans.Pclass == 3]['Fare'].sort_values(ascending=False).head(12)

PassengerId
509    22.5250
185    22.0250
49     21.6792
491    19.9667
452    19.9667
19     18.0000
560    17.4000
744    16.1000
161    16.1000
348    16.1000
606    15.5500
719    15.5000
Name: Fare, dtype: float64

Създаваме "FamilySize" колона на базата на "SibSp" и "Parch"

In [22]:
data_trans['FamilySize'] = data_trans['SibSp'] + data_trans['Parch'] + 1

Нека one-hot-нем титлите.

In [23]:
def one_hot_title(df):
    for title in major_titles:
        df['Is' + title] = (df.Title == title).astype(float)

In [24]:
one_hot_title(data_trans)

Нека one-hot-нем място на заминаване.

In [25]:
def one_hot_embark(df):
    for city in ['S', 'C', 'Q']:
        df['Embarked' + city] = (df.Embarked == city).astype(float)

In [26]:
one_hot_embark(data_trans)

Същото правим и за класата.

In [27]:
def one_hot_class(df):
    for cls in [1, 2, 3]:
        df['Class' + str(cls)] = (df.Pclass == cls).astype(float)

In [28]:
one_hot_class(data_trans)

Ред е на пола и размера на семейството.

In [29]:
def one_hot_sex(df):
    for sex in ['male', 'female']:
        df['Is' + sex.title()] = (df.Sex == sex).astype(float)

In [30]:
one_hot_sex(data_trans)

In [31]:
def one_hot_family(df):
    df['Alone'] = (df.FamilySize == 1).astype(float)
    df['SmallFamily'] = ((df.FamilySize >= 2) & (df.FamilySize < 5)).astype(float)
    df['BigFamily'] = (df.FamilySize >= 5).astype(float)

In [32]:
one_hot_family(data_trans)

Нека видим как би било добре да си разделим възрастите по категории.

In [33]:
t = data_trans.copy()
t['Age_Size']=pd.qcut(t['Age'],5)
t.groupby(['Age_Size'])['Survived'].mean().to_frame().style.background_gradient(cmap='summer_r')

Unnamed: 0_level_0,Survived
Age_Size,Unnamed: 1_level_1
"(0.419, 20.0]",0.459016
"(20.0, 26.0]",0.397727
"(26.0, 32.368]",0.272358
"(32.368, 38.0]",0.509259
"(38.0, 80.0]",0.370787


In [34]:
del t

In [35]:
def one_hot_age(df):
    df['Child'] = (df.Age <= 20).astype(float)
    df['YoundAdult'] = ((df.Age > 20) & (df.FamilySize <= 26)).astype(float)
    df['Adult1'] = ((df.FamilySize > 26) & (df.FamilySize <= 32.3)).astype(float)
    df['Adult2'] = ((df.FamilySize > 32.3) & (df.FamilySize <= 38)).astype(float)
    df['Adult3'] = (df.Age > 38).astype(float)

In [36]:
one_hot_age(data_trans)

Няколко неща, които дадоха по-лош резултат.

In [37]:
def one_hot_fare(df):
    df['FareCat1'] = (df.Fare <= 7.73).astype(float)
    df['FareCat2'] = ((df.Fare > 7.73) & (df.Fare <= 8.05)).astype(float)
    df['FareCat3'] = ((df.Fare > 8.05) & (df.Fare <= 11.72)).astype(float)
    df['FareCat4'] = ((df.Fare > 11.73) & (df.Fare <= 26.55)).astype(float)
    df['FareCat5'] = (df.Fare > 26.55).astype(float)

In [38]:
# one_hot_fare(data_trans)

In [39]:
def name_len(df):
    df['NameLen'] = df['Name'].str.len()

In [40]:
# name_len(data_trans)

In [41]:
def name_binning(df):
    df['ShortName'] = (df.NameLen <= 25).astype(float)
    df['LongName'] = (df.NameLen > 25).astype(float)

In [42]:
# name_binning(data_trans)

Нека видим с какви колони останахме и да махнем всички излишни.

In [43]:
data_trans.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked', 'Title', 'FamilySize', 'IsMr', 'IsMrs',
       'IsMiss', 'IsMaster', 'IsUnknown', 'EmbarkedS', 'EmbarkedC',
       'EmbarkedQ', 'Class1', 'Class2', 'Class3', 'IsMale', 'IsFemale',
       'Alone', 'SmallFamily', 'BigFamily', 'Child', 'YoundAdult', 'Adult1',
       'Adult2', 'Adult3'],
      dtype='object')

In [44]:
data_trans = data_trans.drop(['Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked', 'Title', 'Age'],1)

In [45]:
def data_and_target(df):
    X = df
    X = X.drop('Survived',1)
    y = df['Survived']
    
    print('X shape: {}, y shape {}'.format(X.shape, y.shape))
    
    return (X, y)

Нека да тренираме.

In [46]:
(X, y) = data_and_target(data_trans)

X shape: (891, 23), y shape (891,)


In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=12345, stratify=y)

y_train.mean(), y_test.mean()

(0.38323353293413176, 0.38565022421524664)

In [48]:
model = LogisticRegression().fit(X_train, y_train)
print("train score:", model.score(X_train, y_train))
print("test score: ", model.score(X_test, y_test))

train score: 0.815868263473
test score:  0.878923766816


Нека пробваме гора, като я оптимизираме малко с GridSearch.

In [49]:
forest = RandomForestClassifier(random_state = 0).fit(X_train, y_train)
print("train score:", forest.score(X_train, y_train))
print("test score: ", forest.score(X_test, y_test))

train score: 0.929640718563
test score:  0.829596412556


In [50]:
search = GridSearchCV(forest, {'n_estimators': [10, 30, 50, 70, 100, 200],
                              'max_depth': [2, 4, 6, 8, 10, 12, 15],
                              #'random_state': [0,1,123, 1234, 12345]
                              })
search.fit(X, y)

pd.DataFrame(search.cv_results_)[['rank_test_score', 'mean_test_score', 'params']].sort_values(by='rank_test_score').head(10)

Unnamed: 0,rank_test_score,mean_test_score,params
9,1,0.833895,"{'n_estimators': 70, 'max_depth': 4}"
8,1,0.833895,"{'n_estimators': 50, 'max_depth': 4}"
7,3,0.832772,"{'n_estimators': 30, 'max_depth': 4}"
10,3,0.832772,"{'n_estimators': 100, 'max_depth': 4}"
11,5,0.830527,"{'n_estimators': 200, 'max_depth': 4}"
6,6,0.823793,"{'n_estimators': 10, 'max_depth': 4}"
16,7,0.820426,"{'n_estimators': 100, 'max_depth': 6}"
12,8,0.819304,"{'n_estimators': 10, 'max_depth': 6}"
17,9,0.815937,"{'n_estimators': 200, 'max_depth': 6}"
15,10,0.81257,"{'n_estimators': 70, 'max_depth': 6}"


In [51]:
forest = RandomForestClassifier(max_depth=4, n_estimators=50, random_state = 0).fit(X_train, y_train)
print("train score:", forest.score(X_train, y_train))
print("test score: ", forest.score(X_test, y_test))

train score: 0.818862275449
test score:  0.887892376682


Време е да приложим всичко това върху тестовото множество.

In [52]:
test = pd.read_csv('../data/titanic/test.csv', index_col=['PassengerId'])

add_title(test)
edit_titles(test)
fill_age(test)
test['Embarked'].fillna('Q', inplace=True)
test['FamilySize'] = test['SibSp'] + test['Parch'] + 1
test['Fare'] = test['Fare'].fillna(test['Fare'].mean())
adjust_tickets(test)
one_hot_title(test)
one_hot_embark(test)
one_hot_class(test)
one_hot_sex(test)
one_hot_family(test)
one_hot_age(test)
# one_hot_fare(test)
# name_len(test)

test = test.drop(['Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked', 'Title', 'Age'],1)

Да запишем резултата и да submit-нем в Kaggle.

In [53]:
predictions = forest.predict(test)
frame = pd.DataFrame({
    'PassengerId': pd.read_csv('../data/titanic/test.csv').PassengerId,
    'Survived': predictions
})
frame = frame.set_index('PassengerId')
frame.to_csv('../data/titanic/prediction.csv')
frame.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,1
894,0
895,0
896,1


Максимален резултат 0.80382 (Същият резултат имат всички в интервала около 700-1200 места)

![Класация](img/Titanic_leaderboard.png)