# Project 3 Modeling



### Classification problem using titanic dataset

I am going to measure if class, single male/female or being a parent/child combination is a better indicator of survial. I will judge the quality of my model by the accuracy (precision and recall). 

##### Feature descriptions:

<p>pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)</p>
<p>survival        Survival
                (0 = No; 1 = Yes)</p>
<p>name            Name</p>
<p>sex             Sex</p>
<p>age             Age</p>
<p>sibsp           Number of Siblings/Spouses Aboard</p>
<p>parch           Number of Parents/Children Aboard</p>
<p>ticket          Ticket Number</p>
<p>fare            Passenger Fare</p>
<p>cabin           Cabin</p>
<p>embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)</p>
<p>home.dest       Home/Destination</p>

In [90]:
import pandas as pd

In [91]:
tanic = pd.read_csv('titanic.csv')

#### Data cleaning and feature manipulation



In [92]:
tanic.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [93]:
tanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [94]:
pd.crosstab(tanic['survived'], tanic['age'])

age,0.42,0.67,0.75,0.83,0.92,1.0,2.0,3.0,4.0,5.0,...,62.0,63.0,64.0,65.0,66.0,70.0,70.5,71.0,74.0,80.0
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,2,7,1,3,0,...,2,0,2,3,1,2,1,2,1,0
1,1,1,2,2,1,5,3,5,7,4,...,2,2,0,0,0,0,0,0,0,1


In [95]:
tanic.isnull().sum()

survived      0
pclass        0
name          0
sex           0
age         177
sibsp         0
parch         0
ticket        0
fare          0
cabin       687
embarked      2
dtype: int64

#### Creating test train split

In [96]:
from sklearn.model_selection import train_test_split

In [97]:
y = tanic['survived']

In [98]:
tanic.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [99]:
X_train, X_test, y_train, y_test = train_test_split(tanic, y)

#### Data cleaning 

Decided not to use age as a feature because of missing values and using the median age to replace such a large number of observations I think would throw off the results. 

Creating a dummy variable for single male. Single male determined by if sex = male and sibsp = 0


In [104]:
single_dummy = pd.get_dummies(tanic.sex, prefix='Single')

In [105]:
single_dummy.head()

Unnamed: 0,Single_female,Single_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


Adding Single_male to the tanic dataframe

In [106]:
tanic['Single_male'] = single_dummy['Single_male']

In [107]:
tanic['Single_female'] = single_dummy['Single_female']

In [108]:
tanic.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,Single_male,Single_female
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1,0
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0,1
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0,1
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0,1
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1,0


In [109]:
tanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,Single_male,Single_female
count,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,0.647587,0.352413
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,0.47799,0.47799
min,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104,0.0,0.0
50%,0.0,3.0,28.0,0.0,0.0,14.4542,1.0,0.0
75%,1.0,3.0,38.0,1.0,0.0,31.0,1.0,1.0
max,1.0,3.0,80.0,8.0,6.0,512.3292,1.0,1.0


#### Running logistic regression


In [110]:
from sklearn.linear_model import LogisticRegression

In [111]:
x = [tanic.pclass, tanic.parch, tanic.Single_male, tanic.Single_female,]
lr = LogisticRegression()
lr.fit(x, y)

ValueError: Found input variables with inconsistent numbers of samples: [4, 891]