# Model: Logistic Regression with Age

https://www.kaggle.com/c/titanic/overview

The model__logreg did not include age as a feature.  This model will include age.  Age with NaN in the entry will be replaced with the mean age of all passengers. 

**Initialization**

In [1]:
%run init.ipynb

In [2]:
from data.data import ExtractData
from models import predict_model as pm
from zeetle.data import eda

from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 

RANDOM_STATE = 42

## Extract Clean Data

**Separate data into X (features) and y (label)**

In [24]:
data = ExtractData('../data/raw/train.csv', drop_columns=['cabin', 'name', 'ticket'])
Xy = data.Xy

Xy.age = Xy.age.fillna(value=Xy.age.mean())

**Verify that age has no NaN**

In [25]:
Xy[Xy.age.isna()]

Unnamed: 0_level_0,survived,pclass,sex,age,sibsp,parch,fare,embarked
passengerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1


## Encode Categorical Columns

In [11]:
Xy_encoded = pd.get_dummies(Xy, columns=['pclass', 'sex', 'embarked'], drop_first=True)

## Train Test Split Data

In [12]:
X = Xy_encoded.drop('survived', axis=1)
y = Xy_encoded['survived']

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

In [14]:
print(f'Number of sample in training data = {len(X_train)}')
print(f'Number of sample in test data = {len(X_test)}')

Number of sample in training data = 569
Number of sample in test data = 143


### Logistic Regression with Age

In [17]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train) 

y_pred = logreg.predict(X_test)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

## Calculate Metrics

In [18]:
Xy_test = pm.concat_to_create_xy_test(X_test, y_test, y_pred)
metrics = pm.calc_metrics(Xy_test)

metrics

{'log_loss': 6.762876477199374, 'accuracy': 0.8041958041958042}

# Drill Down

In [58]:
Xy2 = Xy.join(Xy_test[['survived_pred', 'is_prediction_correct']], how='right')

In [66]:
Xy2.groupby(['sex'])[['survived','survived_pred']].mean()

Unnamed: 0_level_0,survived,survived_pred
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.764706,0.843137
male,0.26087,0.065217
