### Purpose of challenge

For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic

For each in the test set, you must predict a 0 or 1 value for the variable.

Your score is the percentage of passengers you correctly predict. This is known as accuracy. 

### Import Packages

In [1]:
#import packages
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
train = pd.read_csv('train.csv')

In [3]:
test = pd.read_csv('test.csv')

## Evaluate data

In [4]:
train.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [5]:
test.head(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


In [6]:
train.shape

(891, 12)

In [7]:
test.shape

(418, 11)

In [8]:
# check for null values
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [9]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

#### Remove null values

The model won't work with missing data so the null values for both train and test datasets have to be removed and replaced with something else.

The missing values were replaced with zero.

In [10]:
train.Age = train.Age.fillna(0)
test.Age = test.Age.fillna(0)

train.Cabin = train.Cabin.fillna(0)
test.Cabin = test.Cabin.fillna(0)

train.Embarked = train.Embarked.fillna(0)

test.Fare = test.Fare.fillna(0)


Now let's evaluate if the missing values were removed.

In [11]:
train.isnull().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [12]:
test.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [13]:
# Store and drop ID columns as they don't contribute to model
train_ID = train.PassengerId
test_ID = test.PassengerId

train = train.drop(['PassengerId'], axis=1)
test= test.drop(['PassengerId'], axis =1)

In [14]:
# Drop saleprice
y = train['Survived']

train = train.drop(['Survived'], axis =1)
 

### Dummy variables

Dummy variables are variables that represent categorical data. They often take on the value of 0 and 1 where the values indicate the presence or absence of something. If we don't convert categorical data into dummy variables we can't use them as predictors in the model to improve the model. Now that we know why dummy variables are important, let's create them!

In [15]:
features = pd.concat([train, test], sort=False)

In [16]:
features.shape

(1309, 10)

The train and test data is combined because if we don't combine them the amount of columns created in train and test datasets after creating dummy variables will not match  and will create problems later on.

In [17]:
final_features = pd.get_dummies(features).reset_index(drop=True)

In [18]:
# Split data for the model
split_train = final_features.iloc[:len(train)]
split_test = final_features.iloc[len(train):]

### Build the model

In [19]:
X = split_train

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [21]:
model = LogisticRegression()

In [22]:
model.fit(X_train, y_train) 



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [23]:
predictions = model.predict(X_test)

In [24]:
from sklearn.metrics import accuracy_score,f1_score
f1_score(y_test, predictions, average='macro')

0.8294734075753096

In [25]:
test = model.predict(split_test)

In [26]:
output = pd.DataFrame( data={"PassengerId":test_ID, "Survived":test} )

In [27]:
output.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
