### Kaggle Titanic Logistic Regression Practice
##### Kaggle offers some practice datasets for people to try to predict whether a passenger will survive on the Titanic. Our submission correctly predicted whether a passenger would surive 77% of the time.
##### Link to problem on Kaggle: https://www.kaggle.com/c/titanic
##### Leaned on a tutorial from a YouTube video: https://www.youtube.com/watch?v=pUSi5xexT4Q&t=917s

##### Import LIbraries 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

##### Read in csv files

In [32]:
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')
gender = pd.read_csv('gender_submission.csv')

##### Data Preview

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


##### FInd N/A values. Will need to impute

In [4]:
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### There are three columns that have null values: Age, Cabin, and Embarked.
##### Let's impute values into each of these colums - starting with Age

In [5]:
def age_impute(data):
    
    # impute missing Age data with the column's median:
    data['Age'].fillna(data['Age'].median(), inplace = True)

##### Next, we'll impute the Embarked field with a dummy value. Embarked is the location from which a passenger originally embarked from. Let's fill in NULLs with a 'U' for Unknown

In [6]:
def embarked_inpute(data):
    
    data['Embarked'].fillna('U', inplace = True)
    

##### 687 of the ~800ish records for the cabin column are NULL. I think we're better off dropping that column than trying to impute.
##### We'll also drop some other columns like Ticket # and Passenger Name

In [7]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [8]:
columns = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp',
       'Parch', 'Fare', 'Embarked']

In [9]:
train = train[columns]

#### In this version, we're going to create dummy variables rather than imputing with numbers

In [10]:
embarked_inpute(train)
age_impute(train)

In [11]:
# This function will convert categorical values to flags!
train = pd.get_dummies(train)

##### Looking really good! Categorical fields are removed, and NULL values are imputed

In [12]:
train.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Embarked_U
0,0,3,22.0,1,0,7.25,0,1,0,0,1,0
1,1,1,38.0,1,0,71.2833,1,0,1,0,0,0
2,1,3,26.0,0,0,7.925,1,0,0,0,1,0
3,1,1,35.0,1,0,53.1,1,0,0,0,1,0
4,0,3,35.0,0,0,8.05,0,1,0,0,1,0


##### Begin training the model!

In [13]:
# Read in libraries

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [14]:
# y values is what we're predicting
# X is are all the other remaining fields. This is what we use to predict y

y = train['Survived']
X = train.drop('Survived', axis = 1) 

In [15]:
#clf = LogisticRegression(random_state = 0, max_iter = 1000).fit(X, y)
clf = LogisticRegression(random_state = 0, max_iter = 1000).fit(X, y)

In [16]:
predictions = clf.predict(X)

#### Clean test df:

In [17]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [18]:
test = test[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare','Embarked']]

In [19]:
test.isna().sum()

Pclass       0
Sex          0
Age         86
SibSp        0
Parch        0
Fare         1
Embarked     0
dtype: int64

In [20]:
#Impute missing Values for test df
test['Fare'].fillna(test['Fare'].median(), inplace = True);

# Run the same functions we used earlier:
age_impute(test)
embarked_inpute(test)
#column_mapping(test)

In [21]:
test = pd.get_dummies(test)

In [22]:
test.isna().sum()

Pclass        0
Age           0
SibSp         0
Parch         0
Fare          0
Sex_female    0
Sex_male      0
Embarked_C    0
Embarked_Q    0
Embarked_S    0
dtype: int64

In [23]:
# Since there are no NULL Embarked columns in the test dataset, 
# we'll have to code in a Embarked_U column of 0s to match the train df
test['Embarked_U'] = 0

##### Test DF is looking good

In [24]:
train.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Embarked_U
0,0,3,22.0,1,0,7.25,0,1,0,0,1,0
1,1,1,38.0,1,0,71.2833,1,0,1,0,0,0
2,1,3,26.0,0,0,7.925,1,0,0,0,1,0
3,1,1,35.0,1,0,53.1,1,0,0,0,1,0
4,0,3,35.0,0,0,8.05,0,1,0,0,1,0


##### Apply model fit to our test df to see whether we think each person will survive:

In [25]:
# Get survival predictions
submission_preds = clf.predict(test)
# Convert to a series
submission_preds = pd.Series(submission_preds)

##### Create a submission csv with two columns: PassengerID and whether we think that person will survive:

In [26]:
test_2 = pd.read_csv('test.csv')

In [27]:
PassengerId = test_2['PassengerId']

In [28]:
df = pd.concat([PassengerId, submission_preds], axis = 1)

In [29]:
df.rename(columns = {0: 'Survived'}, inplace = True)

In [30]:
df.to_csv('Titanic_Predictions_2.csv', index = False)

##### When uploaded to kaggle, this df correctly predicted whether the person would die 76.794% of the time. As of 5/5/2022, this puts me in 9,432nd place out of around 15,000 participants 