# Titanic - Predicting Survival

## Introduction
The sinking of the Titanic on April, 15, 1912 is one of the most notorious shipwrecks in history. Of 2224 passengers and crew, 1502 died that night. Our goal in this project is to predict whether a passenger survived based on a number of factors in the passenger data set. 

## Import libraries and data

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # visualizing data
from sklearn.linear_model import LogisticRegression

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


In [2]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Naive logistical regression

For my first attempt I will try to use a pretty simple logistical regression model.

### Variable descriptions and analysis

passengerId: A unique ID for identifying each passenger

survival: Describes whether the passenger survived the wreck. Values are 0 for no, 1 for yes

pclass:	The passenger ticket class.	1 = 1st, 2 = 2nd, 3 = 3rd

sex: Sex of the passenger.

Age: Age of the passenger in years.	

sibsp: The number of siblings / spouses aboard the Titanic.	

parch: The number of parents / children aboard the Titanic.	

ticket: The ticket number of the passenger.	

fare: The fare paid by the passenger.	

cabin: The cabin number of the passenger.	

embarked: The port where the passenger boarded the 	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

First, let's look at basic info about the training and test sets.

In [4]:
print(train_data.describe())
print(train_data.info())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data colu

We can see that approximately 38% of the passengers in the training set survived. Another observation is that the min fare is zero, which may indicate missing fare data for some passengers, or that these are a separate class of passengers, either people who were given free passage or crew. It may be worth looking at them separately. There are also null values for Age, Embarked, and Cabin that we will have to deal with. Let's look at the test data to see if it has similar properties. Since it is unlikely that the embarkation location affects survival, we will not include it in this model. However, the cabin tells us where the passenger would be located, therefore we want to fill this in with a reasonable value. We will replace the NaN values with "X", and categorize the rest by the first letter, grouping them presumably by deck. Age will be replaced with the average age of the data set.

Additionally, we are only going to use Pclass, Sex, Age, SibSp, Parch, Fare, and Cabin features. These can all easily be converted into numerical values.

In [5]:
# Preprocess data set to fix NaN values for Age, Cabin, and Embarked.
# Drop the Embarked column
clean_train_data = train_data.drop("Embarked", axis = "columns", inplace = False)

# Replace NaN Cabins with "None"
# Replace each non-NaN value with the first character, the letter
clean_train_data.fillna({"Cabin":"X"}, inplace = True)
clean_train_data["Cabin"] = clean_train_data["Cabin"].str.get(0)

# Replace NaN with average of Age
mean = clean_train_data["Age"].mean()
clean_train_data.fillna({"Age": mean}, inplace = True)

# Convert text/bool to numerical
# First delete PassengerId and Name
clean_train_data.drop("PassengerId", axis = "columns", inplace = True)
clean_train_data.drop("Name", axis = "columns", inplace = True)
clean_train_data.drop("Ticket", axis = "columns", inplace = True)

# Now use get_dummies to get numerical values
clean_train_data = pd.get_dummies(clean_train_data)

In [6]:
print(test_data.describe())
print(test_data.info())

       PassengerId      Pclass         Age       SibSp       Parch        Fare
count   418.000000  418.000000  332.000000  418.000000  418.000000  417.000000
mean   1100.500000    2.265550   30.272590    0.447368    0.392344   35.627188
std     120.810458    0.841838   14.181209    0.896760    0.981429   55.907576
min     892.000000    1.000000    0.170000    0.000000    0.000000    0.000000
25%     996.250000    1.000000   21.000000    0.000000    0.000000    7.895800
50%    1100.500000    3.000000   27.000000    0.000000    0.000000   14.454200
75%    1204.750000    3.000000   39.000000    1.000000    0.000000   31.500000
max    1309.000000    3.000000   76.000000    8.000000    9.000000  512.329200
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name       

We can see there is also missing data in the training data set. In this case, Fare, Age, and Cabin are missing values. We will do a similar procedue, replacing fare with the average. Age and Cabin will be treated the same as above.

In [7]:
# Preprocess data set to fix NaN values for Age, Cabin, and Embarked.
# Drop the Embarked column
clean_test_data = test_data.drop("Embarked", axis = "columns", inplace = False)

# Replace NaN Cabins with "None"
# Replace each non-NaN value with the first character, the letter
clean_test_data = clean_test_data.fillna({"Cabin":"X"})
clean_test_data["Cabin"] = clean_test_data["Cabin"].str.get(0)

# Replace NaN with average of Age
mean_age = clean_test_data["Age"].mean()
clean_test_data.fillna({"Age": mean_age}, inplace = True)

#Replace NaN with average of Fare
mean_fare = clean_test_data["Fare"].mean()
clean_test_data.fillna({"Fare" : mean_fare}, inplace = True)

# Convert text/bool to numerical
# First delete PassengerId and Name
clean_test_data.drop("PassengerId", axis = "columns", inplace = True)
clean_test_data.drop("Name", axis = "columns", inplace = True)
clean_test_data.drop("Ticket", axis = "columns", inplace = True)

# Now use get_dummies to get numerical values
clean_test_data = pd.get_dummies(clean_test_data)

# The data is missing Cabin_T, need to add it and will fill with False values
clean_test_data["Cabin_T"] = False

# Swap order of last two columns
col = clean_test_data.pop("Cabin_X")
clean_test_data["Cabin_X"] = col
#data.insert(1, col.name, col)

Now we are ready to train the Logistic Regression model.

In [8]:
xtrain = clean_train_data.drop("Survived", axis = "columns", inplace = False)
ytrain = clean_train_data["Survived"]
model = LogisticRegression(random_state = 0, max_iter = 1000)
model.fit(xtrain, ytrain)

xtest = clean_test_data

# Create the output DataFrame
submission = pd.DataFrame({
        "PassengerId": test_data["PassengerId"],
        "Survived": model.predict(xtest)})
submission.to_csv("submission.csv", index = False)