<a href="https://colab.research.google.com/github/ethanmb/ml-models/blob/main/Titanic_Manual_NN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background
This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works. https://www.kaggle.com/competitions/titanic/overview

# Fastai Course

In lesson 3 of https://course.fast.ai/Lessons/lesson3.html Jeremy walks us through an example of linear regression, matrix multiplication, and creating a simple neural network with Microsoft Excel! Then challenges us to play with it or even recreate in Python; This is my attempt to recreate in Python! 

# The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).





# Import Libraries

In [109]:
import pandas as pd
import numpy as np

# Read in Titanic Dataset

In [110]:
df = pd.read_csv("train.csv")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# Feature Engineering
Features such as name, passengerId, ticket number and cabin number are likely to be unhelpful in a prediction of survival. Although, a solid argument can be made that cabin number might prove useful in further iterations. This is noted and will be kept in mind when tweaking model.

In [111]:
#remove name, ticket, cabin, and passenger id features

df = df.drop("Name", axis=1)
df = df.drop("Cabin", axis=1)
df = df.drop("Ticket", axis=1)
df = df.drop("PassengerId", axis=1)

In [112]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


Since there is a large range in fare and age values, a normalization technique should be used. We will log the fare, and normalize the age by max to get a more even distribution. Otherwise these features would dominate the model.

In [113]:
df["Fare"].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [114]:
df["Age"].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [115]:
df["logFare"] = np.log10(df["Fare"])

  result = getattr(ufunc, method)(*inputs, **kwargs)


In [116]:
df["logFare"].describe()

count    891.000000
mean           -inf
std             NaN
min            -inf
25%        0.898198
50%        1.159994
75%        1.491362
max        2.709549
Name: logFare, dtype: float64

In [117]:
df["normAge"] = df['Age'].apply(lambda x: x/80) #since 80 is the max age

In [118]:
df["normAge"].describe()

count    714.000000
mean       0.371239
std        0.181581
min        0.005250
25%        0.251563
50%        0.350000
75%        0.475000
max        1.000000
Name: normAge, dtype: float64

Much smaller range in data now.

In [119]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,logFare,normAge
0,0,3,male,22.0,1,0,7.25,S,0.860338,0.275
1,1,1,female,38.0,1,0,71.2833,C,1.852988,0.475
2,1,3,female,26.0,0,0,7.925,S,0.898999,0.325
3,1,1,female,35.0,1,0,53.1,S,1.725095,0.4375
4,0,3,male,35.0,0,0,8.05,S,0.905796,0.4375


In [120]:
df = df.drop("Age", axis=1)
df = df.drop("Fare", axis=1)
df.head()

Unnamed: 0,Survived,Pclass,Sex,SibSp,Parch,Embarked,logFare,normAge
0,0,3,male,1,0,S,0.860338,0.275
1,1,1,female,1,0,C,1.852988,0.475
2,1,3,female,0,0,S,0.898999,0.325
3,1,1,female,1,0,S,1.725095,0.4375
4,0,3,male,0,0,S,0.905796,0.4375


Create boolean feature for ticket class.

In [121]:
def pclass1(x):
  if x == 1:
    return 1
  else:
    return 0
def pclass2(x):
  if x == 2:
    return 1
  else:
    return 0

In [122]:
df["Pclass_1"] = df["Pclass"].apply(pclass1)

In [123]:
df["Pclass_2"] = df["Pclass"].apply(pclass2)

In [124]:
df = df.drop("Pclass", axis=1)

In [125]:
df.head()

Unnamed: 0,Survived,Sex,SibSp,Parch,Embarked,logFare,normAge,Pclass_1,Pclass_2
0,0,male,1,0,S,0.860338,0.275,0,0
1,1,female,1,0,C,1.852988,0.475,1,0
2,1,female,0,0,S,0.898999,0.325,0,0
3,1,female,1,0,S,1.725095,0.4375,1,0
4,0,male,0,0,S,0.905796,0.4375,0,0


Change sex to boolean

In [126]:
def sexBool(x):
  if x == "male":
    return 1
  else:
    return 0

In [127]:
df["Sex"] = df["Sex"].apply(sexBool)

In [128]:
df.head()

Unnamed: 0,Survived,Sex,SibSp,Parch,Embarked,logFare,normAge,Pclass_1,Pclass_2
0,0,1,1,0,S,0.860338,0.275,0,0
1,1,0,1,0,C,1.852988,0.475,1,0
2,1,0,0,0,S,0.898999,0.325,0,0
3,1,0,1,0,S,1.725095,0.4375,1,0
4,0,1,0,0,S,0.905796,0.4375,0,0


Remove NaN cells.

In [129]:
df = df.dropna()

In [130]:
df.head()

Unnamed: 0,Survived,Sex,SibSp,Parch,Embarked,logFare,normAge,Pclass_1,Pclass_2
0,0,1,1,0,S,0.860338,0.275,0,0
1,1,0,1,0,C,1.852988,0.475,1,0
2,1,0,0,0,S,0.898999,0.325,0,0
3,1,0,1,0,S,1.725095,0.4375,1,0
4,0,1,0,0,S,0.905796,0.4375,0,0


Change embarked to a boolean as well.

In [131]:
def embC(x):
  if x == "C":
    return 1
  else:
    return 0
def embS(x):
  if x == "S":
    return 1
  else:
    return 0

In [132]:
df["Embark_C"] = df["Embarked"].apply(embC)

In [133]:
df["Embark_S"] = df["Embarked"].apply(embS)

In [134]:
df = df.drop("Embarked", axis=1)

In [135]:
df.head()

Unnamed: 0,Survived,Sex,SibSp,Parch,logFare,normAge,Pclass_1,Pclass_2,Embark_C,Embark_S
0,0,1,1,0,0.860338,0.275,0,0,0,1
1,1,0,1,0,1.852988,0.475,1,0,1,0
2,1,0,0,0,0.898999,0.325,0,0,0,1
3,1,0,1,0,1.725095,0.4375,1,0,0,1
4,0,1,0,0,0.905796,0.4375,0,0,0,1


Finally, we must create a column of ones for a constant to use for gradient descent!

In [136]:
df.insert(0, 'Ones', 1)

In [137]:
df.head()

Unnamed: 0,Ones,Survived,Sex,SibSp,Parch,logFare,normAge,Pclass_1,Pclass_2,Embark_C,Embark_S
0,1,0,1,1,0,0.860338,0.275,0,0,0,1
1,1,1,0,1,0,1.852988,0.475,1,0,1,0
2,1,1,0,0,0,0.898999,0.325,0,0,0,1
3,1,1,0,1,0,1.725095,0.4375,1,0,0,1
4,1,0,1,0,0,0.905796,0.4375,0,0,0,1


# Parameters