<a href="https://colab.research.google.com/github/cesar-yoab/TitanicKaggleChallenge/blob/main/titanic_kaggle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install kaggle and import libraries
!pip install -q kaggle

In [None]:
 # To upload json token
 from google.colab import files
files.upload()

In [None]:
 # Kaggel set-up
 ! mkdir ~/.kaggle
 ! cp kaggle.json ~/.kaggle/
 ! cp kaggle.json ~/.kaggle/

# Download and import data into the notebook
!kaggle competitions download -c titanic

# Kaggle Titanic Competition
[Link](https://www.kaggle.com/c/titanic/overview/evaluation) to competition website

**Goal**: Predict if a passenger survived the sinking of the Titanic or not.
For each in the test set, you must predict a 0 or 1 value for the variable.

**Metric**: Accuracy

## Submission File Format
You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

PassengerId (sorted in any order)
Survived (contains your binary predictions: 1 for survived, 0 for deceased)

```
PassengerId,Survived
892,0
893,1
894,0
Etc.
```

Some References:
1. [Titanic Tut](https://www.kaggle.com/blurredmachine/titanic-survival-a-complete-guide-for-beginners)
2. [Data Science Tut](https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy)
3. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Import Titanic data
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Data dictionary
| Variable | Definition | Key |
|-|-|-|
| survival	| Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex |  |	
| Age | Age in years | |	
| sibsp | # of siblings / spouses aboard the Titanic | |
| parch|	# of parents / children aboard the Titanic | |
| ticket|	Ticket number | |
| fare|	Passenger fare	| |
| cabin|	Cabin number | |
| embarked|	Port of Embarkation |	C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    int64  
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(6), object(4)
memory usage: 83.7+ KB


In [None]:
# Some cleaning
df['Sex'].replace({"female": 0, "male": 1},inplace=True) 

In [None]:
women = df.loc[df.Sex == 0]["Survived"]
rate_women = np.mean(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095


In [None]:
men = df.loc[df.Sex == 1]["Survived"]
rate_men = np.mean(men)

print("% of men who survived:", rate_men)

% of men who survived: 0.18890814558058924


In [None]:
# Survival rate based on class
for i in range(1, 4):
    pc_survival = np.mean(df.loc[df.Pclass == i]['Survived'])
    print("% of passenger class {} that survived".format(str(i)), pc_survival)

% of passenger class 1 that survived 0.6296296296296297
% of passenger class 2 that survived 0.47282608695652173
% of passenger class 3 that survived 0.24236252545824846


In [None]:
# Models to try
from sklearn.linear_model import LogisticRegression

In [None]:
# Train data
y = df['Survived']

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(df[features])

model = LogisticRegression(penalty='elasticnet', 
                           random_state=0, l1_ratio=.3,
                           solver='saga').fit(X, y)

In [None]:
# Get test data
test = pd.read_csv("test.csv")
test['Sex'].replace({"female": 0, "male": 1},inplace=True) 
X_test = pd.get_dummies(test[features])

In [None]:
# Make predictions and save to csv file for submission
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False) # Accuracy 0.77272