# DSCI 100 Group 12 Project Proposal

# Imports

In [1]:
### Run this cell before continuing.
import random

import altair as alt
import pandas as pd
import sklearn
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.metrics.pairwise import euclidean_distances

alt.data_transformers.disable_max_rows()
np.random.seed(1)

# Introduction

## Background

The sinking of the Titanic is one of the most known shipwrecks in history. The widely regarded "unsinkable" Titanic sank after striking an iceberg on April 15, 1912, while on her first voyage. Out of 2224 passengers and crew, 1502 perished because there were not enough lifeboats to go around.

Some people appeared to have higher survival rates than others, despite the fact that survival sometimes involved a certain amount of luck. Based on the provided passenger information, we’d love to explore the dataset and answer the question: what sorts of people were more likely to survive?

## Question

Given a passenger on the Titanic, predict whether they will survive the shipwreck.

## Dataset
The dataset we'll be using is from https://www.kaggle.com/competitions/titanic/data. It already splits our data into a training set to train our model with and a test set to evaluate our model on unseen data. 

The dataset contains a number of features such as the passengers sex, age, cabin number, and more. Our target variable will be the `Survival` feature which is 0 if the passenger did not survive, and 1 if they survived

# Preliminary Exploratory Data Analysis

## Reading the data
> Note: there doesn't seem to be a URL for the dataset. Therefore, we've downloaded the data from Kaggle using the link https://www.kaggle.com/competitions/titanic/data and have moved the csv files into a data directory.

In [2]:
training_data = pd.read_csv('data/train.csv')
test_data = pd.read_csv('data/test.csv')
training_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## Cleaning and wrangling data into a tidy format

The data is already clean and wrangled. Namely
- Each row is a single observation (a single passenger)
- Each column is a single variable
- Each cell contains a single value

Furthermore, the column names are easy to read and use already (no spaces in them) and missing values are represented with `NaN`.

## Summarizing the Data

Here we use `DataFrame::info()` and `DataFrame::describe()` to discover

1. The number of observations for each column and whether we're missing any observations
2. What `Dtype` each column/feature has
3. The mean, std, quartiles, min and max of numerical features

In [3]:
display(training_data.info())
display(training_data.describe(include='all'))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


None

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Braund, Mr. Owen Harris",male,,,,347082.0,,B96 B98,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


We can see that the features with missing values include `Age`, `Cabin` and `Embarked`. `Cabin` has the most missing number of values, so it may not be useful for our model to learn this feature.

Furthermore, `Name` in our training set is a unique feature for each passenger, so it may not be that useful. We'll expand further on this in the [Methods](#methods) section.

Next, we find the number of observations in each class

In [5]:
display(training_data['Survived'].value_counts())
display(training_data['Survived'].value_counts(normalize=True))

0    549
1    342
Name: Survived, dtype: int64

0    0.616162
1    0.383838
Name: Survived, dtype: float64

We can see there are 549 passengers in the training set that did not survive (roughly 61%) and 342 passengers that survived (roughly 38%).

TODO any more EDA?

## Visualizing the Data

TODO 

# Methods

TODO

# Expected Outcomes and Significance 

Through data analysis with classification, we want to figure out what sort of people were more likely to survive from Titanic shipwrecks. As we classify the training data set, we could find specific variables/features such as “sex” or “ticket class” that influence the likelihood of survival in the Titanic disaster. For example, we would expect that passengers with higher ticket classes or those who lived in higher cabin numbers might have survived more than other groups of people. The impact of these findings can help us better understand which groups of people were more likely to survive than others. This could lead to future questions such as whether these groups shared similarities to survivors of other large-scale boat accidents or natural disasters that also resulted in a large number of deaths. Lastly, we could use these findings and further research methodologies to maximize the number of survivors in case an event like this were to happen again.