# TitanicKaggle
Titanic Machine Learning Challenge from Kaggle

From: https://www.kaggle.com/c/titanic

## Competition Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

### Practice Skills
Binary classification
Python and R basics

### Overview
The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

### Import Modules and Data

In [37]:
# %% Import Modules

import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc

In [3]:
# %% Import Data

train_data = pd.read_csv("data/train.csv")
test_data = pd.read_csv("data/test.csv")

print(train_data.head())
print(test_data.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
  

### Data Exploration

Here, I want to explore some basic characteristics of the dataset: population, number of survivors, characteristics of the survivors. Based on conventions at the time, I predict that the passenger's sex and class will play a large role in whether or not an individual survived.

In [55]:
# %% Basic Exploration

cols = list(train_data.columns)
print(cols)
cols = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

sex = train_data.groupby("Sex").count()
print("\nCount of People\n",sex["Embarked"])

survived = train_data["Survived"].sum()
print("\nNumber of survivors:", survived)

survived_sex = train_data.groupby("Sex").sum()
print("\nNumber of survivors by sex:\n", survived_sex["Survived"])

survived_class = train_data.groupby("Pclass").sum()
print("\nNumber of survivors by class:\n", survived_class["Survived"])


['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']

Count of People
 Sex
female    312
male      577
Name: Embarked, dtype: int64

Number of survivors: 342

Number of survivors by sex:
 Sex
female    233
male      109
Name: Survived, dtype: int64

Number of survivors by class:
 Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64


### Multinomial Naive Bayes

In [50]:
def multinb(x, y):
    """
    
    This function performs the required functions for fitting and prediction a 
    Multinomial Naive Bayes
    from given x and y datasets.
    
    Args:
        x (array-like): independent data
        y (array-like): target data
        
    Return:
        score (float): Mean accuracy of the model on the given test and target 
        data
    
    """
    # Train Test Split
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.33,
                                                        random_state = 0)
    X_train = np.array(X_train).reshape(-1,1)
    y_train = np.array(y_train).reshape(-1,1)
    X_test = np.array(X_test).reshape(-1,1)
    y_test = np.array(y_test).reshape(-1,1)
    
    y_train.reshape(-1,1)
    # Fit and predict model
    multinb = MultinomialNB()
    multinb.fit(X_train, y_train)
    
    predicted = multinb.predict(X_test)
    predicted
    
    multinb.predict(X_test)
    score = multinb.score(X_test, y_test)
    
    # Plot
    # x_axis = range(len(X_test))
    #
    # fig,ax = plt.subplots(figsize=(15,10))
    # ax.scatter(x_axis, predicted, alpha = 0.3)
    # ax.scatter(x_axis, y_test, alpha = 0.3)
    
    return score


In [56]:
for col in cols:
    multinb(train_data[col], train_data["Survived"])

  y = column_or_1d(y, warn=True)


ValueError: could not convert string to float: 'male'

In [33]:
train_data["Sex"].reshape(-1,1)

AttributeError: 'Series' object has no attribute 'reshape'