# Titanic - Experiment 12

 1. Frame Problem and Objective
 2. Describe and Wrangle Data
 3. Process Data
 4. Explore Data
 5. Modeling and Evaluation

### Import libraries

In [3]:
import os
import sys
import warnings

import scipy
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap

from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display

if os.getcwd() not in sys.path:
    sys.path.append(os.getcwd())
from Titanic.Code.DataPrep.titanic import Titanic
from Titanic.Code.DataPrep.helpers import score_impute_strategies

### Change notebook settings

In [2]:
warnings.filterwarnings('ignore')
np.random.seed(17)
InteractiveShell.ast_node_interactivity = "all"
plt.style.use('classic')
%matplotlib inline

## 1. Frame Problem and Objective

#### Problem:
On April 15, 1912, the [RMS Titanic](https://en.wikipedia.org/wiki/RMS_Titanic) collided with an iceberg killing more than 1,500 of an *estimated* 2,224 passengers and crew.

***

#### Objective:
Given a set of passenger records from the RMS Titanic, our objective is to generate a model that can predict if a passenger survived the disaster. Therefore, because we know the output can only be one of two discrete values, we can assume the problem type is **[binary classification](https://en.wikipedia.org/wiki/Binary_classification)**.

There are many machine learning models available from the [scikit-learn](https://scikit-learn.org/stable/) library we can leverage for a **binary classification** problem. A handful of these models are:
 * [Logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
 * [Decision trees](https://scikit-learn.org/stable/modules/tree.html#classification)
 * [Random forests](https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees)
 * [Support vector machines](https://scikit-learn.org/stable/modules/svm.html#classification)

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

## 2. Describe and Wrangle Data

The data for this analysis project comes from [Kaggle](https://www.kaggle.com/c/titanic). The following variables are available in the data:
 * PassengerId
    * This appears to be some Kaggle/system-generated identifier column
 * Survived
    * Either 0 or 1 for 'No' or 'Yes' if the passenger survived, respectively
 * Pclass
    * Ticket class of the passenger: first-class (1), second-class (2), or third-class (3)
 * Name
    * Name of the passenger of the form: {last name}, {title} {first name} {middle name}
    * Not all passengers have a middle name
 * Sex
    * The sex of the passenger: 'male' or 'female'
 * Age
    * Age of passenger (in years)
    * Passengers less than 1 year old have their age expressed as a *float*
    * Passengers with an *estimated* age is in the form xx.5
    * A quick look at the data shows there are some records missing values
 * SibSp
    * An aggregated field representing both the number of siblings **and/or** spouses of the passenger aboard the RMS Titanic
 * Parch
    * An aggregated field representing both the number of parents **and/or** children of the passenger aboard the RMS Titanic
 * Ticket
    * The ticket number of the passenger
 * Fare
    * The fare charged to the passenger for the ticket
 * Cabin
    * Cabin number of the passenger
    * This variable also shows that there are missing values
 * Embarked
    * The port of embarkation of the passenger: Cherbourg (C), Queenstown (Q), or Southampton (S)
    * This variable has missing values