## Project 5 : Classification

## Instructions

### Description

Practice classification on the Titanic dataset.

### Grading

For grading purposes, we will clear all outputs from all your cells and then run them all from the top.  Please test your notebook in the same fashion before turning it in.

### Submitting Your Solution

To submit your notebook, first clear all the cells (this won't matter too much this time, but for larger data sets in the future, it will make the file smaller).  Then use the File->Download As->Notebook to obtain the notebook file.  Finally, submit the notebook file on Canvas.


In [None]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import Series, DataFrame

# import sklearn.datasets
# from sklearn.linear_model import LinearRegression
# from sklearn.model_selection import train_test_split
# from pandas import DataFrame
# from scipy.stats import norm
# from matplotlib.colors import ListedColormap


plt.style.use('ggplot')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

### Introduction

On April 15, 1912, the largest passenger liner ever made collided with an iceberg during her maiden voyage. When the Titanic sank it killed 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck resulted in such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

Intro Videos: 
https://www.youtube.com/watch?v=3lyiZMeTKIo
and
https://www.youtube.com/watch?v=ItjXTieWKyI 

The `titanic_data.csv` file contains data for `887` of the real Titanic passengers. Each row represents one person. The columns describe different attributes about the person including whether they survived (`0=No`), their age, their passenger-class (`1=1st Class, Upper`), gender, and the fare they paid (£s*). For more on the currency: http://www.statisticalconsultants.co.nz/blog/titanic-fare-data.html

We are going to try to see if there are correlations between the feature data provided (find a best subset of features) and passenger survival.

### Problem 1: Load and understand the data (35 points)

#### Your task (some of this is the work you completed for L14 - be sure to copy that work into here as needed)
Conduct some preprocessing steps to explore the following and provide code/answers in the below cells:
1. Load the `titanic_data.csv` file into a pandas dataframe
2. Explore the data provided (e.g., looking at statistics using describe(), value_counts(), histograms, scatter plots of various features, etc.) 
3. What are the names of feature columns that appear to be usable for learning?
4. What is the name of the column that appears to represent our target?
5. Formulate a hypothesis about the relationship between given feature data and the target
6. How did Pclass affect passenngers' chances of survival?
7. What is the age distribution of survivors?

In [None]:
# !cat titanic_data.csv
# pd.read_table?

In [None]:
# Step 1. Load the `titanic_data.csv` file into a pandas dataframe
titn = pd.read_table('titanic_data.csv', sep=',',index_col='Name')
titn

In [None]:
# Step 2. Explore the data provided (e.g., looking at statistics using describe(), value_counts(), histograms, scatter plots of various features, etc.) 


In [None]:
titn.info()

In [None]:
# df_titn_dummies = pd.get_dummies(titn,columns=['Sex'])
# print(df_titn_dummies)

In [None]:
titn.describe()

In [None]:
survived_0 = titn[titn['Survived'] == 0]
survived_1 = titn[titn['Survived'] == 1]

In [None]:
survived_1.columns

In [None]:
# Get column names
columns = survived_1.columns

# Create figure with subplots and change the size to look good
fig, axs = plt.subplots(4, 2, figsize=(10, 15))

# Iterate through each column
for i, f in enumerate(columns):
    row = i // 2 #determining what row the plot goes in
    col = i % 2 #determining the column by calculating the remainder
    
    # Plot histograms for survived_1 and _0, alpha is 50% so that we can see the data overlap
    axs[row, col].hist(survived_1[f], alpha=0.5, label='Lived')
    axs[row, col].hist(survived_0[f], alpha=0.5, label='Died')
    axs[row, col].set_xlabel(f)
    axs[row, col].set_ylabel('Count')
    axs[row, col].set_title(f)
    axs[row, col].legend()

#make it tight!
plt.tight_layout()

# Show the plot
plt.show()


In [None]:
#This was my test code, I then combined it with above
# # Calculate how many survived from each
# age_cnt_died = survived_0['Age'].value_counts().sort_index()
# age_cnt_lived = survived_1['Age'].value_counts().sort_index()

# # Calculate total count for each age group
# tot_cnt = age_cnt_died + age_cnt_lived

# # Calculate the percentage of individuals who survived for each age group
# perct_survived = (age_cnt_lived / tot_cnt) * 100

# # Create a scatter plot
# plt.scatter(perct_survived.index, perct_survived.values)
# plt.xlabel('Age')
# plt.ylabel('Percentage Survived')
# plt.title('Percentage Survived by Age')
# plt.ylim(0,100)
# plt.show()


In [None]:
# Get column names
columns = survived_1.columns

# Create figure with subplots and change the size to look good
fig, axs = plt.subplots(4, 2, figsize=(10, 15))

# Iterate through each column
for i, f in enumerate(columns):
    row = i // 2 #determining what row the plot goes in
    col = i % 2 #determining the column by calculating the remainder
    
    # Calculate how many survived from each
    f_cnt_died = survived_0[f].value_counts().sort_index()
    f_cnt_lived = survived_1[f].value_counts().sort_index()

    # Calculate total count for each group
    tot_cnt = f_cnt_died + f_cnt_lived

    # Calculate the percentage of individuals who survived for each group
    perct_survived = (f_cnt_lived / tot_cnt) * 100
    
    # Plot scatters
    axs[row, col].bar(perct_survived.index, perct_survived.values)
    axs[row, col].set_xlabel(f)
    axs[row, col].set_ylabel('Percentage Survived')
    axs[row, col].set_title(f)
    axs[row, col].set_ylim(0,100)

#make it tight!
plt.tight_layout()

# Show the plot
plt.show()



In [None]:
# Calculate how many survived from each fare group
f_cnt_died = survived_0.groupby('Pclass')['Fare'].value_counts().sort_index()
f_cnt_lived = survived_1.groupby('Pclass')['Fare'].value_counts().sort_index()

# Calculate total count for each fare group
tot_cnt = f_cnt_died + f_cnt_lived

# Calculate the percentage of individuals who survived for each fare group
perct_survived = (f_cnt_lived / tot_cnt) * 100

# Assign colors based on Pclass
colors = ['r', 'g', 'b']

# Go through each Pclass and plot the percentage of fare survived
for i, pclass in enumerate(perct_survived.index.levels[0]):
    plt.scatter(perct_survived[pclass].index, perct_survived[pclass], label=f'Pclass {pclass}', color=colors[i], alpha=0.5)

plt.xlabel('Fare')
plt.ylabel('Percentage Survived')
plt.title('Percentage Survived by Fare and Pclass')
plt.legend()
plt.show()


---

**Edit this cell to provide answers to the following steps:**

---

Step 3. What are the names of feature columns that appear to be usable for learning?

<font color = blue>
    
Survived (as the target), Pclass, Sex, Age, Fare all appear useful. There are some interesting trends in Siblings/Spouse and Childre/Parents data, but I am not certain how useful they will be. Maybe if we combine that information into #of family members on board? That might be more useful. 
    
</font>

Step 4. What is the name of the column that appears to represent our target?

<font color = blue>

Survived
    
</font>

Step 5. Formulate a hypothesis about the relationship between given feature data and the target

<font color = blue>

I believe there will be stronger correlations between gender and surviving, and possibly a skew in the data for age (those younger than 20) surviving. Per the age old "women and children" first. I also suspect to see a higher percentage of 1st class passengers surviving.
    
</font>


Step 6. How did Pclass affect passenngers' chances of survival?
Show your work with a bar plot, dataframe selection, or visual of your choice.

<font color = blue> 
    
From the plot below we can see that a higher class meant a higher likelyhood to survive. I like percentages for visualization. If you were first class roughly 60% chance, second 40-50%, and third almost 20%. These values aren't exact, (I could have printed them if we wanted). But, I think the trend is clear here.


In [None]:
#Took my test code and changed it for Pclass, this graph is actually already found above, but here it is alone
# Calculate how many survived from each
P_cnt_died = survived_0['Pclass'].value_counts().sort_index()
P_cnt_lived = survived_1['Pclass'].value_counts().sort_index()

# Calculate total count for each age group
tot_cnt = P_cnt_died + P_cnt_lived

# Calculate the percentage of individuals who survived for each age group
perct_survived = (P_cnt_lived / tot_cnt) * 100

# # Create a bar plot
plt.bar(perct_survived.index, perct_survived.values)
plt.xlabel('Pclass')
plt.ylabel('Percentage Survived')
plt.title('Percentage Survived by Class')
plt.ylim(0,100)
plt.show()


Step 7. What is the age distribution of survivors?
Show your work with a dataframe operation and/or histogram plot.

<font color = blue>
    
See the plot below, but we can see that most people are between the ages of 15 and 40. 


In [None]:
plt.hist(titn['Age'], bins=80)
plt.show()

### Problem 2: transform the data (10 points)
The `Sex` column is categorical, meaning its data are separable into groups, but not numerical. To be able to work with this data, we need numbers, so you task is to transform the `Sex` column into numerical data with pandas' `get_dummies` feature and remove the original categorical `Sex` column.

In [None]:
df_titn_dum=pd.get_dummies(titn,columns=['Sex'])
df_titn_dum

### Problem 3: Classification (30 points)
Now that the data is transformed, we want to run various classification experiments on it. The first is `K Nearest Neighbors`, which you will conduct by:

1. Define input and target data by creating lists of dataframe columns (e.g., inputs = ['Pclass', etc.)
2. Split the data into training and testing sets with `train_test_split()`
3. Create a `KNeighborsClassifier` using `5` neighbors at first (you can experiment with this parameter)
4. Train your model by passing the training dataset to `fit()`
5. Calculate predicted target values(y_hat) by passing the testing dataset to `predict()`
6. Print the accuracy of the model with `score()`

** Note: If you get a python warning as you use the Y, trainY, or testY vector in some of the function calls about "DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, )", you can look up how to use trainY.values.ravel() or trainY.values.flatten() or another function, etc.

In [None]:
df_titn_dum.columns

In [None]:
inputs = df_titn_dum[['Pclass', 'Age', 'Fare', 'Sex_female', 'Sex_male']]
target = df_titn_dum['Survived']

In [None]:
from sklearn.model_selection import train_test_split

X = inputs
y = target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# train, test = train_test_split(data, test_size = 0.5)


In [None]:
from sklearn.neighbors import KNeighborsClassifier
k = 5

In [None]:
from sklearn import neighbors

model = neighbors.KNeighborsClassifier(k)
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)
# y_pred

In [None]:
# conf_matrix = sk.metrics.confusion_matrix(y_test, y_pred)
# print(conf_matrix)
model.score(X_test, y_test)

### Problem 4: Cross validation, classification report (15 points)
- Using the concepts from the 17-model_selection slides and the [`cross_val_score`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function from scikit-learn, estimate the f-score ([`f1-score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) (you can use however many folds you wish). To get `cross_val_score` to use `f1-score` rather than the default accuracy measure, you will need to set the `scoring` parameter and use a scorer object created via [`make_scorer`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html#sklearn.metrics.make_scorer).  Since this has a few parts to it, let me just give you that parameter: ```scorerVar = make_scorer(f1_score, pos_label=1)```

- Using the concepts from the end of the 14-classification slides, output a confusion matrix.

- Also, output a classification report [`classification_report`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) from sklearn.metrics showing more of the metrics: precision, recall, f1-score for both of our classes.

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix, f1_score, classification_report, make_scorer
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC


In [None]:
# f1_score(y_test, y_pred, *, labels=None, pos_label=1, 
#                          average='binary', sample_weight=None, zero_division='warn')

f1_score(y_test, y_pred)

scorerVar = make_scorer(f1_score, pos_label=1)

In [None]:
from sklearn.svm import SVC

scores = cross_val_score(model, X, y, cv=5, scoring=scorerVar)
print(scores)

In [None]:
conf_matrix = sk.metrics.confusion_matrix(y_test, y_pred)
print(conf_matrix)

In [None]:
mets_out = metrics.classification_report(y_test, y_pred)
print(mets_out)

### Problem 5: Support Vector Machines (15 points)
Now, repeat the above experiment using the using a Support Vector classifier [`SVC`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) with default parameters (RBF kernel) model in scikit-learn, and output:

- The fit accuracy (using the `score` method of the model)
- The f-score (using the [`cross_val_score`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function)
- The confusion matrix
- The precision, recall, and f-measure for the 1 class (you can just print the results of the [`classification_report`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) function from sklearn.metrics)

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

#create a model object
model = SVC(kernel='rbf')

#train our model

model.fit(X_train,y_train)

#evaluate the model 

y_pred = model.predict(X_test)

#setup to get f-score and cv

f1_score(y_test, y_pred)

scorerVar = make_scorer(f1_score, pos_label=1)

scores = cross_val_score(model, X, y, cv=5, scoring=scorerVar)

#confusion matrix

conf_matrix = sk.metrics.confusion_matrix(y_test, y_pred)
print(conf_matrix)

#classification report

mets_out = metrics.classification_report(y_test, y_pred)
print(mets_out)


### Problem 6: Logistic Regression (15 points)

Now, repeat the above experiment using the [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model in scikit-learn, and output:

- The fit accuracy (using the `score` method of the model)
- The f-score (using the [`cross_val_score`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function)
- The confusion matrix
- The precision, recall, and f-measure for the 1 class (you can just print the results of the [`classification_report`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) function from sklearn.metrics)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model

#create a model object

model = linear_model.LogisticRegression()

#train our model

model.fit(X_train,y_train)

#evaluate the model 

y_pred = model.predict(X_test)

#setup to get f-score and cv

f1_score(y_test, y_pred)

scorerVar = make_scorer(f1_score, pos_label=1)

scores = cross_val_score(model, X, y, cv=5, scoring=scorerVar)

#confusion matrix

conf_matrix = sk.metrics.confusion_matrix(y_test, y_pred)
print(conf_matrix)

#classification report

mets_out = metrics.classification_report(y_test, y_pred)
print(mets_out)


### Problem 7: Comparision and Discussion (5 points)
Edit this cell to provide a brief discussion (3-5 sentances at most):
1. What was the model/algorithm that performed best for you?

<font color = blue>Logistic Regression if we go off F-score. Both scores for surviving and not surviving were the highest, respectively, amongst all three models. </font>

2. What feaures and parameters were used to achieve that performance?

<font color = blue>Features: 'Pclass', 'Age', 'Fare', 'Sex_female', 'Sex_male'

Parameters: splitting the data 75/25 for Train/Test, using a dictated randomization (state 42), and the different models</font>


3. What insights did you gain from your experimentation about the predictive power of this dataset and did it match your original hypothesis about the relationship between given feature data and the target?

<font color = blue>I am not certain. The F-factor is not as high as I would want it to be if I was using a model for myself (I would like something over 0.80). Perhaps the features  I left out (family members: Siblings/Spouse, Childre/Parents) did have some impact. I could rerun all of this with those features to see if that impacts the results. However, we can say that with F-scores around 0.7, there is some correlation. </font>


### Questionnaire
1) How long did you spend on this assignment? <font color = blue>5 hours</font>
<br><br>
2) What did you like about it? What did you not like about it? <font color = blue>This really challenged me to learn how these models work. It was stressful, because I am still not certain if it worked properly, but it feels like they did. Also, I liked the callback to previous assignments to keep that stuff fresh in my mind.</font>
<br><br>
3) Did you find any errors or is there anything you would like changed? <font color = blue>Not sure.</font>
<br><br>