## Using Data Science to Develop Metrics

There have been many projects that I've done in the past couple years surrounding exploring new datasets and developing metrics. Since the data that I used in these projects is confidential, I have adapted a couple of the projects I did to use publicily data (Kaggle's [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic/overview) Dataset)

The two projects I will be adapting/combining in this presentation:
1. Outlier Detection
2. Feature Importance Analysis

But in order to understand the analysis, we need the proper context around the data.
- RMS Titanic set sail on April 10th, 1912
- Heading to New York City on its maiden voyage
- Struck an iceberg on April 15th, 1912
- 1,500 out of the 2,224 passengers and crew died (~67%)

The data was sourced from Kaggle and is split into a test set and a training set

In [6]:
import pandas as pd

In [12]:
train_df = pd.read_csv('train.csv')
print(f"Number of passenger records in training set: {len(train_df)}")
print(f"Number of features: {len(train_df.columns) - 2}")
train_df.head()

Number of passenger records in training set: 891
Number of features: 10


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


According to the documentation, 
- *survived: Target (0 = No, 1 = Yes)*
- pclass: Ticket  class (1 = 1st, 2 = 2nd, 3 = 3rd)
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin number
- embarked:	Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

### Part 1: Outlier Detection

The team that I was working with was in charge of the software and content for any information a customer service representative might need to handle a member or provider's inquiry. The team wanted to identify specific days, based on the caller's reaction to the call (VOC Survey), in order to perform a case study to see if anything could be learned from those days. 

Instead of trying to fit a function to this data (which was multi-dimensional and very clearly non-linear) and find outliers using the distance of various points from that function, I found an algorithm called the [Extended Isolation Forest](https://ieeexplore.ieee.org/document/8888179), which is an improvement on the [Isolation Forest](https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf), that can find outliers in n-dimensional data in a model agnostic way. It is essentially an unsupervised machine learning model used specifically to identify anamolies in data. This script is a modified version of what I developed for that project.

My intuition is that there is something that can be extracted from either the ticket number or the name that would give us more information about the passenger, which could help us identify another feature to help the model later on. Using the algorithm to find the outliers that survived and shouldn't have to see if we can identify any patterns in the ticket number or their names.

In [13]:
import matplotlib.pyplot as plt
import numpy as np
import eif as iso
import seaborn as sb
sb.set_style(style="whitegrid")

ModuleNotFoundError: No module named 'eif'