Lambda School Data Science, Unit 2: Predictive Modeling

# Applied Modeling, Module 1

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your decisions.

- [ ] Choose your target. Which column in your tabular dataset will you predict?
- [ ] Choose which observations you will use to train, validate, and test your model. And which observations, if any, to exclude.
- [ ] Determine whether your problem is regression or classification.
- [ ] Choose your evaluation metric.
- [ ] Begin with baselines: majority class baseline for classification, or mean baseline for regression, with your metric of choice.
- [ ] Begin to clean and explore your data.
- [ ] Choose which features, if any, to exclude. Would some features "leak" information from the future?

## Reading
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [How Shopify Capital Uses Quantile Regression To Help Merchants Succeed](https://engineering.shopify.com/blogs/engineering/how-shopify-uses-machine-learning-to-help-our-merchants-grow-their-business)
- [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), **by Lambda DS3 student** Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly_express as px
import seaborn as sns

In [3]:
df = pd.read_csv('CA_Hosp_Mortality.csv', encoding = "ISO-8859-1")

pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [4]:
# Cleaning function

def wrangle(X):
    
    X = X.copy()
    
    # Change column names to more appropriate names without caps
    X = X.rename(columns={'YEAR':'Year', 'COUNTY':'County', 'HOSPITAL':'Hospital',
                        'Procedure/Condition':'Procedure_Condition',
                        'Risk Adjuested Mortality Rate':'RAMR', '# of Deaths':'Number_Deaths',
                        '# of Cases':'Number_Cases', 'Hospital Ratings':'Hospital_Ratings',
                        'LONGITUDE':'Longitude', 'LATITUDE':'Latitude'
                       })  
    
    # Remove rows where the value is for the entire state
    X = X.query("County != 'AAAA'") # AAAA is county code for State

    # Remove procedures that are not in every year and hidden NaN values
    X.replace({'AAA Repair':np.nan, 'AAA Repair Unruptured':np.nan, '.':np.nan}, inplace=True)
    X.dropna(inplace=True)
    
    # Change numeric columns from string to float
    numerics = ['RAMR', 'Number_Deaths', 'Number_Cases']
    for column in numerics:
        X[column] = X[column].astype(float)
    
    # Engineer new column of Number of Deaths to Number of Cases ratio
    X['Deaths_Cases'] = X['Number_Deaths']/X['Number_Cases']
    
    return X

newdf = wrangle(df)



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [10]:
newdf.tail()

Unnamed: 0,Year,County,Hospital,OSHPDID,Procedure_Condition,RAMR,Number_Deaths,Number_Cases,Hospital_Ratings,Longitude,Latitude,Deaths_Cases
22264,2015,Yuba,Rideout Memorial Hospital,106580996.0,Acute Stroke Subarachnoid,27.0,2.0,10.0,As Expected,-121.594363,39.138222,0.2
22265,2015,Yuba,Rideout Memorial Hospital,106580996.0,Acute Stroke Ischemic,5.2,10.0,150.0,As Expected,-121.594363,39.138222,0.067
22266,2015,Yuba,Rideout Memorial Hospital,106580996.0,Acute Stroke Hemorrhagic,19.9,4.0,25.0,As Expected,-121.594363,39.138222,0.16
22267,2015,Yuba,Rideout Memorial Hospital,106580996.0,Acute Stroke,9.0,16.0,185.0,As Expected,-121.594363,39.138222,0.086
22268,2015,Yuba,Rideout Memorial Hospital,106580996.0,AMI,5.9,13.0,209.0,As Expected,-121.594363,39.138222,0.062


In [12]:
target = 'Number_Deaths'
features = newdf.columns.drop('Number_Deaths').tolist()

train = newdf.query('Year < 2015')
test = newdf.query('Year >= 2015')

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

# Could also do predict for an area instead of year

This will be a regression problem.
Predicting the number of deaths per procedure or per hospital or area

In [13]:
# Number of deaths averaged between all hospitals between 2012-2014
# Should try to predict per procedure or per hospital/area
train['Number_Deaths'].mean()

6.590228440266696