### Classification Error Metric Challenges

**Settings:  Where applicable, use test_size=0.30, random_state=4444.  This will permit comparison of results across users.

*These reference the Classification Challenges.*

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.set_option('display.max_rows', 999)
pd.set_option('precision', 3)

In [3]:
# Define column names
cols = ['handicapped',
       'water_project',
       'budget',
       'physician_fee',
       'el_salvador',
       'religion_school',
       'anti_satellite',
       'nicaragua',
       'mx_missile',
       'immigration',
       'synfuels',
       'education',
       'superfund',
       'crime',
       'duty_free',
       'south_africa', 
       'party']

In [5]:
# Load CSV voting data
df = pd.read_csv('/home/cneiderer/Metis/Neiderer_Metis/Challenges/challenges_data/house-votes-84.data', names=cols)

In [6]:
# Inspect DF
df.sample(5, random_state=4444)

Unnamed: 0,handicapped,water_project,budget,physician_fee,el_salvador,religion_school,anti_satellite,nicaragua,mx_missile,immigration,synfuels,education,superfund,crime,duty_free,south_africa,party
325,n,y,n,n,y,y,n,n,?,n,n,y,y,y,n,y,democrat
122,n,n,n,y,y,y,n,n,n,y,n,y,n,y,n,y,republican
96,n,n,?,n,y,y,n,n,n,n,y,y,y,y,n,y,democrat
355,y,n,y,y,n,n,n,y,y,y,n,n,n,y,y,y,republican
387,y,y,y,n,y,y,n,y,y,n,y,n,n,y,n,?,democrat


In [7]:
# Convert 'y' and 'n' to numeric
df.iloc[:, :-1] = df.iloc[:, :-1].replace({'n': 0, 'y': 1, '?': np.nan})

In [8]:
# Impute p('y') for '?'
df.iloc[:, :-1] = df.iloc[:, :-1].fillna(df.iloc[:, :-1].mean(axis=0))

In [9]:
# Convert 'democrat' and 'republican' to numeric class vals
df['party'] = df['party'].str.replace('.', '') # replace '.' that occurs at end of some class values
df['party'] = df['party'].replace({'democrat': 1, 'republican': 0})
# df['party'][df['party'].str.contains('dem')] = 1
# df['party'][df['party'].str.contains('rep')] = 0

In [10]:
# Inspect DF
df.sample(5, random_state=4444)

Unnamed: 0,handicapped,water_project,budget,physician_fee,el_salvador,religion_school,anti_satellite,nicaragua,mx_missile,immigration,synfuels,education,superfund,crime,duty_free,south_africa,party
325,0,1,0.0,0,1,1,0,0,0.501,0,0,1,1,1,0,1.0,1
122,0,0,0.0,1,1,1,0,0,0.0,1,0,1,0,1,0,1.0,0
96,0,0,0.597,0,1,1,0,0,0.0,0,1,1,1,1,0,1.0,1
355,1,0,1.0,1,0,0,0,1,1.0,1,0,0,0,1,1,1.0,0
387,1,1,1.0,0,1,1,0,1,1.0,0,1,0,0,1,0,0.813,1


#### Challenge 1

For the house representatives data set, calculate the accuracy, precision, recall and f1 scores of each classifier you built (on the test set).

#### Challenge 2

For each, draw the ROC curve and calculate the AUC.

#### Challenge 3

Calculate the same metrics you did in challenge 1, but this time in a cross validation scheme with the `cross_val_score` function (like in Challenge 9).

#### Challenge 4

For your movie classifiers, calculate the precision and recall for each class.

#### Challenge 5

Draw the ROC curve (and calculate AUC) for the logistic regression classifier from challenge 12.