**Topic:  Classification**

**Data:**

Challenges 1-10:  congressional votes [Congressional Voting Records Dataset](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records)

Challenge 11:     movie data

Challenge 12:     breast cancer surgery [Haberman Survival Dataset](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival)


**Data – Congressional Votes**

Download the congressional votes data from here:[Congressional Voting Records Dataset](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records)

These are votes of U.S. House of Representatives Congressmen on 16 key issues in 1984.

Read the description of the fields and download the data: house-votes-84.data

We will try to see if we can predict the house members' party based on their votes.

We will also use some of the general machine learning tools we learned (a bit more efficiently this time).

In [4]:
import pandas as pd
import numpy as np
import requests

headers = [
    'class'
    ,'handicapped-infants'
    ,'water-project-cost-sharing'
    ,'adoption-of-the-budget-resolution'
    ,'physician-fee-freeze'
    ,'el-salvador-aid'
   ,'religious-groups-in-schools'
   ,'anti-satellite-test-ban'
   , 'aid-to-nicaraguan-contras'
  , 'mx-missile'
  , 'immigration'
  , 'synfuels-corporation-cutback'
  , 'education-spending'
  , 'superfund-right-to-sue'
  , 'crime'
  , 'duty-free-exports'
  , 'export-administration-act-south-africa'
]

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"
df = pd.read_csv(url, header=None, names=headers)
df

Unnamed: 0,class,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
5,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
6,democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
9,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?


**Challenge 1**

Load the data into a pandas dataframe. Replace 'y's with 1s, 'n's with 0s.

Now, almost every representative has a ?. This represents the absence of a vote (they were absent or some other similar reason). If we dropped all the rows that had a ?, we would throw out most of our data. Instead, we will replace ? with the best guess in the Bayesian sense: in the absence of any other information, we will say that the probability of the representative saying YES is the ratio of others that said YES over the whole votes.

So, convert each ? to this probability (when yes=1 and no=0, this is the mean of the column)

In [5]:
df = df.replace(['y', 'n', '?'], [1, 0, np.NaN])
df = df.fillna(df.mean())


**Challenge 2**

Split the data into a test and training set. Use this function:

```
from sklearn.cross_validation import train_test_split
```

In [6]:
from sklearn.cross_validation import train_test_split
train, test = train_test_split(df, test_size=0.33, random_state=42)
X_train, X_test = train.drop(['class'], axis=1), test.drop(['class'], axis=1)
Y_train, Y_test = train['class'], test['class']

**Challenge 3**

Using scikit.learn's KNN algorithm, train a model that predicts the party (republican/democrat):

```
from sklearn.neighbors import KNeighborsClassifier
```

Try it with a lot of different k values (number of neighbors), from 1 to 20, and on the test set calculate the accuracy (number of correct predictions / number of all predictions) for each k

You can use this to calculate accuracy:

```
from sklearn.metrics import accuracy_score
```

Which k value gives the highest accuracy?

In [1]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

neighbors = KNeighborsClassifier(n_neighbors=5)
neighbors.fit(X_train, Y_train)
print(accuracy_score(Y_test, neighbors.predict(X_test)))
 

NameError: name 'X_train' is not defined

**Challenge 4**

Make a similar model but with `LogisticRegression` instead, calculate test accuracy.

In [None]:
from sklearn import linear_model
logit = linear_model.LogisticRegression()
logit.fit(X_train, Y_train)
print(accuracy_score(Y_test, logit.predict(X_test)))

**Challenge 5**

Make a bar graph of democrats and republicans. How many of each are there?

Make a very simple predictor that predicts 'democrat' for every incoming example.

Just make a function that takes in an X --an array or matrix with input examples--, and returns an array of the same length as X, where each value is 'democrat'. For example, if X is three rows, your function should return ['democrat','democrat','democrat']. Make a y_predicted vector using this and measure its accuracy.

Do the same with predicting 'republican' all the time and measure its accuracy.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
party = df.groupby('class')['handicapped-infants'].count()
cats = party.index.tolist()
vals = party.values.tolist()
graphdf = pd.DataFrame(data=[cats, vals])
sns.barplot(cats, vals) 


In [None]:
def acc_republican():
    return accuracy_score(Y_test, ['republican' for i in Y_test])

def acc_democrat():
    return accuracy_score(Y_test, ['democrat' for i in Y_test])

**Challenge 6**

Plot the accuracies as a function of k. Since k only matters for KNN, your logistic regression accuracy, 'democrat' predictor accuracy and 'republican' predictor accuracy will stay the same over all k, so each of these three will be a horizontal line. But the KNN accuracy will change with k.

In [None]:
def accuracy_knn(k):
    neighbors = KNeighborsClassifier(n_neighbors=k)
    neighbors.fit(X_train, Y_train)
    return accuracy_score(Y_test, neighbors.predict(X_test))

def accuracy_logit():
    logit = linear_model.LogisticRegression()
    logit.fit(X_train, Y_train)
    return accuracy_score(Y_test, logit.predict(X_test))

def acc_republican():
    return accuracy_score(Y_test, ['republican' for i in Y_test])

def acc_democrat():
    return accuracy_score(Y_test, ['democrat' for i in Y_test])

maxk = 30
k = [i for i in range(1, maxk)]
acc_knn = [accuracy_knn(i) for i in range(1, maxk)]
acc_logit = [accuracy_logit() for i in range (1, maxk)]
acc_repub = [acc_republican() for i in range(1, maxk)]
acc_demo = [acc_democrat() for i in range(1, maxk)]
plt.figure(figsize=(10,7)).suptitle("Accuracy Score vs. K in Popular Classifier Algorithms", fontsize='15')
plt.xlabel('k', fontsize='15')
plt.ylabel('Accuracy Score', fontsize='15')
plt.plot(k, acc_knn, label='K-Nearest Neighbors')
plt.plot(k, acc_logit, label='Logistic Regression')
plt.plot(k, acc_repub, label='Heuristic: Always Guess Republican')
plt.plot(k, acc_demo, label='Heuristic: Always Guess Democrat')
plt.legend(loc=(.1,.1))




**Challenge 7**

Plot a learning curve for the logistic regression model. But instead of going through the painstaking steps of doing it yourself, use this function:

```
from sklearn.learning_curve import learning_curve
```

This will give you the m, training errors and testing errors. All you need to do is plot them. You don't even need to give it separate training/test sets. It will do crossvalidation all by itself. Easy, isn't it? : )
Remember, since it does cross-validation, it doesn't have a single training error or test error per m value. Instead, it has one for each fold (separate partition) of the cross validation. A good idea is to take the mean of these errors from different folds. This gives you a meaningful single number per m. What I mean is that doing something like:

```
train_cv_err = np.mean(train_err, axis=1)
test_cv_err = np.mean(ts_err, axis=1)
```

Before plotting `m` vs `train_cv_err` and `m` vs `test_cv_err`, where `train_err` and `test_err` are the vectors returned by the learning curve function. The `np.mean(...., axis=1)` means take the mean along axis 1 (axis 1 is the columns axis-- for each row, you have a bunch of columns, each corresponding to a cross validation fold, you are averaging these columns for each row).

Draw the learning curve for KNN with the best k value as well.


In [None]:
# learning curve function that accepts a model and generates a learning curve
from sklearn.learning_curve import learning_curve

X = df.drop('class', axis = 1)
y = df['class']
def learn_curve(model):
    sizes, train_err, ts_err = learning_curve(model, X, y, cv=20)
    train_cv_err = np.mean(train_err, axis=1)
    test_cv_err = np.mean(ts_err, axis=1)
    return train_cv_err, test_cv_err


In [None]:
# Logistic Regression
plt.figure(figsize=(10,7)).suptitle("Learning Curve for Logistic Regression", fontsize='15')
plt.plot(learn_curve(logit)[1], label='Test Error')
plt.plot(learn_curve(logit)[0], label='Training Error')
plt.xlabel('Cross-Validation Fold', fontsize='15')
plt.ylabel('Accuracy Score', fontsize='15')
plt.legend(loc=(.1,.1))

In [None]:
# KNN
plt.figure(figsize=(10,7)).suptitle("Learning Curve for KNN(5)", fontsize='15')
plt.plot(learn_curve(neighbors)[1], label='Test Error')
plt.plot(learn_curve(neighbors)[0], label='Training Error')
plt.xlabel('Cross-Validation Fold', fontsize='15')
plt.ylabel('Accuracy Score', fontsize='15')
plt.legend(loc=(.1,.1))


**Challenge 8**

This is a preview of many other classification algorithms that we will go over. Scikit.learn has the same interface for all of these, so you can use them exactly the same way as you did LogisticRegression and KNeighborsClassifier. Use each of these to classify your data and print the test accuracy of each:

Gaussian Naive Bayes

```
from sklearn.naive_bayes import GaussianNB
```

SVM (Support Vector Machine) Classifier

```
from sklearn.svm import SVC
```

Decision Tree

```
from sklearn.tree import DecisionTreeClassifier
```

Random Forest

```
from sklearn.ensemble import RandomForestClassifier
```

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
def acc_model(module, X, y):
    model = module()
    model.fit(X, y)
    return accuracy_score(Y_test, model.predict(X_test))

In [None]:
accuracy_GaussianNB = acc_model(GaussianNB, X_train, Y_train)
accuracy_SVC = acc_model(SVC, X_train, Y_train)
accuracy_DTC = acc_model(DecisionTreeClassifier, X_train, Y_train)
accuracy_RFC = acc_model(RandomForestClassifier, X_train, Y_train)

In [None]:
print 'GaussianNB: '+ str(accuracy_GaussianNB)
print 'SVC: ' + str(accuracy_SVC)
print 'Decision Tree Classsifier: ' + str(accuracy_DTC)
print 'Random Forest Classifier: ' + str(accuracy_RFC)

**Challenge 9**

There is actually a way to do cross validation quickly to get your accuracy results for an algorithm, without separating training and test yourself:

```
from sklearn.cross_validation import cross_val_score
```

Just like the `learning_curve` function, this takes a classifier object, `X` and `Y`. Returns accuracy (or whatever score you prefer by using the _scoring_ keyword argument). Of course, it will return a score for each cross validation fold, so to get the generalized accuracy, you need to take the mean of what it returns.

Use this function to calculate the cross validation score of each of the classifiers you tried before.


In [None]:
from sklearn.cross_validation import cross_val_score 

def cross_val(module):
    model = module()
    model.fit(X,y)
    return cross_val_score(model, X, y, cv=10).mean()


print "Cross Val Score for KNN: " + str(cross_val(KNeighborsClassifier))
print "Cross Val Score for Logistic Regression: " + str(cross_val(linear_model.LogisticRegression))
print "Cross Val Score for SVC: " + str(cross_val(SVC))
print "Cross Val Score for GaussianNB: " + str(cross_val(GaussianNB))
print "Cross Val Score for RandomForestClassifier: " + str(cross_val(RandomForestClassifier))
print "Cross Val Score for DecisionTreeClassifier: " + str(cross_val(DecisionTreeClassifier))


**Challenge 10**

Instead of 'democrat' or 'republican', can you predict the vote of a representative based on their other votes?

Reload the data from scratch. Convert y-->1, n-->0.

Choose one vote. Build a classifier (logistic regression or KNN), that uses the other votes (do not use the party as a feature) to predict if the vote will be 1 or 0.

Convert each ? to the mode of the column (if a senator has not voted, make their vote 1 if most others voted 1, make it 0 if most others voted 0).

Calculate the cross validation accuracy of your classifier for predicting how each representative will vote on the issue.



In [None]:
df = pd.read_csv(url, header=None, names=headers)
df = df.replace(['y', 'n', '?'], [0, 1, np.NaN])
df = df.fillna(df.mode().iloc[0])

df = df.drop('class', axis=1)
X = df.drop('handicapped-infants', axis = 1)
y = df['handicapped-infants']

print "Cross Val Score for KNN: " + str(cross_val(KNeighborsClassifier))
print "Cross Val Score for Logistic Regression: " + str(cross_val(linear_model.LogisticRegression))
print "Cross Val Score for SVC: " + str(cross_val(SVC))
print "Cross Val Score for GaussianNB: " + str(cross_val(GaussianNB))
print "Cross Val Score for RandomForestClassifier: " + str(cross_val(RandomForestClassifier))
print "Cross Val Score for DecisionTreeClassifier: " + str(cross_val(DecisionTreeClassifier))

**Challenge 11**

Back to movie data! Choose one categoric feature to predict. I chose MPAA Rating, but genre, month, etc. are all decent choices. If you don't have any non-numeric features, you can make two bins out of a numeric one (like "Runtime>100 mins" and "Runtime<=100 mins")

Make a bar graph of how many of each movie there is in the data. For example, with Ratings, show how many G, PG, PG-13, R movies there are, etc. (basically a histogram of your labels).

Predict your outcome variable (labels) using KNN and logistic regression. Calculate their accuracies.

Make a baseline stupid predictor that always predicts the label that is present the most in the data. Calculate its accuracy on a test set.

How much better do KNN and logistic regression do versus the baseline?

What are the coefficients of logistic regression? Which features affect the outcome how?

In [None]:
# import, praise my ingenuity
df = pd.read_csv("~/ds4/challenges/02-pandas/2013_movies.csv")

# train/test split
train, test = train_test_split(df, test_size=0.33, random_state=42)
X_train, X_test = train['Runtime'], test['Runtime']
Y_train, Y_test = train['Rating'], test['Rating']

# adapt inputs so it works
X_train = [[i] for i in X_train]
X_test = [[i] for i in X_test]

In [3]:
# accuracies
from collections import Counter
def acc_model(model, X, y):
    model.fit(X, y)
    return accuracy_score(Y_test, model.predict(X_test))

Naive = [Counter(Y_train).most_common(1)[0][0] for i in X_test]
print "Accuracy Score for Guessing the Most Common Label: " + str(accuracy_score(Y_test, Naive))
print "Accuracy Score for KNN(3): " + str(acc_model(KNeighborsClassifier(n_neighbors=3), X_train, Y_train))
print "Accuracy Score for Logistic Regression: " + str(acc_model(linear_model.LogisticRegression(), X_train, Y_train))

NameError: name 'X_test' is not defined

In [None]:
# coefficients
logit = linear_model.LogisticRegression()
logit.fit(X_train, Y_train)
print logit.classes_[0], logit.coef_[0]
print logit.classes_[1], logit.coef_[1]
print logit.classes_[2], logit.coef_[2]
print logit.classes_[3], logit.coef_[3]



**Challenge 12**

Now you are a classification master. The representative votes dataset only had 0s and 1s. Let's just swiftly tackle the breast cancer surgery data we talked about in class.

Get it from here: [Haberman Survival Dataset](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival)

 * What is the average and standard deviation of the age of all of the patients?

In [None]:
headers =[
    'Age',
    'YearOP',
    'Nodes',
    'Survived'
    ]
url ="https://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data"
df = pd.read_csv(url, header=None, names=headers)
print "Mean: " + str(df['Age'].mean()), " | Std. Dev.: " + str(df['Age'].std())

 * What is the average and standard deviation of the age of those patients that survived 5 or more years after surgery?

In [None]:
aged = df[df['Survived'] == 1]
print "Mean: " + str(aged['Age'].mean()), " | Std. Dev.: " + str(aged['Age'].std())

 * What is the average and standard deviation of the age of those patients who survived fewer than 5 years after surgery?


In [None]:
aged = df[df['Survived'] == 2]
print "Mean: " + str(aged['Age'].mean()), " | Std. Dev.: " + str(aged['Age'].std())

 * Plot a histogram of the ages side by side with a histogram of the number of axillary nodes.

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,10))
ax1.hist(df['Age'], bins=20)
ax2.hist(df['Nodes'], bins=20)

 * What is the earliest year of surgery in this dataset?
  * What is the most recent year of surgery?

In [None]:
print min(df['YearOP']) + 1900
print max(df['YearOP']) + 1900

 * Use logistic regression to predict survival after 5 years. How well does your model do?


In [None]:
train, test = train_test_split(df, test_size=0.33, random_state=42)
X_train, X_test = train.drop('Survived', axis=1), test.drop('Survived', axis=1)
Y_train, Y_test = train['Survived'], test['Survived']

In [None]:
logit = linear_model.LogisticRegression()
logit.fit(X_train, Y_train)
print accuracy_score(Y_test, logit.predict(X_test))

 * What are the coefficients of logistic regression? Which features affect the outcome how?


In [None]:
print logit.coef_

Apparently none of these features (No. Aux. Nodes, Age, Year of operation) exert a great influence on whether or not the patient survives. Nodes is the most influential (more is better), Year of the surgury is second (earlier is better) and interestingly, age slightly affects survival - the older, the better.

 * Draw the learning curve for logistic regression in this case.


In [None]:
# Logistic Regression
plt.figure(figsize=(10,7)).suptitle("Learning Curve for Logistic Regression", fontsize='15')
plt.plot(learn_curve(logit)[1], label='Test Error')
plt.plot(learn_curve(logit)[0], label='Training Error')
plt.xlabel('Cross-Validation Fold', fontsize='15')
plt.ylabel('Accuracy Score', fontsize='15')
plt.legend(loc=(.1,.1))