#Who is this guy talking at me?

<img src="./gignosko_contact.png", align=left>

#iPython Notebook
iPython is a great substitute for the standard python interpreter. It gives a lot of additional features including tab completetion, line numbers, a host of magic functions that do things like time the execution of your code, data visualization support and a lot more. 

iPython also has this handy notebook, which is a great way to:
* Quickly edit and run code
* Take rich notes using markdown so you can pass them on to others
* Insert mathematical symbols using Latex
* Visualize data inline

## iPython Notebook quick tour
* You write code or markdown in cells like this. 
* New cells default to code, but you can change it to a markdown cell using the drop down in the tool bar above.
* If you write code, you can run it with Shift + Enter/Return
* Each cell keeps its own history so if you change cell A, then change cell B, you can undo changes in A without upsetting B
* New cells are added below by default, but you can move them up or down with the arrows in the tool bar. 
* You can run a cell or interrupt a long running execution with the tool bar, too.


#Libraries
The libraries we'll be using for this presentation are:
* pandas. This is a data analytics library that gives us an easy, familiar way to wrangle our data into shape along with access to powerful data analysis and graphing capabilities by tying in to the other libraries we'll look at. 
* matplotlib. The standard for plotting in python. This library has a wide variety of built in plot types and makes it relatively easy to get your data into a plot or multiple plots. 
* numpy. A powerful library for scientific computing in python. The core of numpy revolves around it's ndarray data structure, which allows for fast calculations for a wide range of mathematical uses.
* sklearn. Scikit Learn is a widely used Machine Learning package built on top of numpy, scipy and matplotlib. 

In [None]:
import json
import pandas as pd

These are the imports. You can put them all at the top like this, or you can scatter them throughout your code in various cells. Either way, once imported, the packages are available to every cell. 

The READ.me will give some background on this data set, where it came from and why. For now, let's start playing with data!

In [None]:
# you will need to change these paths to fit your directory structure
current_leg_df = pd.read_csv('/Users/gignosko/congress/bills/legislators-current.csv')
historic_leg_df = pd.read_csv('/Users/gignosko/congress/bills/legislators-historic.csv')
pd.options.display.max_columns = 50

Pandas is a great tool for manipulating and analyzing your data. The main data type in the pandas library is the Dataframe, a 2D structure of rows and columns, like a spreadsheet but...fun. 

Pandas Dataframes have a variety of semantics for accessing and slicing your data. For the most part, these resemble access semantics for common python data types such as dictionaries, lists and tuples, but they also have a bit of syntactic sugar to make access a little easier and a lot more powerful. 

The above code reads csv files into some Dataframes so we can manipulate it, something we'll do a lot of in a bit. And we've set the maximum column display size to 50 so we can see more than the default of 20 columns. 

Dataframe columns are actually a pandas Series datatype, which is a one dimensional data structure. We'll see them a bit in this, but for the most part we'll stick to the Dataframes.

In [None]:
current_leg_df

We can see that pandas has given us a neat data structure from the csv file. And in the next cells we can see how easy it is to slice a Dataframe and query for specific values. 

In [None]:
current_leg_df[:5]

This gives us a slice of the Dataframe, just like we might make of a list. 

In [None]:
current_leg_df[current_leg_df['last_name'] == 'Chu']

Here we take the Dataframe and tell pandas which element of the Dataframe we want by using a reference to the Dataframe itself. In this case we are asking it to find the last_name == 'Chu' as the slice Dataframe that we want. We'll see this method of slicing up the Dataframe a lot. 

In [None]:
current_leg_df['party'].value_counts()

The value_counts() function returns a count of all the unique values in the party column. Pandas is built on top of numpy's ndarrays, which are multidimensional array objects that are very fast at performing array math. Here we've grabbed the 'party' Series and returned the counts of all the values, ordered gretest to least. 

pandas can also read from a variety of file types, including json. But, there are some catches. 

In [None]:
#you will need to change this path to fit your directory structure
loc = r'/Users/gignosko/congress/bills/votes/2013/h10/data.json'
single_vote_df = pd.read_json(loc)
single_vote_df

In [None]:
single_vote_df.shape

Something is definitely off and looking at the shape of the Dataframe we can start to make some guesses about what. Reading a json file into a Dataframe requires some type of orientation so pandas knows what are column headings versus indexes and how the values should be filled in. The default orientation for Dataframes is by column, so when pandas reads the json, the first level objects become the column headings for the dataframe. The second level objects, if there are any, become the row indexes. Everythong else becomes data.

Looking at the json file, there are 15 first level objects and 4 second level ones and the shape of the Dataframe lets us know that. So,  pandas has taken the first level of the json and turned it into columns and the turned the second level into rows. This isn't what we want, but it gives us a bit of a better insight into the data we have. 

The problem here is the shape of our json file. Because we have all the No votes clustered into a list and all the Aye votes clustered in a separate list, we'll need to dig in a little deeper into the json file and split it out. To do that we'll need to manually manipulate the file.

In [None]:
with open(loc) as json_file:
    json_data = json.load(json_file)  
    yes_df = pd.DataFrame(json_data['votes']['Aye'])
yes_df

Here, we've read in the json file and passed only the Aye object in to the Dataframe. This gives us the Dataframe shape that we want, with the proper column headings and data. Since there were nested objects to use as the indeces, pandas defaulted to numbering from zero.

In [None]:
with open(loc) as json_file:
    json_data = json.load(json_file)  
    no_df = pd.DataFrame(json_data['votes']['No'])
no_df

We can do that for each of the 4 vote types, but that leaves us with 4 Dataframes, so we'll need a way to put them all into one workable Dataframe. Luckily, Dataframes have a concatenate function that'll do this for us. Since we won't have a way to tell where the Aye Dataframe ends and the No begins once we concat them, we'll add a column that records the vote for each row which is pretty easy to accomplish: we'll add a new column in the same way that we would add a new entry in a dictionary.

In [None]:
no_df['vote'] = 'No'
yes_df['vote'] = 'Aye'
yes_df

In [None]:
yes_no_df = pd.concat([no_df, yes_df])
yes_no_df

To get to the final Dataframe we need, we can keep using the techniques we've built up so far, grabbing the pieces of data we need from the legislators Dataframes, the votes Dataframes or a bills Dataframe that we haven't looked at. Since we're really more interested in the techniques than the actual data, I'll just give you the script of how I built up the final Dataframe using these techniques.

If you've downloaded the whole dataset from govetrack, then you can run this bit of code if you like, but it's very I/O intensive and takes a while to run, so you can kick it off and watch an episode of Doctor Who. 

Or, you can just skip it and and go to the next cell, where we load the pickled Dataframe that should have come down when you cloned the repo. pandas lets you save files in a variety of formats such as json or csv, so you could share it between languages or applications if needed, but since we're sticking with python, I've decided to pickel it. A pickle is a python object serialization format. It's probably the best choice if you don't plan on sharing the file with other languages since it's efficient for python objects. 

In [None]:
import os
home = r'/Users/gignosko/pytn/bills/'# you will need to change this for your directory path
votes = home + 'votes'
bills = home + 'bills'
votes_list = []
votes_df = pd.DataFrame()
counter = 0
for x in os.walk(votes):
    if x[2]:
        json_file = x[0] + '/' + x[2][0]
        json_data = open(json_file)
        all_data = json.load(json_data)
        
        if ('bill' in all_data) and ('amendment' not in all_data):
            bill_type = all_data['bill']['type']
            bill_number = bill_type + str(all_data['bill']['number'])
            bill_file = bills + '/' + bill_type + '/' + bill_number + '/data.json'
            with open(bill_file) as f:
                bill_json = json.load(f)
                sponsor_id = bill_json['sponsor']['thomas_id']
            sponsor_party = current_leg_df[current_leg_df['thomas_id'] == int(sponsor_id)]['party']
            if sponsor_party.values.size == 0:
                sponsor_party = historic_leg_df[historic_leg_df['thomas_id'] == int(sponsor_id)]['party']
            votes = all_data['votes']
            vote_id = all_data['vote_id']
            for k,v in votes.iteritems():
                df = pd.DataFrame(v)
                if sponsor_party.values:
                    df['sponsor_party'] = sponsor_party.values[0][0]
                else:
                    df['sponsor_party'] = 'U'
                df['vote'] = k
                df['bill'] = bill_number
                df['vote_id'] = vote_id
                votes_list.append(df)
            votes_df = pd.concat(votes_list)
    counter = counter + 1
votes_df.to_pickle('/Users/gignosko/PyTN/dataframe.pkl')# change this path, too. 

In [None]:
pickle_file = ('/Users/gignosko/PyOhio/dataframe.pkl')
total_df = pd.read_pickle(pickle_file)
total_df

You'll notice here that the index resets because each of the Dataframes that we started with has it's own index and pandas doesn't automatically re-index on concatenation, but that's easy to fix if we need to.

In [None]:
total_df.index = range(1, len(total_df) +1)
total_df

Now that we've started munging our data into something a bit more useful, let's start looking at it. 

pandas lets you group your Dataframes by columns, much like you could in a pivot table. We can use groupby() to group our columns and then use count() to get some counts of the values. Let's play around with this a bit and see how we can drill in to get a better idea of the data.

In [None]:
total_df.groupby(['party'])['vote'].describe()

You can see here that we've grouped by the party column, but we've told pandas to give us the count() of the vote column.  The outcome is a Series with party as the index and the counts as the values. The groupby() function always returns a GroupBy object, which has a variety of functions you can call to return a  Series with the grouping as the index.

In [None]:
total_df.groupby(['party'])['vote'].count()

In [None]:
total_df.groupby(['bill', 'vote_id','party'])['vote'].count()

In this one, we've dug a little deeper into the groupings so we can see how many total votes there were by party for each bill, vote_id and party. Since this is a Series everything other than the count is the index and we can see more about that a little further below. 

In [None]:
vote_counts = total_df.groupby(['bill', 'vote_id', 'sponsor_party','party', 'vote'])['vote'].count()
vote_counts

In [None]:
print(vote_counts.index[1], vote_counts.values[1])

We can see here how to grab the index of the first row of a Series; the same type of syntax would work for a Dataframe. In this case, the index is hierarchical or multilevel. We can also grab the values by the same number. 

As it stands, the votes all exist in individual rows, but we want to pull them around and have them as column headings so we can can ultimately see all the Aye votes in one column, all the No votes in another, and so on. This hierarchical index is something we can grab and manipulate to let us reshape the data into something more useful. 

In [None]:
new_df = total_df.loc[:, ['bill', 'vote_id','party', 'vote']]
new_df

First, we create a new Dataframe to work with because we only need a few of the columns from the total_df. To get specific columns, we can use the Dataframe.loc function, which grabs the rows or columns by label name. So, total_df.loc[:,...] is a slice that says grab me all the rows while ..., ['bill', 'vote_id','party', 'vote']] says grab me those particular labels. This makes our new Dataframe.

As we said above, a groupby() function returns a GroupBy object. Transform applies a function to this group object and returns a Series object that has the same index as the one being grouped. This is different than calling the count() function, which only applies to the column it's called on. 

So,  we pass the count function by name and we've added a new column that is a Series that has the same index as the groupby from the Dataframe. That Series has as it's value a count of the votes by vote_id, party and vote. You can see that each row has the same count until one of those three keys changes. 

In [None]:
new_df['counts'] = new_df.groupby(['vote_id','party','vote']).transform('count')
new_df

In [None]:
temp_s1 = new_df.groupby(['vote_id','bill', 'party','vote'])['vote'].count()
temp_s1

We have the count values that we want now in that Series, but we really want to get everything flipped around so those index values get pushed up to be column headings. We can do that with the unstack() function, which takes a Series and returns a Dataframe with the Series indeces as column headings. 

In [None]:
temp_df2 = temp_s1.unstack()
temp_df2

Next, we create a series from that Dataframe, the same as we did above, so that the vote_id, bill, party and vote all make up a  unique index with the counts as the data values. Then, we unstack the Series index, which pulls the index apart and turns each hierarchical level into a column again, but this time, it takes the counts values with it and puts them in as data. 

But, we have a lot of NaN values in this Dataframe and that's going to mess us up later so let's change those to 0. pandas has a function fillna() that works on both Datframes and Series to fill in the NaN values with whatever value you pass as a function parameter. By default the fillna() function returns a copy of the Dataframe as a new object, but we can tell pandas that we want to fill the values on the existing Dataframe by passing the inplace parameter as True.

In [None]:
temp_df2.fillna(0, inplace=True)
temp_df2

Ok, let's start visualizing!

In [None]:
%matplotlib inline
temp_df2[:10].plot(kind='bar', stacked=True)

matplotlib is the most widely used plotting library in python and the standard tool for plotting when working with pandas because it's integrated into the pandas library. Above, you can see that we've called the plot() function on a slice of our Dataframe, passed in the kind of plot we want and since it's a bar, told it to stack the values from each column. You can see that we've let pandas decide where to put the legend and it's obscuring out data. If we import matplotlib and manually construct this plot, then we can gain a bit more control over things. We'll do some of that below. 

Also notice the %matplotlib inline.  This tells iPython notebook to not only return the plot, but to also show it inline. Without that line, the function would return the plot object, but we wouldn't see the pretty pictures. 

As we're looking at this plot, something obvious stands out. We have a column for Yea and a column for Aye; likewise, we have a column for No and a column for Nay. Welcome to the Federal Government! Each House of Congress records it's votes differently, so we get two columns for the same answer. Let's fix that. 

In [None]:
temp_df2['Y'] = temp_df2['Aye'].add(temp_df2['Yea'])
temp_df2['N'] = temp_df2['No'].add(temp_df2['Nay'])
temp_df2 = temp_df2.drop(['Aye', 'Yea', 'Nay', 'No', 'Not Voting', 'Present'], axis=1)
temp_df2

First, we created a new column labeled "Y" and took the values of the Aye column and added to them the values of the Yea column. We then did the same for the "N" column to collect all the no votes and then we dropped all the columns we weren't interested in. Dropping requires an axis so pandas knows if you want to drop rows or columns, so we've passed in the axis=1 parameter. 

In [None]:
temp_df2[:10].plot(kind='barh', stacked=True)

This looks better. Pandas will only try to plot columns with numeric values and now we only have two of those. We've also turned the plot into a horizontal bar plot, which makes the y-axis labels easier to read. 

Below, we take a look at how to manually create plots by calling matplotlib directly.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
temp_df2['sum'] = temp_df2.sum(axis=1)
x = temp_df2['sum'].values 
y = np.square(x)
plt.ylabel('No. votes')
plt.xlabel('Sum of all votes')
plt.scatter(x=x, y=y)

Here we've added a sum column by calling the sum() function and telling it to work across the columns. We've also imported matplotlib so we'll have better control over the plots and numpy so we can use some of it's math functions. 

We set the values of the sum column as the x axis and set y as a function of the squares. Notice we passed a Series as the x-axis values. pandas Series and Dataframes are both built on top of numpy's ndarray data type, so we've essentially passed an array as the x-axis. The y-axis is even more impressive: instead of passing in a collection of values, we've passed in a function and pandas will do the calculations for us! This way we get a nice curve in our scatterplot. We've also manually added labels. Although matplotlib took care of the tick marks on the x and y axes for us, we could have given ranges for those and adjusted the scale of the plot if we had wanted. 

In [None]:
from scipy import stats
x = temp_df2['Y']
y = temp_df2['N']
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
yp = np.polyval([slope, intercept],x)
plt.plot(x, yp)
plt.scatter(x,y)

Using scipy's stats library we can call the linregress() function, which gives us back a tuple with everything we could want from a linear regression. We can then take the slope and intercept and pass them in to the numpy polyval() function, which essentially solves polynomial equations for specific values. Here, the polynomial is the well-known y = mx+b where m is the slope and b is the intercept. It gets solved for y with x being all the values of the 'Y' column from the DataFrame. We also did a scatter plot of the 'Y' and 'N' values and plotted both the scatter and the lnear regression in the same plot. 

This visualization shows a distinct split in votes, with a group of them in the lower left, and a much larger grouping of them above. This is because we've mixed House and Senate votes.  The House will have up to 435 votes per vote_id, while the Senate could have no more than 100.  This split in the groups also brings the line of the regression down quite a bit. 

Below we see an interesting grouping of plots called a scatter matrix, which does a scatter plot of every column over every other column. Since the diagonals in this matrix would be a straight line (plotting a column over itself), the scatter_matrix function takes a diagonal parameter that lets us tell it what we want on the diagonal. In this case we're plotting the kernal density estimate, which is roughly equivalent to a histogram, but with a smooth curve. 

In [None]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(temp_df2, alpha=0.2, figsize=(10,10), diagonal='')

#Machine Learning
In *A Brief History of Time* Stephen Hawking said that he was warned that for each equation he added, the readership would drop by half. I certainly hope that's not the case with this audience, but just in case, we'll just look at one equation.

One appplication of machine learning is to help us categorize new data and this categorization falls into roughly two type: supervised learning, where we know the classification of the training data before we train, and unsupervised learning, where we let the learning algorithm cluster the data and figure out classifications. We're going to look at a supervised method, which means that we will train the machine using input for which we already know the output. For input we'll use the party of the vote sponsor along with the party of the voter and the output is the vote each voter cast.  We'll be using a K-nearest neighbor classifier for our learning algorithm and we'll go over it in a bit more detail below.  

## K Nearest Neighbors
KNN can be used for either classification or for regression. We'll be using it to classify the votes from our data set, first giving the algorithm a portion of the data to train on and another portion to test and see how well we've trained it. As the name of the algorithm implies, KNN classifies new input based on it's "distance" from it's k closest neighbors. The distance formula most often used is the Minkoski formula, the one equation we'll look at (I promise). 

Minkoski distance

$$X = (x_1, x_2, ... x_n) Y = (y_1, y_2, ..., y_n)$$

$$D = {\left({\sum_{l=1}^n {\left|x_i - y_i\right|}^p}\right)}^{1/p}$$

THe X and Y here are essentially the same as the x and y of a Cartesian plane, but expanded out to multiple (technically infinite) dimensions with X being the set of all dimensions of the point to test and Y being the set of all dimensions of the neighboring point, for which we already know the classification. The formula takes the absolute value of the difference between the first dimension of the two points, raises that to p, then adds in the second dimension and so on through all the dimensions. It then raises that whole sum to 1/p, which is the same thing as taking the p-root. The default for p is 2, which makes this the eqivalent of the Euclidean distance, for all the math geeks out there that aren't afraid of an equation or two. 

K is the number of neighbors we want to survey to determine the classification for our new point; if K =1 then the class of the test sample is the same as the class of the nearest neighbor and for K > 1, the class of the test sample is most common class amoung those K neighbors.

One thing to watch out for with this classifier is that KNN is susceptible to noise because it muddles the distance formula. Noise is any dimension that is irrelevent to the decision and in this case the bill number, the vote number, the voter id, and any names are all noisy because they don't add to the prediction of future votes in our case. So, let's rework the data a bit more to get rid of the noise.

In training a classifier, we need the data that is used as the input and the data that is the actual known classification so we can give those separately to the classifier. We can feed the KNN classifier the data in an "array like" structure, which can be the Dataframe, but the target (or the known outputs, which in this case is the actual vote that was cast) needs to be an actual array, so we'll have to split that off. To get the data into the proper form for the classifier, we're going to go back to our total_df, but we need to drop some un-needed columns, clean out some data and change what's left to numeric data.

Luckily for us, panda's Dataframes and Series are both built on top of numpy's ndarray so getting the data in the right format is straightforward. An ndarray is different from the standard python array in that the ndarray is an n-dimensional array, rather than a 1 dimensional array and ndarrays give us access to a lot of numpy's mathematical functions. 

For now, what we need to do is pull the values of the target (the actual vote cast) out of the Dataframe and they will automatically be an ndarray, however the values will be be object datatypes, so we'll need to convert them to int because that's what the classifier needs. The data is represented as a float because it typically needs to be normalized to a value between 0 and 1; since we only have two values per input (Democrat or Republican), we can just go ahead and make them 0 and 1 for simplicity.  And, since the data can be in a Dataframe, we'll just change it in line.

Let's start by creating our final Dataframe and dropping the unnecessary columns which is an easy task in pandas. We can create a list of column names we'd like to drop, then use that list as a parameter to the drop() function; we will also need to tell pandas if we are dropping columns or rows, so we'll set the index parameter to 1 for columns. For a bit of simplicity, we want to make sure we only have Democrats and Republicans (Sorry, Senator Sanders) and Yes or No votes.  We can tell pandas to re-create the Dataframe with vote values not equal to Present or Not Voting and with party and sponsor_party without I's. 

In [None]:
drop_list = ['bill', 'display_name', 'first_name', 'id','last_name', 'state', 'vote_id']
final_df = total_df.drop(drop_list, axis=1)
#final_df['vote'].value_counts()
final_df

In [None]:
change_mapping = {'R': 0, 'D': 1, 'I': 2, 'Nay': 0, 'No': 0, 'Yea': 1 , 'Aye': 1, 'Yes': 1, 'Not Voting': 2, 'Present':3}
final_df['sponsor_party'] = final_df['sponsor_party'].map(change_mapping)
final_df['party'] = final_df['party'].map(change_mapping)
final_df['vote'] = final_df['vote'].map(change_mapping)
final_df

What we've just done is used Pandas map function to map a dictionary across all the values in the various Series of our Dataframe. Pandas takes the value in a cell, looks it up in the change_map dict as a key and then replaces it with the corresponding value. We could also have passed in a function rather than a dict and Pandas would use the value as the parameter to the function and the return value as the new cell value. 

In [None]:
target = final_df.take([2], axis=1)
target = target.values
target = target.astype('int')
target = target.ravel()
target


Above, we've created the target Series by taking a Series out of our Dataframe, grabbed the values, turned them into ints. The what we would have back is an n dimensional array, or an array of arrays, where each value is in it's own array, all of which are inside an array, so we used the ravel() function to turn the multidimensional array into a single dimensional array.

In [None]:
import pickle
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=4, weights='distance')
from sklearn import cross_validation as cv
data = final_df.drop('vote', axis=1)
data_train, data_test, target_train, target_test = cv.train_test_split(data, target, test_size=0.1)
#simple_knn = knn.fit(data_train, target_train)
simple_knn = pickle.load(open('/Users/gignosko/PyOhio/simple_knn.pkl', 'rb'))


In [None]:
import pickle
pickle.dump(simple_knn, open('/Users/gignosko/PyOhio/simple_knn.pkl', 'wb'))

In [None]:
pickle.dump(predicted, open('/Users/gignosko/PyOhio/simple_predicted.pkl', 'wb'))

In [None]:
pickle.dump(simple_score, open('/Users/gignosko/PyOhio/simple_score.pkl', 'wb'))

In [None]:
predicted = pickle.load(open('/Users/gignosko/PyOhio/simple_predicted.pkl', 'rb'))
#predicted = simple_knn.predict(data_test)
print( "Prediction " + str(predicted))
print( "Actual     " + str(target_test))


In [None]:
#simple_score = simple_knn.score(data_test, target_test)
#pickle.dump(simple_score, open('/Users/gignosko/PyOhio/simple_score.pkl', 'wb'))
simple_score = pickle.load(open('/Users/gignosko/PyOhio/simple_score.pkl', 'rb'))
print( "Accuracy   " + str(simple_score*100) + "%")

In [None]:
print(simple_knn.kneighbors(data_test[:1], 5))

In [None]:
print(data_train.iloc[[290]])
print(data_test[:1])

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = predicted
cm = confusion_matrix(target_test, y_pred)
print(cm)
import matplotlib.pylab as plt
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
state_df = pd.read_csv('/Users/gignosko/PyOhio/state_table.csv')
state_df

In [None]:
region_dict = {k: ''.join(v["census_region_name"].tolist()) for k,v in state_df.groupby("abbreviation")}
region_dict
final_df['region'] = total_df['state'].map(region_dict)
final_df


In [None]:
region_mapping = {'South': 1, 'West': 2, 'Midwest': 3, 'Northeast': 4, }
final_df['region'] = final_df['region'].map(region_mapping)
final_df

In [None]:
data = final_df.drop('vote', axis=1)
data_train, data_test, target_train, target_test = cv.train_test_split(data, target, test_size=0.1)
#noisy_knn = knn.fit(data_train, target_train)
#pickle.dump(noisy_knn, open('/Users/gignosko/PyOhio/noisy_knn.pkl', 'wb'))
noisy_knn = pickle.load(open('/Users/gignosko/PyOhio/noisy_knn.pkl', 'rb'))

In [None]:
#noisy_predicted = noisy_knn.predict(data_test)
noisy_predicted = pickle.load(open('/Users/gignosko/PyOhio/noisy_knn.pkl', 'rb'))
print( "Prediction " + str(predicted))
print( "Actual     " + str(target_test))

In [None]:
pickle.dump(predicted, open('/Users/gignosko/PyOhio/noisy_predicted.pkl', 'wb'))

In [None]:
#noisy_score = noisy_knn.score(data_test, target_test)
#pickle.dump(noisy_score, open('/Users/gignosko/PyOhio/noisy_score.pkl', 'wb'))
noisy_score = pickle.load(open('/Users/gignosko/PyOhio/noisy_score.pkl', 'rb'))
print( "Accuracy   " + str(noisy_score*100) + "%")

State stuff below

Pandas Series have a map function that will take a function or a dict and apply it across all the rows in the Series. We can use this to create a mapping from the state abbreviation to the region for that state and add that as a new column in our total_df Dataframe. To get the dict for the mapping takes a little bit of magic, though. The 'key' for the dictionary comprehension is straighforward enough; when we groupby('abbreviation'), the state abbreviation becomes the index and gets returned as the key in the for...in syntax. The 'value' is a little harder. Since everything other than the abbreviation gets returned as the value, we need to dig into that and get just the 'census_region_name'. But, since this returns a Series, in this context we get a lot of unneeded information, such as the Name and the index. To strip this away we use the Series tolist(). The documentation for tolist() is a bit, um...terse. "Convert Series to a nested list". What it ends up doing here is taking the value and putting it into a list, so the documentation is terse, but spot on. To get it out of a list and into a string we use the pythonic ''.join([list]) form. We can then use the resulting dict of {state: region} and map it across the state column of the Dataframe to add a new column with the region. Whew!

We can see that we have the same problem as earlier, which is both Yea and Aye for yes and Nay or No for no. We fixed it before pretty easily, but now we need to make two changes: combine all the yes/no values and convert them from text to numbers. pandas lets us map a dictionary to a Series, so we can create a simple dictionary that spells out our changes and then map it to the individual Series in the Dataframe. 

The problem above is that we have no way to get back to the full original entry.

Also, 50% isn't great. We're missing a lot of useful data such as the content of the bill, i.e., is it a controversial bill (healthcare, gun control, etc) or a no brainer (naming a highway after a former president). 

The confusion matrix is a way of plotting the actual test outcomes versus the predicted outcomes. If you look at the labels, the True label is plotted on the y-axis and the Predicted label is on the x-axis. Looking along the diagonal we see in the top left the count of of where the classifier correctly classified the Predicted 0 to be the same as the True 0 and in the bottom right we see where the classifier correctly classified the Predicted 1 as the True 1. The other blocks show where the classifier miss classified (or confused) one class with the other. 

#Below is old!

In [None]:
data = final_df.drop('vote', axis=1)

In [None]:
from sklearn.linear_model import perceptron
from sklearn import cross_validation as cv
per = perceptron.Perceptron(n_iter=10, eta0=0.002)
data_train, data_test, target_train, target_test = cv.train_test_split(data, target)
per.fit(data_train, target_train)
print "Prediction " + str(per.predict(data_test))
print "Actual     " + str(target_test)
print "Accuracy   " + str(per.score(data_test, target_test)*100) + "%"

Here we've done several things. After our imports we instantiate a Perceptron with the number of iterations set to 10 and a learning rate (eta0) of 0.002; this can in theory help reduce the number of iterations the Perceptron needs. Often times we will have one set of data and we'll want to split it so we have a a training set and a testing set. sklearns's train_test_split function will do that for us, returning a pair of data and target arrays. The fit() function does the actual training of the Perceptron. It takes the a data parameter to run through the Perceptron algorithm and a target parameter which are the known values of the training set. Taken together, these train the Perceptron how to handle new data so it can be applied to data with unknown targets. 

Next, we run the predict() function on the test data. This takes the trained Perceptron (which is essentially just the best set of weights from the training iterations) and uses it to predict what the output from the test data should be. When we print out the Accuracy, we can see that we haven't done very well at all. We'd love to have an accuracy above 80% to feel confident about our Perceptron. 

Most likely, we just haven't set up a very good model. We've cut a lot of corners to get somethign going and that's probably tripped us up. First, we've put both the House and the Senate voting data together into one Dataframe, and that's going to skew things because the Senate could only record 100 votes per vote id, while the House could record up to 435. Second, we looked at all bills evenly, meaning we didn't take out procedural votes such as setting the agenda or non-divisive votes such as naming a highway after our favorite Doctor (10, right?) We really need a subject matter expert to help comb through the bills and guide us to the more inflamatory votes, but as we progress, maybe we can start to find better ways to classify this data. 

Let's look at one last way of evaluating the Perceptron that you'll likely run into through data analysis and machine learning: the confusion matrix.

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = per.predict(data_test)
cm = confusion_matrix(target_test, y_pred)
print cm
import matplotlib.pylab as plt
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

The confusion matrix is a way of plotting the actual test outcomes versus the predicted outcomes. If you look at the labels, the True label is plotted on the y-axis and the Predicted label is on the x-axis. Looking along the diagonal we see in the top left the count of of where the Perceptron correctly classified the Predicted 0 to be the same as the True 0 and in the bottom right we see where the Perceptron correctly classified the Predicted 1 as the True 1. The other blocks show where the Perceptron miss classified (or confused) one class with the other. 

There's a lot here and we've really just scratched the surface of the data analysis, machine learning and the python tools we can use for each. In the near future I hope to tackle a type of unsupervised learning: clustering. 

If you have any questions or comments, please feel free to reach out to me through comments, pull requests or on twitter: @\_gignosko\_

Thanks!