# Farhad Kazemi - Predicting Supreme Court Decision Making

### Overview
In this project, we create data from past legal cases that have been resolved by the courts. We use that data to predict the outcomes of cases that might come before the courts in the future. We'd like to approach this problem. 

First, I'm attaching an example of a redacted and reduced dataset containing ~1500 cases and ~15 variables of interest that we have coded by hand. The classification problem we have here is a binary classification problem -- the two possible outcomes are liable or not liable -- and we would like to classify the outcome of future cases.

Second, I'm including links to five recent cases in a different area of law. (There are many more cases in the universe.) Here, we want to classify whether a court will classify a worker as an employee or an independent contractor. The five cases should give you a flavour of the raw, unstructured data that we begin with. We'd like to approach this problem.

Here are the five cases:
http://i.ca/f7

### First

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
import datetime as dt
import pandas as pd
from pandas import *
import seaborn as sns
mydata = read_csv('Testdataset.csv')
#mydata.head()
mydata

Unnamed: 0,classification,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14
0,Liable,Individual,No,30.0,48623.00,No,Yes,False,No,False,Less than one year,20,Equity,Purchase on the secondary market,4.0
1,Not liable,Individual,No,10.0,-131052.00,No,Yes,False,Yes,False,Less than one year,12,Equity,Exercise of employee options,9.0
2,Liable,Corporation,Yes,20.0,174646.00,Yes,Yes,True,No,False,One year or more,100,Equity,Acquired directly from the issuer,72.0
3,Liable,Individual,No,80.0,200000.00,No,No,False,Yes,False,One year or more,4,Equity,Acquired directly from the issuer,72.0
4,Liable,Individual,No,10.0,1520.46,Yes,No,False,Yes,False,Less than one year,4,Equity,Acquired directly from the issuer,60.0
5,Liable,Corporation,Yes,15.0,-157189.00,Yes,Yes,True,Yes,False,Less than one year,100,Equity,Acquired directly from the issuer,36.0
6,Liable,Individual,Yes,75.0,670024.00,Yes,Yes,False,No,False,Less than one year,93,Equity,Purchase on the secondary market,60.0
7,Not liable,Individual,No,5.0,-5082.52,Yes,Yes,False,No,False,Less than one year,10,Equity,Purchase on the secondary market,1.5
8,Liable,Individual,No,35.0,-45356.00,No,Yes,False,No,False,Less than one year,12,Equity,Purchase on the secondary market,12.0
9,Liable,Individual,No,25.0,-282313.50,Yes,Yes,False,No,False,Less than one year,50,Equity,Purchase on the secondary market,12.0


In [2]:
# Is there any noise or cleanup"
#Percentage of our missing data.
total = mydata.isnull().sum().sort_values(ascending=False)
percent = (mydata.isnull().sum()/mydata.isnull().count()*100).sort_values(ascending=False)
our_miss_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(our_miss_data)

                Total  Percent
v14                 0      0.0
v13                 0      0.0
v12                 0      0.0
v11                 0      0.0
v10                 0      0.0
v9                  0      0.0
v8                  0      0.0
v7                  0      0.0
v6                  0      0.0
v5                  0      0.0
v4                  0      0.0
v3                  0      0.0
v2                  0      0.0
v1                  0      0.0
classification      0      0.0


In [3]:
#Is mydata formatted correctly
mydata['classification'] = mydata['classification'].astype('category')
mydata['v1'] = mydata['v1'].astype('category')
mydata['v2'] = mydata['v2'].astype('category')
mydata['v5'] = mydata['v5'].astype('category')
mydata['v6'] = mydata['v6'].astype('category')
mydata['v7'] = mydata['v7'].astype('category')
mydata['v8'] = mydata['v8'].astype('category')
mydata['v9'] = mydata['v9'].astype('category')
mydata['v10'] = mydata['v10'].astype('category')
mydata['v12'] = mydata['v12'].astype('category')
mydata['v13'] = mydata['v13'].astype('category')

round(mydata.describe(),2)
#mydata

Unnamed: 0,v3,v4,v11,v14
count,155.0,155.0,155.0,155.0
mean,41.49,44219.93,41.58,21.39
std,31.58,963531.68,80.93,31.41
min,0.0,-9936149.0,1.0,0.0
25%,10.5,-25266.5,3.0,3.0
50%,30.0,25000.0,10.0,10.0
75%,68.0,119549.0,50.0,24.0
max,100.0,4415749.0,600.0,180.0


In [4]:
# Convert categorical variable into dummy/indicator variables
mydata_basemodel = pd.get_dummies(mydata, drop_first = True)
#mydata_basemodel.dtypes
mydata_basemodel

Unnamed: 0,v3,v4,v11,v14,classification_Not liable,v1_Individual,v2_Yes,v5_Yes,v6_Yes,v7_True,v8_Yes,v9_True,v10_One year or more,v12_Equity,v12_Other,v13_Exercise of employee options,v13_Exercise of exchange traded options,v13_Other,v13_Purchase on the secondary market
0,30.0,48623.00,20,4.0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1
1,10.0,-131052.00,12,9.0,1,1,0,0,1,0,1,0,0,1,0,1,0,0,0
2,20.0,174646.00,100,72.0,0,0,1,1,1,1,0,0,1,1,0,0,0,0,0
3,80.0,200000.00,4,72.0,0,1,0,0,0,0,1,0,1,1,0,0,0,0,0
4,10.0,1520.46,4,60.0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,0
5,15.0,-157189.00,100,36.0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,0
6,75.0,670024.00,93,60.0,0,1,1,1,1,0,0,0,0,1,0,0,0,0,1
7,5.0,-5082.52,10,1.5,1,1,0,1,1,0,0,0,0,1,0,0,0,0,1
8,35.0,-45356.00,12,12.0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,1
9,25.0,-282313.50,50,12.0,0,1,0,1,1,0,0,0,0,1,0,0,0,0,1


In [5]:
mydata_basemodel.corr().loc[:,'classification_Not liable']

v3                                        -0.071218
v4                                         0.062964
v11                                       -0.172267
v14                                        0.244297
classification_Not liable                  1.000000
v1_Individual                              0.002816
v2_Yes                                    -0.202391
v5_Yes                                    -0.200166
v6_Yes                                    -0.172951
v7_True                                   -0.196310
v8_Yes                                    -0.070242
v9_True                                   -0.224375
v10_One year or more                       0.386844
v12_Equity                                 0.035381
v12_Other                                 -0.054629
v13_Exercise of employee options           0.253280
v13_Exercise of exchange traded options   -0.227139
v13_Other                                  0.065612
v13_Purchase on the secondary market       0.029093
Name: classi

In [6]:
#Train Test Split
#Predictor variable
##X = mydata_basemodel.loc[:,['User Type','Gender','Distance']]
#Target variable
##y = mydata_basemodel.loc[:,['Minutes']]

#Train Test Split
#Predictor variable
X = mydata_basemodel.iloc[:,[0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18]]
#Target variable
y = mydata_basemodel.iloc[:,4]
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#X_train
#X_test
#y_train
#y_test



### Our Predictive Models

### SVM with RBF Kernel

In [7]:
from sklearn import svm
clf = svm.SVC(gamma='auto',probability=True)
clf.fit(X_train, y_train)  
#clf.predict(X_test, y_test)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [8]:
print(clf.score(X_test,y_test))
#from sklearn.metrics import roc_curve
#from sklearn.metrics import roc_auc_score
#from matplotlib import pyplot
# predict probabilities
#probs = clf.predict_proba(X_test)
# keep probabilities for the positive outcome only
#probs = probs[:, 1]
# calculate AUC
#auc = roc_auc_score(y_test, probs)
#print('AUC: %.3f' % auc)
# calculate roc curve
#fpr, tpr, thresholds = roc_curve(y_test, probs)
# plot no skill
#pyplot.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
#pyplot.plot(fpr, tpr, marker='.')
# show the plot
#pyplot.show()

0.7741935483870968


### SVM with Linear Kernel

In [9]:
from sklearn import svm
clfl = svm.SVC(kernel="linear", C=0.025)
clfl.fit(X_train, y_train) 

SVC(C=0.025, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [10]:
clfl.score(X_test,y_test)

0.5161290322580645

### MLP

In [20]:
from sklearn.neural_network import MLPClassifier
clf2 = MLPClassifier(solver='lbfgs', alpha=1e-5,hidden_layer_sizes=(36, 8), random_state=1)
clf2.fit(X_train, y_train)                         

MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(36, 8), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True,
       solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False,
       warm_start=False)

In [21]:
clf2.score(X_test,y_test)

0.3548387096774194

### Knn

In [13]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train) 

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

In [14]:
neigh.score(X_test,y_test)

0.7096774193548387

### Decision tree

In [15]:
from sklearn import tree
clf3 = tree.DecisionTreeClassifier()
clf3 = clf.fit(X_train, y_train)

In [16]:
clf3.score(X_test,y_test)

0.7741935483870968