# Supervized Learning with scikit-learn

Project built following a course of Datacamp to learn how to solve a classification problem using supervised learning techniques. Here we use scikit-learn to classify the party affiliation of United States Congressmen based on their voting records in 1984.

## EDA
We first start with some Exploratory Data Analysis.

In [33]:
# We first import the dataset as a pandas dataframe.
import pandas as pd
df = pd.read_csv('house-votes-84.csv', header=None, names=['party', 'infants', 'water', 'budget', 'physician', 'salvador',
       'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',
       'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa'])


In [24]:
df.head()

Unnamed: 0,party,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 434 entries, 0 to 433
Data columns (total 17 columns):
republican    434 non-null object
n             434 non-null object
y             434 non-null object
n.1           434 non-null object
y.1           434 non-null object
y.2           434 non-null object
y.3           434 non-null object
n.2           434 non-null object
n.3           434 non-null object
n.4           434 non-null object
y.4           434 non-null object
?             434 non-null object
y.5           434 non-null object
y.6           434 non-null object
y.7           434 non-null object
n.5           434 non-null object
y.8           434 non-null object
dtypes: object(17)
memory usage: 57.7+ KB


## Preprocessing

In [25]:
import numpy as np
df[df == '?'] = np.nan
df[df == 'n'] = 0
df[df == 'y'] = 1

In [26]:
df.head()

Unnamed: 0,party,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,republican,0.0,1,0,1.0,1.0,1,0,0,0,1,,1.0,1,1,0,1.0
1,republican,0.0,1,0,1.0,1.0,1,0,0,0,0,0.0,1.0,1,1,0,
2,democrat,,1,1,,1.0,1,0,0,0,0,1.0,0.0,1,1,0,0.0
3,democrat,0.0,1,1,0.0,,1,0,0,0,0,1.0,0.0,1,0,0,1.0
4,democrat,1.0,1,1,0.0,1.0,1,0,0,0,0,1.0,,1,1,1,1.0


In [28]:
X = df.drop('party', axis=1)
y = df.party

## Imputing missing data in a ML Pipeline & SVM (Support Vector Machines)

In [32]:
# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVM', clf)]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Compute metrics
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

             precision    recall  f1-score   support

   democrat       0.99      0.96      0.98        85
 republican       0.94      0.98      0.96        46

avg / total       0.97      0.97      0.97       131

[[82  3]
 [ 1 45]]
