# Congressional Voting Classification

#Objective
The main objective is to predict whether congressmen is Democrat or Republican based on voting patterns by using the decision tree with the adaboost.

#Adaboost
AdaBoost is an ensemble learning method (also known as “meta-learning”) which was initially created to increase the efficiency of binary classifiers. AdaBoost uses an iterative approach to learn from the mistakes of weak classifiers, and turn them into strong ones.


#Data Set
This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).


##Attribute Information:
1. Class Name: 2 (democrat, republican)
2. handicapped-infants: 2 (y,n)
3. water-project-cost-sharing: 2 (y,n)
4. adoption-of-the-budget-resolution: 2 (y,n)
5. physician-fee-freeze: 2 (y,n)
6. el-salvador-aid: 2 (y,n)
7. religious-groups-in-schools: 2 (y,n)
8. anti-satellite-test-ban: 2 (y,n)
9. aid-to-nicaraguan-contras: 2 (y,n)
10. mx-missile: 2 (y,n)
11. immigration: 2 (y,n)
12. synfuels-corporation-cutback: 2 (y,n)
13. education-spending: 2 (y,n)
14. superfund-right-to-sue: 2 (y,n)
15. crime: 2 (y,n)
16. duty-free-exports: 2 (y,n)
17. export-administration-act-south-africa: 2 (y,n)



#Source
The dataset can be obtained from the:
https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

#Tasks:
1.	Obtained the dataset
2.	Apply pre-processing operations
3.	Train Adaboost model from scratch and test the model
4.	Train Adaboost model using sklearn
6.	Compare the performance of Adaboost, Random Forest and Decision Trees


## Part 1: Adaboost from Scratch

In [2]:
# Load the libraries
import pandas as pd
from sklearn.metrics import precision_recall_fscore_support
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import AdaBoostClassifier

In [4]:
# Load the dataset 
data = pd.read_csv('house-votes-84.csv',sep = ',')
data

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
1,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
430,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
431,republican,n,?,n,y,y,y,n,n,n,n,y,y,y,y,n,y
432,republican,n,n,n,y,y,y,?,?,?,?,n,y,y,y,n,y


In [5]:
data.replace(to_replace ="?",value ="n") 

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,n
1,democrat,n,y,y,n,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,n,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,n,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
429,republican,n,n,y,y,y,y,n,n,y,y,n,y,y,y,n,y
430,democrat,n,n,y,n,n,n,y,y,y,y,n,n,n,n,n,y
431,republican,n,n,n,y,y,y,n,n,n,n,y,y,y,y,n,y
432,republican,n,n,n,y,y,y,n,n,n,n,n,y,y,y,n,y


In [6]:
#Check null values
data.isnull().sum()

republican    0
n             0
y             0
n.1           0
y.1           0
y.2           0
y.3           0
n.2           0
n.3           0
n.4           0
y.4           0
?             0
y.5           0
y.6           0
y.7           0
n.5           0
y.8           0
dtype: int64

In [7]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)
data[data.columns[1:]] = pd.DataFrame(np.where(data[data.columns[1:]].values=='y', 1, 0), data.index)
data.head()

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,0
1,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0
2,democrat,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1
3,democrat,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1
4,democrat,0,1,1,0,1,1,0,0,0,0,0,0,1,1,1,1


In [8]:
le = LabelEncoder()
cols = ['republican']
data[cols[0:]] = data[cols[0:]].apply(lambda col: le.fit_transform(col))

In [9]:
X = data.iloc[:,1:].values
y = data.iloc[:,0].values

In [10]:
# Divide the dataset to training and testing set
X_train, X_test, y_train, y_test = train_test_split(X,y , test_size=0.25, random_state=0)

In [11]:
# Implement Adaboost model from scratch
# Adaboost consist of stumps which can be created using builtin decision trees in sklearn
# Stump can be trained by keeping the max_depth as 1
def I(flag):
    return 1 if flag else 0

def sign(x):
    return abs(x)/x if x!=0 else 1       

class AdaBoost:
    
    def __init__(self,n_estimators=50):
        self.n_estimators = n_estimators
        self.models = [None]*n_estimators
        
    def fit(self,X,y):
        
        X = np.float64(X)
        N = len(y)
        w = np.array([1/N for i in range(N)])
        
        for m in range(self.n_estimators):
            
            Gm = DecisionTreeClassifier(max_depth=1)\
                        .fit(X,y,sample_weight=w).predict
                        
            errM = sum([w[i]*I(y[i]!=Gm(X[i].reshape(1,-1))) \
                        for i in range(N)])/sum(w)
            
            AlphaM = np.log((1-errM)/errM)
            
            w = [w[i]*np.exp(AlphaM*I(y[i]!=Gm(X[i].reshape(1,-1))))\
                     for i in range(N)] 
            
            
            self.models[m] = (AlphaM,Gm)

    def predict(self,X):
        
        y = 0
        for m in range(self.n_estimators):
            AlphaM,Gm = self.models[m]
            y += AlphaM*Gm(X)
        signA = np.vectorize(sign)
        y = np.where(signA(y)==-1,1,0)
        return y

In [12]:
# Train the model and test the model
clf = AdaBoost(n_estimators=100)
clf.fit(X_train,y_train)

In [13]:
# Evaluate the results using accuracy, precision, recall and f-measure
y_pred = clf.predict(X_test)

acc = (np.sum(y_pred==y_test)/len(y_pred))*100
print("The accuracy on testing set is: ",acc)

The accuracy on testing set is:  57.798165137614674


## Part 2: Adaboost using Sklearn

In [14]:
# Use the preprocessed dataset here
X = data.iloc[:,1:].values
y = data.iloc[:,0].values

X_train, X_test, y_train, y_test = train_test_split(X,y , test_size=0.25, random_state=0)

In [15]:
# Train the Adaboost Model using builtin Sklearn Dataset
clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),n_estimators=100)
clf.fit(X_train,y_train)
acc = clf.score(X_test,y_test)

print("The accuracy is: ",acc)

The accuracy is:  0.9357798165137615


In [16]:
# Test the model with testing set and print the accuracy, precision, recall and f-measure
y_pred = clf.predict(X_test)
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred,beta=1.0)
print("Precision: ",precision)
print("Recall: ",recall)
print("Fscore: ",fscore)

Precision:  [0.9375     0.93333333]
Recall:  [0.95238095 0.91304348]
Fscore:  [0.94488189 0.92307692]


In [17]:
# Play with parameters such as
# number of decision trees
# Criterion for splitting
# Max depth
# Minimum samples per split and leaf

acc = clf.score(X_test,y_test)
print(acc)

0.9357798165137615


## Part 3: Compare the models

In [18]:
# Train Adaboost, Random Forest and Decision tree models from sklearn
accu = []
rec = []
prec = []
fsc = []

clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),n_estimators=100)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
acc = clf.score(X_test,y_test)
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred,beta=1.0)
accu.append(acc)
prec.append(precision)
rec.append(recall)
fsc.append(fscore)
print("The accuracy with Adaboost is: ",acc)

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
acc = clf.score(X_test,y_test)
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred,beta=1.0)
accu.append(acc)
prec.append(precision)
rec.append(recall)
fsc.append(fscore)

print("The accuracy with Random Forest is is: ",acc)
clf = DecisionTreeClassifier(max_depth = 2)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
acc = clf.score(X_test,y_test)
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, y_pred,beta=1.0)
accu.append(acc)
prec.append(precision)
rec.append(recall)
fsc.append(fscore)
print("The accuracy with Desicion Tree is is: ",acc)


The accuracy with Adaboost is:  0.9357798165137615
The accuracy with Random Forest is is:  0.9174311926605505
The accuracy with Desicion Tree is is:  0.9357798165137615


In [19]:
# Run the model on testing set



In [20]:
# Compare their accuracy, precision, recall and f-measure
label = ["AdaBoost", 'Random_Forest','Desicion Tree']
for i in range(len(label)):
  print("The precison for ",label[i], " :", prec[i])
  print("The recall for ",label[i], " :", rec[i])
  print("The fscore for ",label[i], " :", fsc[i])
  print(".............................................")


The precison for  AdaBoost  : [0.9375     0.93333333]
The recall for  AdaBoost  : [0.95238095 0.91304348]
The fscore for  AdaBoost  : [0.94488189 0.92307692]
.............................................
The precison for  Random_Forest  : [0.95       0.87755102]
The recall for  Random_Forest  : [0.9047619  0.93478261]
The fscore for  Random_Forest  : [0.92682927 0.90526316]
.............................................
The precison for  Desicion Tree  : [0.98275862 0.88235294]
The recall for  Desicion Tree  : [0.9047619  0.97826087]
The fscore for  Desicion Tree  : [0.94214876 0.92783505]
.............................................
