### 1 Data

The data house-votes-84 (see canvas/files/data) contains votes for 16 bills by 435 Representatives in 1984 (see the included readme file). The first variable is the party memebership (republican or democrat), and the following 16 features are votes (y, n, or ? if there was neither yea nor nay vote).

1. Load data. Note: the file does not contain the header line.
2. Explore the data: What is the number of yeas, nays and others by the column.
3. Compute the percentage of democrats and republicans in your data.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
house_votes_df = pd.read_csv('house-votes-84.csv.bz2', header=None)

In [3]:
house_votes_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [4]:
house_votes_df.shape

(435, 17)

In [5]:
house_votes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 17 columns):
0     435 non-null object
1     435 non-null object
2     435 non-null object
3     435 non-null object
4     435 non-null object
5     435 non-null object
6     435 non-null object
7     435 non-null object
8     435 non-null object
9     435 non-null object
10    435 non-null object
11    435 non-null object
12    435 non-null object
13    435 non-null object
14    435 non-null object
15    435 non-null object
16    435 non-null object
dtypes: object(17)
memory usage: 57.9+ KB


In [6]:
house_votes_df.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16], dtype='int64')

In [7]:

for col in house_votes_df.columns:
    print ('Value counts in',col,':\n',pd.value_counts(house_votes_df[col]))
    

Value counts in 0 :
 democrat      267
republican    168
Name: 0, dtype: int64
Value counts in 1 :
 n    236
y    187
?     12
Name: 1, dtype: int64
Value counts in 2 :
 y    195
n    192
?     48
Name: 2, dtype: int64
Value counts in 3 :
 y    253
n    171
?     11
Name: 3, dtype: int64
Value counts in 4 :
 n    247
y    177
?     11
Name: 4, dtype: int64
Value counts in 5 :
 y    212
n    208
?     15
Name: 5, dtype: int64
Value counts in 6 :
 y    272
n    152
?     11
Name: 6, dtype: int64
Value counts in 7 :
 y    239
n    182
?     14
Name: 7, dtype: int64
Value counts in 8 :
 y    242
n    178
?     15
Name: 8, dtype: int64
Value counts in 9 :
 y    207
n    206
?     22
Name: 9, dtype: int64
Value counts in 10 :
 y    216
n    212
?      7
Name: 10, dtype: int64
Value counts in 11 :
 n    264
y    150
?     21
Name: 11, dtype: int64
Value counts in 12 :
 n    233
y    171
?     31
Name: 12, dtype: int64
Value counts in 13 :
 y    209
n    201
?     25
Name: 13, dtype: int64
Val

In [8]:
counts = pd.value_counts(house_votes_df[0])
counts

democrat      267
republican    168
Name: 0, dtype: int64

In [9]:
perc_democrats = counts[0]/(counts[0] + counts[1])*100
print('Percentage of Democrats: ',perc_democrats)
print('Percentage of Republicans: ', (100 - perc_democrats))

Percentage of Democrats:  61.37931034482759
Percentage of Republicans:  38.62068965517241


### 2 Which variable gives the best branch

In [10]:
handicap_yes_df = house_votes_df[house_votes_df[1] == 'y']
handicap_no_df = house_votes_df[house_votes_df[1] == 'n']

In [11]:
# handicap_yes_df
counts_y = pd.value_counts(handicap_yes_df[0])
democrats_y = counts_y[0]/(counts_y[0] + counts_y[1])
republicans_y = 1 - democrats_y
print('Percentage of Democrats: ',democrats_y*100)
print('Percentage of Republicans: ', republicans_y*100)

Percentage of Democrats:  83.42245989304813
Percentage of Republicans:  16.577540106951872


In [12]:
# handicap_no_df
counts_n = pd.value_counts(handicap_no_df[0])
democrats_n = counts_n[0]/(counts_n[0] + counts_n[1])
republicans_n = 1 - democrats_n
print('Percentage of Democrats: ',democrats_n*100)
print('Percentage of Republicans: ', republicans_n*100)

Percentage of Democrats:  56.779661016949156
Percentage of Republicans:  43.220338983050844


In [13]:
entropy_y = -(democrats_y*np.log2(democrats_y) + republicans_y*np.log2(republicans_y))
entropy_y

0.647948835478525

In [14]:
entropy_n = -(democrats_n*np.log2(democrats_n) + republicans_n*np.log2(republicans_n))
entropy_n

0.9866967086735613

In [15]:
final_entropy = ((handicap_yes_df.shape[0])*entropy_y + (handicap_no_df.shape[0])*entropy_n )/(handicap_yes_df.shape[0] + handicap_no_df.shape[0]) 
final_entropy

0.8369429207599164

In [16]:
def find_entropy(house_votes_df):
    entropy_list=[]
    cols = house_votes_df.columns.drop(0)
    for col in cols:
    
        yes_df = house_votes_df[house_votes_df[col] == 'y']
        no_df = house_votes_df[house_votes_df[col] == 'n']

        counts_y = pd.value_counts(yes_df[0])
        democrats_y = counts_y[0]/(counts_y[0] + counts_y[1])
        republicans_y = 1 - democrats_y

        counts_n = pd.value_counts(no_df[0])
        democrats_n = counts_n[0]/(counts_n[0] + counts_n[1])
        republicans_n = 1 - democrats_n


        entropy_y = -(democrats_y*np.log2(democrats_y) + republicans_y*np.log2(republicans_y))
        entropy_n = -(democrats_n*np.log2(democrats_n) + republicans_n*np.log2(republicans_n))
        final_entropy = ((yes_df.shape[0])*entropy_y + (no_df.shape[0])*entropy_n )/(yes_df.shape[0] + no_df.shape[0]) 
        
        entropy_dict = {'column': col, 'entropy': final_entropy}
        entropy_list.append(entropy_dict)
    return entropy_list


In [17]:
list_entropy = find_entropy(house_votes_df)

In [18]:
pd.DataFrame(list_entropy)

Unnamed: 0,column,entropy
0,1,0.836943
1,2,0.959725
2,3,0.519205
3,4,0.206111
4,5,0.533355
5,6,0.818473
6,7,0.757289
7,8,0.61442
8,9,0.655859
9,10,0.956768


Feature 4 has the best entropy with 0.206 entropy