<a href="https://colab.research.google.com/github/ryanleeallred/DS-Unit-1-Sprint-2-Statistics/blob/master/module1/LS_DS_121_Statistics_Probability_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 1 Sprint 2 Assignment 1*

# Apply the t-test to real data

Your assignment is to determine which issues have "statistically significant" differences between political parties in this [1980s congressional voting data](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records). The data consists of 435 instances (one for each congressperson), a class (democrat or republican), and 16 binary attributes (yes or no for voting for or against certain issues). Be aware - there are missing values!

Your goals:

1. Load and clean the data (or determine the best method to drop observations when running tests)
2. Using hypothesis testing, find an issue that democrats support more than republicans with p < 0.01
3. Using hypothesis testing, find an issue that republicans support more than democrats with p < 0.01
4. Using hypothesis testing, find an issue where the difference between republicans and democrats has p > 0.1 (i.e. there may not be much of a difference)

Note that this data will involve *2 sample* t-tests, because you're comparing averages across two groups (republicans and democrats) rather than a single group against a null hypothesis.

In [0]:
#imports
import numpy as np
from scipy.stats import ttest_ind
import pandas as pd


In [26]:
#getting the columns headers for the dataframe
col_names = ['party','handicapped-infants','water-project',
                          'budget','physician-fee-freeze', 'el-salvador-aid',
                          'religious-groups','anti-satellite-ban',
                          'aid-to-contras','mx-missile','immigration',
                          'synfuels', 'education', 'right-to-sue','crime','duty-free',
                          'south-africa']
col_names

['party',
 'handicapped-infants',
 'water-project',
 'budget',
 'physician-fee-freeze',
 'el-salvador-aid',
 'religious-groups',
 'anti-satellite-ban',
 'aid-to-contras',
 'mx-missile',
 'immigration',
 'synfuels',
 'education',
 'right-to-sue',
 'crime',
 'duty-free',
 'south-africa']

In [27]:
votesData = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data',
                        header=None, names=col_names)
votesData.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


In [28]:
#need to get rid of the question marks that are found in the  data
votesData = votesData.replace('?', np.nan )
votesData.head()

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


In [29]:
#now changing the y to 1 and the n to 0
votesData = votesData.replace({'y': 1, 'n': 0})
votesData.head( )

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0
1,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,
2,democrat,,1.0,1.0,,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0
3,democrat,0.0,1.0,1.0,0.0,,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,democrat,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,1.0


In [30]:
#makeing the filter for the republicans and the democrats
demfilt = votesData[votesData['party']=='democrat']
repfilt = votesData[votesData['party'] == 'republican']
print(demfilt.shape)
print(repfilt.shape)

(267, 17)
(168, 17)


In [31]:
#checking if there is any nans in the physician fee freeze
print(demfilt['physician-fee-freeze'].isnull().sum())
print(demfilt['physician-fee-freeze'].shape)

print(repfilt['physician-fee-freeze'].isnull().sum())
repfilt.shape

8
(267,)
3


(168, 17)

In [32]:
#stripping out the nans in the dem and the rep to make two different columns
dem_physician_no_nan = demfilt['physician-fee-freeze'].dropna()
dem_physician_no_nan.shape

(259,)

In [33]:
#stripping out nans for the republicans
col = repfilt['physician-fee-freeze']
rep_physician_no_nan = col[~np.isnan(col)]
rep_physician_no_nan.shape


(165,)

In [34]:
#doing the tTest for the physician fee freeze
#the null hypothesis is that there is no difference between 
#the voting of the democrats and the republicans

#finding the mean of the two samples
print(rep_physician_no_nan.mean())
dem_physician_no_nan.mean()

0.9878787878787879


0.05405405405405406

In [35]:
stat, pval = ttest_ind(rep_physician_no_nan, dem_physician_no_nan)
print(pval)

#we reject the null hypothesis
print("We reject the null hypothesis\n")

1.994262314074572e-177
We reject the null hypothesis



In [36]:
#checking to see who supported the physician fee freeze
print(dem_physician_no_nan.value_counts())
print("the democrats mostly didn't support the fee freeze")
print(rep_physician_no_nan.value_counts())
print("this was mostly supported by the republicans")

0.0    245
1.0     14
Name: physician-fee-freeze, dtype: int64
the democrats mostly didn't support the fee freeze
1.0    163
0.0      2
Name: physician-fee-freeze, dtype: int64
this was mostly supported by the republicans


In [37]:
#choosing another column to look at
votesData.head(1)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0


In [38]:
#choosing to look at the mean of the handicapped infants
print(demfilt['handicapped-infants'].mean())
print(repfilt['handicapped-infants'].mean())
print("it looks like this issue is mostly supported by the democrats")

0.6046511627906976
0.18787878787878787
it looks like this issue is mostly supported by the democrats


In [39]:
#checking to see if there are any null values in the data
print(demfilt['handicapped-infants'].isnull().sum())
repfilt['handicapped-infants'].isnull().sum()

9


3

In [40]:
#i will let the ttest do the omitting of the null values on this one
stats, pval = ttest_ind(demfilt['handicapped-infants'], repfilt['handicapped-infants'], nan_policy="omit")
print(pval)
print("We will reject the null hypothesis: which is that they are they same")

1.613440327937243e-18
We will reject the null hypothesis: which is that they are they same


In [41]:
#looking at the data to pick another column
votesData.head(1)

Unnamed: 0,party,handicapped-infants,water-project,budget,physician-fee-freeze,el-salvador-aid,religious-groups,anti-satellite-ban,aid-to-contras,mx-missile,immigration,synfuels,education,right-to-sue,crime,duty-free,south-africa
0,republican,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,1.0


In [42]:
print(repfilt['water-project'].mean())
demfilt['water-project'].mean()

0.5067567567567568


0.502092050209205

In [0]:
#will be usintg the water project to do a ttest to look at one that most likely 
#has a p value larger than .1
stat , pval = ttest_ind(demfilt['water-project'], repfilt['water-project'], nan_policy='omit' )

In [44]:
#comparing the pval of the democrats and the republicans
print(pval)
print("this one has a p-value that is greater than .1")

0.9291556823993485
this one has a p-value that is greater than .1


In [0]:


#creating a method that will look at the means of the columns comparing 
#the dems to the republicans and will return the one that they are the 
#most similar if how = "equal", will look for greatest rep if "rep" is passed in.
#will find the greatest for democrats if how="dem" is passed in.




# i tried to make a method to find the ones that were most similar but I got stuck


def findLikeness(df=votesData, repfilter= repfilt, demfilter=demfilt, how='rep'):
  #creating the lists to loop through
  
  #importing operator
  import operator

 
  
  val = [-1, -1]

  #getting the comparison operator
  if how == 'rep':
    compare = operator.gt
    ans = ['', "repbublican", -1, "democratic", -1]
    
  elif how == 'dem':
    compare = operator.gt
    ans = ['', "democratic", -1, "republican", -1]
    

  else:
    compare = operator.eq
    ans = ["", "republican", -1, "democratic", -1]
    
       
  colNames  = df.columns.values.tolist()
  
  parties =[repfilter, demfilter]

  

  
 

  #import ipdb
  #ipdb.set_trace()
 
  for col in colNames:
    count = 0
    for theParty in parties:
      if theParty[col].dtype != np.number:
        break
      val[count] = theParty[col].mean()
      count = count +1
    
    if how == 'rep':
      
      theVal = val[0]
      theotherVal = val[1]
    
    elif how == 'dem':
      theVal = val[1]
      theotherVal = val[0]

    else:
      theVal = val[0]
      theotherVal = val[1]

    if compare(theVal, theotherVal):
       if compare(theVal, theotherVal):
          ans[0]= col
          ans[2]= theVal
          ans[4] = theotherVal
        #putting in the names of either rep or dem
         

  
  return ans    


In [46]:
pip install ipdb

  Building wheel for ipdb (setup.py) ... [?25l[?25hdone
  Created wheel for ipdb: filename=ipdb-0.12.2-cp36-none-any.whl size=9171 sha256=d574aa15c5a178e28c0b20b10799ac6b2c73e678a9a04bdeeaef5592bee4cae2
  Stored in directory: /root/.cache/pip/wheels/7a/00/07/c906eaf1b90367fbb81bd840e56bf8859dbd3efe3838c0b4ba
Successfully built ipdb
Installing collected packages: ipdb
Successfully installed ipdb-0.12.2


In [58]:
m = findLikeness(how="dem")
print(m)

['south-africa', 'democratic', 0.9351351351351351, 'republican', 0.6575342465753424]


## Stretch Goals:

1. Refactor your code into functions so it's easy to rerun with arbitrary variables
2. Work on Performing a T-test without using Scipy in order to get "under the hood" and learn more thoroughly about this topic.
### Start with a 1-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://blog.minitab.com/hs-fs/hubfs/Imported_Blog_Media/701f9c0efa98a38fb397f3c3ec459b66.png?width=247&height=172&name=701f9c0efa98a38fb397f3c3ec459b66.png) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then try a 2-sample t-test
 - Establish the conditions for your test 
 - [Calculate the T Statistic](https://lh3.googleusercontent.com/proxy/rJJ5ZOL9ZDvKOOeBihXoZDgfk7uv1YsRzSQ1Tc10RX-r2HrRpRLVqlE9CWX23csYQXcTniFwlBg3H-qR8MKJPBGnjwndqlhDX3JxoDE5Yg) (You'll need to omit NaN values from your sample).
 - Translate that t-statistic into a P-value. You can use a [table](https://www.google.com/search?q=t+statistic+table) or the [University of Iowa Applet](https://homepage.divms.uiowa.edu/~mbognar/applets/t.html)

 ### Then check your Answers using Scipy!