## 1. Introduction: 
This Jupyter Notebook performs a paired t-test to determine if the percentage of Republican candidate votes in 2008 was less than in 2012.

### 1.1 Problem Statement
- **Null Hypothesis:** The percentage of republican candidate votes in 2008 is the same as the percentage of votes in 2012.
- **Alternate Hypothesis:** The percentage of republican candidate votes in 2008 is less than the percentage of votes in 2012.
  
### 1.2 Data Source
The analysis uses the `countypres_2000-2024.csv` dataset from [Harvard](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ), which contains county-level returns for presidential elections from 2000 through 2024. The data is sourced from official state election records.

The columns in the dataset are

| Column | Description |
|--------|-------------|
|`year` | election year |
|`state` | state name |
|`state_po` | U.S. postal code state abbreviation |
|`county_name` | county name | 
|`county_fips` | county FIPS code |
|`office` | President |
|`candidate` | name of the candidate |
|`party` | party of the candidate; takes form of DEMOCRAT, REPUBLICAN, GREEN, LIBERTARIAN, or OTHER |
|`candidatevotes` | votes received by this candidate for this particular party |
|`totalvotes` | total number of votes cast in this county-year |
|`version` | date when dataset was finalized |
|`mode` | mode of ballots cast; default is TOTAL, with different modes specified for 2020 |

## 2. Import Libraries and Set Significe Level
This step imports the necessary Python libraries and loads the dataset into a pandas DataFrame, and sets the significance level to 0.05
- `pandas` is imported as pd for data manipulation and analysis.
- `pingouin` is imported to access functions related to the t-test, specifically for calculating the p-value.
- `alpha` is set to 0.05. This is the chosen significance level, which will be used to compare against the p-value to decide whether to reject the null hypothesis.

In [5]:
#import neccessary packages
import pandas as pd
import pingouin
alpha = 0.05

### Load Data
Load the `countypres_2000-2024.csv` dataset into a pandas DataFrame.



In [7]:
df = pd.read_csv('../../data/countypres_2000-2024.csv')
#inspect dataset
df.head()

Unnamed: 0,year,state,state_po,county_name,county_fips,office,candidate,party,candidatevotes,totalvotes,version,mode
0,2000,ALABAMA,AL,AUTAUGA,1001,US PRESIDENT,AL GORE,DEMOCRAT,4942,17208,20250712,TOTAL
1,2000,ALABAMA,AL,AUTAUGA,1001,US PRESIDENT,GEORGE W. BUSH,REPUBLICAN,11993,17208,20250712,TOTAL
2,2000,ALABAMA,AL,AUTAUGA,1001,US PRESIDENT,OTHER,OTHER,113,17208,20250712,TOTAL
3,2000,ALABAMA,AL,AUTAUGA,1001,US PRESIDENT,RALPH NADER,GREEN,160,17208,20250712,TOTAL
4,2000,ALABAMA,AL,BALDWIN,1003,US PRESIDENT,AL GORE,DEMOCRAT,13997,56480,20250712,TOTAL


### Data Cleaning and Preparation
- Filter the DataFrame to include only Republican candidate votes for the years 2008 and 2012.
- Drop columns that are not relevant for the analysis, such as `state`, `county_fips`, `office`, `candidate`, `party`, `version`, and `mode`.
- Create a new column called `votes_percent` by dividing `candidatevotes` by `totalvotes`.

In [9]:
#Data cleaning and prep
#Since we are only interested in years 2008 and 2012 for republican candidates, filter for those records and copy them to a new dataframe.
filt = (df['year'].isin([2008, 2012])) & (df['party'] == 'REPUBLICAN')
repub_votes_potus_08_12 = df.loc[filt].copy()
#drop columns that are not of interest.
repub_votes_potus_08_12.drop(columns=['state', 'county_fips', 'office', 'candidate', 'party', 'version', 'mode'], axis = 'columns', inplace=True)
#create new column 'votes_percent'
repub_votes_potus_08_12['votes_percent'] = repub_votes_potus_08_12['candidatevotes']/repub_votes_potus_08_12['totalvotes']

### Reshape the DataFrame
- Pivot the DataFrame so that `state_po` and `county_name` form the index, `year` becomes the columns, and `votes_percent` are the values.
- Rename the year columns (2008 and 2012) to `percent_08` and `percent_12` for clarity.

In [11]:
repub_votes_potus_08_12 = repub_votes_potus_08_12.pivot_table(index=['state_po', 'county_name'], columns='year', values='votes_percent')
repub_votes_potus_08_12.rename(columns={2008:'percent_08', 2012:'percent_12'}, inplace=True)

### Perform Paired t-test
Execute a paired t-test using `pingouin.ttest` with percent_08 as x, percent_12 as y, paired=True, and alternative='less'.



In [13]:
paired_test = pingouin.ttest(x=repub_votes_potus_08_12['percent_08'], y=repub_votes_potus_08_12['percent_12'], paired=True, alternative='less')

## 3. Make a decision and interpret results
Retrieve the p-value from the t-test result.

Compare the p-value to the alpha (0.05).

If `p_value < alpha`, reject the null hypothesis. This suggests sufficient statistical evidence that the percentage of Republican candidate votes in 2008 was less than in 2012.

If `p_value >= alpha`, fail to reject the null hypothesis. This means there is not enough statistical evidence to conclude that the percentage of Republican candidate votes in 2008 was less than in 2012.

In [15]:
p_value = paired_test['p-val'].values[0]
print(p_value)
print(f"p_value: {p_value:.10f}")


if p_value < alpha:
    print("Reject Null Hypothesis")
else:
    print("Failed to reject null hypothesis")

1.202835209065816e-257
p_value: 0.0000000000
Reject Null Hypothesis


## 4. Result Interpretation:
In this specific execution, the p-value was approximately 0.00000. Since 0.00000 is less than 0.05, the null hypothesis is rejected. This provides statistically significant evidence to support the claim that the percentage of Republican candidate votes in 2008 was less than the percentage of votes in 2012.