# Data Exploration 04

A political think tank is preparing a public relations campain on a variety of policy issues.

In order to understand how they should best allocate their time, they've asked you to calculate some probabilities based on prior Congressional voting history.

## Part 1: Import Pandas and load the data

The dataset for this exploration is stored at the following url:

`https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/house-votes-84.csv`


### Initial Data Analysis
Once you've loaded the data, it's a good idea to poke around a little bit to find out what you're dealing with.

Some questions you might ask include:

* What does the data look like?
* What kind of data is in each column? 
* Do any of the columns have missing values? 

In [74]:
# Part 1: Enter your code below to import Pandas according to the 
# conventional method. Then load the dataset into a Pandas dataframe.
import pandas as pd
votes = pd.read_csv('https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/house-votes-84.csv')

# Write any code needed to explore the data by seeing what the first few 
# rows look like. Then display a technical summary of the data to determine
# the data types of each column, and which columns have missing data.
votes.info()
votes.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 17 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   class                                   435 non-null    object
 1   handicapped-infants                     435 non-null    object
 2   water-project-cost-sharing              435 non-null    object
 3   adoption-of-the-budget-resolution       435 non-null    object
 4   physician-fee-freeze                    435 non-null    object
 5   el-salvador-aid                         435 non-null    object
 6   religious-groups-in-schools             435 non-null    object
 7   anti-satellite-test-ban                 435 non-null    object
 8   aid-to-nicaraguan-contras               435 non-null    object
 9   mx-missile                              435 non-null    object
 10  immigration                             435 non-null    object
 11  synfue

Unnamed: 0,class,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


## Part 2: Simple Probabilities

An easy way to calcuate simple categorical feature probabilities in Pandas is through the [value_counts() function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html).

Calculate the following:
   * P(class = 'republican')
   * P(class = 'democrat')
   * P(voted 'Y' on education-spending)

In [18]:
# Write the code to calculate the specified probabilities

# We could do the value_counts, then divide by the length of the data set,
# but using normalized value counts gives us the probabilities directly.
votes['class'].value_counts(normalize=True)

democrat      0.613793
republican    0.386207
Name: class, dtype: float64

In [19]:
votes['education-spending'].value_counts(normalize=True)

n    0.535632
y    0.393103
?    0.071264
Name: education-spending, dtype: float64

## Part 3: Joint Probabilities
An easy way to calculate joint probabilities in Pandas is by combining the [groupby() function](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html#aggregating-statistics-grouped-by-category) with the [value_counts() function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html).

Though note that value_counts() is a Pandas Series method, and will therefore not work on an entire DataFrame. See [this article](https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html#each-column-in-a-dataframe-is-a-series) for details on the difference.

In 1984, congress [voted on two similar foreign-aid bills](https://www.nytimes.com/1984/05/25/world/military-aid-bill-for-el-salvador-passed-by-house.html), one to provide military aid to El Salvador `el-salvador-aid`, the other to provide military aid to rebels in Nicaragua `aid-to-nicaraguan-contras`.

Calculate the following probabilities:

* The probability that a representative voted *for* both aid packages.
* The probability that a representative voted *against* both aid packages.
* The probability that a representative voted to provide aid to El Salvador, but not Nicaragua.
* The probability that a representative voted to provide aid to Nicaragua, but not El Savador.


In [64]:
# Write the code to calculate the specified probabilities

# If we define E as the event: "voting in favor of aid for El Salvador"
# and N as the event": "voting in favor of aid for Nicaragua"
#
# Grouping by one column and then doing a value count of the other
# will give us the tallies we need to figure out the marginal probabilities
#
#   votes.groupby('el-salvador-aid')['aid-to-nicaraguan-contras'].value_counts()
#
# We can't use the same normalize=True trick from earlier, because Pandas will
# calculate the precentages within each group, and we need them across the 
# entire dataset.
#
# So instead, we can either manually divide the tallies by the length:
#
#     We know that n = 435 from our data exploration earlier, so...
#
#       P(E, N) = 31 / 435 = 0.071
#       P(¬E, ¬N) = 2 / 435 = 0.01
#       P(E,¬N) = 172 / 435 = 0.40
#       P(¬E, N) = 204 / 435 = 0.47
#
# or, we can take advantage of Pandas's ability to perform mathematical operations
# in a vectorized way. See https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/05_add_columns.html#min-tut-05-columns 
# for an intro, or https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#flexible-binary-operations for more details.

votes.groupby('el-salvador-aid')['aid-to-nicaraguan-contras'].value_counts() / len(votes)

# Note that we would have gotten the same results if we had grouped and aggregated
# in the other direction:
#
# votes.groupby('aid-to-nicaraguan-contras')['el-salvador-aid'].value_counts() / len(votes)
#


el-salvador-aid  aid-to-nicaraguan-contras
?                y                            0.016092
                 ?                            0.009195
                 n                            0.009195
n                y                            0.468966
                 ?                            0.004598
                 n                            0.004598
y                n                            0.395402
                 y                            0.071264
                 ?                            0.020690
Name: aid-to-nicaraguan-contras, dtype: float64

## Part 4: Conditional Probabilities

In 1984, congress passed the [Equal Access Act](https://mtsu.edu/first-amendment/article/1077/equal-access-act-of-1984), which forbids public secondary schools from receiving federal funds if they deny students the First Amendment right to conduct meetings because of the “religious, political, philosophical, or other content of the speech at such meetings.”

The results of this vote are recorded in the `religious-groups-in-schools` column of the dataset, coded as one of the following:

* Y - Voted yea (in favor of passage)
* N - Voted nay (against passage)
* ? - Abstained

Calcualte the following joint probabilities:

* The probability of a Democratic representative voting Yea on the Equal Access Act.
* The probability of a Republica representative voting Yea on the Equal Access Act.

P(Y|D)
P(Y|R)

In [73]:
# Write the code to calculate the specified probabilities

# If we define Y as the event: "Voted Yea for Equal Access Act"
# and D as the event: "The representative is a Democrat"
# and R as the event: "The representative is a Republican"
#

# P(Y | D)
democrats = votes[ votes['class'] == 'democrat' ]
democrats['religious-groups-in-schools'].value_counts()

n    135
y    123
?      9
Name: religious-groups-in-schools, dtype: int64

### Revisit Visualizations
Now that you have the new columns in place. Revisit the pairwise comparison plots to see if the new columns reveal any interesting relationships. Don't forget to check with and without different `hue` variations.

In [None]:
# Enter the code needed to recheck the pairwise comparison. Try different variations of the hue parameter.

### Simplifying Data
There appears to be a lot of different variations of similar titles. (such as abbreviations for Miss and Mademoiselle). 

Scan through the different titles to see which titles can be consolidated, then use what you know about data manipulation to simplify the distribution.

Once you've finished, check the visualizations again to see if that made any difference.

In [None]:
# Enter the code needed to consolidate some of the different title variations 
# Recheck the pairwise distributions to see if it made a difference.

# Part 5: Conclusions

Based on your analysis, what interesting relationships did you find? Write three interesting facts the museum can use in their exhibit.

## 🌟 Above and Beyond 🌟

The museum curator has room for a couple of nice visualizations for the exhibit. 

1. Use Seaborn's customization features to clean up some of the more interesting visualizations to make them suitable for public display.

2. Use the [GeoPandas library](https://geopandas.org) to create a [Choropleth Map](https://geopandas.org/mapping.html#choropleth-maps) of the likelihood of a Titanic passenger surviving based on their port of embarkation.