# Data Exploration 04

A political think tank is preparing a public relations campain on a variety of policy issues.

In order to understand how they should best allocate their time, they've asked you to calculate some probabilities based on prior Congressional voting history.

## Part 1: Import Pandas and load the data

The dataset for this exploration is stored at the following url:

`https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/house-votes-84.csv`


### Initial Data Analysis
Once you've loaded the data, it's a good idea to poke around a little bit to find out what you're dealing with.

Some questions you might ask include:

* What does the data look like?
* What kind of data is in each column? 
* Do any of the columns have missing values? 

In [4]:
# Part 1: Enter your code below to import Pandas according to the 
# conventional method. Then load the dataset into a Pandas dataframe.
import pandas as pd
votes = pd.read_csv('https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/house-votes-84.csv')

# Write any code needed to explore the data by seeing what the first few 
# rows look like. Then display a technical summary of the data to determine
# the data types of each column, and which columns have missing data.
votes.info()
votes.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 17 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   class                                   435 non-null    object
 1   handicapped-infants                     435 non-null    object
 2   water-project-cost-sharing              435 non-null    object
 3   adoption-of-the-budget-resolution       435 non-null    object
 4   physician-fee-freeze                    435 non-null    object
 5   el-salvador-aid                         435 non-null    object
 6   religious-groups-in-schools             435 non-null    object
 7   anti-satellite-test-ban                 435 non-null    object
 8   aid-to-nicaraguan-contras               435 non-null    object
 9   mx-missile                              435 non-null    object
 10  immigration                             435 non-null    object
 11  synfue

Unnamed: 0,class,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


## Part 2: Simple Probabilities

An easy way to calcuate simple categorical feature probabilities in Pandas is through the [value_counts](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html) function.

Calculate the following:
   * P(class = 'republican')
   * P(class = 'democrat')
   * P(voted 'Y' on education-spending)

In [18]:
# Part 2: # Import the seaborn library the conventional way. Then optionally 
# configure the default chart style. Then, write the code needed to generate 
# the visualizations specified.

# Using normalized value counts gives us the probabilities
votes['class'].value_counts(normalize=True)

democrat      0.613793
republican    0.386207
Name: class, dtype: float64

In [19]:
votes['education-spending'].value_counts(normalize=True)

n    0.535632
y    0.393103
?    0.071264
Name: education-spending, dtype: float64

## Part 3: Joint Probabilities


In [None]:
# Write the code to explore how different features affect the survival distribution


### Adding Another Dimension
Now, let's use the `hue` parameter to allow us to add a third dimension to our data.

- Choose pairs of features you think are interesting and chart them against the survival distribution.

In [None]:
# Write the code to visualize passenger survival rates based on two different
# features.

## Part 4: Feature Engineering

The museum curator wonders if the passenger's rank and title might have anything to do with whether or not they survived. Since this information is embedded in their name, we'll use "feature engineering" to create two new columns:

- Title: The passenger's title
- Rank: A boolean (true/false) indicating if a passenger was someone of rank.

For the first new column, you'll need to find a way to [extract the title portion of their name](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html). Be sure to clean up any whitespace or extra punctuation.

For the second new column, you'll need to first look at a summary of your list of titles and decide what exactly constitutes a title of rank. Will you include military and eccelsiastical titles? Once you've made your decision, create the second column.

You may want to review prior Data Explorations for tips on creating new columns and checking for lists of values.

In [None]:
# Enter the code needed to create the two new columns

### Revisit Visualizations
Now that you have the new columns in place. Revisit the pairwise comparison plots to see if the new columns reveal any interesting relationships. Don't forget to check with and without different `hue` variations.

In [None]:
# Enter the code needed to recheck the pairwise comparison. Try different variations of the hue parameter.

### Simplifying Data
There appears to be a lot of different variations of similar titles. (such as abbreviations for Miss and Mademoiselle). 

Scan through the different titles to see which titles can be consolidated, then use what you know about data manipulation to simplify the distribution.

Once you've finished, check the visualizations again to see if that made any difference.

In [None]:
# Enter the code needed to consolidate some of the different title variations 
# Recheck the pairwise distributions to see if it made a difference.

# Part 5: Conclusions

Based on your analysis, what interesting relationships did you find? Write three interesting facts the museum can use in their exhibit.

## 🌟 Above and Beyond 🌟

The museum curator has room for a couple of nice visualizations for the exhibit. 

1. Use Seaborn's customization features to clean up some of the more interesting visualizations to make them suitable for public display.

2. Use the [GeoPandas library](https://geopandas.org) to create a [Choropleth Map](https://geopandas.org/mapping.html#choropleth-maps) of the likelihood of a Titanic passenger surviving based on their port of embarkation.