# Data Exploration

## Objectives
1. Visualizing data
1. Computing summary statistics
1. Aggregating data via pivot tables
2. Combining multiple tables on shared attributes

# Load Data Using Pandas

Pandas is a Python library (set of functions somebody else wrote) for doing data analysis. 

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

In [None]:
import pandas as pd #import is how we load libraries
pd.__version__

In [None]:
# ../data/abuse.csv is where I store the file - change to your location
# file is stored in df variable
df = pd.read_csv("../data/abuse.csv")

In [None]:
df.head()

In [None]:
# Let's remove any accurance of total to avoid double counts
total_rows = (df['characteristic'].str.match("Total") |
              df['race-ethnicity'].str.match("Total"))
dfc = df[~total_rows]

In [None]:
dfc.head()

# How many people do we have total of each race?
1. `groupby` - aggregate
2. `sum` - reduce
3. More info: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html

In [None]:
ragg = dfc.groupby('race-ethnicity').sum()
ragg

## Can we visualize that?

## Let's talk pictures in Python
There are many visualization libraries, but we will primarily use [matplotlib](https://matplotlib.org/index.html). We can use matplotlib directly or through Pandas.

![annotated figure with matplotlib terms for each aspect](figs/L06/anatomy_mpl.png)


In [None]:
%matplotlib inline 

import matplotlib.pyplot as plt

#create an image, make it pretty wide: figsize=(width, height)
fig, ax = plt.subplots(figsize=(10,5)) 
# using matplotlib's bar function (directly, ax.bar(x values, heights)
_ = ax.bar(ragg.index, ragg['Total-estimate'])
#_ suppresses output

In [None]:
# Alternatively, we can use a Pandas wrapper on matplotlib to generate a bar
# Pandas-quicker, Matplotlib-more customization
_ = ragg['Total-estimate'].plot.bar()

# Can we see characteristic by race?

In [None]:
table = df.pivot(index='characteristic', columns='race-ethnicity', 
                    values=['Male-%', 'Female-%', 'Total-%'])


In [None]:
table

In [None]:
_ = table['Male-%'].plot.bar()

In [None]:
# clean it up a bit using stacked bar plots
_ = table['Male-%'].plot.bar(stacked=True)

In [None]:
# Switch the grouping by flipping the table
table['Male-%'].T

In [None]:
_ = table['Male-%'].T.plot.bar(stacked=True)

# Practice:
1. See how women differ
2. Try a different column set

# Let's do some math!

In [None]:
# Let's use columwise summation (axis=1) to confirm that the male and female estimates sum to the total
dfc[['Male-estimate', 'Female-estimate']].sum(axis=1)

In [None]:
#let's see where it's off a bit
dfc[~(dfc[['Male-estimate', 'Female-estimate']].sum(axis=1) == dfc['Total-estimate'])]

In [None]:
# mean per column:
dfc.mean()

In [None]:
# full summary stats
dfc.describe()

In [None]:
# Standard deviation
dfc.groupby(['race-ethnicity']).std()

In [None]:
# What if we want the mean for each race-ethnicity?
estimates = dfc.groupby(['race-ethnicity'])[['Male-estimate', 'Female-estimate', 'Total-estimate']].sum()

In [None]:
estimates

In [None]:
# remove total so we're not double counting on the visualization
est_sex = estimates[['Male-estimate', 'Female-estimate']]
_ = est_sex.plot.bar(stacked=True)

In [None]:
#pictures need lots of polishing, this is just exploratory
_ = est_sex.plot.pie(subplots=True, figsize=(10,5))

In [None]:
_ = est_sex.T.plot.bar(stacked=True)

In [None]:
# let's look at gender, which means removing totals
_ = est_sex.T['White'].plot.pie()

# Practice
Try plotting a different race-ethnicity

In [None]:
# Let's use a boxplot to visualize the different groupings
# using the seaborn visualization library to provide stats graphs
import seaborn as sns
sns.boxplot(x = 'race-ethnicity', y = 'Male-estimate', data=dfc)

# Practice 
Get the mean for each demographic/characteristic (aggregate over race)

# How do we join two datasets?
![table merge where 1st row is scanned and on match with element in second row, new row is created with elements of both](figs/L06/merge.gif)

Source [Randy Au, Can we stop with the SQL JOINs venn diagrams insanity?](https://towardsdatascience.com/can-we-stop-with-the-sql-joins-venn-diagrams-insanity-16791d9250c3?sk=f8bfa36658362ee6d54951681967a45b)


In [None]:
df2 = pd.read_csv("../data/dependency.csv")

In [None]:
df2.head()

In [None]:
# need both characteristic and race-ethnicty for row uniquiness
pd.merge(df, df2, on=['characteristic', 'race-ethnicity'])

In [None]:
#lets use better identifiers than x, y
data = pd.merge(df, df2, on=['characteristic', 'race-ethnicity'], suffixes=('-abuse', '-dependency'))

In [None]:
data.head()

In [None]:
# let's get just estimate data
ecol = [est for est in data.columns if 'estimate' in est]
ecol

In [None]:
estdf = data[['characteristic', 'race-ethnicity']+ecol]

In [None]:
estdf.head()

# Practice
1. For each race/demo/sex, find if the abuse or dependency is higher, and the difference between the two
2. Visualize the difference