Welcome to DS0. This is the iPython Notebook that corresponds to the slides for the DS0 workshop. 

In [2]:
from IPython.display import HTML
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# 1. Bad Visualization

We're going to use quite a few examples from Fox News to demonstrate bad visualization. While there is no doubt that other major news outlets also produce misleading graphs, these next few images are the most blatant for instruction purposes that I've discovered or have been shown. With this new eye for bad visualization, you will be able to spot misleading visualizations all across the spectrum! 

## NO: changing scale mid-axis

What does this mean?

It means the rate at which a quantitative variable is portrayed changes over the sequence. It's easier to understand as an example: 

![](axis.jpg) 

See how the x-axis goes from last year, to last week, to today? Why is that problematic?

## YES: keeping your scale consistent

In [15]:
gas_dict = {'date':pd.to_datetime(['2011-02-20','2012-02-13','2012-02-20']), 'price':[3.17, 3.51, 3.57]}

gas_df = pd.DataFrame(gas_dict).set_index('date')


plt.figure(figsize=(10,5))

...

plt.show()

<Figure size 720x360 with 0 Axes>

## NO: pie graphs

![](Fox-News-pie-chart.png)

Why?

Confusing, misleading, and ugly. It's hard to tell the difference between 30% and 40% in a pie graph, even though that's a huge difference. Pie graphs may be okay for simple budgeting break-downs and expense reports, but in most data science scenarios, stay away from the pie graph. 

## YES: bar charts

Why?

Simple to read, exact, and pleasing to the eye. There is no confusion (as long as you have the right axis limits).

What if you don't have appropriate limits? 

 uh oh | oh no
- | - 
![alt](obc.jpg) | ![alt](Bush-cuts.png)

In [14]:
obc_dict = {'date':['March 27','March 31 (Goal)'], 'enrollment':[6000000, 7066000]}

obc_df = pd.DataFrame(obc_dict)

bush_dict = {'date':['Now', '01-01-2013'],'rate':[35, 39.6]}

bush_df = pd.DataFrame(bush_dict)

...

plt.show()

Which leads to our next point...

## NO: truncating when not necessary 

![](warren.png)

## YES: showing the big picture

In [16]:
warr_dict = {'years':pd.to_datetime(['1970','1980','1990','2000','2010','2017']).year, 'rate':[53,51,49,47,46,45]}
warr_df = pd.DataFrame(warr_dict).set_index('years')

...

Ellipsis

# 3. Mean v. Median

The most straight-forward example of why you would use median over mean comes from income data. The concept is that medians are more descriptive of the "middle" of the data when you are dealing with distributions that are skewed. We will demonstrate with visuals to make understanding this clearer. 

In [6]:
incomes = ...
#incomes.head()

In [5]:
...

np.mean(incomes['Income'].dropna()) - np.median(incomes['Income'].dropna())

6711.829265280925

In [7]:
# more eda

In [9]:
pois = np.random.exponential(4, 10000)

...

Ellipsis