![seaborn](https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2018/07/seaborn.png) 
# The good, the bad, the seaborn


#### Whats wrong with this data visualization? (hint: so many things)

<img src="https://pbs.twimg.com/media/DNTFhGaXcAEbrMO.jpg" width=800>

Learning goals:
- Create a list of best practices for data visualization
- Identify the differences between matplotlib and seaborn
- Create a visualization with seaborn, applying best practices

## Goal 1: Create a list of best practices for data visualization

Documenting best practices:

In groups:
- Group 1: [article 1](https://www.jackhagley.com/What-s-the-difference-between-an-Infographic-and-a-Data-Visualisation)
- Group 2: [article 2](https://thoughtbot.com/blog/analyzing-minards-visualization-of-napoleons-1812-march)
- Group 3: [article 3](http://dataremixed.com/2016/04/the-design-of-everyday-visualizations/)
- Group 4: [article 4](https://visme.co/blog/data-storytelling-tips/)
- Group 5: [article 5](files/VisualizationsThatReallyWork.pdf)

To fill in: [Best practices deck](https://docs.google.com/presentation/d/1KTi7FbCpFsnNW4rxFV5GxB2sNnvpcXe75Fr3jZMJHio/edit?usp=sharing) 

## Goal 2:  Identify differences between seaborn & matplotlib


### Two code examples to accomplish the same plot:

**Resources:**
- [python graph gallery on seaborn](https://python-graph-gallery.com/seaborn/)
- [seaborn](https://seaborn.pydata.org/)


In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
%matplotlib inline

# Load in data
tips = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")


In [None]:
# Matplotlib:

# Initialize Figure and Axes object
fig, ax = plt.subplots()

# Create violinplot
ax.violinplot(tips["total_bill"], vert=False)

# Show the plot
plt.show()

In [None]:
# Seaborn:

import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
tips = sns.load_dataset("tips")
# tips = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

# Create violinplot
# sns.violinplot(x = "total_bill", data=tips)
sns.violinplot(x = "total_bill", data=tips,palette="Pastel1")

# Show the plot
plt.show()

### In depth comparison:

#### Groups 1:3

For each plot:
- How is the code to create it different from the maplotlib code?
- What are the customization options? 
- What are the top 3 most important customization options to know(with code) ?

Group 1 - [histograms](https://python-graph-gallery.com/histogram/)<br>
Group 2 - [scatter plot](https://python-graph-gallery.com/scatter-plot/)<br>
Group 3 - [boxplot](http://python-graph-gallery.com/boxplot/)<br>

#### Groups 4:5
- What new vocabulary was introduced in these posts?
- What is the benefit of these new options?
- What code/options do you need to know? 

Group 4 - [diverging, sequential, discrete color palattes](https://python-graph-gallery.com/101-make-a-color-palette-with-seaborn/)<br>
Group 5 - [seaborn themes](https://python-graph-gallery.com/104-seaborn-themes/) <br>

_Time to work:_ 15 minutes <br>
_Time to discuss as large group:_ 10 minutes

## Goal 3: Create a visualization with seaborn, applying best practices

[exercise from data world](https://data.world/makeovermonday/2018w37-paying-the-president)


In [None]:
import pandas as pd
df = pd.read_excel('https://query.data.world/s/5qxp2ldwsel3ow2pq5mkvfas2rfaup',converters={'date':pd.to_datetime})
df.head()

## Reflection:

- What worked from this training? 
- What can you apply moving forward?
- What's one concept you would like to practice more?

In [None]:
df.info()

In [None]:
df.type.value_counts()

#### For extra fun:
[visualization challenges](http://www.storytellingwithdata.com/blog/2019/3/1/swdchallenge-visualize-this-data)

[seaborn cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Seaborn_Cheat_Sheet.pdf)

In [None]:
df.purpose_scrubbed.value_counts()

In [None]:
sns.violinplot(df.amount)

In [None]:
sns.violinplot(df.type,np.log(df.amount))

In [None]:
sns.distplot(df.property.value_counts())

In [None]:
sns.violinplot(np.log(df.amount),df.purpose_scrubbed)

Let's try to copy the proPublica chart (at least a little).
![title](img/propublica_trump_emoluments.png)

We need a month-year column, and a log-payments column, and a source_category column. (Obviously we could just do these things dynamically but let's make this easier on ourselves).

In [None]:
df['month_year'] = df.date.dt.strftime('%Y-%m')

In [None]:
df.head()

In [None]:
amounts = [400,1000,10500]
list(map(lambda amount: floor(np.log(amount)/np.log(10)),amounts))

In [None]:
df[df.amount < 0]

We probably should have a way of voiding entries that have a corresponding negative entry.

In [None]:
def size_cat(amount):
    if amount <= 0: return np.NaN 
    log = np.floor(np.log(amount)/np.log(10))
    if log < 3:
        return 0
    elif log < 4:
        return 1
    elif log < 5:
        return 2
    elif log < 6:
        return 3
    else:
        return 4
    
df.amount.map(size_cat).value_counts()



In [None]:
df['amount_size'] = df.amount.map(size_cat)

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df[df.type=='government'].source.value_counts()

In [None]:
def source_category(lst):
    source_type,source = lst
    if source_type == 'government':
        return 'Taxpayer dollars'
    elif source == 'Donald J. Trump for President, Inc.':
        return source
    else:
        return 'Other campaigns'

df['source_category'] = df[['type','source']].apply(source_category,axis=1)

In [None]:
df.head()

In [None]:
pd.DataFrame(df.groupby([df.month_year,df.source_category]).amount_size.value_counts())

In [None]:
df = df[~df.date.isna()]

In [None]:
df[df.source_category == 'Taxpayer dollars'][['month_year','amount_size','amount','date']] \
    .set_index('month_year') \
    .sort_index()



In [None]:
sns.countplot(x='month_year',hue='amount_size',data=df)

In [None]:
df_plot = df.groupby(['class', 'survived']).size().reset_index().pivot(columns='class', index='survived', values=0)
