# Data Storytelling
* Data visualization in detail.

## Index.
### 1. Less is More
* [Hide axes.tick_params()](#axes-tick-params)
* [Hide matplotlib.spine.Spine](#matplotlib-spine-spine)
* [Manipulate x-axis / y-axis limit](#manipulate)
    
### 2. Color, Layout, and Annotations
* [Color blind pallets](#color-blind-pallets)
* [Line width](#line-width)
* [Improve layout and ordering](#improve-layout-and-ordering)
* [Legend](#legend)

### 3. Conditional Plots using Seaborn
* [Creating histograms in Seaborn](#creating-histograms-in-seaborn)
* [Generating a kernel density plot](#generating-a-kernel-density-plot)
* [Modifying the appearance of the plots](#modifying-the-appearance)
* [Conditional distributions using a single condition](#conditional-distributions-single)
* [Conditional distributions using two conditions](#conditional-distributions-double)
* [Conditional distributions using three conditions](#conditional-distributions-three)

## Less is More
* how to remove unnecessary details from graph

### 1-1.<a name='axes-tick-params'></a> Hide axes.tick_params()
The parameters for enabling or disabling tick marks are conveniently named after the sides. To hide all of them, we need to pass in the following values for each parameter when we call `Axes.tick_params()`:

* bottom: "off"
* top: "off"
* left: "off"
* right: "off"

```python
import pandas as pd
import matplotlib.pyplot as plt

women_degrees = pd.read_csv('percent-bachelors-degrees-women-usa.csv')

fig, ax = plt.subplots()

ax.plot(women_degrees.Year, women_degrees.Biology, c='b', label="Women")
ax.plot(women_degrees.Year, 100-women_degrees.Biology, c='g', label="Men")

##### hide tick marks #####
ax.tick_params(bottom="off", top="off", left="off", right="off")

plt.title("Percentage of Biology Degrees Awarded By Gender")
plt.legend(loc="upper right")
plt.show()
```

### 1-2. <a name='matplotlib-spine-spine'></a> Hide matplotlib.spine.Spine
In matplotlib, the spines are represented using the [matplotlib.spines.Spine class](http://matplotlib.org/api/spines_api.html). When we create an Axes instance, four Spine objects are created for us. If you run print(ax.spines), you'll get back a dictionary of the Spine objects:
```python
{'right': <matplotlib.spines.spine object="" at="" 0x111089c18="">, 'bottom': <matplotlib.spines.spine object="" at="" 0x111060898="">, 'top': <matplotlib.spines.spine object="" at="" 0x1110606a0="">, 'left': <matplotlib.spines.spine object="" at="" 0x11107cd30="">}
</matplotlib.spines.spine></matplotlib.spines.spine></matplotlib.spines.spine></matplotlib.spines.spine>
```

To hide all of the spines, we need to:

* access each Spine object in the dictionary
* call the Spine.set_visible() method
* pass in the Boolean value False

```python
fig, ax = plt.subplots()
ax.plot(women_degrees['Year'], women_degrees['Biology'], label='Women')
ax.plot(women_degrees['Year'], 100-women_degrees['Biology'], label='Men')
ax.tick_params(bottom="off", top="off", left="off", right="off")

##### Hide spines #####
for sp in ["left", "right", "top", "bottom"]:
    ax.spines[sp].set_visible(False)

ax.legend(loc='upper right')
ax.set_title('Percentage of Biology Degrees Awarded By Gender')
plt.show()
```

### 1-3. <a name='manipulate'></a>Manipulate x-axis / y-axis limit
A good chart uses a consistent style for the elements that aren't directly conveying the data points,
* These elements are part of the non-data ink in the chart. By keeping the non-data ink as consistent as possible across multiple plots, differences in those elements stick out easily to the viewer.

![pic1](https://s3.amazonaws.com/dq-content/four_major_categories_plots.png)
```python
major_cats = ['Biology', 'Computer Science', 'Engineering', 'Math and Statistics']
fig = plt.figure(figsize=(12, 12))

for sp in range(0,4):
    ax = fig.add_subplot(2,2,sp+1)
    ax.plot(women_degrees['Year'], women_degrees[major_cats[sp]], c='blue', label='Women')
    ax.plot(women_degrees['Year'], 100-women_degrees[major_cats[sp]], c='green', label='Men')
    # Add your code here.
    for key, spine in ax.spines.items():
        spine.set_visible(False)
        
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0,100)
    ax.set_title(major_cats[sp])
    ax.tick_params(bottom="off", top="off", right="off", left="off")
        
# Calling pyplot.legend() here will add the legend to the last subplot that was created.
plt.legend(loc='upper right')
plt.show()
```

## 2. Color Selection

### 2-1. <a name='color-blind-pallets'></a>Color blind pallets

When selecting colors, we need to be mindful of people who have some amount of color blindness. People who have [color blindness](https://en.wikipedia.org/wiki/Color_blindness) have a decreased ability to distinguish between certain kinds of colors. **The most common form of color blindness is red-green color blindness, where the person can't distinguish between red and green shades. Approximately 8% of men and 0.5% of women of Northern European descent suffer from red-green color blindness.**

* If we wanted to publish the data visualizations we create, **we need to be mindful of color blindness.** Thankfully, there are color palettes we can use that are friendly for people with color blindness. One of them is called **Color Blind 10 and was released by Tableau**, the company that makes the data visualization platform of the same name. Navigate to this page and **select just the Color Blind 10 option from the list of palettes to see the ten colors included in the palette.**

#### [color blind pallets (RGB values)]
![color-blind-pallets](https://s3.amazonaws.com/dq-content/tableau_rgb_values.png)
The first color in the palette is a color that resembles dark blue and has the following RGB values:

* Red: 0
* Green: 107
* Blue: 164

```python
fig = plt.figure(figsize=(12, 12))

for sp in range(0,4):
    ax = fig.add_subplot(2,2,sp+1)
    
    # The color for each line is assigned here.
    cb_dark_blue = (0/255, 107/255, 164/255)
    cb_orange = (255/255, 128/255, 14/255)
    
    ax.plot(women_degrees['Year'], women_degrees[major_cats[sp]], c=cb_dark_blue, label='Women')
    ax.plot(women_degrees['Year'], 100-women_degrees[major_cats[sp]], c=cb_orange, label='Men')
    for key,spine in ax.spines.items():
        spine.set_visible(False)
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0,100)
    ax.set_title(major_cats[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off")

plt.legend(loc='upper right')
plt.show()
```
![result1](images/graph1.png)

### 2-2. <a name='line-width'></a>Line width
When we call the `Axes.plot()` method, we can use the linewidth parameter to specify the line width. Matplotlib expects a float value for this parameter:

```python
ax.plot(women_degrees['Year'], women_degrees['Biology'], label='Women', c=cb_dark_blue, linewidth=2)
```

### 2-3. <a name='improve-layout-and-ordering'></a>Improve layout and ordering

To make the viewing experience more coherent, we can:

* use layout of a single row with multiple columns
* order the plots in decreasing order of initial gender gap

Here's what that would look like:
![single-row-multi-columns1](https://s3.amazonaws.com/dq-content/line_charts_dec_initial_gg.png)

The leftmost plot has the largest gender gap in 1968 while the rightmost plot has the smallest gender gap in 1968. If we're instead interested in the recent gender gaps in STEM degrees, we can order the plots from largest to smallest ending gender gaps. Here's what that would look like:
![single-row-multi-columns2](https://s3.amazonaws.com/dq-content/line_charts_dec_ending_gg.png)

Modify the starter code to:
* Change the width of the figure to a width of 18 inches and a height of 3 inches.
* In the for loop, change the range to (0,6) instead of (0,4).
* Change the subplot layout from 2 rows by 2 columns to 1 row by 6 columns.
* Use stem_cats instead of major_cats when generating and setting the titles for the line charts.

```python
stem_cats = ['Engineering', 'Computer Science', 'Psychology', 'Biology', 'Physical Sciences', 'Math and Statistics']

fig = plt.figure(figsize=(18, 3))

for sp in range(0,6):
    ax = fig.add_subplot(1,6,sp+1)
    ax.plot(women_degrees['Year'], women_degrees[stem_cats[sp]], c=cb_dark_blue, label='Women', linewidth=3)
    ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[sp]], c=cb_orange, label='Men', linewidth=3)
    for key,spine in ax.spines.items():
        spine.set_visible(False)
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0,100)
    ax.set_title(stem_cats[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off")

plt.legend(loc='upper right')
plt.show()
```
![graph2](images/graph2.png)

### 2-4. <a name='legend'></a>Legend

The purpose of a legend is to ascribe meaning to symbols or colors in a chart. We're using it to inform the viewer of what gender corresponds to each color. **Tufte encourages removing legends entirely if the same information can be conveyed in a cleaner way.** Legends consist of non-data ink and take up precious space that could be used for the visualizations themselves (data-ink).<br>

Instead of trying to move the legend to a better location, we can replace it entirely by annotating the lines directly with the corresponding genders:
![legend](https://s3.amazonaws.com/dq-content/annotated_legend.png)

If you notice, even the position of the text annotations have meaning. In both plots, the annotation for Men is positioned above the orange line while the annotation for Women is positioned below the dark blue line. This positioning subtly suggests that men are a majority for the degree categories the line charts are representing (Engineering and Math and Statistics) and women are a minority for those degree categories.<br>

Combined, these two observations suggest that we should stick with annotating just the leftmost and the rightmost line charts, prioritizing the data-ink ratio over the consistency of elements.

### 2-5. Annotating in Matplotlib
To add text annotations to a matplotlib plot, we use the `Axes.text()` method. This method has a few required parameters:

* x: x-axis coordinate (as a float)
* y: y-axis coordinate (as a float)
* s: the text we want in the annotation (as a string value)

The values in the coordinate grid match exactly with the data ranges for the x-axis and the y-axis. If we want to add text at the intersection of 1970 from the x-axis and 0 from the y-axis, we would pass in those values:

```python
ax.text(1970, 0, "starting point")
```

```python
fig = plt.figure(figsize=(18, 3))

for sp in range(0,6):
    ax = fig.add_subplot(1,6,sp+1)
    ax.plot(women_degrees['Year'], women_degrees[stem_cats[sp]], c=cb_dark_blue, label='Women', linewidth=3)
    ax.plot(women_degrees['Year'], 100-women_degrees[stem_cats[sp]], c=cb_orange, label='Men', linewidth=3)
    for key,spine in ax.spines.items():
        spine.set_visible(False)
    ax.set_xlim(1968, 2011)
    ax.set_ylim(0,100)
    ax.set_title(stem_cats[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off")
    
    if sp == 0:
        ax.text(2005, 87, "Men")
        ax.text(2002, 8, "Women")
    elif sp == 5:
        ax.text(2005, 62, "Men")
        ax.text(2001, 35, "Women")
    
#plt.legend(loc='upper right')
plt.show()
```
![legend](https://s3.amazonaws.com/dq-content/annotated_legend.png)

## 3. Conditional Plots

So far, we've mostly worked with plots that are quick to analyze and make sense of. Line charts, scatter plots, and bar plots allow us to convey a few nuggets of insights to the reader. We've also explored how we can combine those plots in interesting ways to convey deeper insights and continue to extend the storytelling power of data visualization. In this mission, we'll explore how to quickly create multiple plots that are subsetted using one or more conditions.<br>

We'll be working with the [seaborn](http://seaborn.pydata.org/) visualization library, which is built on top of matplotlib. Seaborn has good support for more complex plots, attractive default styles, and integrates well with the pandas library. Here are some examples of some complex plots that can be created using seaborn:

![example_gallery](https://s3.amazonaws.com/dq-content/seaborn_gallery.png)

### 3-1. <a name='creating-histograms-in-seaborn'></a>Creating histograms in Seaborn

To get familiar with seaborn, we'll start by creating the familiar histogram. We can generate a histogram of the Fare column using the seaborn.distplot() function:

```python
# seaborn is commonly imported as `sns`.
import seaborn as sns
sns.distplot(titanic["Fare"])
plt.show()
```
![sample_seaborn](https://s3.amazonaws.com/dq-content/seaborn_histogram_with_kde.png)


Under the hood, seaborn creates a histogram using matplotlib, scales the axes values, and styles it. In addition, seaborn uses a technique called **kernel density estimation, or KDE for short**, to create **a smoothed line chart** over the histogram. If you're interested in learning about how KDE works, you can read more on [Wikipedia](https://en.wikipedia.org/wiki/Kernel_density_estimation).

### What you need to know for now is that the resulting line is a smoother version of the histogram, called a kernel density plot. 
Kernel density plots are especially helpful when we're comparing distributions, which we'll explore later in this mission. When viewing a histogram, our visual processing systems influence us to smooth out the bars into a continuous line.

### 3-2. <a name='generating-a-kernel-density-plot'></a>Generating a kernel density plot

While having both the histogram and the kernel density plot is useful when we want to explore the data, it can be overwhelming for someone who's trying to understand the distribution. **To generate just the kernel density plot, we use the seaborn.kdeplot() function**:

```python
sns.kdeplot(titanic["Age"])
```
![KDP](https://s3.amazonaws.com/dq-content/seaborn_kdeplot.png)

While the distribution of data is displayed **in a smoother fashion**, it's now more **difficult to visually estimate the area under the curve** using just the line chart. When we also had the histogram, the bars provided a way to understand and compare proportions visually.

* To bring back some of the ability to easily compare proportions, we can **shade the area under the line using a single color**. When calling the `seaborn.kdeplot()` function, we can shade the area under the line by setting the shade parameter to True.

```python
sns.kdeplot(titanic["Age"], shade=True)
plt.xlabel("Age")
```
![shadetrue](images/graph3.png)

### 3-3. <a name='modifying-the-appearance'></a>Modifying the appearance of the plots

From the plots in the previous step, you'll notice that seaborn:

* Sets the x-axis label based on the column name passed through plt.xlabel() function
* Sets the background color to a light gray color
* Hides the x-axis and y-axis ticks
* Displays the coordinate grid

In the last few missions, we explored some general aesthetics guidelines for plots. The default seaborn style sheet gets some things right, like hiding axis ticks, and some things wrong, like displaying the coordinate grid and keeping all of the axis spines. We can use the seaborn.set_style() function to change the default seaborn style sheet. Seaborn comes with a few style sheets:

* darkgrid: Coordinate grid displayed, dark background color
* whitegrid: Coordinate grid displayed, white background color
* dark: Coordinate grid hidden, dark background color
* white: Coordinate grid hidden, white background color
* ticks: Coordinate grid hidden, white background color, ticks visible

Here's a diagram that compares the same plot across all styles:

![diagram_seaborn](https://s3.amazonaws.com/dq-content/seaborn_all_styles.png)

By default, the seaborn style is set to "darkgrid":

```python
sns.set_style("darkgrid")
```
If we change the style sheet using this method, all future plots will match that style in your current session. This means you need to set the style before generating the plot.

To remove the axis spines for the top and right axes, we use the seaborn.despine() function:

```python
sns.despine()
```
By default, only the `top` and `right` axes will be **despined**, or have their spines removed. To despine the other two axes, we need to set the `left` and `bottom` parameters to True.

```python
sns.set_style("white")
sns.kdeplot(titanic["Age"], shade=True)
plt.xlabel("Age")
sns.despine(left=True, bottom=True)
```

![despined](images/graph4.png)

### 3-4. <a name='conditional-distributions-single'></a>Conditional distributions using a single condition

In seaborn, we can create a small multiple by specifying the conditioning criteria and the type of data visualization we want. For example, we can visualize the differences in age distributions between passengers who survived and those who didn't by creating a pair of kernel density plots. One kernel density plot would visualize the distribution of values in the "Age" column where Survived equalled 0 and the other would visualize the distribution of values in the "Age" column where Survived equalled 1.

* Here's what those plots look like:
```python
# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Survived", size=6)
# For each subset of values, generate a kernel density plot of the "Age" columns.
g.map(sns.kdeplot, "Age", shade=True)
```
![dist_example](https://s3.amazonaws.com/dq-content/seaborn_simple_conditional.png)


Seaborn handled:

* subsetting the data into rows where Survived is 0 and where Survived is 1
* creating both Axes objects, ensuring the same axis scales
* plotting both kernel density plots

Instead of subsetting the data and generating each plot ourselves, seaborn allows us to express the plots we want as parameter values. 

#### The `seaborn.FacetGrid` object is used to represent the layout of the plots in the grid and the columns used for subsetting the data. 
* The word "facet" from `FacetGrid` is another word for "subset". 
* Setting the col parameter to `"Survived"` specifies a separate plot for each unique value in the `Survived` column.
* Setting the `size` parameter to `6` specifies a height of 6 inches for each plot.

Once we've created the grid, we use the `FacetGrid.map()` method to specify the plot we want for each unique value of `Survived`. Seaborn generated one kernel density plot for the ages of passengers that survived and one kernel density plot for the ages of passengers that didn't survive.

* The function that's passed into `FacetGrid.map()` has to be a valid matplotlib or seaborn function. For example, we can map matplotlib histograms to the grid:

```python
g = sns.FacetGrid(titanic, col="Survived", size=6)
g.map(plt.hist, "Age")
```

![graph5](images/graph5.png)

```python
g = sns.FacetGrid(titanic, col="Pclass", size=6)
g.map(sns.kdeplot, "Age", shade=True)
sns.despine(bottom=True, left=True)
plt.show()
```
![graph6](images/graph6.png)

### 3-5. <a name='conditional-distributions-double'></a>Conditional distributions using two conditions
When creating a FacetGrid, we use the row parameter to specify the column in the dataframe we want used to subset across the rows in the grid. The best way to understand this is to see a working example.

```python
g = sns.FacetGrid(titanic, col="Survived", row="Pclass")
g.map(sns.kdeplot, "Age", shade=True)
sns.despine(left=True, bottom=True)
plt.show()
```
![graph7](images/graph7.png)

### 3-5. <a name='conditional-distributions-three'></a>Conditional distributions using three conditions

When subsetting data using two conditions, the rows in the grid represented one condition while the columns represented another. 
#### We can express a third condition by generating multiple plots on the same subplot in the grid and color them differently. 
Thankfully, we can add a condition just by setting the `hue` parameter to the column name from the dataframe.

```python
g = sns.FacetGrid(titanic, col="Survived", row="Pclass", hue='Sex', size=3)
g.map(sns.kdeplot, "Age", shade=True).add_legend()
sns.despine(left=True, bottom=True)
plt.show()
```
![graph9](images/graph9.png)