# Topic 06:  Data Visualizations

- onl01-dtsc-ft-022221
- 03/04/21


## Learning Objectives

- Quick Intro to Object-Oriented-Programming
- Learn about the anatomy of a Matplotlib figure. 
- Learn about how other packages use Matplotlib
 - Discuss and demonstrate the 2 different syntaxes/interfaces of Matplotlib

- Activity: Remaining portion of Topic 05's Data Cleaning Project with Superheroes

## Questions/Comments?

### Un-Answered Topic 05 Questions

- [ ] The data cleaning mini project's data set (superheroes) had a decent amount of negative values for heights and weights that it appears were not taken out in the solution. I replaced these with medians, and got really weird displots for the gender-height-weight charts, although the regular hist() plots were less crazy looking. 
    - Can you go over displots a little more, and why we might use them instead of the other ways to do plots?
    
    
- [ ] When should we use map(), apply(), mapapply()? I'm having a hard time telling how they're different.


- [ ] One of the solutions in the more data mapping lab included the following code to print the mean, median, and stddev of the age column. 
    - How was it possible to run functions from a list of strings representing the function names and apply()?
```python
age_na_mean = df['Age'].fillna(value=df['Age'].mean())
print(age_na_mean.apply(['mean', 'median', 'std']))
```

    

- [ ] Can you explain stack() and unstack a bit more? I ran through the lab examples but didn't really get what was happening


### New Topic 06 Questions

- Subplots with Enumeration Lab:
    - [ ] I’m confused as to how the enumeration function works with the subplots on the Subplots and Enumeration - Lab. Shouldn’t value 1 be country name and value 2 be year? Why is value 2 population and why is the for loop not iterating through year or the rest of the information?
    - [ ] Also related to this why don’t we specify the ‘Year’ and ‘Value’ as columns of the population variable when plotting as population[‘Year’] and population[‘Value’]?
    https://github.com/learn-co-curriculum/dsc-subplots-and-enumeration-lab/tree/solution


___

# What does it mean to be 'Object-Oriented'?

> ### "Everything is an object."
- some Python sensei

### `intro_object_oriented_programming.ipynb`

### OOP Vocabulary 

- "Object" is an instance of a template class that currently exists in memory
- "Calling" a function:
    - When we use ( ) with a function we are calling it.

- **Function:**  Codes that manipulates data in a useful way. 


- Parameters: the defined data/varaibles that are passed accepted by a function
- Argument: the actual variable/value passed in for a parameter
- Positional Argument:
    - The first arguments required
    - their id is determined by their order
- Keyword/default Arguments:
    - arugments that have a defined default value
    - must come after positional arguments
    
    
- **Class:** Template/blue print.
- Instance: Ab object built from the class blueprint
- Attribute: A variable stored inside an object. 
- Method: Functions are stored inside an object.

# Intro to Matplotlib

- Matplotlib is the backbone of plotting in python and used by pandas,seaborn,etc.
    - [Matplotlib Example Gallery](https://matplotlib.org/gallery/index.html#examples-index) 
    - [Seaborn Example Gallery](https://seaborn.pydata.org/examples/index.html)
    - ['Pandas Visualization docs']('https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html')



-  *Matplotlib is powerful but can be a bit confusing at times because of its 2 sets of commands:*
    - the matplotlib.pyplot functions (`plt.bar()`,`plt.title()`)
    - the object_oriented methods (`ax.bar()`,`ax.set_title()`)
    
    
    
- The 2 syntaxes can be confusing at first and cause problems & odd results when mixed together.
    - Learn about some of the problems when mixing types.
    - Example: see how plt.title()/plt.xlabel(),etc. can behave strangely in subplots.
 

### References

#### Blog Posts

- **Bookmark this article, its the best explanation of how matploblib'S 2 interfaces work:**
> ["Artist" in Matplotlib - something I wanted to know before spending tremendous hours on googling how-tos.](https://dev.to/skotaro/artist-in-matplotlib---something-i-wanted-to-know-before-spending-tremendous-hours-on-googling-how-tos--31oo)<br>

- [My Blog Post on Making Customized Figures in seaborn](https://jirvingphd.github.io/harnessing_seaborn_subplots_for_eda)
    - This covers some concepts we didn't have time to cover, like ticklabel formatters.
    
#### **Matplotlib Offical Documentation**

- [Markers](https://matplotlib.org/3.1.1/api/markers_api.html)
- [Colors](https://matplotlib.org/3.1.0/gallery/color/named_colors.html )
- [Text](https://matplotlib.org/3.1.0/tutorials/text/text_intro.html )
- [Text Properties](https://matplotlib.org/3.1.1/tutorials/text/text_props.html)

## Matplotlib Anatomy / Structure


- Matplotlib Figures are composed of 3 different objects:
    - `Figure` is the largest bucket and contains everything else. It is like a picture frame without any actual images in it.
  - `Axes` are the actual plot / image inside of the Figure / frame. 
        - this is the same `ax` as in `fig, ax = plt.subplots()` and that is returned when you create a Pandas or Seaborn figure.
        - There is an 'Axes` for each subplot in the Figure
        - `Axes` contain information about the titles, labels, grid,background, they also contain an. See the figure below for the contents of `Axes`

<center><img src="https://raw.githubusercontent.com/jirvingphd/fsds_100719_cohort_notes/master/images/matplotlib_anatomy.png" width=400></center>


        
- Inside Axes there is an `Axis` which is further divided into an `Axis.xaxis` and an `Axis.yaxis` that contain the ticks and the tick lables.
    <center><img src="https://raw.githubusercontent.com/jirvingphd/fsds_100719_cohort_notes/master/images/matplotlib_Axes_layout2.png" width=500></center>
  

### Matplotlib Fig, ax

In [None]:
## SETTING MATPLOTLIB STYLE AND DEFAULT FIGSIZE
import matplotlib.pyplot as plt
import matplotlib as mpl


## Set the default style and figsize
print(plt.style.available)
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = [12,8]

In [None]:
## Make a fig and an ax
fig, ax = plt.subplots()

In [None]:
## What does the ax look like?
ax

In [None]:
## what about the fig?
fig

# Project - Data Cleaning

> Official Solution: https://github.com/learn-co-curriculum/dsc-data-cleaning-project/tree/solution

## Introduction
In this lab, we'll make use of everything we've learned about pandas, data cleaning, and exploratory data analysis. In order to complete this lab, you'll have to import, clean, combine, reshape, and visualize data to answer questions provided, as well as your own questions!

## Objectives
You will be able to:
- Use different types of joins to merge DataFrames 
- Identify missing values in a dataframe using built-in methods 
- Evaluate and execute the best strategy for dealing with missing, duplicate, and erroneous values for a given dataset 
- Inspect data for duplicates or extraneous values and remove them 


## The dataset
In this lab, we'll work with the comprehensive [Super Heroes Dataset](https://www.kaggle.com/claudiodavi/superhero-set/data), which can be found on Kaggle!


## Getting Started

In the cell below:

* Import and alias pandas as `pd`
* Import and alias numpy as `np`
* Import and alias seaborn as `sns`
* Import and alias matplotlib.pyplot as `plt`
* Set matplotlib visualizations to display inline in the notebook

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
# import matplotlib.pyplot as plt
%matplotlib inline

For this lab, our dataset is split among two different sources -- `'heroes_information.csv'` and `'super_hero_powers.csv'`.

Use pandas to read in each file and store them in DataFrames in the appropriate variables below. Then, display the `.head()` of each to ensure that everything loaded correctly.  

In [None]:
heroes_url ='https://raw.githubusercontent.com/flatiron-school/Online-DS-FT-022221-Cohort-Notes/master/Phase_1/topic_05_data_cleaning_in_pandas/labs_from_class/dsc-data-cleaning-project-master/heroes_information.csv'
heroes_df = pd.read_csv(heroes_url)

powers_url = 'https://raw.githubusercontent.com/flatiron-school/Online-DS-FT-022221-Cohort-Notes/master/Phase_1/topic_05_data_cleaning_in_pandas/labs_from_class/dsc-data-cleaning-project-master/super_hero_powers.csv'
powers_df = pd.read_csv(powers_url)
# display(heroes_df.head(), powers_df.head())

It looks as if the heroes information dataset contained an index column.  We did not specify that this dataset contained an index column, because we hadn't seen it yet. Pandas does not know how to tell apart an index column from any other data, so it stored it with the column name `Unnamed: 0`.  

Our DataFrame provided row indices by default, so this column is not needed.  Drop it from the DataFrame in place in the cell below, and then display the head of `heroes_df` to ensure that it worked properly. 

In [None]:
heroes_df = pd.read_csv(heroes_url, index_col=0)
heroes_df

## Familiarize yourself with the dataset

The first step in our Exploratory Data Analysis will be to get familiar with the data.  This step includes:

* Understanding the dimensionality of your dataset
* Investigating what type of data it contains, and the data types used to store it
* Discovering how missing values are encoded, and how many there are
* Getting a feel for what information it does and doesn't contain

In the cell below, get the descriptive statistics of each DataFrame.  

In [None]:
heroes_df.info()

In [None]:
heroes_df.describe().round(2)

## Dealing with missing values

Starting in the cell below, detect and deal with any missing values in either DataFrame. Then, explain your methodology for detecting and dealing with outliers in the markdown section below. Be sure to explain your strategy for dealing with missing values in numeric columns, as well as your strategy for dealing with missing values in non-numeric columns.  

Note that if you need to add more cells to write code in, you can do this by:

**1.** Highlighting a cell and then pressing `ESC` to enter command mode.  
**2.** Press `A` to add a cell above the highlighted cell, or `B` to add a cell below the highlighted cell. 

Describe your strategy below this line:
____________________________________________________________________________________________________________________________




#### Visual Alternatives

In [None]:
# # !pip install missingno
# import missingno
# missingno.matrix(heroes_df)

### `heroes_df`

In [None]:
res = heroes_df.isna().sum()
res[res>0]

In [None]:
heroes_df[heroes_df.isna().any(axis=1)]

In [None]:
## Fill Publisher with placeholder
heroes_df['Publisher'] = heroes_df['Publisher'].fillna('Missing')

In [None]:
## Check Remaining Missing 
heroes_df[heroes_df.isna().any(axis=1)]

In [None]:
## Lets try fillna with the max
heroes_df['Weight'].max()

In [None]:
heroes_df['Weight'].fillna(heroes_df['Weight'].max(),inplace=True)
heroes_df[heroes_df.isna().any(axis=1)]

In [None]:
heroes_df.isna().sum()

### `powers_df`

In [None]:
res = powers_df.isna().sum()
res[ res>0]

## Joining, Grouping, and Aggregating

In the cell below, join the two DataFrames.  Think about which sort of join you should use, as well as which columns you should join on.  Rename columns and manipulate as needed.  

**_HINT:_** Consider the possibility that the columns you choose to join on contain duplicate entries. If that is the case, devise a strategy to deal with the duplicates.

**_HINT:_** If the join throws an error message, consider setting the column you want to join on as the index for each DataFrame.  

In [None]:
heroes_df[heroes_df.duplicated(keep=False)]

In [None]:
heroes_df = heroes_df[~heroes_df.duplicated()]
heroes_df[heroes_df.duplicated()]

In [None]:
heroes_df.drop_duplicates(inplace=True)

In [None]:
display(heroes_df.head(2),powers_df.head(2))

In [None]:
## Join with integer indexes joins incorrectly 
# heroes_df.join(powers_df,)[['name','hero_names']]

In [None]:
df = pd.merge(heroes_df,powers_df,left_on='name',right_on='hero_names',how='inner')
df

# Topic 06 Study Group - Superheroes Continued

In the cell below, subset male and female heroes into different dataframes.  Create a scatterplot of the height and weight of each hero, with weight as the y-axis.  Plot both the male and female heroes subset into each dataframe, and make the color for each point in the scatterplot correspond to the gender of the superhero.

In [None]:
# plt.style.use('seaborn-talk')
sns.scatterplot(data=df, x='Height', y='Weight',hue='Gender')

### Addressing -99s

In [None]:
## Visualize all heroes that had a negative height or weight
df[(df['Height']<0) | (df['Weight']<0)]

In [None]:
## Replace Negative Heights with NaN - Use .map


In [None]:
## Replace Negative Weights with NaN - Use .loc indexing



In [None]:
## Check for negaative heights/weights again


### Filling Height and Weight using Gender-means

In [None]:
## Get the mean height and weight by gender using .groupby


In [None]:
## You can use .agg with a groupy and use the string name of a function


In [None]:
## test out slicing out male Weight


In [None]:
## fill Weight by Gender for Males


> - We need to do this for multiple columns for multiple genders. 
- **Great chance for a function!**
    - Take the code we used for fill Weight by Gender for Males above as the starting code for the function.

In [None]:
# ## How many genders does the df have?
groups = df['Gender'].unique()
groups

In [None]:
def fillna_by_groups():
    pass

In [None]:
# make our little null value trick a function
def show_nulls(df):
    res = df.isna().sum()
    print(res[res>0])
    
show_nulls(df)

In [None]:
## Use fillna_by_groups on Height and show_nulls


In [None]:
## &se fillna_by_groups on Weight and show_nulls


## Some Initial Investigation

Next, slice the DataFrame as needed and visualize the distribution of heights and weights by gender.  You should have 4 total plots.  

In the cell below:

* Slice the DataFrame into separate DataFrames by gender
* Complete the `show_distplot()` function.  This helper function should take in a DataFrame, a string containing the gender we want to visualize, and the column name we want to visualize by gender. The function should display a distplot visualization from seaborn of the column/gender combination.  

Hint: Don't forget to check the [seaborn documentation for distplot](https://seaborn.pydata.org/generated/seaborn.distplot.html) if you have questions about how to use it correctly! 

> ### James Edit: Why bother making separate dataframes?!

In [None]:
# male_heroes_df = df.groupby('Gender').get_group('Male')
# female_heroes_df =  df.groupby('Gender').get_group('Female')

def show_distplot(dataframe, gender, column_name):
    pass

In [None]:
# Male Height
show_distplot(df,'Male','Height')

> #### Optional Sub-Task: Demo the same plot being done 4 ways (Poll)
1. plt functions
2. OOP fig,ax
3. Pandas
4. Seaborn

In [None]:
# Male Weight
show_distplot(df,'Male','Weight')

In [None]:
# Female Height
show_distplot(df,'Female','Height')

In [None]:
# Female Weight
show_distplot(df,'Female','Weight')

Discuss your findings from the plots above, with respect to the distribution of height and weight by gender.  Your explanation should include a discussion of any relevant summary statistics, including mean, median, mode, and the overall shape of each distribution.  

Write your answer below this line:
____________________________________________________________________________________________________________________________



> #### It would be so much easier to answer the question if they were on the same figure

In [None]:
def show_displot_by_color(dataframe,column_name, color_col='Gender'):
    pass

In [None]:
show_displot_by_color(df,'Height')

In [None]:
show_displot_by_color(df,'Weight')

### Sample Question: Most Common Powers

The rest of this notebook will be left to you to investigate the dataset by formulating your own questions, and then seeking answers using pandas and numpy.  Every answer should include some sort of visualization, when appropriate. Before moving on to formulating your own questions, use the dataset to answer the following questions about superhero powers:

* What are the 5 most common powers overall?
* What are the 5 most common powers in the Marvel Universe?
* What are the 5 most common powers in the DC Universe?

Analyze the results you found above to answer the following question:

How do the top 5 powers in the Marvel and DC universes compare?  Are they similar, or are there significant differences? How do they compare to the overall trends in the entire Superheroes dataset?

Write your answer below this line:
____________________________________________________________________________________________________________________________


### What are the 5 most common powers overall?

In [None]:
## get list of powers 
power_cols = None
power_cols

In [None]:
## Save the sum of the power cols


### What are the 5 most common powers in the Marvel Universe?

In [None]:
df["Publisher"].value_counts()

In [None]:
## What are the 5 most common powers in the Marvel Universe?
publisher = 'Marvel Comics'

power_counts = df.loc[ df['Publisher']==publisher, power_cols].sum()#head()
ax = power_counts.sort_values(ascending=True).tail().plot(kind='barh')
ax.set_title(f"Top 5 Super Powers - {publisher}")

In [None]:
## What are the 5 most common powers in the Marvel Universe?
publisher = 'DC Comics'
power_counts = df.loc[ df['Publisher']==publisher, power_cols].sum()#head()
ax = power_counts.sort_values(ascending=True).tail().plot(kind='barh')
ax.set_title(f"Top 5 Super Powers - {publisher}")

## Level-Ups

#### Your Own Investigation

For the remainder of this lab, you'll be focusing on coming up with and answering your own question, just like we did above.  Your question should not be overly simple, and should require both descriptive statistics and data visualization to answer.  In case you're unsure of what questions to ask, some sample questions have been provided below.


Explain your question below this line:

___
### Which powers have the highest chance of co-occurring in a hero (e.g. super strength and flight), and does this differ by gender?

In [None]:
## Get the correlation matrix for JUST the power cols


In [None]:
### Need to Turn the square matrix above into a normal dataframe
## ref: https://stackoverflow.com/a/51071640


#### Using .apply with axis=1

> .Apply can by really helpful if you need to apply something to multiple columns at the same time.

In [None]:
##  Make a keep-me column that is False if power 1 and power 2 are the same


In [None]:
## Deal with reverse-order correlations


In [None]:
## Save corr_df where keep-me is true annd only save Power Combo and Correlation 


In [None]:
## Sort values  


In [None]:
## Plotn with pandas (barh)


## Summary

In this lab, we demonstrated our mastery of:
* Using all of our Pandas knowledge to date to clean the dataset and deal with null values
* Using Queries and aggregations to group the data into interesting subsets as needed
* Using descriptive statistics and data visualization to find answers to questions we may have about the data

## Final Activity

> - Make a function to produce a plot and return the fig 
- Loop through each ... Publisher? Gender? and produce the top 10 most highly correlated powers.
- Save the figs to a dictionary.
- In a For Loop, loop through the dictionary and either:
    - print out which publisher/gender and then show plot
    - OR update the title and then store plot "

In [None]:
%%time
## Paste the lines of code we used into this one cell

In [None]:
def plot_top_corr_powers(df,power_cols):
    pass


In [None]:
## Test with DC COmics


In [None]:
## Lets do the the top 5 most common Publishers


In [None]:
## Use function in a looop and save publisher to dictionary


In [None]:
## Get DC Comics fig


## If there's time - Let's save dc and marvel figures to disk 


In [None]:
## If there's time - Let;s save dc and marvel figures to disk 
