# Lesson 3: Visualizing Data

Goals and Key Ideas:
1. Purposes and types of data visualizations
    + Determining the correct type of data visualization for a given type of variable
2. Good and bad data visualizations
    + Principles of good data visualizations
3. Creating data visualizations using python and the `seaborn` library
    + Visualizing the distribution of a variable: **Bar plots** and **Histograms**
    + Visualizing the relationship between two variables: **Scatterplots**
  
Here are the documentations for using seaborn.:

displot: https://seaborn.pydata.org/generated/seaborn.displot.html

histplot: https://seaborn.pydata.org/generated/seaborn.histplot.html

barplot: https://seaborn.pydata.org/generated/seaborn.barplot.html

countplot: https://seaborn.pydata.org/generated/seaborn.countplot.html

scatterplot: https://seaborn.pydata.org/generated/seaborn.scatterplot.html

More information about matplotlib.pyplot (this is a great tool to organize your figures/plots): https://matplotlib.org/stable/tutorials/pyplot.html


## Class Starter

Suppose that we have a data frame called `top50` (top 50 songs on Spotify in 2019).
What does the following python code/command do?

					top50[ top50[‘Length’] < 165 ]

- A. Sorts the rows based on the values in the `Length` column, from small to large				
- B. Adds a new column into the data frame	
- C. Keeps only rows whose `Length` values are less than 165
- D. None of the above

Respond on PollEV: https://pollev.com/fshum

## Recap

**Summary of Lesson 01 and 02:**
- Import necessary libraries
- Obtain and/or construct data
- Modify data
- Sort and/or filter data
- Group data by attribute

## Finding Datasets

An extensive list of places where we can get datasets can be found on Brightspace under the Project tab. Below is an example of one of those places.

- https://www.kaggle.com/datasets

<img src='kaggle.png' width=600>

## 1. Purposes and types of data visualizations

### Why visualize data?

- Convey your finding in an impactful way
- Spark interest in those who may not read through all the details
- Provide accessible information to a wider audience, especially to those who may not be familiar with the topic
- Identify outliers, trends, and generalizations easily and make predictions

### Types of Data Visualizations

1. For visualizing the **distribution** of a variable
    - Bar graphs: if the variable is categorical
    - Histograms: if the variable is numerical (continuous)
2. For visualizing the **relationship** between variables
    - Scatterplots: to visualize the relationship between two numerical variables 
    - Bar plots: to visualize the relationship between a categorical variable and a numerical variable


Other data visualizations:

- Pie charts:
    - Not great with lots of variables; bar plots are better.
    - Only meaningful if the variables sum up to a whole.
- Box plots
- Line graphs
- etc.

## 2. Good and bad data visualizations

### Examples of (good and bad) data visualizations

https://drive.google.com/file/d/1e97xdNftGal-lMH--x7vOWobfRIAMfSy/view?usp=sharing

### Some principles of good data visualizations

1. Use the appropriate type of graphs/charts, consistent with type of variables
2. Clearly label and explain the axes
3. Axes should generally start from 0 and be consistent in scale
4. Avoid pie charts or other shapes; use bar graphs
5. In bar charts and histograms: the area of each bar should be proportional to the quantity represented
6. When visualizing proportions or percentages, clearly state what the population is
7. Overall, any person should be able to observe key takeaways


### Examples

- https://www.cdc.gov/flu/weekly/index.htm

<img src='ILI_WeeklyMap.png' width=600>

- Who has the largest vocabulary in hip-hop?
    - https://pudding.cool/projects/vocabulary

<img src='hiphoplyrics.png' width=600>

- https://furmancenter.org/neighborhoods

<img src='bar_housing.png' width=600>

- Covid-19 cases in the US
    - https://coronavirus.jhu.edu/region/united-states

<img src='cdc_covid.png' width=600>

### Visualizing Distributions: Bar Graphs

**Bar graphs** are used to visualize the **distribution of a categorical variable**.
- One categorical variable
- One bar for each group
    - Bar height = “count” = how many observations belong to the group, or
    - Bar height = “proportion” = the proportion of observations that belong to the group
- The bars have equal width

**Example:**

| Name  | Species | Weight_lb | Age |
| --- | --- | --- | --- |
| Alex | Cat | 25 | 8 |
| Bert | Cat | 15 | 3 |
| Cate | Dog | 100 | 4 |
| Doug | Cat | 20 | 3 |
| Evan | Dog | 20 | 1 |
| Finn | Rabbit | 4 | 2 |

<table>
    <tr>
        <td><p>Let's suppose we want to plot a bar graph of Species using proportions as the bar height.</p> 
            <p>Note that there are 3 Cats, 2 Dogs, and 1 Rabbit out of 6 Animals.</p> 
            <p>Then the proportion that is Cats is 3/6, 2/6 for Dogs, and 1/6 for Rabbit.</p></td>
        <td><img src="bargraph1.png" width="400">
</td>
    </tr>
</table> 

### Quick Concept Check

Which of the following statements is/are true about **bar graphs**?

- A. If the y-axis is count, then the group with the most number of observation has the tallest bar.			
- B. If the y-axis is proportion, then the sum of the heights of all the bars is one.
- C. When the y-axis is count, the bar graph might look different than when the y-axis is proportion.
- D. When the y-axis is count, the bar graph might have a different number of bars than when the y-axis is proportion.
- E. Statements A-D are all false.

Respond on PollEV: https://pollev.com/fshum

### Visualizing Distributions: Histograms

**Histograms** are used to visualize the **distribution of a numerical variable**.
- One numerical variable
- Group the numbers (into “bins”/groups)
- One bar for each bin
    - Bar height = how many observations belong to the bin, or
    - Bar height = the **density** of observations that belong to the bin
      - Area of the bar = proportion of observations in that bin
- Usually, the bars/bins have equal width


<table>
    <tr>
        <td><p>Let's suppose we want to plot a histgram of Weight_lb with 3 bins.</p><p> Since the range of the weights are [0, 100],</p><p> this will create the cut-offs as [0, 33.33, 66.66, 100].</p> 
            <p>Note that Bin 1 is [0, 33.33] (including both 0 and 33.33),</p><p> Bin 2 is (33.33, 66.66] (including 66.66 only),</p><p> and Bin 3 is (66.66, 100] (including 100 only).</p> 
            <p>The first bin will always include both endpoints, but every bin after will only include the right endpoint.</p></td>
        <td><img src="histogram_bin.png" width="600"></td>
    </tr>
</table> 

<table>
    <tr>
        <td><p>Note that density = (proportion) / (width of bin).</p> 
            <p>Note that Area = (density) X (width of bin).</p> 
</td>
        <td><img src="histogram.png" width="500"></td>
    </tr>
</table> 

**Scale of each bin:**

| Bin | Count | Proportion | Density |
| --- | --- | --- | --- |
| 1 | 5 | 5/6 | $\dfrac{5/6}{100/3}$ |
| 2 | 0 | 0 | $\dfrac{0}{100/3}$ |
| 3 | 1 | 1/6 | $\dfrac{1/6}{100/3}$ |

### Quick Concept Check

Which of the following statements is/are true about **histograms**?

- A. If the y-axis is count, then the group with the most number of observation has the tallest bar.			
- B. If the y-axis is density, then the sum of the heights of all the bars is one.
- C. When the y-axis is count, the histogram might look different than when the y-axis is density.
- D. When the y-axis is count, the histogram might have a different number of bars than when the y-axis is density.
- E. Statements A-D are all false.

Respond on PollEV: https://pollev.com/fshum

### Visualizing Relationships: Scatterplots

**Scatterplots** are used to visualize the **relationship between two numerical variables**.
- Two numerical variables
- One point for each observation

<table>
    <tr>
        <td><p>Let's compare Age with Weight. </p>  
</td>
        <td><img src="scatterplot.png" width="500"></td>
    </tr>
</table> 

### Visualizing Relationships: Bar Graphs

We can use bar graphs to visualize the relationship between a **categorical variable** and a **numerical variable**.
- One categorical variable and one numerical variable
- One bar for each group
- The height of the bar is an aggregate value for each group



**Example:**

| Name  | Species | Weight_lb | Age |
| --- | --- | --- | --- |
| Alex | Cat | 25 | 8 |
| Bert | Cat | 15 | 3 |
| Cate | Dog | 100 | 4 |
| Doug | Cat | 20 | 3 |
| Evan | Dog | 20 | 1 |
| Finn | Rabbit | 4 | 2 |

| Species | Average_Weight|
| --- | --- | 
| Cat | 20 | 
| Dog | 60 | 
| Rabbit | 4 |

<table>
    <tr>
        <td><p>Let's compare Species with their respective Average Weight.</p> 
</td>
        <td><img src="bargraph2.png" width="500"></td>
    </tr>
</table> 

## 3. Creating data visualizations using python and the `seaborn` library

We will use the `seaborn` library for creating plots

    	import seaborn as sns
     
Important functions from the `seaborn` library that we’ll use today:
- displot(): for visualizing the distribution of a variable (bars, histograms, etc.)
- relplot(): for visualizing the relationships between variables (scatterplots, etc.)
- countplot(): for bar plots
- histplot(): for histograms
- scatterplot(): for scatterplots


In [None]:
# import the libraries (pandas, seaborn, matplotlib.pyplot)
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## Example with Pets

In [None]:
df = pd.DataFrame({"Name":["A", "B", "C", "D", "E", "F"], \
                   "Species":["Cat", "Cat", "Dog", "Cat", "Dog", "Rabbit"],\
                   "Weight_lb":[25, 15, 100, 20, 20, 4], \
                   "Age":[8, 3, 4, 3, 1, 2]})

## 3.1 Bar graph

Bar graphs are used for categorical variables. In this case, it can be used for ‘Species’ or ‘Name’. ’Species’ would be more suitable because everyone has a distinct name here.

    sns.displot(data = DATAFRAMENAME , x = ‘COLNAME’)
    sns.countplot(data = DATAFRAMENAME , x = ‘COLNAME’)

- Inputs: 
    - a data frame and 
    - a column in that data frame (a categorical variable)
- Output:
    - A bar graph

**Additional options in displot:**

    sns.displot(data = DATAFRAMENAME , x = ‘COLNAME’, OPTIONS)
      
- OPTIONS
    - Pick the y-axis: 
        - stat = ‘count’ OR stat = ‘probability’
    - Pick colors:
        - color = ‘red’, etc
    - etc.



In [None]:
# Displot is a more general bar graph for plotting function
sns.displot(data = df, x = 'Species')

In [None]:
# Before we showed a count, let's show a proportion or a probability
sns.displot(data = df, x = 'Species', stat = 'probability')

In [None]:
# countplot is a specific type of bar graph (more specific than displot)
sns.countplot(data = df, x = 'Species')

## 3.2 Histogram

Histograms are used for a continuous variable rather than discrete variable. In here, a histogram would be appropriate to show the amount of animals that fall under a range of ages and/or weights.

ie: How many animals are within the ages of 1-3, 3-5, 5-7, etc.

    sns.displot(data = DATAFRAMENAME , x = ‘COLNAME’)
    sns.histplot(data = DATAFRAMENAME , x = ‘COLNAME’)
- Inputs: 
    - A data frame and 
    - A column in that data frame (a numerical variable)
- Output: 
    - A histogram


**Additional options:**

    sns.displot(data = DATAFRAMENAME , x = ‘COLNAME’, OPTIONS)
    sns.histplot(data = DATAFRAMENAME , x = ‘COLNAME’, OPTIONS)
- OPTIONS
    - Pick how many bins:
      - bins = NUMBER
    - Pick bin width:
      - binwidth = NUMBER
    - Pick the y-axis: 
        - stat = ‘count’ OR stat = ‘probability’ OR stat = ‘density’
    - Pick colors:
        - color = ‘red’, etc
    - etc.


In [None]:
# You can use displot or histplot; histplot is more specific than displot
# Specify number of bins (or number of categories)
sns.displot(data=df, x='Age',bins=2)

In [None]:
# Specify number of bins (or number of categories)
sns.displot(data=df, x='Age',bins=4)

In [None]:
# Specify width of bins (or how wide each bar will be)
sns.displot(data=df, x='Age',binwidth=1.5)

In [None]:
# Specify ranges with a list
sns.displot(data=df, x='Age',bins = [1,3,5,7,9])

## 3.3 Scatterplot

A scatterplot visualizes two numerical variables together - like a coordinate point. Here, age and
weight would be an appropriate choice if we want to find out if age and weight are correlated to
each other.

    sns.relplot(data=DATAFRAMENAME , x=‘COLNAME1’, y=‘COLNAME2’)
    sns.scatterplot(data=DATAFRAMENAME , x=‘COLNAME1’, y=‘COLNAME2’)
- Inputs: 
    - A data frame
    - Two column names (both are numerical variables)
- Output: 
    - A scatterplot


In [None]:
sns.scatterplot(data=df , x='Age', y='Weight_lb')

In [None]:
sns.scatterplot(data=df , x='Age', y='Weight_lb', hue='Species')

## Additional Visualization Examples: Boxplot

<img src='boxplot.png' width=600>

## Additional Visualization Examples: Word Cloud

<table>
    <tr>
        <td><img src="word_cloud_script.png" width="600"></td>
        <td><img src="word_cloud_plt.png" width="300">
</td>
    </tr>
</table> 

## Practice

### Train of Thought

- **Observation**: Identify information that needs to be visualized
- **Decision**: Based on the information, determine an appropriate type of visualization (histogram, bar plot, scatter plot, etc)
- **Action**: Build it. Be sure to display key information (labels, legends, titles, axes names, etc.) in the type of visualization you chose
- **Presentation**: Beautify and make your visualization appealing



### Exercise

Grab the Medals dataset from https://www.kaggle.com/datasets/arjunprasadsarkhel/2021-olympics-in-tokyo/data

Here is documentation on how to read in excel files using pandas: https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

1. Visualize the Team/NOC and the number of medals received
2. Display the distribution of types of medal for the top 10 teams
3. Display the portion of the teams that received a certain number of medals, maybe in multiples of 20
4. Determine if there is a relationship between the total number of medals a team wins and the number of gold medals they win.


In [None]:
# read in data and name it df


In [None]:
# Question 1
# Set up 'area' to plot your figure
plt.figure(figsize=(20,20))
plt.title("Olympic Team and Total Number of Medals Won")

# The plot itself
sns.barplot(data=df, x='Team/NOC', y='Total', hue='Team/NOC', legend=False)

# Beautify
plt.xticks(rotation=90);

# Display the final plot
plt.show()

In [None]:
# Question 2
df=df.head(10)

# Set up 'area' to plot your figure
plt.figure(figsize=(20,20))
  
# Select info needed from the larger df
sub_df = df[['Team/NOC', 'Gold', 'Silver','Bronze']]
  
# The plot itself
# Note here the plotting function is from Pandas
# Stacked bar graphs can show subsets of information - that is, 
# out of the total medals, what proportion of them were gold/silver/bronze?
sub_df.plot(kind='bar', x='Team/NOC',stacked=True, color=['gold','silver', 'peru'],\
           title='Top 10 Teams & Medal Distribution', \
            ylabel = 'Medal Count');

In [None]:
# Question 3
sns.histplot(data=df, x="Total", stat='percent', bins=[0, 20, 40, 60, 80, 100, 120])

In [None]:
# Question 4
sns.scatterplot(y=df['Total'], x=df['Gold'])

In [None]:
# Question 4
temp=df.query('Gold<15')
sns.scatterplot(y=temp['Total'], x=temp['Gold'])

## More Examples

Here are some more example datasets that we can try to visualize

In [None]:
# load some example datasets

#1. top 50 songs on spotify in 2019
top50 = pd.read_csv('../../../shared/datasets/top50.csv')

#2. 
babyweight = pd.read_csv('../../../shared/datasets/babyweight.csv')

#3. NYC Tree Census
nyctrees = pd.read_csv('../../../shared/datasets/NYC_Tree_Census_small.csv')

## More Resources

- https://seaborn.pydata.org/tutorial.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html
- https://matplotlib.org/stable/tutorials/index
- https://matplotlib.org/stable/gallery/color/named_colors.html
