<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 07: Charts

Associated Textbook Sections: [7.0, 7.1](https://inferentialthinking.com/chapters/07/Visualization.html)

## Overview

* [Why Do We Visualize Data](#Why-Do-We-Visualize-Data)
* [Course Visualizations](#Course-Visualizations)
* [Numerical Data](#Numerical-Data)
* [Categorical Data](#Categorical-Data)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

---

## Why Do We Visualize Data

* A large fraction of our brains are dedicated to visual reasoning. 
* In Data Science we use visualization:
    * For others – to communicate our findings
    * For ourselves – to understand our data, see patterns, and discover relationships 

### Demo: Identifying Data Type of Column Values

Load the `actors.csv` data. The `'Total Gross'`, `'Average per Movie'`, and `'Gross'` values represent Thousands of Dollars

In [None]:
actors = Table().read_table('./data/actors.csv')
actors

The actor's name is a categorical attribute.

In [None]:
...

The total gross dollar is a numerical attribute.

In [None]:
...

---

## Course Visualizations

* In the course we will mostly use the following visualizations:
    * Histograms
    * Line Graphs
    * Scatter Plots
    * Bar Charts
* You will indirectly work withe standard [Matplotlib Python library](https://matplotlib.org/) for data visualization using the `datascience` library.

* It may be helpful to overlay graphs to explore relationships.
* How you visualize your data depends on attribute type.
* The data type doesn't determine numerical/categorical attribute label. 
    * `'$12.00'` is a `str` and likely to refelect a numerical attribute.

---

## Numerical Data

### Visualizing the Distribution of One Numerical Variable

Histograms `tbl.hist` are a standard way to visualize the distribution of one numerical variable. 

*Histograms will be focused on in the next lecture.*

#### A Histogram

In [None]:
actors.hist('Total Gross', unit="Thousands of Dollars") 

# Some extra graph formatting you are not responsible for
plots.title('Distribution of Total Gross')
plots.show()

### Plotting Two Numerical Variables

Line graphs `tbl.plot` and Scatter plots `tbl.scatter` are standard ways to visualize the relationship of two numerical variables.

#### A Line Graph

In [None]:
top_movies = Table.read_table('./data/top_movies_2023.csv')
movies_per_year = top_movies.group('Year').relabeled('count', 'Number of Movies')
movies_per_year.where('Year', are.above(1999)).plot('Year', 'Number of Movies') 

plots.xticks(np.arange(2000, 2023, 5))
plots.title('Number of Movies vs. Release Year')
plots.show()

#### A Scatter Plot

In [None]:
actors.scatter('Number of Movies', 'Average per Movie')

plots.title('Average Pay per Movie (Thousands of Dollars) vs. Number of Movies')
plots.show()

### When to use a line vs scatter plot?

* Use line plots for sequential data if:
    * ... your x-axis has an order
    * ... sequential differences in y values are meaningful
    * ... there's only one y-value for each x-value
* Usually: x-axis is time or distance
* Use scatter plots for non-sequential data --- When you’re looking for associations


### Demo: Census

Explore the US Census data from the [Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2020/cc-est2020-agesex.pdf). 

(Release date: June 2021, Updated January 2022 to include April 1, 2020 estimates)

In [None]:
url = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/asrh/nc-est2020-agesex-res.csv'
full = Table.read_table(url)
full

In the previous lecture, we did the following:
* Select the `SEX`, `AGE`, `CENSUS2010POP`, and `POPESTIMATE2019` columns.
* Relabel the 2010 and 2019 columns.
* Remove the 999 ages and focus just on the combined data where the `SEX` value is 0. Drop the `SEX` column since there is only one value there.

In [None]:
partial = full.select('SEX', 'AGE', 'CENSUS2010POP', 'POPESTIMATE2019')
simple = partial.relabeled(2, '2010').relabeled(3, '2019')
no_999 = simple.where('AGE', are.below(999))
everyone = no_999.where('SEX', 0).drop('SEX')
everyone

Visualize the relationship between age and population size in 2010.

In [None]:
...

plots.title('US Population Size') 
plots.show()

Include lines for both 2010 and the estimated 2019 population sizes.

In [None]:
...

plots.title('US Population Size') 
plots.show()

### Demo: Male and Female 2019 Estimates

Create a table with `Age`, `Males`, `Females` columns showing the population estimates in 2019 for males and females by age.

In [None]:
males = ...
females = ...
pop_2019 = Table().with_columns(
    'Age', ...,
    'Males', ...,
    'Females', ...
)
pop_2019

Visualize the distribution of of population size for both males and females.

In [None]:
...

plots.title('2019 Population Size Estimates')
plots.show()

Calculate the percent female for each age

In [None]:
...
pct_female = ...
pct_female

Round the values to 3 decimal places so that it's easier to read.

In [None]:
pct_female = ...
pct_female

Add female percent to our table

In [None]:
pop_2019 = ...
pop_2019

Visualize the relationship between age and the percent of the population that is female.

In [None]:
...

plots.title('Female Population Percentage over Age')
plots.show()

Be careful of being visually mislead by the y-axis.

In [None]:
...

plots.ylim(0, 100);
plots.title('Female Population Percentage over Age')
plots.show()

### Demo: Scatter Plots

Visualize the relationship between the number of movies and the average pay per movie for each actor in the dataset.

In [None]:
...

plots.title('Average per Movie (Thousands of Dollars) vs. Number of Movies')
plots.show()

Identify the outlier in the dataset.

In [None]:
...

In [None]:
...

In [None]:
...

For all the visualization methods we use from the `datascience` library, if you put an `i` infront of the name of the visualization, you can access an interactive version of plot that is based on another visualization library called [Plotly](https://plotly.com/). You will not be tested on your knowledge of these interactive plots. You might find them helpful for exploring the data.

In [None]:
actors.iscatter('Number of Movies', 
                'Average per Movie', 
                labels='Actor', 
                title='Average per Movie (Thousands of Dollars) vs. Number of Movies')

---

## Categorical Data

* (Horizontal) Bar charts `barh` are a standard way to visualize the distribution of a single categorical variable.
* Pie charts are generally discouraged because most people have a difficult time visually interpreting angles compared to lengths of bars. 


#### A Bar Chart

In [None]:
cones = Table().read_table('./data/cones.csv')
cones_grouped_by_flavor = cones.group('Flavor')
cones_grouped_by_flavor.barh('Flavor')

plots.title('Distrubtion of Ice Cream Flavors')
plots.show()

### Demo: Bar Charts

The dataset `top_movies_2023.csv` shows the highest 1,000 grossing movies world wide listed on IMDB. Adjusted total gross values were also provided for data before 2021 using the Consumer Price Index (CPI)-based Python library `cpi`.



In [None]:
top_movies

Since _Gone with the Wind_ has been re-released several times, the adjusted price is not the most honest representation of its adjusted gross proces. For a more comparable analysis, reduce the table to the top top 10 movies based on actual gross values (`'Gross (Adjusted)'`) for the movies releasted in the last decade.

In [None]:
top_movies_select = ...
top_movies_last_decade = ...
top_movies_last_decade_sorted = ...
top10 = ...
top10

Convert to the gross (adjusted) values to billions of dollars for readability.

In [None]:
billions = ...
top10 = ...
top10

Visualize the gross adjusted values for each of the top 10 grossing (adjusted) movies.

In [None]:
...

plots.title("The Top 10 Grossing Movies")
plots.show()

### Visual Perception Accuracy

From [Nathan Yau’s Data Points: Visualization that Means Something](https://flowingdata.com/data-points/), our eyes can extract information at different levels of accuracy depending on the design.

<img src="./img/lec08_visual_perception.png" width=70%>

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>