# Lecture 7: Charts

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Census

In [None]:
full = Table.read_table('nc-est2019-agesex-res.csv')
full.show(3)

Each SEX-AGE combination is represented in a row. 

  - Remember, SEX is coded (0 = all, 1 = male, 2 = female).

  - The AGE values are ages in years, except 100 is interpreted as "100 or older" and 999 is interpreted as "all ages".

In [None]:
partial = full.select(0, 1, 8, 13)
partial

In [None]:
# We can shorten up annoyingly-long column names
us_pop = partial.relabeled(2, '2014').relabeled(3, '2019')
us_pop.show(4)

### Line Plots

The 999 value in the AGE column indicates "all ages". We won't be using those rows in our line plot examples, so we need to make a new table without the AGE = 999 rows.

In [None]:
# Make a new table, assigned to the variable `no_999`, without those rows
no_999 = ...
no_999

**Explain**: Why did the number of rows decrease by 9 when we omitted the AGE = 999 rows?

In [None]:
# Make a new table from no_999, assigned to the variable `overall`,
# which only keeps the SEX = 0 (both sexes combined) rows;
# then eliminate the 'SEX' column since all the values are identical.
overall = ...
overall

In [None]:
# Plot of US Population in 2019 (in millions) versus Age
overall.plot('AGE', '2019')

It's important to **document** the meaning of each plot as it's generated, so that when you come back later, or share the notebook with team members etc., there's no question as to the meaning of the plot. There are three standard ways to do this:

  1. Put the title in a **comment** immediately above the line of code which makes the plot. (See previous cell for an example of this.)
  2. **Print** the title before generating the plot. (See next cell for an example of this.)
  3. **Add a title element** to the plot itself. On the code line right after making the plot, use the syntax `plots.title(...);` where a string literal is put in the parentheses and there's a semicolon after the closing parenthesis. (See the cell after the next one!)

In [None]:
print('US Population in 2019 (in millions) versus Age')
overall.plot('AGE', '2019')

In [None]:
overall.plot('AGE', '2019')

# plots.title(...); adds a title to the "current" plot
plots.title('US Population in 2019 (in millions) versus Age');  

**Questions**: 
  - Why is there an uptick in the graph for AGE = 100?
  - Do you see evidence in this plot that the US birthrate decreased from 1999 to 2019, or increased from 1999 to 2019?

### Males vs. Females

Make a table, assigned to the name `males`, for the rows of the `no_999` table corresponding to the male population numbers. Make a second table, assigned to the name `females`, in a similar way.

In [None]:
males = ...
females = ...

Now we'll make a new 3-column table with column labels 'AGE', 'MALE_POP', and 'FEMALE_POP'.

In [None]:
pop_2019 = Table().with_columns(
    'AGE', males.column('AGE'),
    'MALE_POP', males.column('2019'),
    'FEMALE_POP', females.column('2019')
)
pop_2019

This new table makes it really simple to plot the male and female population numbers versus age on the same graph.

In [None]:
pop_2019.plot('AGE')
plots.title('US Population in 2019 (in millions), by Sex');

What do you notice?

Now let's try focusing in on a single number versus age: What is the percent of the 2019 population that's female?

In [None]:
# Make an array showing total population at each age in 2019
total = pop_2019.column('MALE_POP') + pop_2019.column('FEMALE_POP')

# Make an array showing percent female at each age
pct_female = pop_2019.column('FEMALE_POP') / total * 100
pct_female

In [None]:
# We don't really need all those decimal places
pct_female = np.round(pct_female, 3)
pct_female

In [None]:
# Now we add these percentages to our table pop_2019 in a new column
pop_2019 = pop_2019.with_columns(
    'PCT_FEMALE', pct_female
)
pop_2019

In [None]:
# Percent Female vs. Age for the US Population in 2019
pop_2019.plot('AGE', 'PCT_FEMALE')

Because the y axis is contrained to the range 45 to 80 (roughly), instead of ranging 0 to 100 (the true set of possible values for a percentage), the steepness of the curve is an exaggeration. It might be better to show the entire y range from 0 to 100, like so:

In [None]:
# Percent Female vs. Age for the US Population in 2019
pop_2019.plot('AGE', 'PCT_FEMALE')
plots.ylim(0,100);

**TAKEAWAY**: When you are looking at a visualization, always inspect the y axis carefully. If it doesn't start at 0, ask yourself if it's giving a misleading picture of reality. (The y axis shouldn't **always** start at 0, but if it doesn't there should be a valid reason for that.)

## Scatter Plots

According to [The-Numbers.com](https://www.the-numbers.com/box-office-star-records/domestic/lifetime-acting/top-grossing-leading-stars), the total gross for movies in which Samuel L. Jackson has appeared is currently \\$5,803,143,777. That's 5803 million dollars. The `actors.csv` table we're about to read in is a bit out of date, it shows Samuel L. Jackson's Total Gross as 4772.8 (million dollars).

In [None]:
# Here we have a slightly out-of-date table with data on 50 actors from high-grossing movies
actors = Table.read_table('actors.csv')
actors

**Question**: For actors in this table, is there an association between Number of Movies and Total Gross? 

To check for a possible association between two numerical variables, a scatter plot is a good choice.

In [None]:
# Total Gross (in millions of dollars) versus Number of Movies
actors.scatter('Number of Movies', 'Total Gross')
plots.ylabel('Total Gross (millions)');  # to include the unit label (millions) on the y axis

  - What do you see in this plot? 
  - Does it show a positive association? If yes, what does that mean?
  - Are there any outliers (unusual points which don't seem to fit the overall trend)?

Of course, an actor who has been in lots of movies should be expected to have a higher total gross. Let's compare 'Average per Movie' with 'Number of Movies':

In [None]:
# Average per Movie (in millions of dollars) versus Number of Movies
actors.scatter('Number of Movies', 'Average per Movie')
plots.ylabel('Average per Movie (millions)');  

  - What do you see in this plot? 
  - Does it show a positive, or a negative, association? What does that tell us?
  - Are there any outliers which buck the overall trend, or outliers which are simply extreme?

In [None]:
# Use table manipulations to discover the actor whose 'Average per Movie' is over 400 million
actors.sort('Average per Movie', descending=True).column('Actor').item(0)

In [None]:
actors.sort('Average per Movie', descending=True)

In [None]:
# We could also have used `where`
actors.where('Average per Movie', are.above(400))

Any idea who Anthony Daniels is? (back to lecture slides...)

## Bar Charts (for Categorical & Nominal Data)

In [None]:
# Read a table about 200 top-grossing movies of all time (as of 2017)
# The fourth column has gross revenue adjusted for inflation
top_movies = Table.read_table('top_movies_2017.csv')
top_movies

In [None]:
# Is each movie title unique (only occurs on one row of the table)?
# Hint: .group & .sort
...

**Question**: Which attributes are numerical (aka quantitative)? Which are categorical? Which are nominal?

**Answer**: ...

In [None]:
# Make a new table for just the top 10 movies, based on 'Gross (Adjusted)'
# Notice the table is already sorted according to 
top10_adjusted = top_movies.take(np.arange(10))
top10_adjusted

The 'Gross (Adjusted)' values are a bit hard to look at, they are so huge. What would they look like if we scaled them down by a factor of 1 million?

In [None]:
top10_adjusted.column('Gross (Adjusted)') / 1000000

In [None]:
# Looks good! Let's also round them to three places after the decimal
millions = np.round(top10_adjusted.column('Gross (Adjusted)') / 1000000, 3)
millions

In [None]:
# Now add those numbers to the table in a new last column labeled 'Millions'
top10_adjusted = top10_adjusted.with_column('Millions', millions)
top10_adjusted

In [None]:
# Notice we CAN make a line plot using the last 2 columns, but it's not very informative
top10_adjusted.plot('Year', 'Millions')
print("Adjusted Gross (Millions) vs. Year")

In [None]:
# Try a horizontal bar graph (`barh`) for showing AGR for top 10 movies
# Notice the non-quantitative variable goes FIRST in the list of arguments
top10_adjusted.barh('Title', 'Millions')
plots.title('Adjusted Gross Revenue (Millions), by Title');

Questions about `barh`?

In [None]:
# What happens if we try to do a similar thing with the "Studio" attribute?
top10_adjusted.barh('Studio', 'Millions')
plots.title('Adjusted Gross Revenue (Millions), by Studio');

Is this an informative visualization?

BACK TO LECTURE SLIDES for our next challenge...

In [None]:
# Calculate values for the 'Age' column (difference between 2024 and 'Year')

In [None]:
# Add the new 'Age' column to the top10_adjusted table

In [None]:
# Sort in descending order by 'Age'

In [None]:
# Draw the horizontal barchart, showing 'Title' and 'Age'