# Lecture 8: More Charts!

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Top Movies Data

Make a horizontal bar chart showing number of years since release for each movie in the top 10 movies dataset.

In [None]:
# Read a table about the 200 top-grossing movies of all time (as of 2017)
# The fourth column has gross revenue adjusted for inflation
top_movies = Table.read_table('top_movies_2017.csv')

# We'll focus in on the rows for the top 10 movies
top10 = top_movies.take(np.arange(10))

# Add a column, 'MILLIONS', showing 'Gross (Adjusted)' rescaled to millions of dollars
# The first line below computes the array of values for the new column
millions = np.round(top10.column('Gross (Adjusted)') / 1000000, 3)
top10 = top10.with_column('Millions', millions)
top10

In [None]:
# Draw a horizontal bar chart showing 'Title' and 'Millions'
top10.barh('Title', 'Millions')
plots.title('Adjusted Gross Revenue (Millions), by Title');

### Plot Challenge

Workflow advice for drawing complex charts in JupyterLab: 

  - Make a step-by-step plan for the required table manipulations. 
  - Record your plan in comments in a sequence of new code cells. 
  - Add a final cell with a comment describing the chart to be drawn. 
  - Fill in the codes, generating output for checking each step as you go.

In [None]:
# Calculate an array of values for the 'Age' (difference between 2025 and 'Year')


In [None]:
# Modify the top10 table by adding an 'Age' column


In [None]:
# Modify the top10 table by sorting the rows in descending order by age


In [None]:
# Draw a horizontal barchart, showing 'Title' and 'Age'; include a plot title


**Back to lecture slides...**

## Categorical Distributions

In [None]:
# Here is the top_movies table (5 attributes, 200 entities)
# Each attribute is a "variable" in the statistical sense
# 'Title' is nominal, 'Studio' is categorical, and the others are numerical
top_movies

In [None]:
# Let's focus on the 'Studio' variable
studios = top_movies.select('Studio')   # a new, 1-column table
studios

The `group` method will let us calculate the distribution of the `Studios` column with just one line of code.

In [None]:
studios_distribution = studios.group('Studio')
studios_distribution

Notice that `t.group(column_label)` includes a `count` column automatically, to show the frequency for each different value of the variable. We'll learn more about the `group` method next week.

In [None]:
# Visualize this table with a bar chart
# Because we used `group` first, each studio name appears just once 
studios_distribution.barh('Studio', 'count')

In [None]:
# The chart looks a bit chaotic; to make visual comparisons easier, we should
# SORT the table into descending order by `count` before making the chart
studios_distribution.sort('count', descending=True).barh('Studio', 'count')
plots.title('Distribution of Studios (frequencies) for Top-Grossing Movies');

**Back to lecture slides...**

## Numerical Distributions

In [None]:
# Let's work with the ages of ALL the 200 top movies
ages = 2025 - top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)
top_movies

In [None]:
# To help inform our ranges for "binning" the ages, scope out the minimum and maximum age
min(ages), max(ages)

In [None]:
## Most often we'll use equal-width ranges for our bins, but let's try
# a set of unequal bins first
# Make an array of all the left endpoints, plus an extra number to be the 
# right end of the last bin
my_bins = make_array(0, 5, 10, 20, 30, 70, 110)
my_bins


In [None]:
binned_data = top_movies.bin('Age', bins=my_bins)
binned_data

How do we interpret this new table?

In [None]:
# More usually, we would use `np.arange(...)` to make equal-width bins
top_movies.bin('Age', bins=np.arange(0,121,20))

**Question**: Are there 6 bins, or 7, or 8?

**Back to lecture slides...**

## Histograms

In [None]:
# recall the my_bins variable
my_bins

**Question**: If we use these bins for a visualization of the 'Age' distribution, how wide will each bar in the histogram be?

In [None]:
binned_data = top_movies.bin('Age', bins=my_bins)
binned_data

In [None]:
# We know the histogram bar areas should show the percentage for each bin
# Let's calculate the percentages (i.e., bar AREAS) before drawing the histogram
num_movies = 200
binned_data.column('Age count') / num_movies * 100

In [None]:
top_movies.hist('Age', bins=my_bins, unit='Year')


**Question**: What is the vertical axis label? How does that make sense, keeping the Area Principle in mind?

In [None]:
# Draw another histogram, but now use equal-width bins
top_movies.hist('Age', bins=np.arange(0,111,10), unit='Year')

The bar for the [10, 20) bin appears to have height 1.75 Percent per Year, and width 10 Years. What is its area? What does that tell us?

**Type answer here:**

In [None]:
# We can also use "default" bins, but it's unsatisfying to not know 
# the exact bin ranges
top_movies.hist('Age', unit='Year')

**Back to lecture slides...**