## Lecture 8: Histograms ##

In [None]:
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

#The following allows porting images into a Markdown window
from IPython.display import Image

## Categorical Distribution ##

In [None]:
top_movies = Table.read_table('top_movies_2017.csv')
top_movies

In [None]:
top_movies = top_movies.with_column('Millions', np.round(top_movies.column('Gross (Adjusted)')/1000000,3))
top_movies

In [None]:

top_movies.take(np.arange(10)).barh('Title', 'Millions')

In [None]:
studios = top_movies.select('Studio')
studios

# New Function Alert: group #


In [None]:
studio_distribution = studios.group('Studio')

**NOTE: The group function always creates a new column called 'count' and adds it to the table.**

In [None]:
studio_distribution

**Q. In total, how many movies did the studios on our list make?**

In [None]:
sum(studio_distribution.column('count'))

## Bar Charts ##

In [None]:
studio_distribution.barh('Studio')

**Can we rearrange in descending order based on count?**

In [None]:
studio_distribution = studios.group('Studio').sort('count', descending=True)
studio_distribution

**Rerun barh command to produce the bar chart**

In [None]:
studio_distribution.barh('Studio')

**Q. How do we relabel the horizontal axis so it says 'Number of Movies Produced'?**  
**A.**  


In [None]:
#Relabel the 'count' column of the table
studio_distribution = studio_distribution.relabel('count','Number of Movies Produced')
studio_distribution

**Plot the bar chart again**

In [None]:
studio_distribution.barh('Studio')

*We could've cascaded the sorting and bar-hart plotting all in one line.*  
*But, for readability, it's good to break down the process into separate steps.*

In [None]:
studios.group('Studio').sort('count', descending=True).barh('Studio')
#Another drawback of this is that we can't assign it to the name studio_distribution
#uncomment the line below to see the type that the command above returns. 
#type(studios.group('Studio').sort('count', descending=True).barh('Studio'))

**SLIDE: Bar Charts and Visualization of Categorical Variables**   



## Numerical Distribution ##

In [None]:
#Create an array, called 'ages,' containing the respective ages of the movies
ages = 2021 - top_movies.column('Year')  

#Add the ages array as a new column to the top_movies table
top_movies = top_movies.with_column('Age', ages)

#display the table
top_movies

## Binning ##

**Before we**
<ul>
    <li><b>Determine bin sizes</b></li>
    <li><b>Visualize the data</b></li>
</ul> 

**Let's get a sense of our data range.**  

    No point creating bins outside of that range. 

In [None]:
min(ages), max(ages)

In [None]:
my_bins = make_array(0, 5, 10, 15, 25, 40, 65, 100, 105)
my_bins

**Q. Why do we need 105?**

**Now let's create a table containing the binned data.**

In [None]:
binned_data = top_movies.bin('Age', bins = my_bins)
#.bin(,) returns a NEW table, which we're calling binned_data here.
#The original table "top_movies" is unaffected.
binned_data

**NOTE:** Given how we created the bins using the min and max of the ages, we're guaranteed that there's no entry above 105.  

The 105 entry also shows the strict upper boundary (excluded upper boundary) of the cell that began with, and inclusive of, 100.

**Verify that the total number movies hasn't changed.**

In [None]:
total_number_of_movies=sum(binned_data.column('Age count'))
total_number_of_movies

**Now let's make equal-sized bins.**

In [None]:
binned_data_uniform_bins=top_movies.bin('Age', bins = np.arange(0, 126, 25))
binned_data_uniform_bins

**Again, verify that we've captured a correct total headcount of the movies.**

In [None]:
sum(binned_data_uniform_bins.column('Age count'))

In [None]:
binned_data_incomplete_uniform_bins=top_movies.bin('Age', bins = np.arange(0, 60, 25))
binned_data_incomplete_uniform_bins

**Now if we run a sum check e notice that some movies are unaccounted for.**  

**This is because of our choice of an incomplete range.**

In [None]:
sum(binned_data_incomplete_uniform_bins.column('Age count'))

In [None]:
top_movies.where('Age', 51)

## Histograms ##  

Slides  


In [None]:
my_bins

In [None]:
binned_data

**Our First Histogram**

In [None]:
# Let's make our first histogram!
top_movies.hist('Age', bins = my_bins, unit = 'Year')

**Hard to compare the bars!**  

**Problem caused by our selection of nonuniform bin sizes.**

**UNIFORM BINS**

In [None]:
# Let's try equally spaced bins instead.
top_movies.hist('Age', bins = np.arange(0, 110, 10), unit = 'Year')

In [None]:
# Let's try not specifying any bins!
top_movies.hist('Age', unit='Year')

**The problem with the above is that we don't know where Python made the bins start or end.**

**Add the Percent column to the Table**

In [None]:
# Add a column containing what percent of movies are in each bin
binned_data = binned_data.with_column(
    'Percent', 100*binned_data.column('Age count')/total_number_of_movies)
#Recall that total_number_of_movies in this case is 200

In [None]:
binned_data

## Height ##

### Question: What is the height of the [40, 65) bin? ###  

**NOTE:** The square bracket means the interval *includes* that boundary, and the parenthesis means that the interval *excludes* that boundary. So, a number $n$ belongs to the bin $[40,65)$ if, and only if, $40\leq n < 65$.

**Step 1: Determine the number of movies in the bin**

In [None]:
# Step 1: Calculate % of movies in the [40, 65) bin
percent = binned_data.where('bin', 40).column('Percent').item(0)

**Step 2: Determine the Bin Width.**

In [None]:
# Step 2: Calculate the width of the 40-65 bin
bin_width = 65 - 40

**Step 3: Calculate the Height of the rectangular bar using the formula**  

$$\textsf{Height}=\frac{\textsf{Percent in Bin}}{\textsf{Bin Width}}\cdot$$

**Recall:** The area of the bar denotes the Percent figure. 

In [None]:
# Step 3: Area of rectangle = height * width
#         --> height = percent / bin_width
height = percent / bin_width
height

### What are the heights of the rest of the bins?

In [None]:
# Get the bin lefts
bin_lefts = binned_data.take(np.arange(binned_data.num_rows-1))
bin_lefts

In [None]:
# Get the bin widths
bin_widths = np.diff(binned_data.column('bin'))
bin_lefts = bin_lefts.with_column('Width', bin_widths)

In [None]:
# Get the bin heights
bin_heights = bin_lefts.column('Percent') / bin_widths
bin_lefts = bin_lefts.with_column('Height', bin_heights)

In [None]:
bin_lefts

In [None]:
top_movies.hist('Age', bins = my_bins, unit = 'Year')

In [None]:
actresses_income_2016 = Table.read_table('actresses.csv')
actresses_income_2016.show(actresses_income_2016.num_rows)