## Lecture 9 ##  

## Histograms (Continued) ##

In [None]:
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

#The following allows porting images into a Markdown window
from IPython.display import Image

In [None]:
top_movies = Table.read_table('top_movies_2017.csv')
top_movies

In [None]:
top_movies = top_movies.with_column('Millions', np.round(top_movies.column('Gross (Adjusted)')/1000000,3))
top_movies

## Numerical Distribution ##

In [None]:
#Create an array, called 'ages,' containing the respective ages of the movies
ages = 2021 - top_movies.column('Year')  

#Add the ages array as a new column to the top_movies table
top_movies = top_movies.with_column('Age', ages)

#display the table
top_movies

## Binning ##

SLIDE: Binning  

**Before we**
<ul>
    <li><b>Determine bin sizes</b></li>
    <li><b>Visualize the data</b></li>
</ul> 

**Let's get a sense of our data range.**  

    No point creating bins outside of that range. 

In [None]:
min(ages), max(ages)

In [None]:
my_bins = make_array(0, 5, 10, 15, 25, 40, 65, 100, 105)
my_bins

**Q. Why do we need 105?**

**Now let's create a table containing the binned data.**

## New Function Alert: bin ##

In [None]:
binned_data = top_movies.bin('Age', bins = my_bins)
#.bin(,) returns a NEW table, which we're calling binned_data here.
#The original table "top_movies" is unaffected.
binned_data

**NOTE:** Given how we created the bins using the min and max of the ages, we're guaranteed that there's no entry above 105.  

The 105 entry also shows the strict upper boundary (excluded upper boundary) of the cell that began with, and inclusive of, 100.

**Verify that the total number movies hasn't changed.**

In [None]:
num_movies = sum(binned_data.column('Age count'))
num_movies

In [None]:
top_movies.hist('Age', bins = my_bins, unit = 'Year')

In [None]:
binned_data = binned_data.with_column(
    'Percent', binned_data.column('Age count')/num_movies * 100)
binned_data

**Now let's make equal-sized bins.**

In [None]:
binned_data_uniform_bins=top_movies.bin('Age', bins = np.arange(0, 126, 25))
binned_data_uniform_bins

**Again, verify that we've captured a correct total headcount of the movies.**

In [None]:
sum(binned_data_uniform_bins.column('Age count'))

In [None]:
binned_data_incomplete_uniform_bins=top_movies.bin('Age', bins = np.arange(0, 60, 25))
binned_data_incomplete_uniform_bins

**Now if we run a sum check, we notice that some movies are unaccounted for.**  

**This is because of our choice of an incomplete range.**

In [None]:
sum(binned_data_incomplete_uniform_bins.column('Age count'))

In [None]:
top_movies.where('Age', 51)

## Histograms ##  

Slides  


In [None]:
my_bins

In [None]:
binned_data

**Our First Histogram**

In [None]:
# Let's make our first histogram!
top_movies.hist('Age', bins = my_bins, unit = 'Year')

**Hard to compare the bars!**  

**Problem caused by our selection of nonuniform bin sizes.**

**UNIFORM BINS**

In [None]:
# Let's try equally spaced bins instead.
top_movies.hist('Age', bins = np.arange(0, 110, 10), unit = 'Year')

In [None]:
# Let's try not specifying any bins!
top_movies.hist('Age', unit='Year')

**The problem with the above is that we don't know where Python made the bins start or end.**

**Add the Percent column to the Table**

In [None]:
# Add a column containing what percent of movies are in each bin
total_number_of_movies = sum(binned_data.column('Age count'))
binned_data = binned_data.with_column(
    'Percent', 100*binned_data.column('Age count')/total_number_of_movies)
#Recall that total_number_of_movies in this case is 200

In [None]:
binned_data

## Heights of the Histogram Bars ##

### Question: What is the height of the [40, 65) bin? ###  

**NOTE:** The square bracket means the interval *includes* that boundary, and the parenthesis means that the interval *excludes* that boundary. So, a number $n$ belongs to the bin $[40,65)$ if, and only if, $$40\leq n < 65.$$

**Step 1: Determine the number of movies in the bin**

In [None]:
# Step 1: Calculate % of movies in the [40, 65) bin
percent = binned_data.where('bin', 40).column('Percent').item(0)
percent

**Step 2: Determine the Bin Width.**

In [None]:
# Step 2: Calculate the width of the 40-65 bin
bin_width = 65 - 40
bin_width

**Step 3: Calculate the Height of the rectangular bar using the formula**  

$$\textsf{Height}=\frac{\textsf{Percent in Bin}}{\textsf{Bin Width}}\cdot$$

**Recall:** The area of the bar denotes the Percent figure. 

In [None]:
# Step 3: Area of rectangle = height * width
#         --> height = percent / bin_width
height = percent / bin_width
height

### Heights of all the bins ###

In [None]:
# Get the bin lefts
bin_lefts = binned_data.take(np.arange(binned_data.num_rows-1))
bin_lefts

### Widths of all the bins ###

In [None]:
# Get the bin widths
bin_widths = np.diff(binned_data.column('bin'))
bin_lefts = bin_lefts.with_column('Width', bin_widths)
bin_lefts

### Heights of all the bins ###

In [None]:
# Get the bin heights
bin_heights = bin_lefts.column('Percent') / bin_widths
bin_lefts = bin_lefts.with_column('Height', bin_heights)
bin_lefts

In [None]:
top_movies.hist('Age', bins = my_bins, unit = 'Year')

## Visualization Example: Welcome Survey ##

In [None]:
survey = Table.read_table('welcome_survey_v1.csv')
survey

**Number of Participants**

In [None]:
number_of_participants=survey.num_rows
number_of_participants

### Categorical Data: Bar Charts

In [None]:
handedness = survey.group('Handedness')
handedness

In [None]:
handedness.barh('Handedness')

### Numerical Data: Histograms

In [None]:
survey.hist('Extroversion')

In [None]:
survey.hist('Hours of sleep')

In [None]:
min_hours_sleep=min(survey.column('Hours of sleep'))
min_hours_sleep

In [None]:
max_hours_sleep=max(survey.column('Hours of sleep'))
max_hours_sleep

In [None]:
sleep_bins = np.arange(min_hours_sleep,max_hours_sleep+1,0.5)
sleep_bins

In [None]:
survey.hist('Hours of sleep', bins=sleep_bins)

In [None]:
sleep_relative_to_eight_hours=survey.bin('Hours of sleep', bins=make_array(0,8,max_hours_sleep+1))
sleep_relative_to_eight_hours

**Verify that we've captured the correct number of participants**

In [None]:
sum(sleep_relative_to_eight_hours.column('Hours of sleep count'))

**Let's compute the percentage of participants who sleep at least 8 hours.**  

First, what does survey.bin return?

In [None]:
type(survey.bin('Hours of sleep', bins=make_array(0,8,max_hours_sleep)))

**How do we grab the value in the second column (column index 1), second row (row index 1)?**

In [None]:
number_sleep_at_least_eight=survey.bin('Hours of sleep', bins=make_array(0,8,max_hours_sleep)).column(1).item(1)
number_sleep_at_least_eight

**Alternatively,**

In [None]:
number_sleep_at_least_eight=survey.bin('Hours of sleep', bins=make_array(0,8,max_hours_sleep)).column('Hours of sleep count').item(1)
number_sleep_at_least_eight

**Yet another alternative:**

In [None]:
number_sleep_at_least_eight=survey.bin('Hours of sleep', bins=make_array(0,8,max_hours_sleep)).where('bin',8).column(1).item(0)
number_sleep_at_least_eight

### Percentage of Participants Who Sleep At Least Eight Hours ###

In [None]:
percent_sleep_at_least_eight=number_sleep_at_least_eight/number_of_participants * 100

#Round to one decimal digit
np.round(percent_sleep_at_least_eight,1)

In [None]:
survey.where(
    'Pant leg',are.containing('Right')).hist('Hours of sleep', bins=sleep_bins)
plots.title('Right Leg First');

survey.where(
    'Pant leg',are.containing('Left')).hist('Hours of sleep', bins=sleep_bins)
plots.title('Left Leg First');

In [None]:
survey.hist('Hours of sleep', bins=sleep_bins)

SLIDE: Discussion Actresses

In [None]:
actresses_income_2016 = Table.read_table('actresses.csv')
actresses_income_2016.show(actresses_income_2016.num_rows)

## Functions ##

In [None]:
def triple(x):
    return 3 * x

In [None]:
triple(3)

In [None]:
num = 4

In [None]:
triple(num)

In [None]:
triple(num * 5)

### Type Agnostic

In [None]:
triple('ha')

In [None]:
triple(np.arange(4))

### Discussion Question

In [None]:
def percent_of_total(s):
    return np.round(s / sum(s) * 100, 2)

In [None]:
percent_of_total(make_array(1,2,3,4))

In [None]:
percent_of_total(make_array(1, 213, 38))

### Multiple Arguments

$ h^2 = x^2 + y^2 \hspace{20 pt} => \hspace{20 pt} h = \sqrt{ x^2 + y^2 } $

In [None]:
def hypotenuse(x,y):
    hypot_squared = (x ** 2 + y ** 2)
    return hypot_squared ** 0.5

In [None]:
hypotenuse(9, 12)

In [None]:
hypotenuse(2, 2)

## Apply ##

In [None]:
ages = Table().with_columns(
    'Person', make_array('Jim', 'Pam', 'Michael', 'Creed'),
    'Birth Year', make_array(1985, 1988, 1967, 1904)
)
ages

In [None]:
def cap_at_1980(x):
    return min(x, 1980)

In [None]:
cap_at_1980(1975)

In [None]:
cap_at_1980(1991)

In [None]:
ages.apply(cap_at_1980, 'Birth Year')

In [None]:
def name_and_age(name, year):
    age = 2019 - year
    return name + ' is ' + str(age)

In [None]:
ages.apply(name_and_age, 'Person', 'Birth Year')