In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

## Bar Charts

In [None]:
top = Table.read_table('top_movies.csv')
top.set_format([2, 3], NumberFormatter)

Use the `group` method to count how many times a categorical variable appears in a column

In [None]:
top.group('Studio').sort("count", descending=True)

How many of the top movies did each studio produce?

In [None]:
top.group('Studio').sort("count", descending=True).barh('Studio', 'count')

How old are each of these movies?

In [None]:
aged = top.with_column("Age", 2018-top.column('Year'))

How many movies of each age are there?

In [None]:
aged.group('Age').barh('Age', 'count')

## Histograms

In [None]:
aged.group('Age')

The `bin` method groups numbers into 10 equally spaced bins

In [None]:
aged.bin('Age').show()

You can ask the `bin` method to use any bins you like

In [None]:
aged.bin('Age', bins=[0, 20, 40, 60, 80, 100])

How can I make the array `[0, 20, 40, 60, 80, 100]` with less typing?

In [None]:
np.arange(0, 101, 20)

In [None]:
aged.bin('Age', bins=np.arange(0, 101, 20)).barh("bin")

In [None]:
aged.hist('Age', normed=False)

In [None]:
aged.hist('Age', bins=np.arange(0, 101, 20), normed=False)

What's going on in this picture?
* a) heights represent counts in each bin
* b) areas represent counts in each bin
* c) both heights and areas represent counts in each bin

# Uneven bins

There are lots of new movies, and few old movies, so let's change the bin sizes so we can see them better

In [None]:
aged.hist('Age', bins=[0, 10, 20, 40, 60, 100], normed=False)

What's going on in this picture?
* a) heights represent counts in each bin
* b) areas represent counts in each bin
* c) both heights and areas represent counts in each bin

## Density

In [None]:
aged.hist('Age', bins=np.arange(0, 101, 20), unit='year')

In [None]:
1.8*20

What's going on in this picture?
* a) heights represent counts in each bin
* b) areas represent counts in each bin
* c) both heights and areas represent counts in each bin

In [None]:
aged.hist('Age', bins=np.arange(0, 101, 5), unit='year')

In [None]:
aged.hist('Age', bins=[0, 5, 10, 15, 20, 25, 30, 40, 60, 80, 100], unit='year')

What's going on in this picture?
* a) heights represent counts in each bin
* b) areas represent counts in each bin
* c) both heights and areas represent counts in each bin

In [None]:
aged.hist('Age', bins=[0, 5, 10, 15, 20, 25, 30, 40, 60, 80, 100], normed=False) 

### Discussion question

In [None]:
actress = Table.read_table('actress.csv')

What's the height of each bar in these 
two histograms?
```
actress.hist(1, bins=[0,15,25,85])
actress.hist(1, bins=[0,15,35,85])
```

In [None]:
9/20 * 100 / 15

In [None]:
8/20 * 100 / 10

In [None]:
3/20 * 100 / 60

In [None]:
actress.hist(1, bins=[0,15,25,85], unit='million $')

In [None]:
actress.hist(1, bins=[0,15,35,85], unit='million $')

## Overlaid Graphs

In [None]:
heights = Table.read_table('galton.csv')
heights = heights.where('gender', 'female').select('father', 'mother', 'childHeight').relabeled(2, 'daughter')
heights

In [None]:
heights.hist('daughter', unit='inch')

In [None]:
heights.hist('mother', unit='inch')

In [None]:
heights.hist('daughter', 'mother', unit='inch')
_ = plots.xlabel('Height (inches)')

In [None]:
heights.hist(unit='inch')
_ = plots.xlabel('Height (inches)')

In [None]:
heights.hist(bins=np.arange(55, 81, 1), unit='inch')
_ = plots.xlabel('Height (inches)')

In [None]:
heights.scatter('mother', 'daughter')

In [None]:
heights.scatter('daughter')