# Histograms

In [None]:
from datascience import *
from cs104 import *
import numpy as np

%matplotlib inline

# 0. Error practice

In [None]:
majors = Table().read_table("data/majors.csv")
majors.show(5)

In [None]:
# Select only division 3 majors 
div3 = majors.where("Division", are.equal_to(3)
div3

In [None]:
# Get the number of majors across both time periods
majors.select("2008-2012") + majors.select("2018-2021")

## 1. Overlaid graphs

Sometimes we want to see more than one plot on a single graph.

### Overlaid bar charts

In [None]:
div3 = majors.where("Division", are.equal_to(3)).drop("Division")
div3

In [None]:
# First graph for 2008-2012
div3.barh("Major", "2008-2012")
# Second graph from 2018-2021
div3.barh("Major", "2018-2021")

Overlaid graph puts the two graphs together to make comparison easier.

The package we're using will automatically make overlaid graphs with the remainder of the columns if you give it just one parameter. 

In [None]:
div3.barh("Major")

### Overlaid line plots

In [None]:
temps_by_month = Table().read_table("data/temps_by_month_upernavik.csv")
temps_by_month.show(5)

As with bar charts, if you supply only one parameter, the `plot` method will plot a line for every other column.

In [None]:
temps_by_month.plot("Year")

Qualitatively, we can see that the plot above has too much information on it which makes it not very useful for understand trends. 

In [None]:
temps_by_month.select("Year", "Feb", "Aug").plot("Year")

### Overlaid scatter plots 
* We want to plot points (the values of two numerical variables) from different groups on the same graph.
* A new approach.  Use categorical variable to break the rows into groups of related points in the plot.

In [None]:
finch_1975 = Table().read_table("data/finch_beaks_1975.csv")
finch_1975.show(6)

In [None]:
finch_1975.scatter("Beak length, mm", "Beak depth, mm", group="species")

**Takeaway:** The overlaid scatter plot above helps us very quickly discern differences between groups. In this case, we can quickly tell that the two Finch species have evolved (via natural selection) to have different beak characteristics. 

## 2. Histograms

A Histogram shows us the **distribution of a numerical variable**.

### Midterm scores

In [None]:
scores = Table().read_table("data/scores_by_section.csv")
scores = scores.relabeled("Midterm", "Midterm 1")
scores

Let's subset to just section 4 for now. 

In [None]:
scores_sec4 = scores.where("Section", 4)
scores_sec4

A **histogram** can give us a sense of the data as a whole:  What are the common values?  What are uncommon? How much variability is there?  What are the extremes?

In [None]:
scores_sec4.hist("Midterm 1")

### Class survey: Distance from home

#### Load Data

In [None]:
survey = Table().read_table("data/prelab01-survey-fall2025.csv")
survey.show(5)

In [None]:
survey.labels

In [None]:
distance_home = survey.column('Distance Home (in miles)')
distance_home

Some basic info about the distances:

In [None]:
len(distance_home)

In [None]:
np.mean(distance_home)

In [None]:
max(distance_home)

Sneak preview of a histogram for those distances

In [None]:
survey.hist('Distance Home (in miles)', bins=np.arange(0, 12000, 2000))

## 3. Binning

In [None]:
survey.show(3)

We have a method in our package that can make bins automatically: `table.bin`.

In [None]:
our_range = np.arange(0,12000,2000)
our_range

In [None]:
binned_distance_home = survey.bin('Distance Home (in miles)', bins=our_range)
binned_distance_home

Let's add a column that is the percentage in each bin. 

In [None]:
percent = binned_distance_home.column('Distance Home (in miles) count') / survey.num_rows * 100
percent_table = binned_distance_home.with_columns('Percent', percent)
percent_table

### Histogram of distances from home

In [None]:
survey.hist('Distance Home (in miles)', bins= np.arange(0,12000,2000))

**Think-pair-share:** Calculate the area of each bar in the histogram (estimating the height). Then show the sum of the area of all the bars equals 100.  

In [None]:
#Possible approximations
widths = make_array(2000, 2000, 2000, 2000, 2000)
heights = make_array(0.038, 0.007, 0.001, 0.002, 0.002)
areas = widths*heights
areas

In [None]:
sum(areas)

Let's check our estimates. 

In [None]:
percent_table

Cool! We're pretty close to the actual areas! Great!

Let's work backwards now and see how our `hist()` method calculated the y-axis. 

1. Let's look at the first bar/bin. 

In [None]:
bin0 = percent_table.take(0)
bin0

Recall, the height is equal to the `(percent of entries in the bin) / width of the bin`

In [None]:
percent_in_bin0 =  bin0.column('Percent').item(0)
percent_in_bin0

In [None]:
height0 = percent_in_bin0/2000
height0

Fantastic! That's what we see on the y-axis on the histogram. 

### More histogram practice

In [None]:
survey.show(5)

In [None]:
survey.labels

In [None]:
plot = survey.hist('Height (in inches)', bins=np.arange(58,80,2))
plot.set_ylim(0,0.15)
plot.set_title("Students in CS 104")

### Think-pair-share: Approximating Histogram of heights
1. Approximate the percentage of the class that has height greater than or equal to 70 inches but less than 72 inches. 
2. How many students is this? We know 58 students responded to our survey. 

In [None]:
survey.num_rows

In [None]:
answer_q1 = 7 * 2
answer_q1

In [None]:
answer_q2 = answer_q1 / 100 * 59
answer_q2

We can't have a fraction of a student so our approximation was probably slightly incorrect.  The real count is 8.

In [None]:
survey.where('Height (in inches)', are.between(70, 72)).num_rows

### Overlaid histograms

#### Scores 
Circle back around to our midterm data. 

In [None]:
scores

In [None]:
scores_sec4 = scores.where("Section", 4)
scores_sec4.hist("Midterm 1")

Like `scatter` we can create overlaid histograms with the `group=` named variable

In [None]:
scores.hist("Midterm 1", group="Section", bins=10)

#### Finches
One more overlay, for the two finch species.

In [None]:
finch_1975.show(10)

In [None]:
finch_1975.hist("Beak length, mm", bins=20)

In [None]:
plot = finch_1975.hist("Beak length, mm", group="species")
plot.set_title("Finches, 1975")

Try different bins to see differences in granularity.

In [None]:
def hist_with_bins(bins):
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=bins,  title=str(bins) + " bins")

interact(hist_with_bins, bins=Slider(1,20))

A few different histograms side-by-side:

In [None]:
with Figure(2,2, figsize=(4,3)):
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=6,  title="6 bins")
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=10, title="10 bins")    
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=15, title="16 bins")
    finch_1975.hist("Beak length, mm", group="species", legend=False, bins=20, title="20 bins")
