# Do Your Statistics Homework With Python

Obviously, doing _all_ of your homework with Python is a bad idea––the code you write at home will not help you on an exam taken on paper. That said, it's definitely possible to benefit from doing _some_ of your homework with Python. It requires you to analyze the problem from a new perspective, which deepens your understanding.

What you **should** do is work a few problems out by hand first, so you can get used to the steps needed to solve the problem. Next, you should check your work (odd problems almost always have solutions in the back of the textbook). Then, you should write code that will give you the same answer as you got when you did the problem by hand. Finally, you should generalize that code into a function that will allow you to solve similar problems.

What you **should not** do is complete your homework with code you've copied and pasted from somewhere else.

Let's get started! I'm going to be following _The Practice of Statistics_ (5e), which is the textbook I used to teach AP Statistics.

# Chapter 1: Exploring Data

## Section 1: Categorical Data

### Analyzing the distribution of a single categorical variable.

This is the global distribution of rollercoasters found in the <a href=”https://rcdb.com/census.htm”>RollerCoaster DataBase</a>:

| Continent     | Number of Rollercoasters |
|---------------|--------------------------|
| Africa        | 90                       |
| Asia	        | 2,649                    |
| Australia     | 27                       |
| Europe        | 1,329                    |
| North America |	905                    |
| South America |	175                    |

1. According to this table, how many roller coasters are there in the world in total?

2. (a) Create a relative frequency table of these data. Give your answers in percents, rounded to one decimal place. (b) Do your percents add up to 100? Why or why not?

3. What percent of roller coasters are in either North America or South America?

4. Construct a bar chart and a pie chart using the relative frequencies you found in part 2. Don’t forget to include labels and a descriptive title!

In [None]:
# create a dictionary, where the keys are the names of the continents (categories) and the values
# are the number of rollercoasters in that continent (frequencies)

coasters = {'Africa':90,
            'Asia':2649,
            'Australia':27,
            'Europe':1329,
            'North America':905,
            'South America':175}

In [None]:
# 1. According to this table, how many rollercoasters are there in the world in total?

# We'll solve this by summing the values in the coasters dictionary. Let's wrap that in a function,
# in case we need it later.
def total_values(dictionary):
    """
    Computes the sum of the values in the dictionary.
    
    Parameters:
    dictionary (dict): keys are categories, values are counts.
    
    Returns:
    (int, float) sum of the values in the dictionary
    """
    return sum(dictionary.values())

print('There are', total_values(coasters), 'rollercoasters in the world.')

In [None]:
# 2. Create a relative frequency table of these data. Give your answers in percents.

# To do this, we'll need to complete the following steps:
# 1. Compute the total number of rollercoasters in the world (luckily, this was problem #1).
# 2. Divide the value in each category by the total number of rollercoasters in the world.
# 3. Multiply each value by 100.
# 4. Format the output in a nice table.
def relative_frequencies(dictionary, percents = False, precision = 1):
    """
    Computes the relative frequencies of values in a dictionary. 
    If percents is set to True, values are multiplied by 100.
    
    Parameters:
    dictionary (dict): keys are categories, values are counts
    
    percents (bool): computes relative frequencies as percents if true
    
    precision (int): the number of decimal places for rounding percents
    
    Returns:
    rel_freq (dict): dictionary containing the relative frequencies
    """
    # 1. compute the total number of rollercoasters
    total = total_values(dictionary)
    
    # 2. divide the value in each category by the total number
    # create a new dictionary to hold the relative frequencies.
    rel_freq = {}
    
    # iterate through the key, value pairs in the dictionary
    for key, value in dictionary.items():
        # add the pair key, value/total to the relative frequencies dictionary
        rel_freq[key] = value / total
        
        if percents:
            rel_freq[key] = round(rel_freq[key] * 100, precision)
        
    return rel_freq

def display_dictionary(dictionary, title='Table Title', left_header='Category', right_header='Values'):
    """
    Displays a dictionary in a simple table with a title.
    
    Parameters:
    dictionary (dict): keys are categories, values are frequencies
    
    title (str): title for the table that will be displayed
    
    left_header: title for the left column of the table
    
    right_header: title for the right column of the table
    
    Returns:
    None
    """
    max_len = max(max(len(value) for value in dictionary.keys()), len(left_header))
    
    spaces = {key:(max_len - len(key)) for key, value in dictionary.items()}
    
    header = left_header + ' '*(2 + max_len - len(left_header)) + '| ' + right_header
    bar = '-' * max(len(title), len(header))
    
    print(title + '\n' + bar + '\n' + header + '\n' + bar)
    for key, value in dictionary.items():
        print(key, ' '*(spaces[key]), '|', value)
                  
display_dictionary(dictionary=relative_frequencies(coasters, percents=True, precision=2), 
                   title='Relative Frequencies of Rollercoasters by Continent',
                   left_header='Continent',
                   right_header='Percent of Rollercoasters')

In [None]:
# 2 (b) Do your percents add up to 100? Why or why not?

def total_percents(rel_freq):
    """
    Computes the sum of the values in a dictionary that holds relative frequencies.
    """
    return total_values(rel_freq)
    
total_percents(relative_frequencies(coasters, percents=True, precision=2))

In [None]:
# 3. What percent of roller coasters are in either North America or South America?
def sum_from_list(dictionary, categories):
    """
    Computes the sum of values in a dictionary across a list of categories.
    """
    return sum(dictionary[cat] for cat in categories)

sum_from_list(relative_frequencies(coasters, percents=True, precision=2), ['North America', 'South America'])

In [None]:
# 4. Construct a bar chart and a pie chart of these data. 
#    Don’t forget to include labels and a descriptive title!

# Here's where we'll need to rely on some libraries. If this is your first experience
# with matplotlib, wrapping your head around this might actually take longer than just 
# doing it by hand.

import matplotlib.pyplot as plt
import seaborn as sns

def display_bar_and_pie(dictionary, title, categorical_name, ylabel='Frequency', 
                        relative_frequency=False, percent=False):
    """
    Constructs a (1,2)-Axes Figure. 
    The left Axes is a bar chart, and the right Axes is a pie chart.
    
    Parameters:
    dictionary (dict): keys are categories, values are labels
    
    title (str): the shared title
    
    categorical_name (str): the name of the categorical variable
    
    relative_frequency: converts values to relative frequencies if true
    
    percent: converts relative frequencies to percents if true
    
    Returns:
    None
    """
    if relative_frequency:
        rel_freqs = relative_frequencies(dictionary)
        categories = list(rel_freqs.keys())
        values = list(rel_freqs.values())
        ylabel = 'Relative Frequency'
        if percent:
            ylabel += ' (%)'
            values = [val * 100 for val in values]
            
    else:
        categories = list(dictionary.keys())
        values = list(dictionary.values())
        
    # create two subplots
    fig, (ax_bar, ax_pie) = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    
    # create the bar chart
    ax_bar = sns.barplot(x=categories, y=values, ax=ax_bar)
    ax_bar.set_xlabel('Continent')
    ax_bar.set_xticklabels(categories, rotation=45)
    ax_bar.set_ylabel(ylabel)  
    
    # create the pie chart
    ax_pie.pie(values, labels=categories, autopct='%1.1f')
    ax_pie.axis('equal')
    plt.suptitle(title, size=24)
    plt.show()
    
display_bar_and_pie(coasters, 'Rollercoasters of the World', 'Continent', relative_frequency=True, percent=True)

In [None]:
# now we can collect all of our answers in one place.
coasters = {'Africa':90,
            'Asia':2649,
            'Australia':27,
            'Europe':1329,
            'North America':905,
            'South America':175}

print('1. There are', total_values(coasters), 'rollercoasters in the world.\n')
print('\n2. a.')
display_dictionary(dictionary=relative_frequencies(coasters, percents=True, precision=2), 
                   title='Relative Frequencies of Rollercoasters by Continent',
                   left_header='Continent',
                   right_header='Percent of Rollercoasters')
print('\n2. b. The total of the percents is {}%. If this does not equal 100, it is due to rounding errors.\n'.format( 
      total_percents(relative_frequencies(coasters, percents=True, precision=2))))

print('\n3. {:.1f}% of rollercoasters are in North America or South America.\n'.format(
     sum_from_list(relative_frequencies(coasters, 
                                          percents=True, 
                                          precision=2), 
                     ['North America', 'South America'])))

print('\n4. Below are a bar chart and pie chart of the relative frequencies.')
display_bar_and_pie(coasters, 
                    'Rollercoasters of the World', 
                    'Continent', 
                    relative_frequency=True, 
                    percent=True)

Wonderful! Since we focused on creating reusable functions, we should be able to use our code to solve the following problem, as well. 

# Bonus Problem

The following table lists the box office gross for the weekend of July 3-5, 2020, taken from <a href="https://topdrawer.aamt.edu.au/Statistics/Misunderstandings/Misunderstandings-of-averages/Problems-with-categorical-data">BoxOfficeMojo</a>:

| Movie | Gross |
|-|-|
| Relic | \$192,352 |
| The Wretched | \$24,830 |
| Followed | \$23,677 |
| Becky | \$19,780 |
| Miss Juneteenth | \$11,883 |
| Infamous | \$2,642 |
| The Truth | \$2,200 |
| Sex and the Future | \$604 |

We will assume that, due theater closures as a result of COVID-19, these were the **only** movies that earned any money at the American box office in this time period.

1. According to this table, how much money was earned at the box office over the weekend of July 3-5.

2. Find the percent of box office income earned by each movie.

3. What percent of the total box office income was earned by the five lowest-earning movies?

4. Construct a bar chart and pie chart showing the distribution of box office income for each of the movies.

In [None]:
box_office = {'Relic':192352,
              'The Wretched':24830,
              'Followed':23677,
              'Becky':19780,
              'Miss Juneteenth':11883,
              'Infamous':2642,
              'The Truth':2200,
              'Sex and the Future':604}

print('1. There was a total of ${} earned at the box office on the weekend of July 3-5.'
      .format(total_values(box_office)))

print('\n2. a.')
display_dictionary(dictionary=relative_frequencies(box_office, percents=True, precision=2), 
                   title='Box Office Income',
                   left_header='Movie',
                   right_header='Income (Percent of Total)')
print('\n2. b. The total of the percents is {}%. If this does not equal 100, it is due to rounding errors.\n'.format( 
      total_percents(relative_frequencies(box_office, percents=True, precision=2))))

five_lowest = ['Becky', 'Miss Juneteenth', 'Infamous', 'The Truth', 'Sex and the Future']
print('\n3. The five lowest-earning movies earned a total of ${:.2f}.'
      .format(sum_from_list(box_office, five_lowest)))

print('\n4. Below are a bar chart and pie chart of the relative frequencies.')
display_bar_and_pie(dictionary=box_office, 
                    title='Box Office Income by Movie', 
                    categorical_name='Movie',
                    ylabel='Income ($)')