# STOR 120 - Homework 2: Arrays and Tables

**Recommended Reading**: 
* [Data Types](https://www.inferentialthinking.com/chapters/04/Data_Types.html) 
* [Sequences](https://www.inferentialthinking.com/chapters/05/Sequences.html)
* [Tables](https://www.inferentialthinking.com/chapters/06/Tables.html)

For all problems that you must write explanations and sentences for, you **must** provide your answer in the designated space. Moreover, throughout this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. 

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

You should start early so that you have time to get help if you're stuck. 

In [46]:
# Don't change this cell; just run it.

import numpy as np
from datascience import *

## 1. Creating Arrays


**Question 1.1.** Make an array called `weird_numbers` containing the following numbers (in the given order):

1. 6
2. the square root of 72
3. -21
4. 3 to the power of 7.8

*Note:* Python lists are different/behave differently than numpy arrays. In STOR 120, we use numpy arrays, so please make an **array**, not a python list if you know how to do so.

<!--
BEGIN QUESTION
name: q1_1
-->

In [2]:
weird_numbers = make_array(6, (72)**.5, -21, 3**7.8)
weird_numbers

array([    6.        ,     8.48528137,   -21.        ,  5266.78738671])

**Question 1.2.** Make an array called `book_title_words` containing the following three strings: `Eats`, `Shoots`, and `and Leaves`.

<!--
BEGIN QUESTION
name: q1_2
-->

In [3]:
book_title_words = make_array('Eats', 'Shoots', 'and Leaves')
book_title_words

array(['Eats', 'Shoots', 'and Leaves'],
      dtype='<U10')

Strings have a method called `join`.  `join` takes one argument, an array of strings.  It returns a single string.  Specifically, the value of `a_string.join(an_array)` is a single string that's the [concatenation](https://en.wikipedia.org/wiki/Concatenation) ("putting together") of all the strings in `an_array`, **except** `a_string` is inserted in between each string.

**Question 1.3.** Use the array `book_title_words` and the method `join` to make two strings:

1. `Eats, Shoots, and Leaves` (call this one `with_commas`)
2. `Eats Shoots and Leaves` (call this one `without_commas`)

*Hint:* If you're not sure what `join` does, first try just calling, for example, `"foo".join(book_title_words)` .

<!--
BEGIN QUESTION
name: q1_3
-->

In [5]:
with_commas = ", ".join(book_title_words)
without_commas = " ".join(book_title_words)

# These lines are provided just to print out your answers.
print('with_commas:', with_commas)
print('without_commas:', without_commas)

with_commas: Eats, Shoots, and Leaves
without_commas: Eats Shoots and Leaves


## 2. Indexing Arrays


These exercises give you practice accessing individual elements of arrays.  In Python (and in many programming languages), elements are accessed by *index*, so the first element is the element at index 0.

*Note:* Please don't use bracket notation when indexing if you know how to (i.e. `arr[0]`), as this can yield different data type outputs than what we will be expecting.

**Question 2.1.** The cell below creates an array of some numbers.  Set `sixth_element` to the sixth element of `some_numbers`.

<!--
BEGIN QUESTION
name: q2_1
-->

In [6]:
some_numbers = make_array(-1, -3, -6, -10, -15, -19, -24, -30)

sixth_element = some_numbers.item(5)
sixth_element

-19

**Question 2.2.** The next cell creates a table that displays some information about the elements of `some_numbers` and their order.  Run the cell to see the partially-completed table, then fill in the missing information (the cells that say "Ellipsis") by assigning `blank_a`, `blank_b`, `blank_c`, and `blank_d` to the correct elements in the table.

<!--
BEGIN QUESTION
name: q2_2
-->

In [15]:
blank_a = 'fourth'
blank_b = 'sixth'
blank_c = 0
blank_d = 5
elements_of_some_numbers = Table().with_columns(
    "English name for position", make_array("first", "second", "third", blank_a, "fifth", blank_b, "seventh", "eighth"),
    "Index",                     make_array(blank_c, 1, 2, 3, 4, blank_d, 6, 7),
    "Element",                   some_numbers)
elements_of_some_numbers

English name for position,Index,Element
first,0,-1
second,1,-3
third,2,-6
fourth,3,-10
fifth,4,-15
sixth,5,-19
seventh,6,-24
eighth,7,-30


**Question 2.3.** You will sometimes want to find the *last* element of an array.  Suppose an array has 747 elements.  What is the index of its last element?

<!--
BEGIN QUESTION
name: q2_3
-->

In [16]:
index_of_last_element = -1

More often, you don't know the number of elements in an array, its *length*.  (For example, it might be a large dataset you found on the Internet.)  The function `len` takes a single argument, an array, and returns the `len`gth of that array (an integer).

**Question 2.4.** The cell below loads an array called `primary_total_votes`. This array has the votes for each candidate or ballot choice in every contest held in North Carolina in the May 8, 2018 elections.. Calling `.column(...)` on a table returns an array of the column specified, in this case the `sum_total_votes` column of the `primary_results` table. The third to last element in the array is the number of votes for Johnny Riddle in the election for the Yancey County Board of Commissioners. Assign this number of votes to `Riddle_votes`.
<!--
BEGIN QUESTION
name: q2_4
-->

In [17]:
primary_total_votes = Table.read_table("primary_results.csv").column('sum_total_votes')

Riddle_votes = primary_total_votes.item(-3)
Riddle_votes

1005

**Question 2.5.** The 829th and 830th rows of the `primary_results` table show the contest, names, and votes for the two people who were running for the position of `ORANGE COUNTY CLERK OF SUPERIOR COURT` in 2018. Assign `sum_of_OC_votes` to the sum of votes for these two people using the `primary_total_votes` array.

<!--
BEGIN QUESTION
name: q2_5
-->

In [19]:
sum_of_OC_votes = primary_total_votes.item(828) + primary_total_votes.item(829)
sum_of_OC_votes

18256

## 3. Basic Array Arithmetic


**Question 3.1.** Multiply the numbers 12, 1212, 122122, and -120 by 585. Assign each variable below such that `first_product` is assigned to the result of $12 * 585$, `second_product` is assigned to the result of $1212 * 585$, and so on. 

For this question, **don't** use arrays.

<!--
BEGIN QUESTION
name: q3_1
-->

In [20]:
first_product = 585 * 12
second_product = 585 * 1212
third_product = 585 * 122122
fourth_product = 585 * -120
print(first_product, second_product, third_product, fourth_product)

7020 709020 71441370 -70200


**Question 3.2.** Now, do the same calculation, but using an array called `numbers` and only a single multiplication (`*`) operator.  Store the 4 results in an array named `products`.

<!--
BEGIN QUESTION
name: q3_2
-->

In [21]:
numbers = make_array(12, 1212, 122122, -120)
products = numbers * 585
products

array([    7020,   709020, 71441370,   -70200], dtype=int64)

**Question 3.3.** Oops, we made a typo!  Instead of 585, we wanted to multiply each number by 919.  Compute the correct products in the cell below using array arithmetic.  Notice that your job is really easy if you previously defined an array containing the 4 numbers.

<!--
BEGIN QUESTION
name: q3_3
-->

In [22]:
correct_products = numbers *919
correct_products

array([    11028,   1113828, 112230118,   -110280], dtype=int64)

**Question 3.4.** We've loaded an array of temperatures in the next cell. This dataset contains air quality data collected using a PurpleAir Dual Laser Air Quality Sensor located at the Chapel Hill Public Library, as well as temperature in Fahrenheit. Convert the temperatures to [R&eacute;aumur](https://en.wikipedia.org/wiki/R%C3%A9aumur_scale) by first subtracting 32 from them, then multiplying the results by $\frac{4}{9}$. Round the final result *after* converting to R&eacute;aumur to the nearest tenths place using the `np.round` function.

<!--
BEGIN QUESTION
name: q3_4
-->

In [23]:
temperatures = Table.read_table("Local_Air_Quality.csv").column("Temp_F")

reaumur_temperatures = np.round((temperatures - 32) * (4/9), 1)
reaumur_temperatures

array([ 24.9,  24.9,  24. , ...,  16.9,  16.9,  18.7])

## 4. North Carolina Population


The cell below loads a table of estimates of the North Carolina population from 1900 to 2021. The estimates come from the [FRED Economic Data](https://fred.stlouisfed.org/series/NCPOP).

In [24]:
NCpop = Table.read_table("NCpop.csv")
NCpop.show(4)

Year,NCPOP
1900,1897000
1901,1926000
1902,1956000
1903,1986000


The name `population` is assigned to an array of population estimates.

In [25]:
population = NCpop.column(1)
population

array([ 1897000,  1926000,  1956000,  1986000,  2017000,  2051000,
        2077000,  2105000,  2142000,  2174000,  2221000,  2276000,
        2313000,  2362000,  2421000,  2473000,  2513000,  2546000,
        2522000,  2535000,  2588000,  2651000,  2700000,  2761000,
        2830000,  2895000,  2959000,  3027000,  3082000,  3133000,
        3167000,  3184000,  3227000,  3268000,  3304000,  3323000,
        3346000,  3385000,  3440000,  3514000,  3574000,  3589000,
        3569000,  3654000,  3560000,  3533000,  3706000,  3769000,
        3837000,  3911000,  4068000,  4120000,  4109000,  4120000,
        4131000,  4242000,  4309000,  4368000,  4376000,  4458000,
        4573000,  4663000,  4707000,  4742000,  4802000,  4863000,
        4896000,  4952000,  5004000,  5031000,  5084411,  5203531,
        5301150,  5389852,  5470911,  5547188,  5607964,  5685607,
        5759492,  5823491,  5898980,  5956653,  6019101,  6077056,
        6164006,  6253954,  6321578,  6403700,  6480594,  6565

In this question, you will apply some built-in Numpy functions to this array. Numpy is a module that is often used in Data Science!

The difference function `np.diff` subtracts each element in an array from the element after it within the array. As a result, the length of the array `np.diff` returns will always be one less than the length of the input array.

The cumulative sum function `np.cumsum` outputs an array of partial sums. For example, the third element in the output array corresponds to the sum of the first, second, and third elements.

**Question 4.1.** Very often in data science, we are interested understanding how values change with time. Use `np.diff` (and other functions) to calculate the largest absolute change (positive or negative) in population between any two consecutive years.

<!--
BEGIN QUESTION
name: q4_1
-->

In [29]:
largest_change_NC_pop = np.max(abs(np.diff(population)))
largest_change_NC_pop

430825

**Question 4.2.** What do the values in the resulting array represent (choose one)?

In [30]:
np.cumsum(np.diff(population))

array([  29000,   59000,   89000,  120000,  154000,  180000,  208000,
        245000,  277000,  324000,  379000,  416000,  465000,  524000,
        576000,  616000,  649000,  625000,  638000,  691000,  754000,
        803000,  864000,  933000,  998000, 1062000, 1130000, 1185000,
       1236000, 1270000, 1287000, 1330000, 1371000, 1407000, 1426000,
       1449000, 1488000, 1543000, 1617000, 1677000, 1692000, 1672000,
       1757000, 1663000, 1636000, 1809000, 1872000, 1940000, 2014000,
       2171000, 2223000, 2212000, 2223000, 2234000, 2345000, 2412000,
       2471000, 2479000, 2561000, 2676000, 2766000, 2810000, 2845000,
       2905000, 2966000, 2999000, 3055000, 3107000, 3134000, 3187411,
       3306531, 3404150, 3492852, 3573911, 3650188, 3710964, 3788607,
       3862492, 3926491, 4001980, 4059653, 4122101, 4180056, 4267006,
       4356954, 4424578, 4506700, 4583594, 4668459, 4759987, 4851135,
       4934850, 5050412, 5163959, 5288403, 5410658, 5531672, 5648828,
       5753789, 6184

1) The total population change between consecutive years, starting at 1901.

2) The total population change between 1900 and each later year, starting at 1901.

3) The total population change between 1900 and each later year, starting inclusively at 1900.

<!--
BEGIN QUESTION
name: q4_3
-->

In [31]:
# Assign cumulative_sum_answer to 1, 2, or 3
cumulative_sum_answer = 2

**Question 4.3.** Assign the name `smallest` to the smallest absolute (positive or negative) change in population between any two consecutive years. Assign the name `average` to the average of the absolute changes in population between any two consecutive years.

<!--
BEGIN QUESTION
name: q4_3
-->

In [32]:
smallest = np.min(abs(np.diff(population)))
average = np.mean(abs(np.diff(population)))


smallest, round(average)

(8000, 75162)

**Question 4.4.** Suppose that you had assumed that the average of the *absolute* changes in population was equal to the average of the changes in population between any two consectutive years (i.e. you assumed that the population never decreased). Set `difference_from_expected` to an array with 121 elements, where the elements are the differences (in order by years) between the actual population change during each pair of consectutive years (which could be positive or negative) and the expected change (`average`). 

For example, since the North Carolina populations in 1900 and 1901 are 1897000 and 1926000, the first element of the `difference_from_expected` would be  $(1926000 - 1897000) - average$

<!--
BEGIN QUESTION
name: q4_4
-->

In [35]:
difference_from_expected = np.diff(population) - average
difference_from_expected

array([ -4.61617851e+04,  -4.51617851e+04,  -4.51617851e+04,
        -4.41617851e+04,  -4.11617851e+04,  -4.91617851e+04,
        -4.71617851e+04,  -3.81617851e+04,  -4.31617851e+04,
        -2.81617851e+04,  -2.01617851e+04,  -3.81617851e+04,
        -2.61617851e+04,  -1.61617851e+04,  -2.31617851e+04,
        -3.51617851e+04,  -4.21617851e+04,  -9.91617851e+04,
        -6.21617851e+04,  -2.21617851e+04,  -1.21617851e+04,
        -2.61617851e+04,  -1.41617851e+04,  -6.16178512e+03,
        -1.01617851e+04,  -1.11617851e+04,  -7.16178512e+03,
        -2.01617851e+04,  -2.41617851e+04,  -4.11617851e+04,
        -5.81617851e+04,  -3.21617851e+04,  -3.41617851e+04,
        -3.91617851e+04,  -5.61617851e+04,  -5.21617851e+04,
        -3.61617851e+04,  -2.01617851e+04,  -1.16178512e+03,
        -1.51617851e+04,  -6.01617851e+04,  -9.51617851e+04,
         9.83821488e+03,  -1.69161785e+05,  -1.02161785e+05,
         9.78382149e+04,  -1.21617851e+04,  -7.16178512e+03,
        -1.16178512e+03,

**Question 4.5.** Using the `difference_from_expected` array from the previous problem, At the beginning of what year is the expected population change (assuming a linear increase in population by the average absolute change) the most similar (in magnitude - positive or negative) to the actual North Carolina population change over the prior year? 

<!--
BEGIN QUESTION
name: q5_4
-->

In [57]:
most_similar = np.argmin(abs(difference_from_expected)) + 1901
most_similar

1980

## 5. Tables


**Question 5.1.** Suppose that [Brandwein's Bagels](https://www.brandweinsbagels.com/) have ready to sell 23 plain bagels, 17 everything bagels, 12 wheat everything bagels,  19 sesame bagels, and 7 onion bialy. Create a table that contains this information. It should have two columns: `menu item` and `count`.  Assign the new table to the variable `Brandweins`.

<!--
BEGIN QUESTION
name: q5_1
-->

In [60]:
Brandweins = Table().with_column(
    'menu item', make_array('plain bagels','everything bagels', 'wheat everything bagels', 'sesame bagels', 'onion bialy')).with_column(
    'count', make_array(23, 17, 12, 19, 7))
Brandweins

menu item,count
plain bagels,23
everything bagels,17
wheat everything bagels,12
sesame bagels,19
onion bialy,7


**Question 5.2.** The file `Brandweins_sales.csv` contains the number of menu items sold over a brief period of time from each type of menu item in the `Brandweins` table. `Brandweins_sales.csv` has an extra column called "price per menu item (\$)". Load these data into a table called `Brandweins_sales`.

<!--
BEGIN QUESTION
name: q5_2
-->

In [63]:
Brandweins_sales = Table.read_table('Brandweins_sales.csv')
Brandweins_sales

menu item,count,price per menu item ($)
plain bagel,9,2.0
everything bagel,14,2.0
wheat everything bagel,6,2.0
sesame bagel,13,2.0
onion bialy,3,2.5


**Question 5.3.** How many menu items were sold in the `Brandweins_sales` table? Assign this value to `total_items_sold`

<!--
BEGIN QUESTION
name: q5_3
-->

In [67]:
total_items_sold = np.sum(Brandweins_sales.column('count'))
total_items_sold

45

**Question 5.4.** What was the total revenue (the total price of all menu items sold) in the `Brandweins_sales` table? Assign this value to `total_revenue`.

<!--
BEGIN QUESTION
name: q5_4
-->

In [69]:
total_revenue = np.sum(Brandweins_sales.column('price per menu item ($)'))
total_revenue

10.5

**Question 5.5.** Make a new table called `remaining_inventory`.  It should have the same rows and columns as `Brandweins`, except that the amount of menu items sold for each menu item should be subtracted from that menu item's original count, so that the `count` is the amount of menu items remaining.

<!--
BEGIN QUESTION
name: q5_5
-->

In [72]:
remaining_inventory = Table().with_column(
    'menu item', make_array('plain bagels','everything bagels', 'wheat everything bagels', 'sesame bagels', 'onion bialy')).with_column(
        'count', (Brandweins.column('count') - Brandweins_sales.column('count')))

remaining_inventory

menu item,count
plain bagels,14
everything bagels,3
wheat everything bagels,6
sesame bagels,6
onion bialy,4
