In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab05.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Lab 05: Data Manipulation

## References

* [Sections 8.2 - 8.5 of the Textbook](https://ccsf-math-108.github.io/textbook/chapters/08/2/Classifying_by_One_Variable.html)
* [datascience Documentation](https://datascience.readthedocs.io/)

---

## Lab Assignment Reminders

- 🚨 Make sure to run the code cell at the top of this notebook that starts with `# Initialize Otter` to load the auto-grader.
- Your tasks are categorized as auto-graded (📍) and manually graded (📍🔎):
    - **For all auto-graded tasks:**
        - Replace the `...` in the provided code cell with your own code.
        - Run the `grader.check` code cell to execute tests on your code.
        - There are no hidden auto-grader tests in the lab assignments. This means if you pass the tests, you can assume you've completed the task successfully.
    - **For all manually graded tasks:**
        - You may need to provide your own response to the provided prompt. Replace the template text "_Type your answer here, replacing this text._" with your own words.
        - You might need to produce a graphic or another output using code. Replace the `...` in the code cell to generate the image, table, etc.
        - In either case, check your response with a classmate, a tutor, or the instructor before moving on.
- Throughout this assignment and all future ones, please **do not re-assign variables** throughout the notebook! _For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you may fail tests that you thought you were passing previously!_
- You may [submit](#Submit-Your-Assignment-to-Canvas) this assignment as many times as you want before the deadline. Your instructor will score the last version you submit once the deadline has passed.
- **Collaborating on labs is encouraged!** You should rarely remain stuck for more than a few minutes on questions in labs, so ask an instructor or classmate for help. (Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it.) However, please don't just share answers.

---

## Configure the Notebook

Run the following cell to configure this Notebook.

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# Interactive Widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

---

## Joining Tables

There are a  variety of reasons why data might be stored across multiple tables, and it might be necessary to combine data from 2 or more sources into one table. Think of joining tables like mixing ingredients to create a delicious recipe! When we use methods like `join` in the `datascience` Python library, we're basically combining different sets of data based on something they have in common, just like adding various ingredients to make a yummy dish. This helps us get a bigger picture of the data and discover exciting insights by bringing together information from different sources, making data analysis a bit like cooking up a data stew!

For example, you might have customer data in two different tables:

* A table called `customers` containing contact information.
* A table called `purchase_history` containing information on the customer's latest purchases.

Run the following code cell to create examples of such tables.

In [None]:
customers = Table().with_columns(
    'Name', ['Alice Johnson', 'Mohammed Khan', 'Maria García', 'Ling Chen'],
    'Email', ['alice@example.com', 'mohammed@example.com', 'maria@example.com', 'ling@example.com'],
    'Customer ID', [1, 2, 3, 4]
)

purchase_history = Table().with_columns(
    'Customer Number', [1, 3, 4, 5, 1, 4, 4],
    'Product', ['Widget', 'Gadget', 'Doodad', 'Thingamajig', 'Gadget', 'Widget', 'Doodad'],
    'Ad', ['Yes', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes'],
    'Amount ($)', [10, 15, 8, 12, 15, 10, 8]
)

display(customers)
display(purchase_history)

You might want to put all this information in one table to more easily analyze the data or follow up with a customer about one of their purchases. The key to joining data is that there must be a clearly defined relation between the two tables. In this case, the link between the tables is the customer identification number. Customer 1, Alice Johnson (alice@example.com) has purchased a Widget for 10 dollars and a Gadget for 15 dollars. Both purchases were made through an advertisement link.

To use the join method from the `datascience` library, you'll typically have two tables, let's call them `tbl1` and `tbl2`, and you want to combine them based on a common key. Let's say that the key values are located in `'Column 1'` for `tbl1` and `'Column 2'` for `tbl2`.

You could use `tbl1.join('Column 1', tbl2, 'Column 2')` to join these tables together, aligning rows with matching key values in `'Column 1'` and `'Column 2'` into a single table where the focus is placed on the column of `tbl1`.
If you used `tbl2.join('Column 2', tb1, 'Column 1')`, you would also create a joined table, but the output would likely be different as the table would be created from the perspective of the column in `tbl2`.

---

### Task 01 📍

Complete the following code to join the data in the `purchase_history` table with the data in the `customers` table. The resulting table should have 6 rows and the 6 columns `'Customer ID', 'Name', 'Email', 'Product', 'Ad', 'Amount ($)'`.

In [None]:
customer_data = customers.join(..., ..., ...)
customer_data

In [None]:
grader.check("task_01")

---

There are a few things to notice about how this `join` method worked. 
* Since you started with `customers`, it kept all the column labels of the `customers` table.
* The `'Customer ID'` column was brought to the front of the table.
* The columns in `purchase_history` that were not used to join the data (`'Product'`, `'Ad'`, and `'Amount ($)'`) were added to the end of the table.
* The label `'Customer Number'` doesn't appear in the resulting table.
* There is no customer contact data for customer number 5 so that information is missing from the joined table.
* Customer with ID 2 does not have any purchase history so that information is missing from the joined table.

---

## Aggregation

Aggregation refers to the process of summarizing and condensing large datasets into more manageable and meaningful insights. You've seen this in action through using functions like `np.average` and `np.sum`. In the context of the `datascience` library, more advanced aggregation can be achieved using table methods like `group` and `pivot`. 

* The `group` method allows you to group data by specific criteria, such as categories or attributes, and then apply aggregation functions (e.g., `sum`, `mean`, etc.) to calculate summary statistics within each group.
* Pivot tables with the `pivot` method, on the other hand, enable you to restructure data to create a compact summary table, making it easier to analyze relationships between variables.

These methods are essential for extracting valuable information and patterns from complex datasets.

---

### Group

The `group` table method has the general format `tbl.group(column_or_label, collect)`. You've worked with the group method without providing a function name for `collect`. The `collect` argument gives you a place to specify a function that you want to apply to all the values associated with the grouped labels.  

By default, the `group` method counts up the number of rows in the table associated with the grouped values for the specified column. If you wanted to do something else besides counting the number of rows, then you'd provide a function name for `collect`. For example, `tbl.group('Column Label', np.min)` would apply the minimum function to all the grouped data based on the category values in the column `'Column Label'`. 

Run the following code to see how this works when we apply `np.min` to the grouped data from the previous task based on `'Name'` values. 

In [None]:
customer_data.group('Name', np.min)

It is important to know a few things about how using `group` with a collect function works.
* The `np.min` function attempted to apply to every column of the table other than `'Name'`. In this case, if the column contains numerical values, it will return the minimum of the grouped values for each row. If the column doesn't contain numerical values, then it will return an empty value (The empty string `''` for example).
* The columns of the resulting table will be `'Name'` and a column for each of the other columns in `customer_data` with the function name `min` added to the end of the label.

You can reduce the number of columns using `select` or `drop` before you use the group method to get a more presentable table.

---

### Task 02 📍

Using the `customer_data` table, create a table called `customer_averages` that shows the name and average purchase amount for each customer in the `customer_data` table. Your resulting table should have 3 rows and 2 columns. The column labels should be `'Name'` and `'Amount ($) average'`.

In [None]:
customer_averages = ...
customer_averages

In [None]:
grader.check("task_02")

---

Now that you have the purchase amount averages by customer, you might want to take this a step further by doing the same and considering whether or not the purchases were made using advertisements.

You can group by more than one column using the `group` method by providing a list of column labels or indexes for the first argument of `group`. The format is `tbl.group(['Label 1', 'Label 2'], collect)`. This creates a table with a row for each unique grouping of values from both columns and calculates the averages for the grouped data (if `average` were in the collect argument).

---

### Task 03 📍

Using the `customer_data` table, create a 4-row and 3-column table called `customer_ad_averages` that contains a row for each combination of customer name and ad (Yes/No) pairing along with the average purchase amount. The labels should be `'Name', 'Ad', and 'Amount ($) average'`.

**Tip**: Don't forget that you will probably want to reduce the `customer_data` table to the relevant columns first.

In [None]:
customer_ad_averages = ...
customer_ad_averages

In [None]:
grader.check("task_03")

You should see that customer Ling Chen spends about 1 dollar more on average for purchases made through advertisements compared to purchases not made through advertisements. This is a small example of the more detailed analysis you can provide when aggregating your data across multiple categories.

---

### Pivot

A pivot table provides the same information you get from grouping by two columns, but the format is slightly different.

Here are a few examples where using pivot over group can be advantageous:

* Pivot tables are excellent for creating summary tables. If your goal is to create a structured summary table with row and column headers that represent specific categories, pivot is often more straightforward to use.
* If your primary goal is to visualize data, pivot tables can be more useful. The resulting grid structure from a pivot table can be directly used in data visualization tools or libraries for creating bar charts or other visual representations of your data.
* Pivot tables in the `datascience` library handle missing values differently compared with `group`. When using `pivot`, a placeholder of some kind is provided for missing combinations of values. On the other hand, `group` will leave out any missing combinations of values from the table.

The first two arguments of `pivot` are important. 
* The first argument specifies the column where the pivot table column values will come from.
* The second argument will specify the column where the pivot table row values in the first column come from.

By default, the values inside the table come from the counts of how often the pair of column and row values occur in the data set. 

For example, run the following cell  to see the pivot table showing the number of purchases made by each customer through an advertisement or not. Notice that the `'Ad'` values appear as column labels.

In [None]:
customer_data.pivot('Ad', 'Name')

You can go further by specifying a third and fourth argument: `values` and `collect`. `values` identifies the column in the table to which the `collect` function will be applied in order to create the values inside the pivot table.

---

### Task 04 📍

Using the `customer_data` table, create a pivot table `ad_name_pivot` that has a row for each customer name, a column for the `'Yes'` and `'No'` `'Ad'` values, and the average purchase amount for each pairing of customers and purchases made through advertisements or not. You won't need to reduce the `customer_data` table like you needed to do in previous tasks. This is another benefit of the `pivot` method!

**Hint**: Use `'Amount ($)'` for the values and `'np.average'` for the collect function.

In [None]:
ad_name_pivot = ...
ad_name_pivot

In [None]:
grader.check("task_04")

As you can see there are a lot of ways to organize and summarize data in a table. You'll gain a lot of experience with using the tools in this lab throughout the semester.

---

## Nachos and Conditionals

In Python, the boolean data type contains only two unique values:  `True` and `False`. Expressions containing comparison operators such as `<` (less than), `>` (greater than), and `==` (equal to) evaluate to Boolean values. A list of common comparison operators can be found below!

| Comparison            | Operator | True example | False example |
|-----------------------|----------|--------------|---------------|
| Less than             | <        | 2 < 3        | 2 < 2         |
| Great than            | >        | 3 > 2        | 3 > 3         |
| Less than or equal    | <=       | 2 <= 2       | 3 <= 2        |
| Greater than or equal | >=       | 3 >= 3       | 2 >= 3        |
| Equal                 | ==       | 3 == 3       | 3 == 2        |
| Not equal             | !=       | 3 != 2       | 2 != 2        |

Run the cell below to see an example of a comparison operator in action.

In [None]:
3 > 1 + 1

We can even assign the result of a comparison operation to a variable.

In [None]:
result = 10 / 2 == 5
result

Arrays are compatible with comparison operators. The output is an array of boolean values.

In [None]:
make_array(1, 5, 7, 8, 3, -1) > 3

One day, when you come home after a long week, you see a hot bowl of nachos waiting on the dining table! Let's say that whenever you take a nacho from the bowl, it will either have only **cheese**, only **salsa**, **both** cheese and salsa, or **neither** cheese nor salsa (a sad tortilla chip indeed). 

Let's try and simulate taking nachos from the bowl at random using the function, `np.random.choice(...)`.

---

### `np.random.choice`

`np.random.choice` picks one item at random from the given array. It is equally likely to pick any of the items. Run the cell below several times, and observe how the results change.

In [None]:
nachos = make_array('cheese', 'salsa', 'both', 'neither')
np.random.choice(nachos)

To repeat this process multiple times, pass in an int `n` as the second argument to return `n` different random choices. By default, `np.random.choice` samples **with replacement** and returns an *array* of items. 

Run the next cell to see an example of sampling with replacement 10 times from the `nachos` array.

In [None]:
np.random.choice(nachos, 10)

To count the number of times a certain type of nacho is randomly chosen, we can use `np.count_nonzero`

---

### `np.count_nonzero`

`np.count_nonzero` counts the number of non-zero values that appear in an array. When an array of boolean values are passed through the function, it will count the number of `True` values (remember that in Python, `True` is coded as 1 and `False` is coded as 0.)

Run the next cell to see an example that uses `np.count_nonzero`.

In [None]:
np.count_nonzero(make_array(True, False, False, True, True))

---

### Task 05 📍

Assume we took ten nachos at random, and stored the results in an array called `ten_nachos` as done below. Find the number of nachos with only cheese using code (do not hardcode the answer).  

*Hint:* Our solution involves a comparison operator (e.g. `=`, `<`, ...) and the `np.count_nonzero` method.


In [None]:
ten_nachos = make_array('neither', 'cheese', 'both', 'both', 'cheese', 'salsa', 'both', 'neither', 'cheese', 'both')
number_cheese = ...
number_cheese

In [None]:
grader.check("task_05")

---

### Conditional Statements

A conditional statement is a multi-line statement that allows Python to choose among different alternatives based on the truth value of an expression.

Here is a basic example.

```
def sign(x):
    if x > 0:
        return 'Positive'
    else:
        return 'Negative'
```

If the input `x` is greater than `0`, we return the string `'Positive'`. Otherwise, we return `'Negative'`.

If we want to test multiple conditions at once, we use the following general format.

```
if <if expression>:
    <if body>
elif <elif expression 0>:
    <elif body 0>
elif <elif expression 1>:
    <elif body 1>
...
else:
    <else body>
```

Only the body for the first conditional expression that is true will be evaluated. Each `if` and `elif` expression is evaluated and considered in order, starting at the top. As soon as a true value is found, the corresponding body is executed, and the rest of the conditional statement is skipped. If none of the `if` or `elif` expressions are true, then the `else body` is executed. 

For more examples and explanation, refer to the [section on conditional statements in the course textbook](https://ccsf-math-108.github.io/textbook/chapters/09/1/Conditional_Statements.html).

---

### Task 06 📍

Complete the following conditional statement so that the string `'More please'` is assigned to the variable `say_please` if the number of nachos with cheese in `ten_nachos` is less than `5`.

*Hint*: You should be using `number_cheese` from a previous task.


In [None]:
say_please = '?'

if ...:
    say_please = 'More please' 

say_please

In [None]:
grader.check("task_06")

---

### Task 07 📍

Write a function called `nacho_reaction` that returns a reaction (as a string) based on the type of nacho passed in as an argument. Use the table below to match the nacho type to the appropriate reaction.

<img src="./nacho_reactions.png">

*Hint:* If you're failing the test, double check the spelling of your reactions.


In [None]:
def nacho_reaction(nacho):
    if nacho == "cheese":
        ...
    ... :
        ...
    ... :
        ...
    ... :
        ...


spicy_nacho = nacho_reaction('salsa')
spicy_nacho

In [None]:
grader.check("task_07")

---

### Task 08 📍

Create a table `ten_nachos_reactions` that consists of the nachos in `ten_nachos` as well as the reactions for each of those nachos. The columns should be called `Nachos` and `Reactions`.

*Hint:* Use the `apply` method. 


In [None]:
ten_nachos_tbl = Table().with_column('Nachos', ten_nachos)
...
ten_nachos_reactions

In [None]:
grader.check("task_08")

---

### Task 09 📍

Using code, find the number of 'Wow!' reactions for the nachos in `ten_nachos_reactions`.


In [None]:
number_wow_reactions = ...
number_wow_reactions

In [None]:
grader.check("task_09")

---

## Simulations and For Loops

Using a `for` statement, we can perform a task multiple times. This is known as iteration.

One use of iteration is to loop through a set of values. For instance, we can print out all of the colors of the rainbow.

In [None]:
rainbow = make_array("red", "orange", "yellow", "green", "blue", "indigo", "violet")

for color in rainbow:
    print(color)

We can see that the indented part of the `for` loop, known as the body, is executed once for each item in `rainbow`. The name `color` is assigned to the next value in `rainbow` at the start of each iteration. Note that the name `color` is arbitrary; we could easily have named it something else. The important thing is we stay consistent throughout the `for` loop. 

In [None]:
for another_name in rainbow:
    print(another_name)

In general, however, we would like the variable name to be somewhat informative. 

---

### Task 10 📍

In the following cell, we've loaded the text of _Pride and Prejudice_ by Jane Austen, split it into individual words, and stored these words in an array `p_and_p_words`. Using a `for` loop, assign `longer_than_five` to the number of words in the novel that are more than 5 characters long.

*Hint*: You can find the number of letters in a word with the `len` function.

*Note*: You should expect "words" like `"About` to be included in this collection of words that are more than 5 characters long because the quotation mark is not removed with the `split` function.


In [None]:
austen_string = open('Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
p_and_p_words = np.array(austen_string.split())

longer_than_five = ...

# a for loop would be useful here



longer_than_five

In [None]:
grader.check("task_10")

---

### Task 11 📍

Using a simulation with 10,000 trials, assign num_different to the number of times, in 10,000 trials, that two words picked uniformly at random (with replacement) from Pride and Prejudice have different lengths. 

*Hint 1*: What function did we use in section 1 to sample at random with replacement from an array? 

*Hint 2*: Remember that `!=` checks for non-equality between two items.


In [None]:
trials = 10000
num_different = ...

for ... in ...:
    ...

num_different

In [None]:
grader.check("task_11")

---

## Submit Your Assignment to Canvas

Follow these steps to submit your lab assignment:

1. **Check the Assignment Completion Requirements:** This assignment is scored as Complete or Incomplete. Make sure to check with your instructor about their requirements for a Complete score. 
2. **Run the Auto-Grader:** Ensure you have executed the code cell containing the command `grader.check_all()` to run all tests for auto-graded tasks marked with 📍. This command will execute all auto-grader tests sequentially.
3. **Complete Manually Graded Tasks:** Verify that you have responded to all the manually graded tasks marked with 📍🔎.
4. **Save Your Work:** In the notebook's Toolbar, go to `File -> Save Notebook` to save your work and create a checkpoint.
5. **Download the Notebook:** In the notebook's Toolbar, go to `File -> Download HTML` to download the HTML version (`.html`) of this notebook.
6. **Upload to Canvas:** On the Canvas Assignment page, click "Start Assignment" or "New Attempt" to upload the downloaded `.html` file.

---

## Attribution

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a> and derived from the <a href="https://www.data8.org/">Data 8: The Foundations of Data Science</a> offered by the University of California, Berkeley.

<img src="./by-nc-sa.png" width=100px>

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()