## [ESPM-136] Notebook 3: Interactive Data Exploration Assignment

Welcome to the third Jupyter Notebook for ESPM 136! In the previous notebook, you went through a demo of some data analysis that we were able to do with the emissions data that was collected. In this assignment, you'll be **completing some hands-on exercises for performing data analysis based on what we did in Notebook 2!**

In order to complete and submit this assignment, go through the notebook and **answer the questions denoted by the yellow boxes**. At the end of the notebook, there will be a cell that creates and downloads a PDF of your responses, which you can then submit.

## Feeling stuck or want extra help with this notebook? Contact Data Peer Consultants!

If you find yourself having trouble with any content in this notebook, **Data Peer Consultants** are an excellent resource! Click [here](https://dlab.berkeley.edu/training/frontdesk-info) to locate live help.
Peer Consultants are there to answer all data-related questions, whether it be about the content of this notebook, applications of data science in the world, or other data science courses offered at Berkeley.

<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## Learning Outcomes
Working through this notebook, you will get some experience answering questions about:
1. Basic table functions for **viewing and conditionally selecting data**
2. More advanced table functions for **performing detailed data analysis**
3. **Creating visualizations** to better understand our data

<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## Recap of What We've Done So Far

In Assignment #1 you **gathered carbon emissions data and evaluated internal decarbonization strategies for two companies of your choice** and submitted all of this information into a Google Form. In Notebook 2, we introduced many different methods of **using Jupyter Notebooks to analyze the data** that you collected.

This notebook will be heavily basesd on the contents that you saw in Notebook 2, but this time, you will get an opportunity to **gain some more hands-on experience and do some analysis yourself!**

<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## Importing Modules and Taking an Initial Look at the Data

As stated previously, in this notebook we'll be utilizing data exported from the Google Form that you filled out based on company emissions. From the Google Form, we were able to export the results as a CSV (comma-separated value) file. We cleaned up and anonymized the data, and we uploaded it to the same folder that this notebook is located in. Thanks to this pre-processing, we can now load in our data to a table object from the `datascience` module!

As with the first notebook, we need to import the modules necessary for this notebook. You can do this by running (clicking the cell hitting `Cmd/Ctrl`+`Enter`, or clicking the cell and hitting the `Run` button at the top) the following code cell!

In [1]:
# Run this cell to import the modules
import numpy as np
import pandas as pd
from datascience import *
import otter
grader = otter.Notebook()
import matplotlib.pyplot as plt
from ipywidgets import *
%matplotlib inline
plt.style.use("fivethirtyeight")
import seaborn as sns, plotly.express as px

Now that we have all the modules we need, we can load our original CSV file into a table object from the `datascience` module!

The `.show()` command in the cell below allows us to see a specified number of rows of the table. In this case, we want to see the first five rows of the table to see what it looks like.

In [2]:
# Run this cell to create our emissions table object
emissions = Table.read_table('emissions.csv')
emissions.show(5)

Index,Company Name,Company Sector,Company Sector (Other),Year of CDP Disclosure,Scope 1 value,Scope 2 value (location-based),Scope 2 value (market-based),Scope 3 value,Total Revenue,Currency of Total Revenue,Internal Price on Carbon (Y/N),Price
0,ADIDAS,APPAREL & FOOTWEAR,,2022,12908.4,,125502.0,7254510.0,21234000000.0,EUR,Yes,85.0
1,ADIDAS,APPAREL & FOOTWEAR,,2022,12908.4,,125502.0,7254510.0,21234000000.0,EUR,Yes,85.0
2,ADIDAS,APPAREL & FOOTWEAR,,2022,12908.4,,125502.0,7254510.0,21234000000.0,EUR,Yes,85.0
3,AHOLD DELHAIZE,CONSUMER PACKAGED GOODS,GROCERY / FOOD DISTRIBUTION,2022,1728000.0,1748000.0,1099000.0,65930400.0,75600.0,EUR,Yes,150.0
4,AIR FRANCE-KLM,TRANSPORTATION,,2022,16336300.0,,19104.7,8700670.0,14315000000.0,EUR,Yes,


In the view of our table above, particularly in the `Company Sector (Other)` and `Scope 2 value (location-based)` columns, you may see multiple values called `nan`. In Python, missing or undefined values are represented by `nan` values, which stands for "Not a Number". You can think of these as missing or N/A values!

Another small note on notation that you may notice in the table above and throughout this notebook: in some of the numerical columns, you may see the number followed by a value like `e+06` or `e+10`. This notation is Python's way of displaying very large or very small numbers. `e+` in this case means a positive power of 10 (`e-` would be a negative power, which would be a way of representing very small decimal numbers). So, when you see a number like `e+06`, this means $10^6$, which is a million. In the table above, `7.25451e+06` means $7.25451 * 10^6$, which is actually equal to $7,254,510$. This notation is not the easiest for us to read, but it is a very useful method for Python to store large and small numbers simply without losing any information!

As we saw in Notebook 1, we can use the `.num_rows` and `.num_columns` methods to take a look at the dimensions of our table.

In [None]:
emissions.num_rows

In [None]:
emissions.num_columns

We have 13 different columns and 156 rows. Each column corresponds to a feature of our data point, as described in further detail below. Each row corresponds to one company of the two that were submitted by each student via the Google Form. The total number of rows ended up being 156 after we removed a few duplicate submissions to the Google Form.

The following code cell is a bit more advanced, but we can use it to show us the number of distinct companies that were reported in our table.

In [None]:
len(emissions.group('Company Name')['count'])

## Data Dictionary

As a reminder, here's the data dictionary with the contents of the `emissions` table that we introduced in the last notebook.

|Column Name| Meaning |Type|
|--|--|--|
|Company Name| The name of the company chosen to report data| category |
|Company Sector| The industry sector the company belongs in| category |
|Company Sector (Other)|An optional field if the company's sector was not one of the given options|category |
|Year of CDP Disclosure|The submission year of the report. The data provided represents the previous year.|number|
|Scope 1 value|The company's direct GHG emissions, given in metric tons of CO2e|number|
|Scope 2 value (location-based)|The company's indirect (location-based) GHG emissions, given in metric tons of CO2e |number |
|Scope 2 value (market-based)|The company's indirect (market-based) GHG emissions, given in metric tons of CO2e|number |
|Scope 3 value| The company's indirect GHG emissions that come from its value chain, given in metric tons of CO2e| number |
|Total Revenue| The company's total revenue for a given year| number |
|Currency of Total Revenue|The currency of the company's total revenue|category|
|Internal Price on Carbon|An indicator of if the company implements internal carbon pricing |category|
|Price|Internal price on carbon  | number|

<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## 1. Basic Table Operations: `select`, `where`, `sort`

As we saw in **Notebook 2**, there are some interesting table operations that allow us to begin exploring and analyzing our data. We showed you some examples in the previous notebook, and now it's your turn to try out these methods! 

For the exercises below, you will fill in the necessary code to create the desired tables. If you get stuck, feel free to reference the code in Notebook 2.

**Some Reminders:**
* **Text values/column names should be in quotation marks**, while integers (numbers) should not!
* Run your code cells after you answer the exercises.

### 1.1 Selecting Columns

Sometimes, we only want to keep certain information from our table. In the question below, you will be using the `.select()` method to choose which columns of the `emissions` table we want.

For reference, to use the `.select()` method it should look something like: `tbl.select(col1_name, col2_name, ...)` for however many columns we want our table to have.

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 1.1:
In the cell below, use `.select()` to create a table with the company's name, their year of disclosure, and their total revenue. <br> <br>
    
*Hint:* If you're running into an error, make sure to put the column name in quotation marks!
</div>

In [None]:
# Replace ... with the necessary column name(s)
emissions.select(...)

<!-- END QUESTION -->

### 1.2 Conditioning on Rows

When we want to focus on rows in which the data meets some condition, we can use the `.where()` method. In the exercise below, you will use the `.where()` method to create a table that displays **only** rows that meet a certain condition.

For reference, to use the `.where()` method it should be of the form: `tbl.where(col1_name, predicate_fuction)`, where the `predicate_function` describes the criterion that the column's values should meet.

Here are some helpful predicate functions you *may* need that describe the criterion the column must meet:

|Predicate|Example|Result|
|-|-|-|
|`are.equal_to`|`are.equal_to(50)`|Find rows with values equal to 50|
|`are.not_equal_to`|`are.not_equal_to(50)`|Find rows with values not equal to 50|
|`are.above`|`are.above(50)`|Find rows with values above (and not equal to) 50|
|`are.above_or_equal_to`|`are.above_or_equal_to(50)`|Find rows with values above 50 or equal to 50|
|`are.below`|`are.below(50)`|Find rows with values below 50|
|`are.between`|`are.between(2, 10)`|Find rows with values above or equal to 2 and below 10|
|`are.between_or_equal_to`|`are.between_or_equal_to(2, 10)`|Find rows with values above or equal to 2 and below or equal to 10|


<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 1.2:
In the cell below, use `.where()` to create a table that only includes companies whose `Total Revenue` ranges from 50,000 to 3,000,000 in their respective currency.
</div>

In [None]:
# Replace ... with the appropriate column name and condition
emissions.where(..., ...) 

<!-- END QUESTION -->

### 1.3 Sorting Values

Sorting values can reveal interesting patterns. In the exercise below, you will use the `.sort()` method to sort a specific column's values in either ascending or descending order. 

For reference, to use the `.sort()` method it should look something like: `tbl.sort(col1_name)` to sort in **ascending order** and `tbl.sort(col1_name, descending = True)` to sort in **descending order**.

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 1.3.1:
Using `.sort()`, sort the `emissions` table to be able to find the company with the highest total revenue (i.e. sort total revenue from highest to lowest). <br> <br>
    
Then, in the text cell below, write about any interesting trends you notice for the top few firms. *Hint:* Look at the `Currency of Total Revenue` column!
</div>

In [4]:
# Replace ... with the necessary code
emissions.sort(..., ...)

TypeError: object of type 'ellipsis' has no len()

*Type your answer here. Double-click to edit this cell and replace this text with your answer. Run this cell to proceed when finished.*

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 1.3.2:
Using `.sort()`, sort the `emissions` table to find the company with the lowest `Scope 1 value` (i.e. sort `Scope 1 value` from lowest to highest).
</div>

In [None]:
# Replace ... with the necessary code
emissions.sort(...)

<!-- END QUESTION -->

### 1.4 Putting it all together

<div class="alert alert-block alert-success">
    <p style="font-size:20px">This section is advanced/optional
</div>

You've worked with `.select()`, `.where()`, and `.sort()` individually -- but you could also use them together. Now, it's time to try out using all the methods together! In the exercise below, you will fill in the necessary code to create the desired table. 

**This sections deals with chaining together methods. This is a more complex topic, so it's okay if you don't understand. The exercise below is optional. Feel free to try it out if you want!**

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 1.4 (Optional):
Using `.select()`, `.where()`, and `.sort()` to create a table that displays the company name, their sector (including the 'Company Sector (Other)' column), and their total revenue, **only** for companies whose total revenue is below $100,000,000. Sort this table by total revenue in ascending order.

In [None]:
# Replace ... with the necessary column names and conditions
(emissions.select(...)
            .where(..., ...)
            .sort(...))

<!-- END QUESTION -->

<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## 2. More Advanced Table Operations: `apply`, `group`, `pivot`

Now you have done some exercises with basic table operations, let's think back to what we learned regarding the advanced table operations in the previous notebook. Now, let's try them out on our own!

For the exercises below, you will fill in the necessary code to create the desired tables. If you get stuck, feel free to reference the code in Notebook 2.

**Reminder:** Run your code cells after you fill in the blanks and see if they return the expected result.

### 2.1 Apply

In Notebook 2, we got some experience with how we can apply different types of functions to a column in our dataset using the `.apply()` function.

In the following question, we'll use `.apply()` to clean up the presentation of the values in the `Scope 3 value` column. For reference, the `.apply()` method looks something like this: `tbl.apply(function_to_apply, 'column_name')`.

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 2.1:
The numbers in the `Scope 3 value` column look messy! We have decided that we only care about the value in the form of millions. Let's use `.apply()` to help us add a new column called "Scope 3 value (Millions)" that contains all the `Scope 3 value`s in the form of millions. <br> <br>

Note that we have already coded the `make_million` function in the cell below.
</div>

In [None]:
# The make_million function is written below
def make_million(scope_value):
    return scope_value // (10 ** 6)

In [None]:
# Replace ... with the appropriate column function and column name
million = emissions.apply(..., ...)

# Here, we add a new column to our dataset with the value you computed -- you don't need to do anything here.
emissions['Scope 3 value (Millions)'] = million
emissions.show(5)

<!-- END QUESTION -->

### 2.2 Group

Notebook 2 showed us some examples of how useful it can be to group by categorical information such as `Company Sector`. In the following question, you'll get to explore the data a bit on your own by grouping the data by a column of your choice! For reference, the `.group()` method looks something like this: `tbl.group('column_to_group_by', aggregation_function)`.

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 2.2
Group the data by whatever column you would like in order to find an intriguing trend. Then, explain what you did and what you found interesting about it. Look at what we did in Notebook 2 for inspiration! <br> <br>

*Hint:* Ask yourself a question you might be interested in looking at, such as: how many companies use EUR? What is the `Scope 1 value` for each company? Be creative and explore anything that spikes your interest! 
</div>

In [None]:
# Replace ... with your column name and aggregation function
emissions.group(..., ...)

*Type your answer here. Double-click to edit this cell and replace this text with your answer. Run this cell to proceed when finished.*

<!-- END QUESTION -->

### 2.3 Pivot

<div class="alert alert-block alert-success">
    <p style="font-size:20px">This section is advanced/optional
</div>

The following question for the `.pivot()` method is optional, but it is definitely an interesting functionality to be able to do! If you have time and want to try it out, give it a shot below. Be sure to reference back to how we used `.pivot()` in Notebook 2. For reference, the `.pivot()` method looks something like this: `tbl.pivot('new_row_index', 'new_column_index')`.

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 2.3 (Optional):
Notice how some companies have values for the internal price on carbon and some don't? Let's use the `.pivot()` method to see this breakdown for each `Company Name`. <br> <br>

*Hint:* Look back to how the `.pivot()` method changed the presentation of the tables in Notebook 2 if you feel stuck!
</div>

To accomplish this task, we need to do some manipulation of the data using the `pandas` module in the following cell. You don't have to understand how this code works, but essentially, we write the code below to remove the duplicate rows to have only one row per company. Otherwise, our pivot table would have multiple repeated rows!

In [None]:
# We make a new dataset by removing all the duplicates
emission = pd.read_csv('emissions.csv')
emission = emission.drop("Index", axis = 'columns').drop_duplicates()
emission.index.name = 'Index'
emission.to_csv('emissions_distinct.csv')
emissions_distinct = Table.read_table('emissions_distinct.csv')
emissions_distinct.show(5)

We then use this new `emissions_distinct` table to perform the pivot operation for this question.

In [None]:
# Replace ... with the appropriate column name and condition
emissions_distinct.pivot(..., ...)

<!-- END QUESTION -->

<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## 3. Visualizations

In this section, you'll get some practice **creating and analyzing some visualizations of the data that we saw for the first time in Notebook 2!** Be sure to reference that notebook, as well as Notebook 1 (which briefly introduced some of these methods) if you're not sure how to answer a question. 

### 3.1 Basic `datascience` Visualizations

In the previous notebook, we saw that we could create visualizations using multiple modules in Python. The only coding questions we'll be asking you in this section will be based on the simpler visualizations we can create with the `datascience` module.

For the first question, we'll be creating a scatter plot with the `datascience` module. As a reminder, the `.scatter()` method looks something like this: `tbl.scatter('column_on_x_axis', 'column_on_y_axis')`.

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 3.1
Create a scatter plot visualizing the relationship between `Scope 3 value` and `Total Revenue`.
</div>

In [None]:
# Replace the ellipsis with the columns for Scope 3 value and Total Revenue
emissions.scatter(..., ...)

<!-- END QUESTION -->

Next, let's practice making a bar chart to visualize the relationship between a categorical column and a numeric one. As a reminder, in the `datascience` module, the `.barh()` method looks as follows: `tbl.barh('categorical_column', 'numeric_column')`. Let's utilize it to create a bar chart of the counts of different currencies used.

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 3.2
Create a bar chart visualizing the relationship between `Currency of Total Revenue` and the number of companies that use it in the dataset. <br> <br>
    
*Hint:* We've already filled out the numeric column `'count'` for you. Replace the ellipsis below with the categorical column we want to group the data by! The two sets of ellipsis should be the same column name.
</div>

In [None]:
# Replace the ellipsis with the categorical column we want to visualize
emissions.group(...).barh(..., 'count')

<!-- END QUESTION -->

### 3.2 Analysis with More Advanced Visualizations

In this section, you'll be **answering some short-answer questions based on some visualizations we create for you.** Hopefully this will give you a better insight into the types of trends we look for when doing exploratory data analysis, as well as the different types of questions we have to ask ourselves.

First, we'll take a look at an example of a histogram, a type of graph that displays the distribution of a numeric value that we talked about in Notebook 1.

In [None]:
df = pd.read_csv('emissions.csv')
sns.histplot(df['Scope 1 value']);

In the above graph, we're looking at the distribution of the `Scope 1 value` across all of the companies in the dataset. In histograms like these, the heights of the vertical bars represent higher counts of companies with the specified `Scope 1 value`.

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 3.3
How would you describe the distribution of the `Scope 1 value` across the companies in the dataset? Are the values concentrated in one area? Is the distribution skewed towards one side, or is it symmetric? Feel free to discuss anything else you find interesting.
</div>

*Type your answer here. Double-click to edit this cell and replace this text with your answer. Run this cell to proceed when finished.*

<!-- END QUESTION -->

The final graph we'll be looking at comes directly from Section 3.3 of Notebook 2, where we utilized a second CSV of data specific to the Food & Agriculture Sector. The interactive graph below plots the summated Scope 1 and Scope 2 values for multiple companies across the years 2018 - 2021.

In [None]:
food_ag = pd.read_csv('foodag.csv')
companies = food_ag['COMPANY NAME'].unique()
years = ['2018', '2019', '2020', '2021']

@interact(x = widgets.Dropdown(options = list(companies), value = 'Danone'))
def g(x):
    emissions = food_ag[food_ag['COMPANY NAME'] == x]["TOTAL EMISSIONS"] / 1000
    bar_container = plt.bar(years, emissions);
    plt.ylim(0, max(emissions) + 5000)
    plt.xlabel('Year')
    plt.ylabel('Total Scope 1 & Scope 2 Emissions (thousand mt CO2e)', size = 11)
    plt.title(x + ' GHG Emissions (2018-2021)', size = 15)
    plt.bar_label(bar_container, fmt='{:,.0f}')
    plt.grid(False)

<!-- BEGIN QUESTION -->
<div class="alert alert-warning">

### Question 3.4
Pick 1-2 companies using the dropdown menu in the visualization above, and describe any trends over the years that you notice (if you pick 2 companies, you can also compare and contrast the trends). You can also do your own research into information you can find regarding your chosen company's steps to decarbonize in recent years. Do the trends make sense? Is there anything that surprises you? Feel free to discuss whatever interests you the most.
</div>

*Type your answer here. Double-click to edit this cell and replace this text with your answer. Run this cell to proceed when finished.*

<!-- END QUESTION -->

<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## How to Submit This Assignment

**Make sure that you've answered all the questions.**

Follow these steps: 
1. Go to `File` in the menu bar, then select `Save and Checkpoint` (or press `Ctrl` + `S` on Windows / `Cmd` + `S` on Mac on the keyboard).
2. Go to `Cell` in the menu bar, then select `Run All`.
3. Click the link produced by the code cell below.
4. Submit the downloaded PDF according to your professor's instructions.

**Note:** If clicking the link below doesn't work for you, don't worry! Simply click `File` in the menu, find `Download As`, and choose `PDF via LaTeX (.pdf)` to save a copy of your PDF onto your computer. Alternatively, you can also right click the link and save the link content as a PDF.

**Check the PDF before submitting and make sure all of your answers and any changes are shown.**

In [None]:
# Run this cell
# This may take a few extra seconds.
from otter.export import export_notebook
from IPython.display import display, HTML
export_notebook("ESPM-136_Notebook3_Assignment.ipynb", filtering=True, pagebreaks=False)
display(HTML("<p style='font-size:20px'> <br>Save this notebook, then click <a href='ESPM-136_Notebook3_Assignment.pdf' download>here</a> to open the PDF.<br></p>"))

<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## Conclusion

In this notebook, you've learned quite a bit, below is a summary of the methods you got to practice implementing:
- Basic Table Operations 
    - Select
    - Sort
    - Where (Conditioning)
- More Advanced Table Operations
    - Apply
    - Group
    - Pivot
- Visualizations
    - Basic data visualizations with `datascience`
    - More advanced visualizations with `Seaborn` and `Plotly`

<h3>Congratulations on finishing the notebook!</h3>

<hr style="border: 2px solid #003262">
<hr style="border: 2px solid #C9B676">

## A Final Request: Feedback Form

<div class="alert alert-info">
<b> We encourage students to fill out the following feedback form to share your experience with this Modules notebook. This feedback form will take no longer than 5 minutes. At UC Berkeley Data Science Undergraduate Studies Modules, we appreciate all feedback to improve the learning of students and experience utilizing Jupyter Notebooks for Data Science Education: </b> 
</div>

# [UC Berkeley Data Science Feedback Form](https://forms.gle/hPgYMxFWKXH2sVkd7)