<img style="float: left;" src="earth-lab-logo-rgb.png" width="150" height="150" />

# Earth Analytics Education

## Important  - Assignment Guidelines

1. Before you submit your assignment to GitHub, make sure to run the entire notebook with a fresh kernel. To do this first, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart & Run All)
2. Always replace the `raise NotImplementedError()` code with your code that addresses the activity challenge. If you don't replace that code, your notebook will not run.

```
# YOUR CODE HERE
raise NotImplementedError()
```

3. Any open ended questions will have a "YOUR ANSWER HERE" within a markdown cell. Replace that text with your answer also formatted using Markdown.
4. **DO NOT RENAME THIS NOTEBOOK File!** If the file name changes, the autograder will not grade your assignment properly.

* Only include the package imports, code, and outputs that are required to run your homework assignment.
* Be sure that your code can be run on any operating system. This means that:
   1. the data should be downloaded in the notebook to ensure it's reproducible
   2. all paths should be created dynamically using the `os.path.join`
   3. sort lists of dated files even if they are sorted correctly by default on your machine

## Follow to PEP 8 Syntax Guidelines & Documentation

* Run the `autopep8` tool on all cells prior to submitting (HINT: hit shift + the tool to run it on all cells at once!
* Use clear and expressive names for variables. 
* Organize your code to support readability.
* Check for code line length
* Use comments and white space sparingly where it is needed
* Make sure all python imports are at the top of your notebook and follow PEP 8 order conventions
* Spell check your Notebook before submitting it.

For all of the plots below, be sure to do the following:

* Make sure each plot has a clear TITLE and, where appropriate, label the x and y axes. Be sure to include UNITS in your labels.


### Add Your Name Below 
**Your Name:**

<img style="float: left;" src="colored-bar.png"/>

---

# Assignment 4: Plotting in Python

To complete assignment 4, be sure you have reviewed Chapter 1 from the <a href="https://www.earthdatascience.org/courses/scientists-guide-to-plotting-data-in-python/" target="_blank">Scientist's Guide to Plotting Data in Python</a> online textbook, which introduces plotting in Python using matplotlib.   

**Read the instructions for each question carefully to successfully complete the required tasks.**


## PEP 8 Syntax and Clean Code

Be sure to follow PEP 8 syntax guidelines as your write your code. These guidelines include the following:
* Use clear and expressive variable names
* Organize your code to support readability
* Follow PEP 8 standards for code line length and spacing 
* Use comments sparingly to document important steps in your code
* Finally, use the `autopep8` tool as a check to apply PEP 8 syntax throughout your notebook (note that `autopep8` is not available on Google Colab, so you will need to either check manually or run the tool on your computer)

IMPORTANT: the `autopep8` tool will not fix all PEP 8 syntax issues but it 
will fix many of them. Be sure to double check your code prior to submitting
it!

If you need a reminder about what PEP 8 is, read our <a href="https://www.earthdatascience.org/courses/intro-to-earth-data-science/write-clean-expressive-code/intro-to-clean-code/python-pep-8-style-guide/" target="_blank">online textbook page on PEP 8 </a>.

## Fire Occurrence Data

### Data formats
For this assignment, you will use summarized data on fire occurrence in California from 2000 to 2018 provided by <a href="https://www.fs.usda.gov/rds/archive/Catalog/RDS-2013-0009.5" target="_blank">the United States Forest Service</a>. These data show the total number of annual fires and the mean size.

Go to the <a href="https://www.fs.usda.gov/rds/archive/Catalog/RDS-2013-0009.5" target="_blank">the United States Forest Service data catalog page</a>. In the cell below, name each of the formats the data are available in, and whether or not they are open formats **in a bulleted list**.


### Find the data url

You will use the `.sqlite` data format in this assignment. Right click on the link for the `.sqlite` data and paste it in the following cell **without any formatting**.


### Find the data table name

**Databases** like the .sqlite file you will download are structured as a collection of **tables**, where a table is equivalent to a pandas DataFrame. Table names are case-insensitive. To find the correct table name:
      1. Click on the metadata site from the data catalog page
      2. Navigate to the 'Entity and Attribute Information' section
      3. Find the name of the "**Table including wildfire data for the period of 1992-2018 compiled from US federal, state, and local reporting systems.**"
      
**Put the table name in the cell below** 


YOUR ANSWER HERE

### Cite the data
It is important to properly cite your data. Many datasets contain a recommended citation, including this one. Locate the recommended citation on the data catalog page, and copy it into the cell below.


YOUR ANSWER HERE

## Import Python Packages

In the cell below, replace `raise NotImplementedError()` with your code to

1. import a package and module needed to create plots.
also import the following packages:
2. import the os package: `import os` 
3. import pandas: `import pandas as pd`
4. import earthpy: `import earthpy as et`

The test below will check to see that you imported os and pandas.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# DO NOT MODIFY THIS CELL
# This cell will ensure that you imported the packages correctly

# Creating total points
import_answer_points = 0
# test that both modules imported - use duck typing
try:
    pd.NA
    print("\u2705 Score! Pandas has been imported as a pd!")
    import_answer_points += 1
except NameError:
    print(("\u274C Pandas has not been imported as a pd, please make sure "
           "to import is properly."))

try:
    plt.show()
    print("\u2705 Nice! matplotlib.pyplot has been imported as plt!")
    import_answer_points += 1
except NameError:
    print(("\u274C matplotlib.pyplot has not been imported as plt, please "
           "make sure to import is properly."))
    
try:
    os.getcwd()
    print("\u2705 Great work! The os module has imported correctly!")
    import_answer_points += 1
except NameError:
    print("\u274C Oops make sure that the os package is imported.")
    
try:
    data = et.io
    print("\u2705 Great work! The earthpy package has imported correctly!")
    import_answer_points += 1
except NameError:
    print("\u274C Oops make sure that the earthpy package using the alias et.")

print("You recieved {} out of 4 points.".format(import_answer_points))
import_answer_points


## Open the fire data as a pandas DataFrame
You need to load the data into Python before you can use it.

### Download the fire Data

You can download file needed to complete this task using the earthpy package as follows:

`et.data.get_data(url="url-here")`

by default earthpy will:

  * create an `~/earth-analytics/data` directory in your home directory that you will use to process your data all semester.
  * Download the data and unzip (if it's compressed) into an `~/earth-analytics/data/earthpy-downloads/` directory.

In the cell below, download the data using the `.sqlite` url you found above. **Replace url-here with the url you pasted above**. Remember that your line length must be less than 80 characters in Python - check out the PEP 8 standard for ideas of how to do this while following the style guide.

Notice that the first time you run the download, it takes longer than subsequent times, even if you close jupyter notebook and reopen it. That's because earthpy is handling something called **caching** for you. If you've already downloaded the data, then earthpy doesn't bother to download it a second time. **Caching** is a key application for **conditionals** in Python.

Make sure to follow the PEP 8 standard and add descriptive comments.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
download_sqlite_pts = 0

download_path = os.path.join(
    et.io.HOME, 'earth-analytics', 'data', 'earthpy-downloads',
    'RDS-2013-0009.5_SQLITE', 'Data', 'FPA_FOD_20210617.sqlite')

if os.path.exists(download_path):
    print("\u2705 Nice work! You downloaded the .sqlite file!")
    download_sqlite_pts += 2
else:
    print(("\u274C Check your url - your file didn't download as expected."))
    
download_sqlite_pts

### Set your working directory
1.  Set your working directory to the `earth-analytics/data` directory using the following syntax:

    `os.chdir(os.path.join(et.io.HOME, 'earth-analytics', 'data'))`

    > `os.path.join` is a function that will allow you to create file paths that can run on any machine (mac, windows or Linux). It is a good practice to use this when creating file paths in `Python`. 

2. Using the `ls` command in your terminal to find the path to the file you downloaded.
    
    > BONUS: look up how to use the `find` command to list files in the `earth-analytics/data` directory recursively. 

3. Use `os.path.join` again to define a variable that contains the **relative** path to your downloaded and unzipped file.

    > An **absolute** path starts from the root directory `/`, while a **relative** path starts from your current working directory. Since you ran `os.chdir()`, your current working is `/~/earth-analytics/data` and your relative paths will start from there.

4. At the end of the cell, call the variable like to ensure that the output prints in your notebook. The last line of your cell should look something like this:

    `path_variable_name`
    
The code will look something like the example below (except your **variable names** will be more expressive, and you will include **descriptive comments**! Don't forget to also replace the path to the `.sqlite` file and the table name):

```python
# Comment here
os.chdir(os.path.join(et.io.HOME, 'earth-analytics', 'data'))

# Comment here
pth = os.path.join("earthpy-downloads", "path", "to", "downloaded.sqlite")

pth
```

**IMPORTANT: At the end of your cell below be sure to call the name of your path variable
so that the path prints in your notebook. If you don't do this, the test below
will fail.**

Make sure to follow the PEP 8 standard and add descriptive comments.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
student_path = _
path_pts = 0

path_ans = (
    'earthpy-downloads/RDS-2013-0009.5_SQLITE/Data/FPA_FOD_20210617.sqlite')

if student_path == path_ans:
    path_pts += 1
    print("\u2705 Nice work! You located your file correctly!")
else:
    print(("\u274C Check the location of your .sqlite file again."))
    
if os.path.exists(student_path):
    path_pts += 1
    print(("\u2705 Nice work! You set your working directory correctly "
           "and your file path exists!"))
else:
    print(("\u274C Make sure you set your working directory to "
           "~/earth-analytics/data."))
    
path_pts

1. Open the `.sqlite` file using pandas `read_sql_query("SELECT * FROM table_name", "sqlite:///" + pth)` and assign the output data to a variable (be sure the variable name is expressive!). 

2. **Replace table_name with the table name you found in the documentation above**

3. At the end of the cell, call the variable like to ensure that the output prints in your notebook. The last line of your cell should look something like this:

    `data_frame_variable_name`
    
The code will look something like the example below (except your **variable names** will be more expressive, and you will include **descriptive comments**! Don't forget to also replace the path to the `.sqlite` file and the table name):

```python
# Comment here
os.chdir(os.path.join(et.io.HOME, 'earth-analytics', 'data'))

# Comment here
pth = os.path.join("earthpy-downloads", "path", "to", "downloaded.sqlite")

# Comment here
df = pd.read_sql_query(
    "SELECT * FROM table_name", 
    "sqlite:///" + pth)
df
```

**IMPORTANT: At the end of your cell below be sure to call the name of your path variable
so that the path prints in your notebook. If you don't do this, the test below
will fail.**

Make sure to follow the PEP 8 standard and add descriptive comments.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# DO NOT MODIFY THIS CELL

student_data = _

# Creating total points
dataframe_answer_points = 0

# Correct answers for headings
column_names_ans = [
    'FOD_ID', 'FPA_ID', 'SOURCE_SYSTEM_TYPE', 'SOURCE_SYSTEM',
    'NWCG_REPORTING_AGENCY', 'NWCG_REPORTING_UNIT_ID',
    'NWCG_REPORTING_UNIT_NAME', 'SOURCE_REPORTING_UNIT',
    'SOURCE_REPORTING_UNIT_NAME', 'LOCAL_FIRE_REPORT_ID',
    'LOCAL_INCIDENT_ID', 'FIRE_CODE', 'FIRE_NAME',
    'ICS_209_PLUS_INCIDENT_JOIN_ID', 'ICS_209_PLUS_COMPLEX_JOIN_ID',
    'MTBS_ID', 'MTBS_FIRE_NAME', 'COMPLEX_NAME', 'FIRE_YEAR',
    'DISCOVERY_DATE', 'DISCOVERY_DOY', 'DISCOVERY_TIME',
    'NWCG_CAUSE_CLASSIFICATION', 'NWCG_GENERAL_CAUSE',
    'NWCG_CAUSE_AGE_CATEGORY', 'CONT_DATE', 'CONT_DOY', 'CONT_TIME',
    'FIRE_SIZE', 'FIRE_SIZE_CLASS', 'LATITUDE', 'LONGITUDE', 'OWNER_DESCR',
    'STATE', 'COUNTY', 'FIPS_CODE', 'FIPS_NAME']

# Test that it's of type dataframe
if type(student_data) == pd.DataFrame:
    print("\u2705 Data was opened with a Pandas DataFrame, good job!")
    dataframe_answer_points += 2
else:
    print(("\u274C Data is not stored in a Pandas DataFrame, make sure you "
           "use the Pandas module to import the data and that you added the "
           "dataframe variable as the last object in the cell above."))

try:
    # Headings and columns should be all correct
    column_names = student_data.columns.tolist()
    if sorted(column_names) == sorted(column_names_ans):
        print("\u2705 Columns have the correct heading names, good job!")
        dataframe_answer_points += 2
    else:
        print("\u274C Columns do not have the correct heading names.")
except AttributeError:
    print(("\u274C Cannot continue testing until data has been opened "
           "in a Pandas Dataframe"))

print("You recieved {} out of 4 points for creating a dataframe".format(
    dataframe_answer_points))
dataframe_answer_points

You might have noticed that that step took some time. At 2 millYou're not working with **big data** (yet!), but it's enough to cause a noticeable pause on most computers. We might call it **medium-sized data**.

Practice efficient memory management so that you can run the whole notebook without having to wait on this cell each time. Go back to the previous cell, and put the `read_sql_query` function call inside a **conditional**, making sure to replace `dataframe_name` with your DataFrame variable name. To test if the dataframe variable has been created already, you can use the `globals()` function in Python, which contains all the global variables. (for right now, we don't have any other type of variables):

```python
if not 'dataframe_name' in globals():
    dataframe_name = pd.read_sql_query(
        ...
```

Your cell should now run instantaneously once the database has been converted to a DataFrame the first time. You can try both ways by restarting the kernel.

BONUS: Computers have both memory and disk space. Memory is much faster and  you need to have data in memory to perform computation on it. The `.sqlite` file is on disk, where it persists even when the computer is off. In general you don't want to load data into memory that you aren't going to use. We could make this workflow even more efficient by changing the SQL query we used to only import the columns we want, which contain an ID, size, and year of each fire. If you have some experience with databases, give it a try!

### Filter Data

You may have noticed it takes Python some time to work with the current fire DataFrame. One way to help with this is to **aggregate** the data so you only have the values you need. In this case, you want the mean size and total number of fires in California **for each year**.

1. Start by selecting only California fires.
    `fire_df['state_column_name'=='CA']

2. Start by **grouping** the DataFrame by year, because we want a single value for each year. Replace `year_column_name` with the actual column name for the year data, and replace `dataframe_name` with your the DataFrame you made above. (HINT: you can see all the column names on the metadata page OR by accessing the columns attribute of the DataFrame, e.g. `dataframe_name.columns`.:

    `dataframe_name.groupby('year_column_name')`
    
2. Next, aggregate the data to count the unique fires and take the mean of the fire size. This function will count then number of fires each year and compute the mean size. You will need to find the correct column names to replace `id_column_name` and `size_column_name`. Note: Pandas functions are intended to be strung together (also known as **piping**) by placing the following immediately after the closing parenthesis of `groupby`:
    
    `.agg(total_fires=('id_column_name', count), mean_size_acres='size_column_name': mean}))`
    
**IMPORTANT: At the end of your cell below be sure to call the name of your path variable
so that the path prints in your notebook. If you don't do this, the test below
will fail.**

Make sure to follow the PEP 8 standard and add descriptive comments.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
student_df = _
df_by_year_pts = 0


sum_ans = [194426.0, 1853.0]
columns_ans = ['total_fires', 'mean_size_acres']

if isinstance(student_df, pd.DataFrame):
    df_by_year_pts += 1
    print("\u2705 Nice work! You returned a DataFrame!")
else:
    print(("\u274C Make sure that you called your DataFrame on the last "
           "line of the cell above."))

if sorted(student_df.columns.tolist())==sorted(columns_ans):
    df_by_year_pts += 2
    print("\u2705 Nice work! Your DataFrame has the correct column names!")
else:
    print(("\u274C Make sure that you renamed your column names."))
              
if sorted([round(sum, 0) for sum in student_df.sum()])==sorted(sum_ans):
    df_by_year_pts += 5
    print(("\u2705 Nice work! Your DataFrame has the correct values!"))
else:
    print(("\u274C Check your aggregate function - your DataFrame does "
           "not have correct values."))
    
df_by_year_pts

<img style="float: left;" src="colored-bar.png"/>

## Create a Bar Plot of Total Fires

Create a plot with total fires on the y-axis and year on the x-axis. 

There are several ways to create a bar plot using Python.
Below, create a bar plot using the following approach:

```
name_of_df.plot.bar(y='total_fires')
```
Modify the plot as follows:

* Add a **title, x and y label** using the xlabel, ylabel, and title arguments of the `.bar()` method. 
* Change the **color** of the bars on the plot using `color`
* Change the **edgecolor** of the bars on the plot using `edgecolor`
* Remove the default **legend** since you have named the y-axis already using `legend`
* Change the **figure size** using `figsize` so that the plot is easy to read. Bear in mind both the font size and the aspect ratio of the figure.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

<img style="float: left;" src="colored-bar.png"/>

## Create a Column Containing Mean Fire Size in Hectares

Above, you imported some data which contains a column for mean 
fire size in acres. For your analysis, you want the data to be in hectares. Do the following

1. Write a function called `acre_to_hectare` that converts from acres to hectares

HINT: `1 acre = 0.404686 hectares`

2. Apply your function to the `mean_size_acres` column:
    `dataframe_name.existing_column_name.apply(function_name)`

2. You can create a new column in a pandas data frame using the following syntax:

`df["new_col_name"] = dataframe_name.existing_column_name.apply(function_name)

Create a new column called `mean_size_hectares` in your data frame that contains the mean fire size values converted to hectares.

****
    
**IMPORTANT: At the end of your cell below be sure to call the name of your dataframe
so that the dataframe prints in your notebook. If you don't do this, the test below
will fail.**

Make sure to follow the PEP 8 standard, add descriptive comments, and document your function.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
student_data_2 = _
sum_ans = 749.9
dataframe_2_answer_points = 0

# Test that the new column is present
if "mean_size_hectares" in student_data_2.columns.tolist():
    print(("\u2705 Great - you have added the mean_size_hectares_column."))
    dataframe_2_answer_points += 3
else:
    print(("\u274C Oops - you should have a column called mean_size_hectares "
           "in your data frame. Make sure you added and named the "
           "column correctly."))

# Test that the new column has the correct values
if round(student_data_2.mean_size_hectares.sum(), 1) == sum_ans:
    print(("\u2705 Great - you correctly converted to hectares."))
    dataframe_2_answer_points += 3
else:
    print(("\u274C Oops - your converted values are not correct."))

# Report out
print('You earned {} / 6 points for adding a column.'.format(
    dataframe_2_answer_points))

dataframe_2_answer_points

## Create A Figure with Subplots

In the cell below create a Figure that contains
two subplots:

* Plot 1: Create a plot that shows total number of fires by year
* Plot 2: Create a plot that shows the mean fire size in hectares by year

You will need to create the subplot layout first, using the following syntax:
    `fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))`
    
Assign `ax1` and `ax2` to your plots using the `ax=ax1` argument of the plotting method you chose.

For each plot do the following:

* Modify the default plot colors
* Add a title, and x and y axis labels

For the figure:
* Add an **overall title** for the entire figure

When adding your titles and labels, think about the following pieces of information that could help someone easily interpret the plot:
* geographic coverage or extent of data.
* duration or temporal extent of the data.
* what was actually measured and/or represented by the data.
* units of measurement.
****

Make sure to follow the PEP 8 standard and add descriptive comments.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## Explain Your Plot

In the Markdown cell below, write an assertion-evidence style takeaway (formatted as a heading)

Next, answer the following questions about your plot using a **bullet list**.

1. Do either the yearly total number of fires or average size of fires appear to be increasing over time in California? Explain your answer using the patterns that you see in the plotplot. Do some research about fire size and occurrence to help answer this question.

2. Include at least one citation supporting your answer.

3. What additional data might help you to better answer the first question about whether number of fires or average fire size appear to increasing? (It could help to take a look at the information about the original dataset from <a href="https://www.fs.usda.gov/rds/archive/Catalog/RDS-2013-0009.5" target="_blank">the United States Forest Service</a>.)

Remove any existing text in the cell below before adding your answer.


YOUR ANSWER HERE

## Discuss Your Workflow

In the Markdown cell below, answer the following questions using a **numbered list**:

1. Consider the variable name that you used for your pandas dataframe. Explain why it is expressive (or not).
2. In your final plot, why did you choose the type of plot (e.g. bar, scatter, line) that you did?
3. In your final plot, why did you choose the color combinations you did for your plots.

Remove any existing text in the cell below before adding your answer.


YOUR ANSWER HERE