In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("assignment2_234.ipynb")

# Assignment 2: Forest Fires
**PSTAT 234 (Winter 2024)  
Due Date: 02/11**

## Collaboration Policy

While you may talk with others about the homework, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** at the top of your notebook.

**Collaborators**: *list collaborators here*

# Direction and Goal

This assignment will focus on analyzing the data for forest fires in the United States from 1992 through 2020. The data we will be using is from the US Forest Service, which is part of the US Department of Agriculture. 

## Question 1: Data Processing

First we will import the needed modules. We will use `pandas` for our data cleaning and analysis, `seaborn` for our plotting, and `geopandas`. `geopandas` may not be installed so, if not, uncomment the below line and run it.

In [None]:
!pip install geopandas
!pip install pyarrow

In [None]:
import pandas as pd
import seaborn as sns
import geopandas as geo

## Question 1a: Read the data

In the `data` folder, there is a file called `fires.parquet.gzip`. The `parquet` format is a very efficient file format for tabular type data that enables compression methods as well. Since the file we will be working with is very large(over 100 MB), it has been stored as a `parquet` binary file, compressed using `gzip`. Fortunately, `pandas` has a convenient method called `read_parquet()` that can read `parquet` files.ince the file is very large, storing it as a csv would entail taking up a lot of space. Each row in the  represents one fire that happened in the United States with info like the fire name, its year, the number of acres burned, etc.  Read the file using `read_parquet()` into a pandas `DataFrame` called `fires`.

In [None]:


# Fill-in ...
fires = pd.read_parquet(...) # Replace ... with correct code
fires.head()

In [None]:
grader.check("q1a")

A handy method to initially examine the data is to use the `info()` method. It can tell us how many rows our data has, how much memory it takes up, and the names and data types of the columns. Notice the `memory_usage='deep'`. The purpose of this is to tell `pandas` to tell us the **true** memory taken up by the dataframe as by default, `pandas` only tells us how much space is taken up by references to objects contained in the dataframe and not necessarily the objects themselves

In [None]:
fires.info(memory_usage='deep')

Another useful method is `describe()` which gives you summary statistics of all the numeric columns. We see 

In [None]:
fires.describe().T.head(4) # if you remove the 

In [None]:
fires.isna().sum()

## Question 1b: Clean the Data

### Question 1b (i): LowerCase Column Names

Convert all the column names to lowercase.

In [None]:

# Fill-in ___
fires.columns = ___.str.____ # Replace ___ with correct code
fires.head()

In [None]:
grader.check("q1bi")

### Question 1b (ii): Rename Columns

Rename the `fire_size` column to  `acres_burned` and `cont_date` to  `contains_date`

In [None]:
# Fill-in ...
fires = fires.rename(...) # Replace ... with correct code
fires.head()

In [None]:
grader.check("q1bii")

### Question 1b (iii): Subset Data

In large datasets, it is always wise to consider whether some rows can be removed. We have over 2 million rows and many of these rows are for small fires that are not going to enhance our analysis. Thus, for this analysis, we will focus only on fires that burned at least 10 acres. 

In [None]:

# Fill-in ___
fires = fires[...] # Replace ... with correct code(you can also use query method if you wish)
fires[fires.duplicated()] # This should return an empty DataFrame

In [None]:
grader.check("q1biii")

### Question 1b (iv): Drop Unnecessary rows

The US Forest Service collects the data on forest fires from multiple agencies and this can lead to the problem of  duplicate rows(rows that contain exactly the same information). 
We can use the `duplicated()` method in `pandas` to check for which rows are exactly the same as shown in the code snippet below. 

In [None]:
fires[fires.duplicated(keep=False)]

Write code to drop all duplicate rows, keeping only the first ocurrence of each set of duplicate rows. 

In [None]:
# Fill-in ___
fires = fires.___ # Replace ___ with correct code(you can also use query method if you wish)
fires[fires.duplicated()] # This should return an empty DataFrame

In [None]:
grader.check("q1biv")

### Question 1b (v): Convert columns to appropriate types

It is always beneficial to have each column in a dataframe be the most beneficial and appropriate type. Currently the `contains_date` and `discovery_date` columns are strings(represented as `object` type in `pandas`) but
we can convert them to a `datetime` column so we have access to myriad of methods appropriate for time series data. 

Write code to convert these two columns to `datetime` columns.

In [None]:
# Fill-in ...
fires['discovery_date'] = pd.to_datetime(...)
fires['contains_date'] = ...
fires.info() # You should see datetime64[ns] for the two columns now. The ns tells us it is in nanosecond precision.

In [None]:
grader.check("q1bv")

# Question 2: Prepare the Data

In order to facilitate our analysis, we will add two new columns. The first column will be the month of the fire(based on `discovery_date`) and the second column will be the number of days the fire burned which
is the difference between `contains_date` and `discovery_date`. Name the two new columns `fire_month` and `days_burning` respectively. 

Write code to accomplish the above. The following page may prove helpful: 

[Pandas Time Series](https://pandas.pydata.org/docs/user_guide/timeseries.html#time-date-components)

[Pandas Timedelta](https://pandas.pydata.org/docs/user_guide/timedeltas.html#attributes)

In [None]:
# Fill-in ...
fires['fire_month']  = ...
fires['days_burning'] = ...

In [None]:
grader.check("q2")

## QUESTION 3: Analyzing the Data


<!-- BEGIN QUESTION -->

### Question 3a (i): Maximum Fire Size by Year for California

For the state of California(CA), what was the **maximum** fire size for each year? Create a **barplot** with the year on the x-axis and the fire size(`acres_burned`) on the y-axis. Add/Change the titles and labels to what you deem appropriate. 

In [None]:
# Fill-in ___ and ...
fires.query(...) \
     .groupby(...) \
     .acres_burned.___\
     .plot(kind=...,title=...,ylabel=...,xlabel=...);


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 3a (ii): Fire Count By Month in California

Create a barplot of the number of fires by month in California. The x-axis, as before, should be the year and the y-axis is the count of fires. Do you notice any patterns? Which months have the most fires? Does this make sense? For these free-resposne questions, type you answer after the **SOLUTION** cell. 

_Type your answer here, replacing this text._

In [None]:

# Fill-in ___ and ...
fires.query(...) \
     .groupby(...) \
     .___\
     .plot(kind=...,title=...,ylabel=...,xlabel=...);

<!-- END QUESTION -->

### Question 3b (i): Ranking states by total acres burned

For each state, calculate the total amount of acres burned and rank each state by that amount. The state with the most acres burned overall should have rank 1 and so on.
Name the dataframe `fires_states` and it should have the state abbreviation be the index and the two columns should be the total acres burned for each state and the rank of each state. The names of these two columns
should be `acres_burned` and `state_rank` respectively. Sort `fires_states` in ascending order by `state_rank`.

HINT: The `rank()` method may come in handy.

In [None]:


# Fill-in ___ and ...
fires_states = fires.groupby(...).acres_burned.agg(...) 

# Rank each state in descending order(state with most acres burned should be rank 1)

fires_states['state_rank] = ...

# SORT in ASCENDING ORDR BY state_rank

fires_states = ...



In [None]:
grader.check("3bi")

<!-- BEGIN QUESTION -->

### QUESTION 3b (ii)

Using the `fires_states' from the previous question, for the top 10 ranking states, create a barplot with the state on the x-axis and the total acres burned on the y-axis. Add
appropriate titles and labels. 

In [None]:

# Fill-in ...
fires_states[...].plot(y=...,kind=...,legend=False,title=...,xlabel=...,ylabel=...);

<!-- END QUESTION -->

### Question 3c (i): Preparing a DataFrame to total acres burned by year within each state

We have already seen the sum of acres burned overall for each state but suppose we want to see how this changes over the years? 

Create a new `DataFrame` called `fires_states_year` storing the total acres burned by year within each state. The result should be a `DataFrame` with a MultiIndex of `state` and `fire_year` with a sole column representing the total acres burned by year within each state. 

Join `fires_states_year` to `fires_states` to create a `DataFrame` of three columns and then use `reset_index()` to set the indexes to be columns and filter for only states ranking within the top 4, naming the result `fires_states_top_4`.

HINT: Use `join()` to merge/join two `DataFrame`s together. Join automatically joins by common index values.

In [None]:


# Fill-in ... and ____
fires_states_year = fires.groupby(...).____


fires_states_top_4 = fires_states_year.join(...,lsuffix='_by_year',rsuffix='_by_year').query('...').reset_index() # lsuffix and rsuffix are used here to append strings
to the end of common column names in the two dataframes we are joining(acres_burned) in order to differentiate between them. The default is to add an x and a y so we can use
this to make it more informative. 

fires_states_top_4

In [None]:
fires_states_top_4

In [None]:
grader.check("3ci")

<!-- BEGIN QUESTION -->

### Question 3c (ii): Preparing a DataFrame to total acres burned by year within each state

Using `fires_states_top_4` and `seaborn`, create four lineplots corresponding to each of the top 4 states showing the change in acres burned over the years. The result should be a plot with 2 rows and 2 columns, one lineplot for each state. The line color for each state/plot should be different. 

In [None]:

# Fill-in ... and ____
g = sns.relplot(data=...,=...,x=...,y=...,hue=...,col=...,col_wrap=...,legend=False);
g.fig.suptitle('Total Acres Burned By Year in the Top 4 States',y=1.025); # Super Title For Entire Plot

<!-- END QUESTION -->

## Question 4: Using GeoPandas to plot fires on maps

Now that we have created some nice plots, it would be nice if we could view the locations of the fires on maps of states so we can get a visual 
understanding of where they tend to occur and reaffirm our analyses from above. We will be using `GeoPandas` which you can think of `pandas` but extended to work with geospatial data. 
The data we will be working with that stores the map information for the United States is located in the `Maps` subfolder as shape files. These are special files
formats for storing geographic information. We will first go through a demonstration of this to see it in action. You will notice there are other files as well that are needed for the rendering of the plots. 


First, lets load the map of the USA and get it displayed. We load the shape file `states.shp` located inside the `Maps` subfolder.

In [None]:
usa = geo.read_file('Maps/states.shp')
usa.head()

We see a familiar looking `DataFrame` style output and that is because the core data structure in `geopandas` is the `GeoDataFrame`, an augmented `DataFrame`. The `geometry` column is the bread-and-butter that stores the locations of the points in terms of longitude and latitude. We can infer that the above is storing the points on the boundaries of the states for each state.

To plot the map associated with it, we simply just call the `plot()` function. There are options to change the colors of the edges and fill as you will explore in the next question

In [None]:
usa.plot()

<!-- BEGIN QUESTION -->

# Question 4a  Get and plot a map of California


Filter  `usa` for the state of California and plot the corresponding map. Set the fill color(part inside boundary of state) to be `white` and the edgecolor to be `black`

In [None]:
# Fill-in ...
...

<!-- END QUESTION -->

# Question 4b: Create a GeoDataFrame of the California Fires in the year 2015. 

Lets focus solely on 2020 forest fires in Calfifornia and create a new `GeoDataFrame`. First filter `fires` for 2020 forest fires in Calfornia and then use `geo.GeoDataFrame()` to create the `GeoDataFrame`. Save it in a variable called `fires_CA_2020_locations`.

HINT:
[GeoDataFrame](https://geopandas.org/en/stable/gallery/create_geopandas_from_pandas.html)

In [None]:

# Fill-in ... and 
ca_2020_fires = fires.query('...')
fires_CA_2020 = geo.GeoDataFrame(...)
fires_ca_2020_locations.head()

In [None]:
grader.check("4b")

<!-- BEGIN QUESTION -->

# Question 4c: Use Seaborn to Plot Fires on California map

Plot 2020 fires in California that **burned more than 500 acres** onto the California map using `seaborn`. Vary the color of the fires by how many acres they burned. In other words, fires with mores acres burned should be 
darker. Use the `flare` palette for this.

Also vary the sizes of the fires by how many acres they burned as well. In other words, fires that burned more acres should appear as bigger points.


 Have the California map be filled in white with a black edge color. 

HINT:
`sns.scatterplot()`

In [None]:
ca_map = usa.query(...)
ca_map.plot(...)
ax = sns.scatterplot(data=...,x=...,y=...,size=...,hue=...,palette='flare')
ax.set(title='California fires in 2020 over 500 acres,ylabel='',xlabel='') # Adding a title and removing x and y labels.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# Question 4d: Plot the fires in the continental United States.

Repeat the above procedure for all fires that burned more than 100,000 acres in the continental USA(ignore Hawaii and Alaska). Like before, fires that burned more acres should appear darker
and bigger. Add an appropriate title. Your plot should be a map of the continental USA with points representing locations of fires. 

In [None]:


# Fill-in ...
...

<!-- END QUESTION -->

## Plotting with Folium

[Folium](https://python-visualization.github.io/folium/latest/) builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the leaflet.js library. This allows you to manipulate your data in Geopandas and visualize it on a Leaflet map via Folium.

_Note: Due to how Folium map works, the figure will will not show up on Gradescope (both PDF and ipynb files will not show Folium plots). Instead, capture a short screen recording demonstrating your folio visualization here:_ https://ucsb.instructure.com/courses/17058/assignments/211107

In [None]:
! pip install folium

In [None]:
import folium

In [None]:
CA_map = folium.Map(location=[37.7749, -122.42], #middle point of the map, here is San Francisco
                        zoom_start=7, #zoom scale, you can change to other number to see the difference
                        tiles='openstreetmap')
fires_ca_2020_locations_bigger500 = fires_ca_2020_locations.query('acres_burned   > 500')
                        
for i in fires_ca_2020_locations_bigger500.index:
    lat = fires_ca_2020_locations_bigger500.latitude[i]
    long = fires_ca_2020_locations_bigger500.longitude[i]
    marker = folium.Marker([lat, long],
                           popup="Acres Burned: {}".format(fires_ca_2020_locations_bigger500.acres_burned[i]),
                           icon=folium.Icon(icon="fire", color = 'red')).add_to(CA_map)
    
CA_map

The plot above is using folium to plot the fire location in california. The dataset is similar to Question 4c but we do not use different point size. Try to click the markers, then you can see the acres burned for each location.

<!-- BEGIN QUESTION -->

# Quesetion 4e1

Using `folium` to plot a map of all fires that burned more than 100,000 acres in the continental USA(ignore Hawaii and Alaska), set the `fire_name`,`acres_burned`, `fire_year` and `days_burning` as the popup message. The marker of fires that burned more acres be appear darker(same level break as plot in question 4d).

In [None]:


# Fill-in ...
US_map = folium.Map(location=..., #middle point of the map
                        zoom_start=..., #zoom scale, you can change to other number to see the difference
                        tiles='openstreetmap')
fires_locations_bigger100000 = fires.query(...)


for i in fires_locations_bigger100000.index:
    if ...:
        type_color = "orange"
    elif ...:
        type_color = 'red'
    elif ...:        
        type_color = "darkred"
    elif ...:
        type_color = "lightgray"
    elif ...:
        type_color = "gray"
    else:
        type_color = "black"
    lat = ...
    long = ..
    popmsg = "Name: "\
    + ...\
    + "<br>"\
    + "Acres Burned: "\
    + ...\
    + "<br>"\
    + "Year: "\
    + ...\
    + "<br>"\
    + "Days Burning: "\
    + ...
    marker = folium.Marker([lat, long],
                           popup=...,
                           icon=folium.Icon(icon="fire", color = "%s" % type_color)).add_to(US_map)
US_map

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# Quesetion 4e2
We can also create a heatmap layer onto folium map for `fires_locations_bigger100000`. To do this, we need to create `heat_data` which is a list of latitude and longitude.

In [None]:
from folium import plugins


#fill in ...
US_heatmap = folium.Map(location=..., tiles="Cartodb dark_matter", zoom_start=...)

heat_data = ...

plugins.HeatMap(heat_data).add_to(US_heatmap)
US_heatmap

<!-- END QUESTION -->

_Intentionally Blank_

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Download the zip file and submit to Gradescope.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)