Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "" # put your full name here
COLLABORATORS = [] # list anyone you collaborated with on this workbook

---

## Lab 6: Multiple Regression

-------------------------------------------

Welcome to your sixth lab of the semester!<br>

This lab continues to build on the spatial analysis and modeling skills we have been developing in previous assignments. Specifically, we will use Geopandas and the `statsmodels` library to try to predict the area burned by large wildfires in the Sierra Nevada region of California. 

Feel free to refer to Lab 3 for the basic Geopandas methods we learned a few weeks ago, and to Lab 5 for linear regression (single variable) basics. 

## Setup & Review

Let's begin by importing the packages we'll need.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
!pip install pyogrio

%matplotlib inline

The first dataset we will examine is fire perimeter data from the [Monitoring Trends in Burn Severity (MTBS)](https://www.mtbs.gov/project-overview) database. The data are stored as shapefiles in the `data/mtbs_ca` folder. To reduce the file size, we pre-processed the original nationwide data to only include data for the Sierra Nevada region (as defined by [the Sierra Nevada Conservancy boundary](https://gis.data.ca.gov/datasets/f147fdc76a104484b9fa90baacf9462f_0?geometry=-133.799%2C35.544%2C-106.047%2C41.552)). The raw MTBS data includes information about prescribed fires and wildfires; in this lab, we have filtered out all fire types except for wildfires. 

**Question 1 (1pt):** Import the shapefile as a GeoDataFrame. Print the first few rows. 

In [None]:
# YOUR CODE HERE
sn_wildfires = ...
sn_wildfires

Let's do an abbreviatd EDA on `sn_wildfire`, focusing on granularity and scope. The documentation above indicates that the MTBS data include all fires that burned >= 1000 acres in the western U.S. and >=500 acres in the eastern U.S (reminder that we filtered the full dataset to focus only on the Sierra Nevada region). All land ownerships are included.

Each record in `sn_wildfire` represents a unique fire incident. Take a look at the shape of the geodataframe. How many fire records do we have?

In [None]:
sn_wildfires.shape

We can see that the data are recorded according to the start date of the fire; we know the year, month, and day of each incident. To determine temporal scope, we can examine the Year column.

In [None]:
sn_wildfires.sort_values(by = 'Year')['Year'].unique() # sort our data frame by year, then find the unique years

Since we are working with spatial data, we also want to identify the coordinate reference system (CRS) in which our data are recorded.

In [None]:
sn_wildfires.crs

Finally, let's check the `geometry` column and determine the types of geometries contained in our geodataframe.

In [None]:
sn_wildfires.geometry.geom_type.unique()

## More handy Geopandas operations

Geopandas provides a veritable treasure trove of [methods and attributes for Geoseries](https://geopandas.org/docs/reference/geoseries.html#general-methods-and-attributes). As a reminder, in a GeoDataFrame, a Geoseries is the column that contains the `geometry` attribute. That column is often, but not always, named "geometry". 

In our `sn_wildfire` dataframe, each geometry represents the perimeter of the area burned by a wildfire incident. We can use Geopandas operations to explore different properties of these geometries. 

For example, we might want to know the **centroid** of each burned area:

In [None]:
sn_wildfires.geometry.centroid.head()

# equivalently, we could have called sn_wildfires['geometry'].centroid

**Question 2 (1 pt):** Your centroid call probably returned the following warning:  
`Geometry is in a geographic CRS. Results from 'length' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.`

Why does Python give you this warning? What are the units of area and length returned by the Geopandas built-in methods? *Hint:* it might be helpful to [look up the CRS](https://epsg.org/home.html) for this dataset. 

*YOUR ANSWER HERE*

**Question 3 (1pt):** Transform the `sn_wildfires` data to the CONUS Albers equal area projection (EPSG:5070), which uses the meter as its unit of measure.

In [None]:
# YOUR CODE HERE
sn_wildfires = ...

In this lab, we will try to predict the area burned by a wildfire (using `Acres` as our response variable), using start month and (relative) distance to the nearest highway as independent variables. For the latter, we will need [data on the locations of primary roads (i.e., interstates and highways) in the U.S](https://catalog.data.gov/dataset/tiger-line-shapefile-2016-nation-u-s-primary-roads-national-shapefile). 

**Question 4 (1pt):** Open the shapefile in `data/tl_2016_us_primaryroads/` as a GeoDataFrame named ` sn_roads`. If needed, transform the CRS to match that of `sn_wildfires`.

In [None]:
# YOUR CODE HERE
sn_roads = ...

Take a look at the records in `sn_roads`. In this case, roads are represented as Linestrings. Each record represents a segment of a state or interstate highway that intersects with the Sierra Nevada boundary.

In [None]:
print(sn_roads.geometry.geom_type.unique())
print(sn_roads.shape)
sn_roads

**Question 5 (1pt):** Use Geopandas operations to find the length of each road in `sn_roads`.

In [None]:
# YOUR CODE HERE

Geopandas can also calculate the distance between geometries. The code below finds the nearest distance between each road in `sn_roads` and the centroid of the first wildfire listed in `sn_wildfires`. 

In [None]:
dsts = sn_roads.distance(sn_wildfires.centroid.loc[0])
dsts

**Question 6 (1 pt):** Write a function `min_distance` that takes in a single Point geometry ("point") and a series of Linestring geometries ("lines") and returns the distance (in kilometers) between that point and the nearest line.

In [None]:
# YOUR CODE HERE
def min_distance(point, lines):
    return ...

In [None]:
assert min_distance(sn_wildfires.centroid[0], sn_roads) == dsts.min()/1000

**Question 7 (1 pt):** Using your `min_distance` function, add a new column to `sn_wildfires`, each of whose elements represents the distance between the centroid of the burned area and the nearest major road in `sn_roads`. Name this column "dst_to_rd."

In [None]:
# YOUR CODE HERE

In [None]:
sn_wildfires.head()

## Multi-Variable Regression

In addition to distance to the nearest highway, we want to use the month in which the fire starts as an independent variable. Before we fit a regression model, let's visualize the data and qualitatively try to identify any patterns or trends that emerge between our independent and dependent variables. 

**Question 8 (1 pt):** Run the code below to generate a pair of scatter plots showing the relationship between `Acres` burned (the target variable, represented on the y-axis), and each of the independent variables (`StartMonth` and `dist_to_rd`, represented on the x-axes). What trends do you notice? 

In [None]:
# This time using a logarithmic scale for area burned
fig, (ax0, ax1) = plt.subplots(ncols=2, sharey=True, figsize=(12,5))

ax0.scatter(sn_wildfires['StartMonth'],sn_wildfires['Acres'])
ax0.set_xlabel('Start month')
ax0.set_ylabel('Acres burned per fire')

ax1.scatter(sn_wildfires['dist_to_rd'],sn_wildfires['Acres'])
ax1.set_xlabel('Distance to nearest highway')

plt.suptitle('Acres burned versus fire start month and distance to nearest highway');

*YOUR OBSERVATION HERE*


We are ready at last to create our linear regression model, using **two features** (start month and distance to nearest highway) to predict acres burned. 

This time, instead of `scikit-learn`, we'll use a library called `statsmodels`. One nice feature of `statsmodels` is its clean, informative summary of regression results and statistics.

In [None]:
# Run this cell to import the statsmodels library
import statsmodels.api as sm

Estimating a model with `statsmodels` uses a similar process to model estimation in `scikit-learn`. We first initialize a model, in this case using the `sm.OLS()` method, which takes **X** and **y** (in dataframe form) as arguments. We then `.fit()` the model and can view information about the coefficients and model performance using `.summary()`. 

**Question 9 (1pt):** Create a dataframe **X**, which holds our two independent variables, each as a column of observations. In addition, create a dataframe **y** that holds the response variable.

In [None]:
# YOUR CODE HERE
X = ...
y = ...

Unlike `scikit-learn`, statsmodels expects a column of 1's in the **X** dataframe in order to fit an intercept. One way to achieve this is to apply `statsmodel`'s built-in `add_constant` function to your dataframe of **X** values.

In [None]:
# run this cell
X_const = sm.add_constant(X)
X_const.head()

Run the cell below to fit a model to **X** and **y** and view the results.

In [None]:
# Run this cell
sn_wf_model = sm.OLS(y,X_const)
sn_wf_results = sn_wf_model.fit()
sn_wf_results.summary()

**Question 10 (1pt):** What are the values and 95% confidence intervals of the three coefficients? What do the confidence intervals imply about the model we've built?

*YOUR ANSWER HERE* 

## Feature Engineering

Let's try to improve our model by adding more features. Instead of using new sources of data, we will transform the two independent variables we already have and add these transformations as additional features. This process is known as "feature engineering."

**Question 11 (1pt)** To make it easy to test different sets of features, write a function `fit_OLS` that takes in a dataframe containing the independent variables ($X$) and another dataframe containing response variable ($y$). The function should fit a linear regression model and output the `statsmodels` summary for the model. Feel free to use the code in the previous section as a template. Test your model on the $X$ and $y$ dataframes you created in Question 9.

In [None]:
# Replace ellipses with your code* 

def fit_OLS(X,y):
    X_const = sm.add_constant(...)
    ols_model = sm.OLS(...)
    results = ...
    return results.summary()

In [None]:
fit_OLS(X,y)

The first new feature we will add to our model is the natural log of the `dist_to_rd` variable. The code block below provides an approach to expanding our $X$ dataframe to include this new feature.

In [None]:
X['log_dist'] = np.log(sn_wildfires['dist_to_rd'])
X.head()

We can estimate the model with the addition of our new log distance feature as follows

In [None]:
fit_OLS(X,y)

**Question 12 (2pts)** Try engineering at least one new feature of your own, using different transformations and/or combinations of the features in the $X$ dataframe. Fit your model and view the results. Did your new feature(s) improve the model? 

In [None]:
# YOUR CODE HERE

*YOUR ANSWER HERE*

# Hooray, you're done! 

Click Kernel -> Restart & Run All. Then submit your lab work as an HTML file on bCourses.