# Machine Learning 2

Machine learning is the process of learning from data to make predictions. **Supervised** machine learning models are trained to predict an outcome based on input data (predictors or features). The model is trained to minimise the error in predictions using a training set where both the outcome labels and input data are known. A key part of the machine learning model development workflow is evaluating model performance. This lab will introduce a technique for evaluating model performance in the context of limited training and test data: cross-validation.

Previously, we demonstrated a workflow to develop a machine learning model for a classification task: predicting a field's crop type. In this lab we will develop a machine learning model for a regression task (predicting a continuous number) and evaluate the model using k-fold cross-validation.

The task for this lab is to develop and evaluate a machine learning model that can predict smallholder farm maize crop yields in Uganda using remotely sensed vegetation indices as input features.

In this lab you'll learn how to use Scikit-learns tools for evaluating models using cross-validation. We'll also introduce approaches for pre-processing categorical data to be used as features in machine learning models and techniques for interpreting the model and exploring how the model is making predictions.

## Setup

### Run the labs

You can run the labs locally on your machine or you can use cloud environments provided by Google Colab. **If you're working with Google Colab be aware that your sessions are temporary and you'll need to take care to save, backup, and download your work.**

<a href="https://colab.research.google.com/github/geog3300-agri3003/coursebook/blob/main/docs/notebooks/week-5_1.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Download data

If you need to download the date for this lab, run the following code snippet. 

In [None]:
import os
import subprocess

if "data_lab-5" not in os.listdir(os.getcwd()):
    subprocess.run('wget "https://github.com/geog3300-agri3003/lab-data/raw/main/data_lab-5.zip"', shell=True, capture_output=True, text=True)
    subprocess.run('unzip "data_lab-5.zip"', shell=True, capture_output=True, text=True)
    if "data_lab-5" not in os.listdir(os.getcwd()):
        print("Has a directory called data_lab-5 been downloaded and placed in your working directory? If not, try re-executing this code chunk")
    else:
        print("Data download OK")

### Working in Colab

If you're working in Google Colab, you'll need to install the required packages that don't come with the colab environment.

In [None]:
if 'google.colab' in str(get_ipython()):
    !pip install rioxarray
    !pip install mapclassify
    !pip install rasterio

## Cross-validation

A key part of the machine learning model development workflow is evaluating model performance. This should provide an assessment of how well a model will perform in making predictions on new or unseen data. Machine learning models are data hungry, the more examples the model sees during training the better it will be able to learn mappings that relate input features to outcome labels. However, we also want to test our model on a dataset that is representative of conditions the model might encounter "in-the-wild"; this results in setting aside a chunk of our ground truth dataset that cannot be used for model training. Thus, our model is not trained using all available ground truth data.

One strategy that is deployed to maximise data available for model training and to provide an assessment of model performance is cross-validation. 

Before we explore our ground truth dataset for model development, let's quickly introduce cross-validation. Previously, we evaluated the model's performance by removing a random sample of the data prior to model training to use as a test set. 

#### Recap quiz

<details>
    <summary><b>Why is it important for the test dataset to be randomly sampled from the ground truth data?</b></summary>
We want the test dataset to be representative and unbiased to provide as realistic assessment of the model's performance on new data as possible. 
</details>

<p></p>

<details>
    <summary><b>What is a potential limitation of using a single hold-out randomly sampled test set for evaluating model performance?</b></summary>
With a randomly sampled test set, each time the machine learning model development workflow is repeated new training and test sets would be generated and the model will have different performance scores. Using a single test set means, that by chance, the model could have an overly optimistic or pessimistic assessment of its performance.
    
Further, by withholding a test set we reduce the amount of data available to train the model. A smaller training dataset can reduce the model's performance. Thus, as we're removing data to form the test set we'd expect the model's error to be larger than if we'd trained the model on the entire dataset. 
</details>

<p></p>

In k-fold cross-validation there is not a single test set. Instead, the ground truth dataset is randomly split into $k$ folds. For example, if $k=5$ the ground truth dataset would be randomly split into 5 groups. Then, in turn, each fold is held out as a test set and the model is trained using data from the remaining four folds. Each fold takes a turn at being the test set. The model performance can be summarised using the average of the performance metrics generated using each fold. This means the model's performance is less susceptible to being influenced by a chance split of the ground truth data into training and test splits. It also means we can use the whole dataset to train the model and evaluate its performance. 

### MAPS 2016 Data

The dataset we're using is the data from <a href="https://web.stanford.edu/~mburke/papers/lobell_et_al_AJAE_2019.pdf" target="_blank">Lobell et al. (2019)</a>. Their analysis compared different approaches to estimating smallholder maize crop yields in Uganda: farmer reported yields, subplot crop cut samples, full plot crop cut samples, and satellite-based crop yield esimates. 

Boosting agricultural productivity in smallholder landscapes is important for improving a range of livelihood outcomes including food security and poverty alleviation. Accurate data on smallholder farmer crop production is a key ingredient to guiding development initiatives, government policies / programs, agricultural management and input use, and monitoring progress towards several Sustainable Development Goals. 

Traditionally, agricultural productivity in smallholder landscapes has been measured using farmer reported crop yields via surveys after harvests. These estimates are subject to considerable error impacting the quality of the data. 

More accurate measures of crop yield include physically harvesting a sub plot or full plot - crop cutting. However, crop cutting is more costly, time consuming, and requires liaising with farmers to generate large datasets of yield measurements.

<a href="https://web.stanford.edu/~mburke/papers/lobell_et_al_AJAE_2019.pdf" target="_blank">Lobell et al. (2019)</a> explore the potential for using satellite data to measure crop yields in smallholder fields to address i) the issue of error in farmer reported yields, and ii) the cost of crop cutting. 

We're going to use the replication data from their paper and see if we can develop a machine learning model to accurately predict smallholder maize crop yield using satellite data as inputs. As this dataset only has a few hundred data points, we'll use all the data to train the model and test its performance using cross-validation

### Import modules

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
import os

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import GroupShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.inspection import permutation_importance
from sklearn import tree

# setup renderer
if 'google.colab' in str(get_ipython()):
    pio.renderers.default = "colab"
else:
    pio.renderers.default = "jupyterlab"

rng = np.random.RandomState(0)

## Load data

In [None]:
df = pd.read_csv(os.path.join(os.getcwd(), "data_lab-5", "lobell_2019_maize.csv"))
print(f"the shape of the DataFrame is {df.shape}")
df.head()

### Explore data

There data that we have read into `df` includes a range of variables related to crop yield outcomes, farm management and farm type, and satellite-derived vegetation indices from the Sentinel-2 sensor. 

We'll be using the crop cut maize yield measures from sample sub plots in the fields as our outcome variable here - this is referenced by the column `cc_yield`. The units are Mg/ha.

Let's look at the distribution of yield values.

In [None]:
fig = px.histogram(
    data_frame=df, 
    x="cc_yield",  
    marginal="box"
)
fig.show()

The `DataFrame` stores Sentinel-2 derived vegetation indices in the following columns:

* `gcvi_doy_151` - average field GCVI on day of year 151.
* `gcvi_doy_171` - average field GCVI on day of year 171.
* `ndvi_doy_151` - average field NDVI on day of year 151.
* `ndvi_doy_171` - average field NDVI on day of year 171.
* `mtci_doy_151` - average field MTCI on day of year 151.
* `mtci_doy_171` - average field MTCI on day of year 171.

NDVI is the normalised difference vegetation index. GCVI is the green chlorophyll index. MTCI is the meris terrestrial chlorophyll index. These will be the main features (predictors) in our model.

#### Recap quiz

**Can you generate scatter plots to explore the correlation between vegetation index values and maize crop yield?**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>
    
```python
fig = px.scatter(
    df,
    x = "gcvi_doy_151", ## CHANGE THIS FOR DIFFERENT VEGETATION INDICES
    y = "cc_yield",
    trendline = "ols",
    opacity=0.25,
    labels={"cc_yield": "Maize crop yield (Mg/ha)",
           "gcvi_doy_151": "GCVI (DOY 151)"} ## CHANGE THIS FOR DIFFERENT VEGETATION INDICES
)
fig.show()
```
</details>

## Cross-validation 

We'll start by training a linear regression model that predicts maize crop yield as a function of vegetation indices and evaluate its performance using cross validation. 

Maize crop yield is a continuous variable and we need a metric to evaluate model performance. We'll use the `mean_absolute_error` as the metric which is the mean absolute difference between predicted and observed crop yields. In this case, the mean absolute error will be computed using observations in the held out fold in cross-validation.

We'll need to create a linear regression estimator object that we can train (using its `fit()` method). To evaluate the model using cross validation you call the `cross_val_score()` function on the dataset and the estimator. You pass in the metric you wish to use to evaluate the model to the `scoring` argument.

We'll also need to use the `dropna()` method of pandas `DataFrame`s to remove missing data before training the model. Scikit-learn models cannot be trained on datasets with `NaN` values.

In [None]:
# get X and y data

# drop nas as Linear Regression object cannot be trained on datasets with missing data
df_linear_reg = df.loc[: , [
    "cc_yield",
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171"]].dropna()

# get X
X = df_linear_reg.loc[:, [
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171"]]

# get Y
y = df_linear_reg.loc[:, "cc_yield"]

In [None]:
# create a LinearRegression estimator object
reg = LinearRegression()

# evaluate using 5-fold cross validation
cv_scores = cross_val_score(reg, X, y, cv=5, scoring="neg_mean_absolute_error")

`cv_scores` should reference an array of values recording the mean absolute error for the predictions of maize crop yield for each fold. Scikit-learn returns negative mean absolute error values (becauase their convention is that a higher metric values are better than lower metric values which holds for metrics for categorical outcomes such as accuracy). Therefore, we'll want to convert negative mean absolute error values to positive. 

In [None]:
# print cross validation test scores
for i, mae in enumerate(cv_scores):
    print(f"the mae for the {i}th fold is {round(abs(mae), 4)}")

If we want to use more than one metric to evaluate the model, we can pass in a list of metrics to the `scoring` argument. Let's also estimate the mean squared error value as well as the mean absolute error. The mean squared error penalises the model more for predictions with larger error.

To use multiple metrics we need to use the `cross_validate()` function instead.

In [None]:
# evaluate using 5-fold cross validation
cv_scores = cross_validate(reg, X, y, cv=5, scoring=["neg_mean_absolute_error", "neg_mean_squared_error"])
cv_scores

#### Recap quiz

**Can you estimate the mean mean absolute error and mean mean squared error across the five test folds using the `cv_scores` dictionary object?**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>

```python
print(f"mean mae: {abs(cv_scores['test_neg_mean_absolute_error'].mean())}")
print(f"mean mse: {cv_scores['test_neg_mean_squared_error'].mean()}")
```
</details>

<p></p>

**Can you train and evaluate a random forests model using 5-fold cross-validation to see if it improves the predictive performance?**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>
    
```python
rf = RandomForestRegressor(n_estimators=20, random_state=rng)
rf_cv_scores = cross_validate(rf, X, y, cv=5, scoring=["neg_mean_absolute_error"])
rf_cv_scores
```
</details>

## Categorical features

In our `DataFrame` there are some categorical variables that could help improve our predictions of crop yield. These include the slope, soil type, and soil quality variables. These variables are `str` type. We can only pass numeric data into our models; therefore, we'll need to recode the text data to a numeric representation. 

One approach for recoding categorical data is one hot encoding. Each unique value in a one hot encoded categorical variable is assigned a new column in the `DataFrame`. For rows when this value is present a value of one is assigned and zero otherwise. 

Let's one hot encode the slope variable `slope_sr` to illustrate this concept. 

The pandas `get_dummies()` function can be used to one hot encode a column in a `DataFrame`. The `get_dummies()` function has a `columns` argument that takes a list of column names that will be one hot encoded.

First, let's visualise our `DataFrame` `df` and inspect the values in the `slope_sr` column. You should see the values "FLAT", "MODERATE SLOPE", "SLIGHT SLOPE", "STEEP SLOPE" as `str` data. 

In [None]:
df.head()

Now, let's one hot encode the `slope_sr` variable and see how it is represented as numeric data. (scroll to the far right of the displayed `DataFrame`).

In [None]:
df_cat = pd.get_dummies(df, columns=["slope_sr"])

In [None]:
df_cat.head()

In [None]:
df_cat.columns

Now, let's retrain our linear regression model using slope as a feature.

In [None]:
# get X and y data

## NOTE WE ARE USING df_cat here!!
# drop nas as Linear Regression object cannot be trained on datasets with missing data
df_linear_reg = df_cat.loc[: , [
    "cc_yield",
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE",
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE"]].dropna()

# get X
X = df_linear_reg.loc[:, [
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE",
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE"]]

# get Y
y = df_linear_reg.loc[:, "cc_yield"]

# create a LinearRegression estimator object
reg = LinearRegression()

# evaluate using 5-fold cross validation
cv_scores = cross_validate(reg, X, y, cv=5, scoring=["neg_mean_absolute_error", "neg_mean_squared_error"])
cv_scores

#### Recap quiz

**Can you also recode the soil type `soiltype_sr` and `soilqual_sr` variables from categorical to numeric using one hot encoding? Reference the result with the variable `df_2`.**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>

```python
df_2 = pd.get_dummies(df, columns=["slope_sr", "soiltype_sr", "soilqual_sr"])
df_2.head()
```
</details>

**Can you use `df_2` with `soiltype_sr`, `slope_sr`, and `soilqual_sr` as training data in a random forests model?**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>

```python
# get X and y data

# NOTE WE USE df_2 here!!
# drop nas as Linear Regression object cannot be trained on datasets with missing data
df_linear_reg = df_2.loc[: , [
    "cc_yield",
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE", 
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE", 
    "soiltype_sr_CLAY", 
    "soiltype_sr_LOAM",
    "soiltype_sr_OTHER (SPECIFY)", 
    "soiltype_sr_SANDY", 
    "soilqual_sr_FAIR",
    "soilqual_sr_GOOD", 
    "soilqual_sr_POOR"]].dropna()

# get X
X = df_linear_reg.loc[:, [
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE", 
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE", 
    "soiltype_sr_CLAY", 
    "soiltype_sr_LOAM",
    "soiltype_sr_OTHER (SPECIFY)", 
    "soiltype_sr_SANDY", 
    "soilqual_sr_FAIR",
    "soilqual_sr_GOOD", 
    "soilqual_sr_POOR"]]

# get Y
y = df_linear_reg.loc[:, "cc_yield"]

rf = RandomForestRegressor(n_estimators=20, random_state=rng)
rf_cv_scores = cross_validate(rf, X, y, cv=5, scoring=["neg_mean_absolute_error"])
rf_cv_scores
```
</details>

## Controlling randomness

Some elements of the machine learning workflow are inherently random. For example, allocating data points to folds in k-fold cross-validation and bootstrap sampling of data to train decision trees in random forests. 

While this randomness is important (e.g. to ensure unbiased estimates of model performance when using cross-validation) it presents a challenge for reproducible results. The randomness of estimators (e.g. an instance of `RandomForestsRegressor()` or a cross-validation splitter is controlled by a `random_state` parameter. 

Some general tips on setting the `random_state` parameter:

* Never set `random_state` to `None` for reproducible results.
* Create a `RandomState` variable at the start of your program and pass it to all functions that accept a `random_state` argument. Look at the start of this notebook and see if you can sport where we create a `RandomState` variable just after we import the modules.
* If you're generating cross validation splits, use an integer value instead of a `RandomState` instance.

This is quite an advanced topic, but important to ensure your results are reproducible. Generally, following the guidelines above is the best way to go. However, you can read more about this topic <a href="https://scikit-learn.org/stable/common_pitfalls.html#general-recommendations" target="_blank">here</a>.


## Feature importance

Machine learning models are often considered "black boxes". That is, it is not clear how the model is using input features to make predictions and what relationships it has learnt to relate features to outcomes.

One strategy to make machine learning models more interpretable is to compute feature importance (or permutation importance). The feature importance is a measure of how much the error in a model's prediction increases when a feature is omitted from the model. Features with larger importance scores are therefore more important for making accurate predictions. 

The permutation feature importance score is computed as the decrease in a model's performance when a feature is randomly shuffled (permuted). This should ensure there is no relationship between the feature and the outcome. 

You can read more about feature importance in the <a href="" target="_blank">Interpretable Machine Learning</a> book and in the Scikit-learn <a href="https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-feature-importance" target="_blank">docs</a>.

First, let's set up and fit a linear regression model that predicts maize crop yield using vegetation indices and field characteristics. 

In [None]:
# NOTE we use df_cat here!!

# drop nas as Linear Regression object cannot be trained on datasets with missing data
df_linear_reg = df_cat.loc[: , [
    "cc_yield",
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE", 
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE"]].dropna()

# get X
X = df_linear_reg.loc[:, [
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE", 
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE"]]

# get Y
y = df_linear_reg.loc[:, "cc_yield"]

# create a LinearRegression estimator object
reg = LinearRegression()

# fit the model
reg.fit(X, y)

Now we compute permuation importance using 30 shuffles of each feature. The results referenced by `p_imp` is a dictionary object with arrays showing the model error when each feature was randomly shuffled and for each repeat of the random shuffling. It also has a property `importances_mean` which is the mean increase in error across all iterations when a feature was randomly shuffled. 

We use the `permutation_importance()` function and pass in the `LinearRegression()` estimator, the features and labels data (`X` and `y`), the metric to evaluate model performance to the `scoring` argument, and specify the number of repeats with the `n_repeats` argument. 

In [None]:
p_imp = permutation_importance(reg, X, y, scoring="neg_mean_absolute_error", n_repeats=30, random_state=rng)

In [None]:
p_imp["importances_mean"]

Let's use this data to make a permutation importance plot that visualises the increase in error when a feature is randomly shuffled.

First, let's get a list of column headings for each feature and convert the negative mean absolute error values to positive.

In [None]:
columns = X.columns
p_imp = abs(p_imp["importances_mean"])

Now, let's combine the feature labels and importance values into a `DataFrame`, sort the `DataFrame` by the importance scores, and generate a bar plot. 

In [None]:
p_imp_df = pd.DataFrame({"feature": columns, "importance": p_imp})
p_imp_df = p_imp_df.sort_values(by=["importance"], ascending=True)
p_imp_df

In [None]:
fig = px.bar(p_imp_df, y="feature", x="importance", height=600)
fig.show()

#### Recap quiz

<details>
    <summary><b>Do the feature importance results make sense? Can you explain them?</b></summary>
The most important features for predictive importance are the vegetation indices. There is an established literature that vegetation indices are correlated with, and predictive of, crop yields. 
    
However, we should be cautious in interpreting the differences between vegetation indices as it is likely that the vegetation indices are correlated (even if they're designed to capture different information about vegetation growth and condition). When one of the vegetation indices is permuted (shuffled), it is likely the model will still have access to information about this feature through other features in the model which it is correlated with. You can read more about this <a href="https://scikit-learn.org/stable/modules/permutation_importance.html#misleading-values-on-strongly-correlated-features" target="_blank">here</a>.
</details>

<p></p>

<details>
    <summary><b>Here, we computed the feature importance scores using the training data. Can you think of a limit to computing feature importance with the training set compared to using the test set?</b></summary>
Computing feature importance using a held-out test set would indicate which features are important for the model's capacity to generalise well to unseen data. Features that are important for the training set migh be causing the model to overfit. You can read more about this <a href="https://scikit-learn.org/stable/modules/permutation_importance.html#permutation-feature-importance" target="_blank">here</a>.
</details>

## Final activity

You will notice in `df` that there are some columns related to mixed cropping in some of the maize fields (e.g.`intercrop_legume`, `intercrop_cassava`, `crop_rotation`, `purestand`). One issue that could be affecting model performance is that we're using average vegetation indices across the whole field but not all of the field is maize cropping. This means that our vegetation index data is not purely capturing a maize crop signal but also the condition of other crops. We might be able to improve the model's performance if we restrict our analysis to pure maize fields or control for the effect of mixed cropping. 

**Can you create a training set that will use one or more of the `intercrop_legume`, `intercrop_cassava`, `crop_rotation`, and `purestand` variables to train a model to predict maize yield that accounts for the mixed cropping practices inherent in smallholder systems in Uganda. Evaluate your model using cross-validation and justify the rationale for your approach.**

<details>
    <summary><b>answer 1</b></summary>

Here, we filter out any data points representing mixed cropping fields using the condition `df[purestand] == 1` where a value of 1 in the `purestand` column indicates pure maize cropping. 
    
This approach means our vegetation indices should just be capturing information about maize crop condition. 
    
```python
# get X and y data

# drop mixed cropping fields
df_pure = df.loc[df["purestand"] == 1, :]

# one hot encode categorical predictors
df_pure = pd.get_dummies(df_pure, columns=["slope_sr", "soiltype_sr", "soilqual_sr"])

# drop nas as Linear Regression object cannot be trained on datasets with missing data
df_linear_reg = df_pure.loc[: , [
    "cc_yield",
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE",
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE"]].dropna()

# get X
X = df_linear_reg.loc[:, [
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE",
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE"]]

# get Y
y = df_linear_reg.loc[:, "cc_yield"]

# create a LinearRegression estimator object
reg = LinearRegression()

# evaluate using 5-fold cross validation
cv_scores = cross_validate(reg, X, y, cv=5, scoring=["neg_mean_absolute_error"])
cv_scores
```
</details>

<details>
    <summary><b>answer 2</b></summary>

Here, we control for the presence of mixed cropping by using intercropping and crop rotation variables as features in the model.  
    
This approach might be suited to generating a maize crop prediction model that's applicable to the Ugandan context where mixed cropping is prevalent and pure maize fields are uncommon.  
    
```python
# get X and y data

# one hot encode categorical predictors
df_2 = pd.get_dummies(df, columns=["slope_sr", "soiltype_sr", "soilqual_sr"])

# drop nas as Linear Regression object cannot be trained on datasets with missing data
df_linear_reg = df_2.loc[: , [
    "cc_yield",
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE",
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE",
    "intercrop_legume",
    "intercrop_cassava",
    "crop_rotation"]].dropna()

# get X
X = df_linear_reg.loc[:, [
    "gcvi_doy_151", 
    "gcvi_doy_171", 
    "ndvi_doy_151", 
    "ndvi_doy_171", 
    "mtci_doy_151", 
    "mtci_doy_171",
    "slope_sr_FLAT",
    "slope_sr_MODERATE SLOPE",
    "slope_sr_SLIGHT SLOPE",
    "slope_sr_STEEP SLOPE",
    "intercrop_legume",
    "intercrop_cassava",
    "crop_rotation"]]

# get Y
y = df_linear_reg.loc[:, "cc_yield"]

# create a LinearRegression estimator object
reg = LinearRegression()

# evaluate using 5-fold cross validation
cv_scores = cross_validate(reg, X, y, cv=5, scoring=["neg_mean_absolute_error"])
cv_scores
```
</details>