# Final Project — Random Forest Modeling of Influenza Prevalence
## 1. Introduction and Paper Selection

For this project, I explored Random Forest Regression as my machine-learning method to predict avian influenza prevalence (H5 and H7 strains) across U.S. counties. Avian influenza is a viral disease found in wild birds, and its H5 and H7 subtypes are closely monitored because they vary seasonally and can develop into highly pathogenic forms. 

My modeling approach was inspired by the paper:

Singleton, R., Poff, A., Maldonado, M., & Harding, K. (2024).
Species distribution modeling for disease ecology: A multi-scale case study for schistosomiasis host snails in Brazil.
PLOS Global Public Health

Their study used spatial machine-learning models (including Random Forests) to model ecological risk based on spatial and environmental predictors. I adapted the same workflow to a new ecological system: influenza prevalence prediction using latitude, longitude, and month.

---

## 2. Data Collection and Preprocessing
### Raw Data Documentation

The influenza dataset includes the following columns:

Latitude

Longitude

Month

Prevalence

CommonN (number of samples)

County

State

GEOID

These predictors represent spatial and seasonal influenza patterns but do not include environmental variables such as temperature or vegetation.

### Data Cleaning

The dataset was generally clean, but I performed basic validation steps:

Checked for missing values

Confirmed latitude, longitude, and month were numeric

Verified month values were between 1–12

Split the dataset into H5 and H7

Ensured prevalence values were realistic



---

## 3. Data Visualization
### Spatial Distribution Plots

Before modeling, I created scatterplots showing influenza prevalence across the U.S. for both strains. Each point represents a sampling location, colored by prevalence. These visualizations reveal geographic clustering and early spatial patterns.

### Prevalence Distribution (Histograms)

To better understand the target variable, I plotted prevalence histograms for both H5 and H7.

These histograms show that prevalence values are highly right-skewed and concentrated near zero — common in ecological disease datasets.
Understanding this distribution helps explain the model’s performance and supports the choice of Random Forests.

### Setting Up Predictors

X (features): latitude, longitude, month
y (target): prevalence


---

## 4. Modeling and Analysis
### Model Selection

I used RandomForestRegressor, which is well suited for:

Nonlinear ecological relationships

Spatial datasets

Noisy prevalence values

Generating interpretable feature importance

### Train/Test Split

I used an 80/20 split to evaluate the model.
This ensures strong training performance while still assessing generalization.

### Model Evaluation

After training, I evaluated both H5 and H7 models using R² and MSE.

I also added Predicted vs. Actual scatterplots, which visually show model fit quality:

H5 points cluster tightly near the diagonal line → strong predictive accuracy

H7 points show more scatter → weaker predictability



---

## 5. Additional Visualizations for Interpretation
### Seasonal Trend Lines

Since Month was the most important feature in both models, I plotted average prevalence by month for H5 and H7.

These seasonal trend lines clearly show why Month dominates the model:

Both strains show strong increases and decreases across the year

H5 has sharper and more consistent seasonal patterns

H7 varies more and has less predictable seasonality


---

## 6. Random Forest Results for H5 and H7 Avian Influenza Models
### Model Performance Summary

The models performed strongly overall, especially for H5:

H5: R² = 0.8927, MSE = 0.0001096

H7: R² = 0.6978, MSE = 0.0003340

The Predicted vs. Actual visualizations confirm these metrics:
H5 predictions fall nearly on the 1:1 line, while H7 predictions are more dispersed.

### Feature Importance

Ranking of predictors:

Month (strongest)

Longitude

Latitude

The seasonal trend line plots support this result, showing a clear monthly pattern in influenza prevalence.
The prevalence histograms also show strong right-skew, indicating that most counties have low prevalence with few high outliers.

### Interpretation

Seasonality (Month) is the dominant driver for both strains.

H5 is more predictable, reflecting clearer ecological structure.

H7 is more variable, suggesting missing environmental predictors.

Spatial features (Lat/Long) still matter but much less than month.

Together, the model results and visualizations reveal strong seasonal patterns and moderate spatial patterns, with H5 showing clearer structure than H7.

---

## 7. Discussion and Reflection
### Discussion

These results support the idea that influenza prevalence follows predictable seasonal cycles influenced by migration and climate.
H5 aligns well with basic spatial-seasonal predictors, while H7 appears to require additional ecological information to fully explain its variation.

### Reflection

##### What went well:

Strong H5 model performance

Visualizations clearly reinforced model results

Random Forests handled the simple predictor set effectively

##### Challenges:

Lack of environmental predictors

H7 prevalence was harder to model

Dataset skew required careful interpretation

##### What I learned:

Data visualization is essential for confirming model behavior

Even small predictor sets can reveal meaningful ecological patterns

Different subtypes can behave differently ecologically