## Fighting Fire With Data - Divine Saungweme

Each year, thousands of fires blaze across the African continent. Some are natural occurrences, part of a ‘fire cycle’ that can actually benefit some dryland ecosystems. Many are started intentionally, used to clear land or to prepare fields for planting. And some are wildfires, which can rage over large areas and cause huge amounts of damage. Whatever the cause, fires pour vast amounts of CO2 into the atmosphere, along with smoke that degrades air quality for those living downwind.

Figuring out the dynamics that influence where and when these fires occur can help us to better understand their effects. And predicting how these dynamics will play out in the future, under different climatic conditions, could prove extremely useful. For this project, the goal is to do exactly that. The datasets contain aggregated data on burned areas across Zimbabwe for each month since 2001. There will be the burn area data which goes up to the end of 2013, along with some additional information (such as rainfall, temperature, land cover etc) that extends into the test period. The project is centred around building a model capable of predicting the burned area in different locations based on only this information. This information provides vital keys which can determine our future and is still relevant even to this day.

### Loading the packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from jupyterthemes import jtplot

warnings.filterwarnings('ignore')
jtplot.style(context='notebook', theme='monokai', grid='False')

%matplotlib inline

### Loading the data

In [2]:
train_set, test_set = pd.read_csv('Train.csv')

ValueError: too many values to unpack (expected 2)

In [None]:
train_set.head()

### Exploratory Data Analysis

First, I need to drop attributes with a very low variance.

In [None]:
drop_list = train_set.nunique()[train_set.nunique()==1].index

for dropper in train_set, test_set:
    dropper.drop(drop_list, axis=1, inplace=True)

Let's see how all these attributes relate to eachother using correlation

In [None]:
plt.figure(figsize=(16, 16))
sns.heatmap(train_set.corr())

Let's see how the other attributes relate with the 'burn_area' attribute (which is the target) through correlation.

In [None]:
train_set.corr()['burn_area'].sort_values().plot(kind='bar', figsize=(18, 6))

The 'ID' attribute can be a useful key. It can provide us with month and year attributes. Let's see what I can salvage from it.

In [None]:
train_set['date'] = pd.to_datetime(train_set['ID'].apply(lambda x: x.split('_')[1]))

In [None]:
train_set['month'] = train_set['date'].dt.month
train_set['year'] = train_set['date'].dt.year

In [None]:
train_set['month'].value_counts()

In [None]:
train_set['year'].value_counts()

Well, now we have much more useful attributes in hand, let's poke around with them for a little bit in Data Visualization.

In [None]:
train_set.groupby('month').mean().reset_index().plot(x='month', y='burn_area', kind='bar')
plt.show()

In relation to the seasonal conditions in my country, it makes super sense that the 'burn_area' starts to rise after the middle months. At this time of the year, in my country, the weather starts as windy during the 8th month. The 9th and 10th months are popularly known for their solar intensity and according to the Data Visualization, it can be concluded with evidence that indeed windy and sunny conditions are more favourable to forest fires.

In [None]:
train_set.groupby('year').mean().reset_index().plot(x='year', y='burn_area', ylim=(0, 0.03))
plt.show()

Let's compare the 'burn_area' with 'precipitation'

In [None]:
ax = train_set.groupby('date').mean().reset_index().plot(y='burn_area', x='date', figsize=(18, 5))
train_set.groupby('date').mean().reset_index().plot(y='precipitation', x='date', ax=ax)

The plot is expressing the most obvious of expositions. When there is high precipitation is unlikely for Forest Fires to burn a large area and the plot is just clear showing it.

In [None]:
ax = train_set.groupby('date').mean().reset_index().plot(y='burn_area', x='date', figsize=(18, 5))
train_set.groupby('date').mean().reset_index().plot(y='climate_vap', x='date', ax=ax)

In [None]:
plt.scatter(train_set.sample(500)['climate_pr'], train_set.sample(500)['burn_area'], alpha=0.3)
plt.xlabel('Climate Precipitation')
plt.ylabel('Burn Area')
plt.show()

From the plot above, we can deduce that fires occur when there is very little rain.

Let's plot the longitude and latitude points and see what we can come up with

In [None]:
train_set.plot(x='lon', y='lat', kind='scatter', )
plt.show()

Well, what do you know? If it isn't the map of Zimbabwe itself. This clearly shows us that the data is collected from various parts of the country.

### Splitting Data For Training

We don't want to just split randomly - this would give us artificially high scores. Instead, let's use the last 3 years of the dataset for validation to more closely match the test configuration.

In [None]:
train_all = train_set.copy().dropna()
train = train_all.loc[train_all.date < '2011-01-01']
valid = train_all.loc[train_all.date > '2010-12-01']

We will take the columns (attributes) which are essential for our prediction

In [None]:
in_cols = ['climate_aet', 'climate_def',
       'climate_pdsi', 'climate_pet', 'climate_pr', 'climate_ro',
       'climate_soil', 'climate_srad', 'climate_tmmn',
       'climate_tmmx', 'climate_vap', 'climate_vpd', 'climate_vs', 'elevation',
       'landcover_0', 'landcover_1', 'landcover_2', 'landcover_3',
       'landcover_4', 'landcover_5', 'landcover_6', 'landcover_7',
       'landcover_8', 'precipitation']
target_col = 'burn_area'

### Training Model

In [None]:
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error

# Get our X and y training and validation sets ready
X_train, y_train = train[in_cols], train[target_col]
X_valid, y_valid = valid[in_cols], valid[target_col]

# Create and fit the model
model = CatBoostRegressor()
model.fit(X_train, y_train)

# Make predictions
preds = model.predict(X_valid)

# Score
mean_squared_error(y_valid, preds)**0.5 # RMSE - Lower is better

It seems that the model performed well on the validation set