# Predicting Hazardous Weather Events with Scikit-Learn 

In this notebook, we will use the popular <a href="https://scikit-learn.org/stable/" target="_blank">Scikit-Learn</a> machine learning library to predict tornados. The data is from the <a href="https://www.noaa.gov/" target="_blank">National Oceanic and Atmospheric Administration (NOAA)</a>, a U.S. government agency that monitors the climate and forecasts the weather in the United States and other countries. For this notebook, we will be using data from 1975 for 84 unique weather stations in the United States and Europe. We will be using a <a href="https://en.wikipedia.org/wiki/Logistic_regression" target="_blank">logistic regression</a> model to predict the presence or absence of a tornado at each weather station using 29 predictor variables including temperature, atmospheric pressure, and thunder.

First, we will import the required modules, namely <a href="https://pandas.pydata.org/" target="_blank">Pandas</a> for reading in and processing the data and <a href="https://scikit-learn.org/stable/" target="_blank">Scikit-Learn</a> for fitting the logistic regression model. We will also use <a href="https://numpy.org/" target="_blank">NumPy</a> for some basic vector operations.

In [1]:
import numpy as np
import pandas as pd
import warnings

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from zipfile import ZipFile

In [3]:
warnings.filterwarnings(action = 'ignore')

## Reading in the Data

In [4]:
with ZipFile('data/1975_dir.zip', mode = 'r') as zip_ref: zip_ref.extractall('data/1975_dir/')

In [5]:
df = pd.read_csv('data/1975_dir/1975weatherdata.csv')

In [6]:
df

Unnamed: 0.1,Unnamed: 0,station_num,year,month,day,WBAN,temp_ft,dewpt_ft,slp_mb,STP,...,rain,snow,hail,thunder,tornado,max_temp_frnht,min_temp_frnht,precip_in,precip_flag,SNDP
0,0,722210,1975,1,1,13858.0,66.2,62.7,1023.0,0.0,...,0,0,0,0,0,78.1,0.0,0.00,I,
1,1,722210,1975,1,2,13858.0,54.1,37.6,1025.2,0.0,...,0,0,0,0,0,0.0,45.0,0.00,I,
2,2,722210,1975,1,3,13858.0,0.0,48.7,1021.6,0.0,...,0,0,0,0,0,64.0,51.1,0.00,,
3,3,722210,1975,1,4,13858.0,54.6,52.3,1018.0,0.0,...,0,0,0,0,0,63.0,0.0,0.43,G,
4,4,722210,1975,1,5,13858.0,43.2,34.3,1024.0,0.0,...,0,0,0,0,0,0.0,32.0,0.00,G,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1349522,347,105260,1975,12,27,,36.6,36.6,1030.2,,...,1,0,0,0,0,37.4,33.8,0.08,F,
1349523,348,105260,1975,12,28,,30.8,30.8,1027.8,,...,0,0,0,0,0,32.0,28.4,0.00,F,
1349524,349,105260,1975,12,29,,28.6,28.6,0.0,,...,0,0,0,0,0,30.2,28.4,0.00,F,
1349525,350,105260,1975,12,30,,0.0,0.0,1025.7,,...,0,0,0,0,0,30.2,28.4,0.00,F,


Upon reading in the data, the first thing we notice is that there is a column labeled `Unnamed: 0` which appears to be a duplicate of the index column. Since this column doesn't contribute any meaningful information to our dataset, let's go ahead and drop it.

In [7]:
df = df.drop('Unnamed: 0', axis = 1)

In [8]:
df

Unnamed: 0,station_num,year,month,day,WBAN,temp_ft,dewpt_ft,slp_mb,STP,visib_mi,...,rain,snow,hail,thunder,tornado,max_temp_frnht,min_temp_frnht,precip_in,precip_flag,SNDP
0,722210,1975,1,1,13858.0,66.2,62.7,1023.0,0.0,2.5,...,0,0,0,0,0,78.1,0.0,0.00,I,
1,722210,1975,1,2,13858.0,54.1,37.6,1025.2,0.0,12.8,...,0,0,0,0,0,0.0,45.0,0.00,I,
2,722210,1975,1,3,13858.0,0.0,48.7,1021.6,0.0,11.0,...,0,0,0,0,0,64.0,51.1,0.00,,
3,722210,1975,1,4,13858.0,54.6,52.3,1018.0,0.0,12.0,...,0,0,0,0,0,63.0,0.0,0.43,G,
4,722210,1975,1,5,13858.0,43.2,34.3,1024.0,0.0,14.1,...,0,0,0,0,0,0.0,32.0,0.00,G,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1349522,105260,1975,12,27,,36.6,36.6,1030.2,,1.2,...,1,0,0,0,0,37.4,33.8,0.08,F,
1349523,105260,1975,12,28,,30.8,30.8,1027.8,,0.0,...,0,0,0,0,0,32.0,28.4,0.00,F,
1349524,105260,1975,12,29,,28.6,28.6,0.0,,1.4,...,0,0,0,0,0,30.2,28.4,0.00,F,
1349525,105260,1975,12,30,,0.0,0.0,1025.7,,2.0,...,0,0,0,0,0,30.2,28.4,0.00,F,


Much better. The next thing we notice is that there are a couple of rows where the temperature or the dewpoint are zero. While zero degrees Fahrenheit is a valid temperature, the frequency of these occurences, as well as the surrounding values, suggest that they may be filling in for missing values. Thus, we will drop all rows where the temperature or the dewpoint are zero

In [9]:
df = df[(df['temp_ft'] != 0) & (df['dewpt_ft'] != 0)]

In [10]:
df

Unnamed: 0,station_num,year,month,day,WBAN,temp_ft,dewpt_ft,slp_mb,STP,visib_mi,...,rain,snow,hail,thunder,tornado,max_temp_frnht,min_temp_frnht,precip_in,precip_flag,SNDP
0,722210,1975,1,1,13858.0,66.2,62.7,1023.0,0.0,2.5,...,0,0,0,0,0,78.1,0.0,0.00,I,
1,722210,1975,1,2,13858.0,54.1,37.6,1025.2,0.0,12.8,...,0,0,0,0,0,0.0,45.0,0.00,I,
3,722210,1975,1,4,13858.0,54.6,52.3,1018.0,0.0,12.0,...,0,0,0,0,0,63.0,0.0,0.43,G,
4,722210,1975,1,5,13858.0,43.2,34.3,1024.0,0.0,14.1,...,0,0,0,0,0,0.0,32.0,0.00,G,
6,722210,1975,1,7,13858.0,50.7,46.8,1021.1,0.0,0.0,...,0,0,0,0,0,0.0,0.0,0.00,I,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1349521,105260,1975,12,26,,34.0,34.0,1027.6,,2.0,...,1,0,0,0,0,37.4,30.2,0.08,F,
1349522,105260,1975,12,27,,36.6,36.6,1030.2,,1.2,...,1,0,0,0,0,37.4,33.8,0.08,F,
1349523,105260,1975,12,28,,30.8,30.8,1027.8,,0.0,...,0,0,0,0,0,32.0,28.4,0.00,F,
1349524,105260,1975,12,29,,28.6,28.6,0.0,,1.4,...,0,0,0,0,0,30.2,28.4,0.00,F,


Looks good. However, there is a column called `precip_flag` with values `F`, `G`, and `I`. Since logistic regression requires that our predictor variables are numeric, we will use the built-in Pandas `get_dummies()` function to encode this column using a binary variable for each level.

In [11]:
df = pd.concat([df, pd.get_dummies(df['precip_flag'])], axis = 1).drop('precip_flag', axis = 1)

In [12]:
df

Unnamed: 0,station_num,year,month,day,WBAN,temp_ft,dewpt_ft,slp_mb,STP,visib_mi,...,SNDP,A,B,C,D,E,F,G,H,I
0,722210,1975,1,1,13858.0,66.2,62.7,1023.0,0.0,2.5,...,,0,0,0,0,0,0,0,0,1
1,722210,1975,1,2,13858.0,54.1,37.6,1025.2,0.0,12.8,...,,0,0,0,0,0,0,0,0,1
3,722210,1975,1,4,13858.0,54.6,52.3,1018.0,0.0,12.0,...,,0,0,0,0,0,0,1,0,0
4,722210,1975,1,5,13858.0,43.2,34.3,1024.0,0.0,14.1,...,,0,0,0,0,0,0,1,0,0
6,722210,1975,1,7,13858.0,50.7,46.8,1021.1,0.0,0.0,...,,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1349521,105260,1975,12,26,,34.0,34.0,1027.6,,2.0,...,,0,0,0,0,0,1,0,0,0
1349522,105260,1975,12,27,,36.6,36.6,1030.2,,1.2,...,,0,0,0,0,0,1,0,0,0
1349523,105260,1975,12,28,,30.8,30.8,1027.8,,0.0,...,,0,0,0,0,0,1,0,0,0
1349524,105260,1975,12,29,,28.6,28.6,0.0,,1.4,...,,0,0,0,0,0,1,0,0,0


Great. However, there are a lot of `NaN` values in the `WBAN`, `STP`, and `SNDP` columns. For now, we will simply drop any rows that have `NaN` values. 

In [13]:
df = df.dropna()

In [14]:
df

Unnamed: 0,station_num,year,month,day,WBAN,temp_ft,dewpt_ft,slp_mb,STP,visib_mi,...,SNDP,A,B,C,D,E,F,G,H,I
4263,725130,1975,1,1,14777.0,34.8,31.4,1011.6,0.0,0.0,...,0.0,0,0,0,0,0,0,1,0,0
4264,725130,1975,1,2,14777.0,31.4,24.1,1021.2,0.0,0.0,...,0.0,0,0,0,0,0,0,1,0,0
4265,725130,1975,1,3,14777.0,26.0,15.6,1023.1,0.0,0.0,...,0.0,0,0,0,0,0,0,1,0,0
4266,725130,1975,1,4,14777.0,34.8,23.7,1014.2,0.0,15.8,...,0.0,0,0,0,0,0,0,1,0,0
4267,725130,1975,1,5,14777.0,33.7,23.8,1023.1,0.0,14.7,...,0.0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1318270,702310,1975,12,26,26510.0,-16.4,-21.0,0.0,0.0,13.5,...,0.0,0,0,0,0,0,0,0,0,1
1318271,702310,1975,12,27,26510.0,-5.7,-10.6,0.0,0.0,12.6,...,0.0,0,0,0,0,0,0,1,0,0
1318272,702310,1975,12,28,26510.0,0.6,-3.2,1004.2,0.0,0.0,...,0.0,0,0,0,0,0,0,1,0,0
1318273,702310,1975,12,29,26510.0,-20.8,-26.1,1025.7,1011.8,16.2,...,0.0,1,0,0,0,0,0,0,0,0


We now are ready to fit the model.

## Fitting the Model

First, we will split our dataset into training and test datasets using the built-in Scikit-Learn `train_test_split()` function. For this project, we will use a train-test ratio of 3:1 (75% training data and 25% test data).

In [15]:
train_df, test_df = train_test_split(df)

In [16]:
train_x = train_df.drop('tornado', axis = 1)
train_y = train_df['tornado']

In [17]:
test_x = test_df.drop('tornado', axis = 1)
test_y = test_df['tornado']

Next, we will initialize a logistic regression model and fit it to the training data.

In [18]:
lr = LogisticRegression()
lr.fit(train_x, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Finally, we will use the logistic regression model to predict tornados in the test data and compare the predictions to the actual data.

In [19]:
predict = lr.predict(test_x)
sum(abs(predict - np.array(test_y)))

0

Woohoo! Perfect accuracy!