# Classification Using Measurement Data

This notebook contains a basic classification that analyzes the lake ice measurements provided by Environment and Climate Change Canada and the Canadian Ice Service program.

This was done as part of exploratory work into using this dataset with basic machine learning.

In [1]:
# Import the required packages for data analysis and machine learning
from imblearn.over_sampling import SMOTE
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

We start by reading the lake ice measurements from the file. We create a "HAS_ICE" column that determines whether a measurement has ice or not, which can then be used for training and testing.

In [17]:
X = pd.read_excel('lakeice-measurements.xlsx')
X['YEAR'] = X['DATE'].dt.year
X['MONTH'] = X['DATE'].dt.month
X['HAS_ICE'] = X['ICE_COVER'] > 0.5
y = X.pop("HAS_ICE").values
X.head(10)

Unnamed: 0,ID,DATE,TIME,NAME,LAT,LONG,ICE_COVER,YEAR,MONTH
0,1.0,1995-11-17,2130.0,Rainy Lake,48.6,93.0,0.0,1995,11
1,2.0,1995-11-17,2130.0,Gods Lake,54.7,94.3,10.0,1995,11
2,3.0,1995-11-17,2130.0,Lake Nipissing,46.3,79.7,0.0,1995,11
3,4.0,1995-11-17,2130.0,Lake Nipigon,49.8,88.5,0.0,1995,11
4,5.0,1995-11-17,2130.0,Baker Lake,64.2,95.4,10.0,1995,11
5,6.0,1995-11-17,2130.0,Yuthkyed Lake,62.7,97.9,10.0,1995,11
6,7.0,1995-11-17,2130.0,Island Lake,53.8,94.5,10.0,1995,11
7,8.0,1995-11-17,2130.0,Red Lake,48.0,95.0,0.0,1995,11
8,9.0,1995-11-17,2130.0,Lake Simcoe,44.4,79.3,0.0,1995,11
9,10.0,1995-11-17,2130.0,Nettilling Lake,66.5,70.5,10.0,1995,11


We then create a feature set, using the latitude, longitude, year and month of the lake measurements as features. We then randomly split the data into a training and test set.

In [18]:
featureSet = ['LAT', 'LONG', 'YEAR', 'MONTH']
X = X[featureSet].copy()

# split the large dataset into train and test
print("Splitting the dataset...")
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2, random_state=2)
print("Done!")

Splitting the dataset...
Done!


## Naive Bayes

We can then apply a Naive Bayes algorithm to see if it can be trained on the data well.

In [5]:
# Helper to calculate accuracy
def accuracy(actualTags, predictions):
    totalFound = 0
    for i in range(len(actualTags)):
        if (actualTags[i] == predictions[i]):
            totalFound += 1
    return totalFound / len(predictions)

In [19]:
print("Training the NB classifier...")
clf_nb = MultinomialNB().fit(X_train, y_train)
print("Done!")

Training the NB classifier...
Done!


We can then use the trained classifier to predict the results of the data it was trained on. Then, we use it to predict the results of the test data.

In [20]:
training_predictions = clf_nb.predict(X_train)
print(training_predictions[0:10])
print(accuracy(y_train, training_predictions))

[False False  True  True  True False  True False False  True]
0.7603099937731164


In [21]:
testing_predictions = clf_nb.predict(X_val)
print(testing_predictions[0:10])
print(accuracy(y_val, testing_predictions))

[ True False  True False  True  True False False False  True]
0.7603228904093146


## Logistic Regression

Same as the Naive Bayes algorithm but using a Logistic Regression algorithm instead.

In [23]:
print("Training the LR classifier...")
clf_lr = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=1).fit(X_train, y_train)
print("Done!")

Training the LR classifier...
Done!


We can then test it on the training and test data, as well as create a set of predictions for the year 2013 at a given latitude and longitude (sample case).

In [24]:
training_predictions_lr = clf_lr.predict(X_train)
print(training_predictions_lr[0:10])
print(accuracy(y_train, training_predictions_lr))

[False False  True  True False False  True  True False  True]
0.739963839061316


In [26]:
testing_predictions_lr = clf_lr.predict(X_val)
print(testing_predictions_lr[0:10])
print(accuracy(y_val, testing_predictions_lr))

[ True False  True False  True  True False False False  True]
0.737426479414236


In [30]:
data = [[45, 75, 2013, 1], [45, 75, 2013, 2], [45, 75, 2013, 3], [45, 75, 2013, 4], [45, 75, 2013, 5], [45, 75, 2013, 6], [45, 75, 2013, 7], [45, 75, 2013, 8], [45, 75, 2013, 9], [45, 75, 2013, 10], [45, 75, 2013, 11], [45, 75, 2013, 12]]
df_test = pd.DataFrame(data, columns = ['LAT', 'LONG', 'YEAR', 'MONTH'])
clf_nb.predict(df_test)

array([ True,  True,  True,  True,  True, False, False, False, False,
       False, False, False])