# Assignment 2 - Build and Deploy Weather Model

## Part 1 (30 points)

- successfully run this notebook on your Raspberry Pi
- answer questions given in the notebook

## Part 2 (10 points)
- make a copy of this notebook and call it `Assignment2-Part2`
- open the new notebook
- using June-2022 data, the BALANCED model has these accuracy for training and test data
```
    Average model accuracy(training data): 0.7559129612109745
    Average model accuracy(test data): 0.7471698113207547
```    
- use additional features in the "Build Model - use BALANCED data" section to improve the model's accuracy
- you will not be required to use this model with live BMP280 data
- what is your best accuracy from the revised model?
- what other changes can you make to improve the model's accuracy?  For example, include more features, improve how Rain/NoRain labels are made, etc.
- the higher your model accuracy, the higher your grade
   

## Part 3 (10 points)

- make a copy of this notebook and call it `Assignment2-Part3`
- open the new notebook
- instead of using training data from June-2022, make needed changes to use training data from July-2022
- run the new notebook and make necessary changes to the code (hint: you'll also need to change prg550_assignment2_students.py)
- what caused your code to break?  why is this a challenge for data science?


In [None]:
# increase Jupyter cell width

from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

### set auto reload for user modules

In [None]:
import numpy as np
import pandas as pd

from joblib import dump, load

import seaborn as sns # import plotting library
sns.set_palette('tab10') # see here for reference https://seaborn.pydata.org/tutorial/color_palettes.html

In [None]:
# import functions from prg550_assignment2_student.py for use in this notebook
from prg550_assignment2_students import bmp_initialize, bmp_read_values, data_collection, data_clean_prep, model_load 

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Collect data

In [None]:
##############################################################################
############## Set stationID and climateID for weather station ##############
stationID, climateID = "51459", "6158731" # Toronto Pearson


##############################################################################
########### Set Year, Month, Day to capture 1 month of hourly data ###########
year = 2022
month = 6
day = 1
##############################################################################

download_date_str = f"{year}-{month:02d}-{day:02d}"

print(f"One month of data will be downloaded for {download_date_str}\nStationID={stationID} and ClimateID={climateID}")



In [None]:
data_raw_df = data_collection(stationID, climateID, year, month, day)
data_raw_df.head(3)

## Clean data

In [None]:
data_clean_df = data_clean_prep(data_raw_df)
data_clean_df.head(4)

## Additional feature processing - Convert Rain/NoRain to 1/0 for use by model

In [None]:
from sklearn import preprocessing
from sklearn.utils import resample

fields = ['Temp','StnPress','PRG550_2labels']
target_field = 'PRG550_2labels'
feature_fields = ['Temp','StnPress' ]

df = data_clean_df[fields].copy()

# convert Rain/NoRain to binary 1, 0
lb = preprocessing.LabelBinarizer() 
binarized_target = lb.fit_transform(data_clean_df[target_field])
s = pd.Series(binarized_target.ravel(), name='PRG550_target') # create pandas series from array named 'target'

data_clean_df = data_clean_df.assign(PRG550_target=s) # add binarized labels to data_clean_df

In [None]:
target_to_label = lb.inverse_transform(s) # transform 0's, 1's back to labels
target_to_label[:20] # show first 20 entries

### Verify new mappings: NoRain-->0 and Rain -->1

`NoRain` should map to 0

`Rain` should map to 1

In [None]:
fields = ['Temp','StnPress','PRG550_2labels']
target_field = 'PRG550_target' # <<<<<<<<<<<<<< PRG550_target is the new 1/0 target column
feature_fields = ['Temp','StnPress' ]

data_clean_df.iloc[8:12]

### Show data characteristics

In [None]:
data_clean_df.describe()

## Show Cleaned Data with Environment Canada Weather Labels

In [None]:
plot_df = data_clean_df
weather_fix_order = ['Mostly Cloudy', 'Cloudy', 'Fog', 'Mainly Clear', 'Clear', 'Drizzle', 'Rain', 'Rain', 'Rain Showers', 'Rain Showers,Fog', 'Rain,Fog', 'Moderate Rain,Fog', 'Moderate Rain Showers', 'Thunderstorms', 'Thunderstorms,Moderate Rain Showers', 'Thunderstorms,Heavy Rain Showers']

g = sns.scatterplot(x='Temp', y='StnPress', hue='Weather_fix', hue_order=weather_fix_order, palette = "coolwarm_r", data=plot_df)
# Put the legend out of the figure
g.legend(loc='center left', bbox_to_anchor=(1, 0.5))

## Show Cleaned Data with PRG550 labels

In [None]:
plot_df = data_clean_df
sns.scatterplot(x='Temp', y='StnPress', hue='PRG550_target', data=plot_df)

## Check if data is imbalanced between Rain and NoRain classes

In [None]:
print(data_clean_df.groupby(target_field)[target_field].count())
_temp = data_clean_df.groupby(target_field)[target_field].count().values

num_majority = _temp[0]
num_minority = _temp[1]

In [None]:
naive_guess_accuracy = num_majority / (num_majority+num_minority)
naive_guess_accuracy

## Large class imbalance! 
We have 10x data imbalance between NoRain and Rain

Guessing `NoRain` all the time will give you 91.8% accuracy without even using a model!

## Correct for class imbalance by creating duplicates of the smaller class

In [None]:
data_clean_minority_df = data_clean_df.loc[data_clean_df['PRG550_2labels'] == 'Rain'] # filter only rows having 'Rain'
data_clean_majority_df = data_clean_df.loc[data_clean_df['PRG550_2labels'] == 'NoRain'] # filter only rows having 'NoRain'

# confirm size of minority and majority data rows same as above
(data_clean_minority_df.shape, data_clean_majority_df.shape)

In [None]:
from sklearn import preprocessing
from sklearn.utils import resample

# Upsample minority class
data_clean_minority_upsampled_df = resample(data_clean_minority_df, 
                                 replace=True,     # sample with replacement
                                 n_samples=num_majority,    # to match number of samples in majority class
                                 random_state=123) # reproducible results
# Combine majority class with upsampled minority class
data_clean_balanced_df = pd.concat([data_clean_majority_df, data_clean_minority_upsampled_df], axis=0)
data_clean_balanced_df.reset_index(inplace=True)

### Confirm classes are now balanced for `data_clean_balanced_df`

In [None]:
# shows equal number of 0's (NoRain) and 1's (Rain)
data_clean_balanced_df.groupby(target_field)[target_field].count() 

## Build Model - using imbalanced data

This section shows what happens when imbalanced data is used to train a model

### Create feature and target datasets - imbalanced data

In [None]:
# create features and target dataframes
_features = ['Temp', 'StnPress'] # temperature (Temp) and air pressure (StnPress) at weather station are features
_target = 'PRG550_target' # Rain/NoRain converted to 1/0

df_features = data_clean_df[_features]  # <<<<<<< using dataframe before Rain/NoRain was balanced 
df_target = data_clean_df[_target]      # <<<<<<< using dataframe before Rain/NoRain was balanced 

### Split into train and test data subsets - imbalanced data

In [None]:
from sklearn.model_selection import train_test_split

percentage_for_testing = 0.2 # 20% data for testing, 80% for training

df_features_train, df_features_test, df_target_train, df_target_test = train_test_split(
    df_features
    , df_target
    , test_size=percentage_for_testing
    , random_state=42)


### Show train and test data - imbalanced data

In [None]:
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,4)) # create a figure with two subplots (1 row, 2 columns)

# combine df_features_train, df_target_train into one dataframe
_plot_train_df = pd.concat([df_features_train, df_target_train], ignore_index=True, sort=False, axis=1) # axis=1 means put columns side-by-side
_plot_train_df.columns = ['Temp', 'StnPress', 'PRG550_target'] # rename columns so proper labels appear in graph

# combine df_features_test, df_target_test into one dataframe
_plot_test_df = pd.concat([df_features_test, df_target_test], ignore_index=True, sort=False, axis=1) # axis=1 means put columns side-by-side
_plot_test_df.columns = ['Temp', 'StnPress', 'PRG550_target'] # rename columns so proper labels appear in graph

g1 = sns.scatterplot(x='Temp', y='StnPress', hue='PRG550_target', data=_plot_train_df, ax=ax[0])
g1.set(title='Env Canada - training subset (imbalanced data)')
g2 = sns.scatterplot(x='Temp', y='StnPress', hue='PRG550_target', data=_plot_test_df, ax=ax[1])
g2.set(title='Env Canada  - test subset (imbalanced data)')

### Instantiate and train logistic regression classifier model - imbalanced data

In [None]:
# instantiate and train logistic regression model
from sklearn.linear_model import LogisticRegression

imbalanced_clf = LogisticRegression(random_state=0)
imbalanced_clf.fit(X=df_features_train.values, # .values to get numpy representation of array
                   y=df_target_train.values # .values to get numpy representation of array
                  )

### Model Accuracy - imbalanced data

Model is not much better than naive guessing!

In [None]:
score_with_train_data = imbalanced_clf.score(df_features_train.values, df_target_train.values)
score_wtih_test_data = imbalanced_clf.score(df_features_test.values, df_target_test.values)
print("Average model accuracy(training data): {0}\nAverage model accuracy(test data): {1}\nAccuracy for always guessing 'NoRain': {2}".format(
    score_with_train_data,score_wtih_test_data, naive_guess_accuracy )
     )

### Save trained model to file - imbalanced data

In [None]:
from joblib import dump, load

imbalanced_model_filename = 'imbalanced_data_environment_canada_logistic_regression_classifier.joblib'
dump(imbalanced_clf, imbalanced_model_filename) 

## Build Model - use BALANCED data

This section shows what happens when imbalanced data is used to train a model

### Create feature and target datasets - imbalanced data

In [None]:
# create features and target dataframes
_features = ['Temp', 'StnPress'] # temperature (Temp) and air pressure (StnPress) at weather station are features
_target = 'PRG550_target' # Rain/NoRain converted to 1/0

df_features = data_clean_balanced_df[_features]  # <<<<<<< using dataframe AFTER Rain/NoRain was balanced 
df_target = data_clean_balanced_df[_target]      # <<<<<<< using dataframe AFTER Rain/NoRain was balanced 

### Split into train and test data subsets - use BALANCED data

In [None]:
from sklearn.model_selection import train_test_split

percentage_for_testing = 0.2 # 20% data for testing, 80% for training

df_features_train, df_features_test, df_target_train, df_target_test = train_test_split(
    df_features
    , df_target
    , test_size=percentage_for_testing
    , random_state=42)


### Show train and test data - use BALANCED data

In [None]:
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14,4)) # create a figure with two subplots (1 row, 2 columns)

# combine df_features_train, df_target_train into one dataframe
_plot_train_df = pd.concat([df_features_train, df_target_train], ignore_index=True, sort=False, axis=1) # axis=1 means put columns side-by-side
_plot_train_df.columns = ['Temp', 'StnPress', 'PRG550_2labels'] # rename columns so proper labels appear in graph

# combine df_features_test, df_target_test into one dataframe
_plot_test_df = pd.concat([df_features_test, df_target_test], ignore_index=True, sort=False, axis=1) # axis=1 means put columns side-by-side
_plot_test_df.columns = ['Temp', 'StnPress', 'PRG550_2labels'] # rename columns so proper labels appear in graph

g1 = sns.scatterplot(x='Temp', y='StnPress', hue='PRG550_2labels', data=_plot_train_df, ax=ax[0])
g1.set(title='Env Canada - training subset (balanced data)')
g2 = sns.scatterplot(x='Temp', y='StnPress', hue='PRG550_2labels', data=_plot_test_df, ax=ax[1])
g2.set(title='Env Canada  - test subset (balanced data)')

### Instantiate and train logistic regression classifier model - use BALANCED data

In [None]:
# instantiate and train logistic regression model
from sklearn.linear_model import LogisticRegression

balanced_clf = LogisticRegression(random_state=0)
balanced_clf.fit(X=df_features_train.values, # .values to get numpy representation of array
                   y=df_target_train.values # .values to get numpy representation of array
                  )

### Model Accuracy - use BALANCED data

Model is not much better than naive guessing!

In [None]:
score_with_train_data = balanced_clf.score(df_features_train.values, df_target_train.values)
score_wtih_test_data = balanced_clf.score(df_features_test.values, df_target_test.values)
print("Average model accuracy(training data): {0}\nAverage model accuracy(test data): {1}".format(
    score_with_train_data,score_wtih_test_data, naive_guess_accuracy )
     )

### Save trained model to file - use BALANCED data

save into file: `balanced_data_environment_canada_logistic_regression_classifier.joblib`

In [None]:
from joblib import dump, load

balanced_model_filename = 'balanced_data_environment_canada_logistic_regression_classifier.joblib'
dump(balanced_clf, balanced_model_filename) 

# Show Decision Boundary of Balanced Model

Questions: 

1. what does the purple area represent?
1. what does the yellow area represent?
1. what will prediction will your model give if Temperature=20.0 and Pressure=100.5?
1. what will prediction will your model give if Temperature=32.0 and Pressure=97.5?

In [None]:
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay

_features = ['Temp', 'StnPress'] # temperature (Temp) and air pressure (StnPress) at weather station are features
_target = 'PRG550_2labels' # Rain or NoRain

disp = DecisionBoundaryDisplay.from_estimator(
    estimator=balanced_clf, 
    X=df_features_train[_features].values, 
    response_method="predict",
    alpha=0.4 # use 40% transparency
)
# DecisionBoundaryDisplay based on matplotlib, using matplotlib's version of scatter()
disp.ax_.scatter(x=df_features_train['Temp'], # x-axis for scatter plot
                 y=df_features_train['StnPress'], # y-axis for scatter plot
                 c=df_target_train,  # use target label to colour data points
                 edgecolor="k")
plt.title('Balanced Classifier Decision Boundary')
plt.show()

In [None]:
# 1. what does the purple area represent?

# type your answer here


In [None]:
# 2. what does the yellow area represent?

# type your answer here


In [None]:
# 3. what will prediction will your model give if Temperature=20.0 and Pressure=100.5?

# type your answer here

In [None]:
# 4. what will prediction will your model give if Temperature=32.0 and Pressure=97.5?

# type your answer here

# Plot experimental results

Load data from  `experiment_data_YYYYMMDD_HHMMSS.csv` and 

1. plot Measured_Temperature vs time 
1. plot Measured_Pressure vs time
1. model's prediction vs time (ie Prediction_Label)

Your the x- and y-axis of your plots should be similar these three charts
<div>
<img src="plot_experimental_results.png" width="500"/>
</div>

In [None]:
# add lines of code to load your captured data into dataframe: realtime_data_df

# pseudo code:
# read csv file into realtime_data_df

In [None]:
# 1. your code here to plot Measured_Temperature vs time 

In [None]:
# 2. your code here to plot Measured_Pressure vs time

In [None]:
# 3. your code here to plot your model's prediction vs time (ie Prediction_Label)