# Intro to Machine Learning with Meteorological Station Data: Part 1

The overarching goal of Parts 1 and 2 of the ML application lab is to become familiar with the limits and applicability of a variety of Machine Learning tools.

Part 1) __Unsupervised Learning *(this notebook)*: Grouping events into different categories__ 
<br> Can we use K-means clustering to detect the seasons in the christman dataset? <br>

__The goals of Part 1 of this Application Lab are to:__<br>
1) Learn how to utilize an unsupervised learning technique (K-means) to look for patterns in a dataset.<br>
2) Become familiar with the sensitivity of K-means to standardization, changing the input data and K.<br>
3) Understand at least one application of an unsupervised learning technique: data exploration & pattern recognition.<br>

# K-Means clustering

In the first part of this application lab, we will use [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering) to see if the algorithm can separate some data into different seasons. This may seem trivial, because we clearly already know which observations are in which season. But the goal is for you to understand how the algorithm works and its limitations.<br><br>

![k-means_image](./images/kmeans_image.png)

# Part 0. Read in data into a pandas dataframe.

In [None]:
import pandas as pd
import numpy as np
import datetime

Read in the data.

In [None]:
df = pd.read_csv("./christman_2016.csv")
# preview data (also through df.head() & df.tail())
df

## Deal with the time dimension

How many days are in this dataset?

In [None]:
df.day.nunique()

__Optional__: transform the day column into a readable date. Can only run once successfully.

In [None]:
df['day'] = [datetime.date.fromordinal(day+693594) for day in df['day']]

---------------------------------------------

# Part 1. Unsupervised learning 
#### *Using K-means to look for patterns in the data*

In [None]:
from scipy.cluster.vq import vq, kmeans, whiten
import matplotlib.pyplot as plt

Only select noons, or when hour is equal to 0.5.

In [None]:
noondf = df[df.hour==0.5]
#noondf = df # try selecting all data instead of just noon data

Convert to numpy array for model input, leaving out the time dimensions day & hour, and wind directions.

In [None]:
included_cols = ['temp_F','RH','dewtemp_F','wind_mph','windgust','pres_Hg','SOLIN_Wm2','Prec_inches'] # original input
#included_cols = ['temp_F','RH','dewtemp_F','wind_mph','windgust','pres_Hg','Prec_inches'] # try removing insolation as a feature
data = noondf.loc[:, noondf.columns.isin(list(included_cols))].to_numpy()
np.shape(data)

## Standardize or normalize data

Since clustering among features depends on distance, we need to standardize all of our features so that variance across features is equal. We don't want the clustering to be dominated by the feature with the largest variance. 

In [None]:
normal_data = whiten(data) # sci-kit learn's normalization feature in the clustering toolbox
#normal_data = data # can uncomment this to avoid normalization
np.shape(normal_data) # great, we end up with the same shape as our original data.

There are 366 observations over 8 variables, or "features", as they're called in the ML world.

## Look for patterns with k-means

In [None]:
NO_CLUSTERS = 2 # 2 for separating cold & warm seasons, 4 for separating into all four seasons
centroids, _  = kmeans(normal_data,NO_CLUSTERS,iter=20)
idx, _ = vq(normal_data,centroids) # returns season for each observation
print(idx) # prints K-mean's season label for each day in the year (Jan - December 2016)

Create a few xy scatter plots, where points are colored by season (which we know), and each k-means-determined centroid is in a black dot overlaid.

In [None]:
vars2plot = ['temp_F','pres_Hg'] # format (x, y)
# You may also try plotting different variables. Just ensure they are listed 
# in "included_cols" in the cell above.

plt.figure(figsize=(10,6))
data2plot = [data[:,included_cols.index(var)] for var in vars2plot]
plt.plot(data2plot[0],data2plot[1],'.',markersize = 8)
plt.xlabel(vars2plot[0],fontsize=18)
plt.ylabel(vars2plot[1],fontsize=18)

yvals = plt.ylim()
xvals = plt.xlim()

plt.title('Original Data',fontsize=22)
plt.show()

In [None]:
cols = ['red','blue','green','orange']

plt.figure(figsize=(10,6))
plt.title(str(NO_CLUSTERS) + ' Clusters',fontsize=22)
for (ind,val) in enumerate(np.transpose(data2plot)):
    plt.plot(val[0],val[1],".", color=cols[idx[ind]], markersize=8, markerfacecolor = 'none')

plt.xticks()
plt.yticks()

plt.xlabel(vars2plot[0],fontsize=18);
plt.ylabel(vars2plot[1],fontsize=18);

## Unsupervised Learning Questions:

In [None]:
from solutions import unsupervised

1. __Do a quick search online for the definition of a "centroid" for K-means clustering. What is a centroid?__

In [None]:
unsupervised.answer1()

2. __What would happen if you didn't select only daily data, but instead included hourly data?__

In [None]:
unsupervised.answer2()

3. __What happens when you don't standardize the data beforehand? Why should you standardize the data?__

In [None]:
unsupervised.answer3()

4. __What happens when you change the number of clusters from two to four? Why do you think the algorithm yields different results?__

In [None]:
unsupervised.answer4()

5. __What happens when you remove certain features? Does the model perform better or worse at detecting seasons?__

In [None]:
unsupervised.answer5()