# Clustering parallel coordinates plot

Use cluster analysis to generate a big picture model of the weather at a local station using a minute-graunlarity data.   
In this dataset, we have in the order of millions records. How do we create 12 clusters our of them?

The dataset we will use is in a large CSV file called *minute_weather.csv*.  
The download link is: https://drive.google.com/open?id=0B8iiZ7pSaSFZb3ItQ1l4LWRMTjg 

### Import

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates

# import utils
from itertools import cycle, islice

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

### Data

In [None]:
!find ../.. | grep -i minute_weather.csv

In [None]:
data = pd.read_csv('../../_data/minute_weather.csv')

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold">Minute Weather Data Description</p>
<br>
The **minute weather dataset** comes from the same source as the daily weather dataset that we used in the decision tree based classifier notebook. The main difference between these two datasets is that the minute weather dataset contains raw sensor measurements captured at one-minute intervals. Daily weather dataset instead contained processed and well curated data. The data is in the file **minute_weather.csv**, which is a comma-separated file.

As with the daily weather data, this data comes from a weather station located in San Diego, California. The weather station is equipped with sensors that capture weather-related measurements such as air temperature, air pressure, and relative humidity. Data was collected for a period of three years, from September 2011 to September 2014, to ensure that sufficient data for different seasons and weather conditions is captured.

Each row in **minute_weather.csv** contains weather data captured for a one-minute interval. Each row, or sample, consists of the following variables:

* **rowID:** 	unique number for each row	(*Unit: NA*)
* **hpwren_timestamp:**	timestamp of measure	(*Unit: year-month-day hour:minute:second*)
* **air_pressure:** air pressure measured at the timestamp	(*Unit: hectopascals*)
* **air_temp:**	air temperature measure at the timestamp	(*Unit: degrees Fahrenheit*)
* **avg_wind_direction:**	wind direction averaged over the minute before the timestamp	(*Unit: degrees, with 0 means coming from the North, and increasing clockwise*)
* **avg_wind_speed:**	wind speed averaged over the minute before the timestamp	(*Unit: meters per second*)
* **max_wind_direction:**	highest wind direction in the minute before the timestamp	(*Unit: degrees, with 0 being North and increasing clockwise*)
* **max_wind_speed:**	highest wind speed in the minute before the timestamp	(*Unit: meters per second*)
* **min_wind_direction:**	smallest wind direction in the minute before the timestamp	(*Unit: degrees, with 0 being North and inceasing clockwise*)
* **min_wind_speed:**	smallest wind speed in the minute before the timestamp	(*Unit: meters per second*)
* **rain_accumulation:**	amount of accumulated rain measured at the timestamp	(*Unit: millimeters*)
* **rain_duration:**	length of time rain has fallen as measured at the timestamp	(*Unit: seconds*)
* **relative_humidity:**	relative humidity measured at the timestamp	(*Unit: percent*)

In [None]:
data.shape

In [None]:
data.head()

### Data Sampling

In [None]:
sample_ind = np.random.permutation(data.index)[:100000]

sampled_df = data.iloc[sample_ind]
sampled_df.shape

### Data statistics

In [None]:
sampled_df.describe().T

In [None]:
sampled_df[sampled_df['rain_accumulation'] == 0].shape

In [None]:
sampled_df[sampled_df['rain_duration'] == 0].shape

### Skewed feature distributions

- find features with a distribution concentrated in one value

In [None]:
from collections import Counter
{ftr:Counter(sampled_df[ftr])[0]/sampled_df.shape[0] for ftr in sampled_df.columns}

In [None]:
del sampled_df['rain_accumulation']
del sampled_df['rain_duration']

### NaN's

In [None]:
'Number of rows with NaN: {}'.format(sampled_df.isnull().any(1).sum())
{ftr:sampled_df[ftr].isnull().sum() for ftr in sampled_df.columns}

In [None]:
rows_before = sampled_df.shape[0]
sampled_df = sampled_df.dropna()
rows_after = sampled_df.shape[0]

#### Sanity check

In [None]:
rows_before - rows_after

In [None]:
sampled_df.columns

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br>

Select Features of Interest for Clustering
<br><br></p>


In [None]:
features = ['air_pressure', 'air_temp', 'avg_wind_direction', 'avg_wind_speed', 'max_wind_direction', 
        'max_wind_speed','relative_humidity']

In [None]:
select_df = sampled_df[features]

In [None]:
select_df.sample(5)

### Feature scaling

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(select_df)

### k-Means Clustering

In [None]:
kmeans = KMeans(n_clusters=12)
kmeans.fit(X)

### Clusters

In [None]:
centers = model.cluster_centers_
centers

### Dataframe with cluster center coordinates

In [None]:
def df_centers(featuresUsed, centers):
    colNames = featuresUsed + ['prediction']

    # Zip with a column called 'prediction' (index)
    Z = [np.append(A, index) for index, A in enumerate(centers)]

    # Convert to pandas data frame for plotting
    P = pd.DataFrame(Z, columns=colNames)
    P['prediction'] = P['prediction'].astype(int)
    return P

In [None]:
P = df_centers(features, centers)
P

### Parallel plots

In [None]:
def parallel_plot(data):
    my_colors = list(islice(cycle(['b', 'r', 'g', 'y', 'k']), None, len(data)))
    
    plt.figure(figsize=(15, 8)).gca().axes.set_ylim([-3, +3])
    
    parallel_coordinates(data, 'prediction', color=my_colors, marker='o')

In [None]:
parallel_plot(P)

#### Dry Days

In [None]:
parallel_plot(P[P['relative_humidity'] < -0.5])

#### Warm Days

In [None]:
parallel_plot(P[P['air_temp'] > 0.5])

#### Cool Days

In [None]:
parallel_plot(P[(P['relative_humidity'] > 0.5) & (P['air_temp'] < 0.5)])