# DSE 230: Programming Assignment 4.1 - K-Means Cluster Analysis
---
#### Tasks:
- Work with `minute_weather.csv`
    - Use scikit-learn to perform k-means clustering (25%)
    - Explore parallelism with scikit-learn for k-means clustering (10%)
    - Explore parallelism with dask for k-means clustering (65%)
- Submission on Gradescope (2 files)
  - Completed notebook (.ipynb) or PDF with results under **PA4.1 Notebook**
    - Make sure that all expected outputs are present
  - An executable script (.py) exported from this notebook under **PA4.1**

#### Due date: Friday 5/28/2021 at 11:59 PM PST

## Setup

In [None]:
# Import required libraries
# YOUR CODE HERE


## Scikit-Learn (25%)
---

**1.1** (5%) Load Data
- Load the "minute_weather.csv" into the Pandas dataframe
- Drop the two columns ["rowID", "hpwren_timestamp"] from the dataframe
- Print out the column names (features) from the output of the previous step

In [None]:
# YOUR CODE HERE

**1.2** (5%) Data preprocessing and normalization using sklearn
- Perform train and test split with 80% of the original dataset being the training dataset and 20% of the original dataset being the testing dataset.
    * Pass `random_state=seed` to `train_test_split` for reproducing results
- Print the number of samples from both train and test dataset, and the summary statistics of training dataset.
- Perform feature normalization on both the train dataset and the test dataset using StandardScaler from sklearn library. Only **train** data should be used for scaling
- Print out the mean and standard deviation along the feature columns of both the train and the test dataset.

(your output of the mean and std should be a vector of shape (1, number of features) make sure you clearly label your results)

In [None]:
seed=30
# YOUR CODE HERE

### Build Clustering Model
**1.4** (10%) KMeans Clustering model with sklearn
- Use the normalized training dataset to fit a K-means model with 9 clusters
    * Pass `random_state=seed` to `KMeans` for reproducing results
- Print out the cluster centers found by the model
- Print out the computational performance by adding "%%time" at the top of the cell

In [None]:
# YOUR CODE HERE

### Evaluate Model

**1.5** (5%) Evaluate KMeans clustering model with sklearn
- Print out the inertia_ variable of the model, and explain what the number means in KMeans model
- Print out the within-cluster sum of squares (WSSE) on the train and test

Check documentations on KMeans at https://scikit-learn.org/stable/modules/clustering.html

In [None]:
# YOUR CODE HERE

---
## Parallelism with Scikit-Learn (10%)
**2.1** (10%) Single machine parallelism using **all** the cores
- Fit the model with single-machine parallelism using scikit-learn and joblib (via `n_jobs` parameter)
    * Pass `random_state=seed` to `KMeans` for reproducing results
- Print out the WSSE on train and test
- Use %%time to print out the computational performance

Note that your model's parameters and seed setting should remain the same from the previous questions

In [None]:
# YOUR CODE HERE

---
## Parallelism with Dask (65%)
Multi-machine parallelism using Dask's scalable k-means algorithm

### Create and connect to client
**3.1** (5%) Setup the Dask distributed client
- Create a Dask distributed client with 2 workers
- Print out the Dask client information

In [6]:
# YOUR CODE HERE

### Load Data into Dask DataFrame

**3.2** (5%) Load the data into Dask Dataframe
- Load the dataset into Dask Dataframe
- Use %%time to print out the loading efficiency of the operation

In [7]:
# YOUR CODE HERE

### Explore Data using Dask

**3.3** (5%) Summary statistics
- Print out the shape of the dataframe
- Print the first 10 rows of the dask dataframe
- Print the summary statistics on all the features of the dask dataframe

In [None]:
# YOUR CODE HERE

### Prepare Data using Dask 

**3.4** (5%) Data Preparation with Dask DataFrame
- Drop the ["rowID", "hpwren_timestamp"] two columns from the dataframe
- Perform 80/20 train and test split with `random_state=seed` (same as the previous task but in dask)
- Print out the number of samples in train and test dataset

Note that numbers of samples are slightly difference since Dask and scikit-learn are different implementations, and also due to round-off differences.

In [None]:
# YOUR CODE HERE

**3.5** (10%) Data preprocessing and normalization with Dask
- Perform feature normalization using the Dask library. Use only the **train** data for scaling.
- Print out the summary statistics of the transformed features in train and test dataframes
- Comments on your observation on the summary statistics of the transformed features in train and test dataframes

In [None]:
# YOUR CODE HERE

### Build Dask K-Means Model
**3.6** (15%) KMeans clustering model with dask
- Fit KMeans model with Dask cluster library with the transformed Dask dataframe, you should set cluster number `n_clusters` and `random_state` as the same number as previous task
- Print out the computational performance using %%time

Note that Dask's K-Means estimator uses kmeans|| as the default algorithm.  To compare to scikit-learn's implementation of k-means, use k-means++ instead.  

In [None]:
# YOUR CODE HERE

### Evaluate Dask K-Means Model
**3.7** (5%) Analyse hyperparameters
- Print out the inertia_ of KMeans model
- Print out the computational efficiency with %%time
- Double check if the dataframes and hyperparameters are the same for both scikit-learn K-Means model and Dask K-Means model. Is the inertia_ you printed different from your answer from the previous question? Explain your observation.


**3.8** (10%) Dask K-Means estimator does not have a score() method.  As an easy fix, we can instantiate a scikit-learn K-Means estimator with the fitted Dask model (i.e., just copy the cluster centers over) to use the scikit-learn K-Means score method.
- Print out the cluster centers found by the Dask KMeans model
- Instantiate a scikit-learn KMeans estimator and assign the cluster centers with the one from Dask model
- Print out the WSSE on train and test using score method. (Note that WSSE is the within-cluster sum of **square** error)

In [None]:
# YOUR CODE HERE

### Stop the Dask Client

**3.9** (5%) Stop the dask client

In [None]:
# YOUR CODE HERE