### University of Virginia
### DS 5110: Big Data Systems
### K-Means Cluster Analysis of Fidelity Fund Returns 
### Last updated: March 22, 2021

## Instructions

In this assignment, you will conduct a k-means cluster analysis on a set of Fidelity mutual funds.  
This helps to group similar funds based on their performance (as opposed to their description, which is typical).  
The outline below will walk you through the required steps.  

This assignment is worth a total of **10 POINTS.**

## Data Details 

The file *fido_returns.csv* is the raw data

The file *fido_returns_funds_on_rows.csv* is the processed data for k-means. Additional details about this file: 
- Each row represents a mutual fund  
- Each column represents a trading day (these are used as features)  
- Each value represents the daily percentage change in price between the current trading day and previous trading day

Preprocessing the Data (using pandas) 

In [None]:
import pandas as pd
import numpy as np

In [None]:
# read in the dataset
df0 = pd.read_csv('fido_returns.csv')

In [None]:
# print the head
df0.head(3)

In [None]:
# store the tickers in a list called tickers
tickers = list(df0.columns[1:])
tickers[:5]

In [None]:
# store the dates
dates = df0.Index.values
dates[:5]

In [None]:
# drop the index column
del df0['Index']

In [None]:
# store the dataframe values
vals = df0.values

In [None]:
# transpose the data, putting funds on rows and timepoints on columns
vals_t = vals.transpose()

In [None]:
vals_t

In [None]:
# VALUE: 1 POINT
# print the shape of the transposed dataframe
vals_t.shape

In [None]:
# build the dataframe w transposed data, calling it dft.
dft = pd.DataFrame(data=vals_t, index=df0.columns, columns=dates)

In [None]:
# save preprocessed data to file (in case you wish to work with it later)
dft.head()

### Load Modules and Read Data into Spark DataFrame

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# Load data
df = spark.read.csv("fido_returns_funds_on_rows.csv", header=True, inferSchema=True)
df.select('*').show(2)

**(VALUE: 2 POINTS) Assemble the Features into a column. 
Show the first five rows of data ONLY for the features column.
(this should make things easier to read)**

**(VALUE: 2 POINTS) Set up the k-means model and train the model**  
Use parameters: 
- 3 clusters
- maximum of 10 iterations 
- seed=314

**(VALUE: 2 POINTS) Compute and Print the Silhouette Score**  

**(VALUE: 2 POINTS) Define a function `kmeans_range()` that does the following:**
- takes an integer representing the lower bound for k
- takes an integer representing the upper bound for k
- take a Spark DataFrame containing training data
- fit K-means with k ranging from lower bound to upper bound, inclusive  
- the other parameters should be the same as earlier 
- for each k, compute the silhouette score
- return a pandas dataframe with columns containing k, silhouette score (each row holds the score for given k)

In [None]:
# kmeans_range() definition here


**(VALUE: 1 POINT) Call `kmeans_range` to compute K-means with clusters ranging from 2 to 10 inclusive, printing the resulting dataframe.**

In [None]:
# function call, and printing results


**(VALUE: 1 POINTS) Produce a plot with cluster numbers k on the x-axis, sihouette scores on the y-axis**

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt