# SD201 - Lab2 - Clustering
Student: José Lucas Barretto

## Question 1

First, we need to import the data from the csv file. I will import the files to a Pandas DataFrame.

In [1]:
import pandas as pd

# import csv data to a pandas DataFrame
data = pd.read_csv('data.csv')

# print the first 5 samples from the data
data.head(5)

Unnamed: 0,StockName,1/28/2011,4/29/2011,5/20/2011,4/1/2011,5/27/2011,6/17/2011,4/15/2011,2/18/2011,3/18/2011,...,1/14/2011,4/8/2011,4/21/2011,3/4/2011,3/25/2011,2/4/2011,1/7/2011,2/25/2011,5/13/2011,1/21/2011
0,American Express,-4.7557,4.00509,3.58155,-0.395257,0.768624,1.12594,-0.237274,-1.91728,0.706794,...,4.63801,1.46898,2.74809,-0.022868,1.87709,-0.70247,2.44804,-3.13752,-1.13863,-0.065175
1,Boeing,-3.2019,5.65488,-1.44928,0.693878,0.574788,1.50561,-1.42566,0.467675,-2.90853,...,0.93633,0.122649,3.74037,-0.92452,4.33917,3.06093,4.88284,-0.069109,-0.353045,1.15721
2,Chevron,-0.55384,1.92791,0.529256,1.80451,2.05676,-0.869652,-3.18936,3.37173,3.67084,...,2.06707,1.0505,3.03001,1.43723,2.81148,3.47363,-0.512765,2.89227,-0.83293,0.903809
3,Cisco Systems,0.431862,3.48494,-1.72414,-1.84332,0.304692,-1.12285,-3.83964,0.053079,-3.76193,...,1.2894,3.76249,0.35545,-1.18153,-0.346021,5.35117,2.54279,-0.480513,-3.70793,-2.35627
4,DuPont,3.81916,1.97522,0.0,1.99593,1.56522,-0.919448,-1.00992,2.8288,-0.357277,...,3.10559,-0.18018,3.00295,-0.645518,0.557621,4.74576,-0.579421,-1.60146,-3.69494,-2.38239


Now we can apply the KMeans Algorithm with default parameters to cluster the data.

In [27]:
from sklearn.cluster import KMeans

# set random seed for reproducibility
seed = 0

# remove stock names from the data
X = data.drop(['StockName'], axis=1)

# train the kmeans algorithm on the data
kmeans_default = KMeans(n_clusters=8, random_state=seed).fit(X)

# calculate kmeans sse
sse_default = kmeans_default.inertia_

# print results
print('K-Means with default parameters:')
print('Sum of Squared Errors: {:.2f}'.format(sse_default))

K-Means with default parameters:
Sum of Squared Errors: 1536.62


## Question 2

In order to improve the clustering performance, I chose to tune the following parameters:

**n_init**: this parameter indicates the number of times that the k-means algorithm will be run with different initial centroids. Since the algorithm automatically selects the best fit, increasing this parameter tends to increase the clustering performance, but will also increase computational cost. One thing to take into account is that since the centroid initialization is random, increasing or decreasing this parameter will not necessarily increase/decrease the clustering performance. While unlikely, it may happen that, for *n_init = 1*, the algorithm randomly selects the best possible centroid or, on the other hand, it may select 10 bad centroids for *n_init = 10*.

In [28]:
kmeans_n_init = KMeans(n_clusters=8, n_init=200, random_state=seed).fit(X)

# calculate kmeans sse
sse_n_init = kmeans_n_init.inertia_

# print results
print('K-Means with tuned n_init parameter:')
print('Sum of Squared Errors: {:.2f}'.format(sse_n_init))

K-Means with tuned n_init parameter:
Sum of Squared Errors: 1510.02


**tol**: this parameter defines the tolerance with regards to the Frobenius Norm of the difference between the centroids of two consecutive iterations to declare convergence. Therefore, decreasing this parameter means that the algorithm will have to perform more iterations in order for convergence to be declared, which can improve the clustering performance. As before, decreasing this parameter only improves the SSE up to a certain point. After the tolerance becomes too small, there are no significant gains in performance, which means that decreasing it won't neceessarily improve the model. In fact, its default value is already very low, and can be used out-of-the-box.

In [29]:
kmeans_tol = KMeans(n_clusters=8, tol=0.0001, random_state=seed).fit(X)

# calculate kmeans sse
sse_tol = kmeans_tol.inertia_

# print results
print('K-Means with tuned tol parameter:')
print('Sum of Squared Errors: {:.2f}'.format(sse_tol))

K-Means with tuned tol parameter:
Sum of Squared Errors: 1536.62


Now, we're going to tune both parameters and use a tuned model to cluster the data.

In [30]:
kmeans_tuned = KMeans(n_clusters=8, tol=0.0001, n_init=200, random_state=seed).fit(X)

# calculate kmeans sse
sse_tuned = kmeans_tuned.inertia_

# print results
print('K-Means with tuned parameters:')
print('Sum of Squared Errors: {:.2f}'.format(sse_tuned))

K-Means with tuned parameters:
Sum of Squared Errors: 1510.02


## Question 3

The next step is to organize the clustering results into a dictionary and proceed to a qualitative analysis.

In [31]:
# get companies names
stocks = data['StockName']

# create empty dict to store results
results = {}

# update dictionary with key, value pairs
for idx, group in enumerate(kmeans_tuned.labels_):
    
    # create the key,value pair if key is not present in dict 
    if group not in results.keys():
        results[group] = [stocks[idx]]
    
    # if key is already in dict, append to the list
    else:
        results[group].append(stocks[idx])

# print formatted dictionary
from pprint import pprint
pprint(results)

{0: ['Intel'],
 1: ['DuPont', 'Caterpillar', 'Alcoa'],
 2: ['American Express',
     'Boeing',
     'Microsoft',
     'Walt Disney',
     'General Electric',
     'United Technologies',
     'JPMorgan Chase',
     '3M'],
 3: ['Cisco Systems'],
 4: ['Kraft',
     'Verizon',
     'IBM',
     'The Home Depot',
     'Procter & Gamble',
     'Wal-Mart',
     'AT&T',
     'Merck',
     'Travelers',
     'McDonalds',
     'Coca-Cola',
     'Johnson & Johnson'],
 5: ['Chevron', 'Pfizer', 'ExxonMobil'],
 6: ['Bank of America'],
 7: ['Hewlett-Packard']}


Labeling each cluster group:
    
0. **High-Tech - Computer Hardware**. The sole company in this group is Intel, a semiconductor chip manufacturer.


1. **Heavy-Industry**. All the companies in this group belong to the heavy-industry sector: DuPont (chemicals), Caterpillar (heavy-equipment) and Alcoa (aluminum).


2. **American S&P 100 Companies**. This group includes companies from different sectors, such as: **financial services** (American Express and JPMorgan Chase), **industrial conglomerates** (Boeing, General Electric, United Tech, 3M), **tech** (Microsoft) and **media and entertainment** (Walt Disney). We can also identify that all these corporations are US-based and are part of the Standard & Poor's (S&P's) 100 stock market index.


3. **High-Tech - Networking Hardware and Software**. The sole company in this group is Cisco, which develops and sells networking hardware and software.


4. **Consumer Goods**. This big cluster is highly associated with companies in the retailing/wholesaling of consumer goods (Kraft, Home Depot, Procter & Gamble, Walmart, McDonald's, Coca-Cola, Johnson & Johnson). Besides that, we also have companies in the tech/telecommunications sectors (Verizon, AT&T, IBM, Merck), and one insurance company (Travelers).


5. **Oil Companies**. Out of the three companies in the cluster, two are oil companies (Chevron and Exxon Mobil). The other one is a pharmaceutical industry company (Pfizer).


6. **Banking**. Cluster contains only one company: Bank of America.


7. **High-Tech - IT, Software and Hardware**. Cluster contains only one company: Hewlett-Packard.