# Preprocessing

In [78]:
import pandas as pd
import numpy as np

print("Preprocessing and cleaning data...\n")
##Read excel spread sheet of data
data = pd.read_excel('cleaned_data.xls', header=None)

##Declare the column names
data.columns = ['date', 's_and_p_comp', 'dividend', 'earnings',
                'CPI', 'fraction_date', 'long_interest_rate', 'real_price',
                'real_dividend', 'real_total_return_price','real_earnings',
                'real_scaled_earnings', 'CAPE', 'TR_CAPE', 'excess_CAPE', 'montly_bond_return',
                'real_bond_return','10_year_stock_return', '10_year_bond_return',
                '10_year_excess_return']

##Drop all rows with missing data
data = data.replace('NA',np.NaN)

data.head()

##Drop "10 year" columns so there are no rows with missing data after 2011
print("Dropping unfinished columns...")    
data = data.drop(['10_year_stock_return'],axis=1)
data = data.drop(['10_year_bond_return'],axis=1)
data = data.drop(['10_year_excess_return'],axis=1)

##Drop rows with missing data
print("Dropping unfinished rows...")
print('\n\nNumber of rows in original data = %d' % (data.shape[0]))
data = data.dropna()
print('Number of rows after discarding missing values = %d\n' % (data.shape[0]))

#Number
print('Number of instances = %d' % (data.shape[0]))
print('Number of attributes = %d\n' % (data.shape[1]))

##Check to make sure there are no missing values in each column
print('Number of missing values:')
for col in data.columns:
    print('\t%s: %d' % (col,data[col].isna().sum()))

print("\n\nPreprocessing done.")

Preprocessing and cleaning data...

Dropping unfinished columns...
Dropping unfinished rows...


Number of rows in original data = 1810
Number of rows after discarding missing values = 1687

Number of instances = 1687
Number of attributes = 17

Number of missing values:
	date: 0
	s_and_p_comp: 0
	dividend: 0
	earnings: 0
	CPI: 0
	fraction_date: 0
	long_interest_rate: 0
	real_price: 0
	real_dividend: 0
	real_total_return_price: 0
	real_earnings: 0
	real_scaled_earnings: 0
	CAPE: 0
	TR_CAPE: 0
	excess_CAPE: 0
	montly_bond_return: 0
	real_bond_return: 0


Preprocessing done.


# K-means Clustering

In [79]:
recession_data = data[(data['date'] > '2006') & (data['date'] < '2011')]
recession_data = recession_data[['date', 'real_price', 'real_scaled_earnings']]
recession_data

Unnamed: 0,date,real_price,real_scaled_earnings
1621,2006-01,1766.770556,45625.435221
1622,2006-02,1760.345828,46210.667771
1623,2006-03,1774.089539,46631.147797
1624,2006-04,1770.58446,46692.36296
1625,2006-05,1745.388308,46916.715417
1626,2006-06,1692.201032,47282.288405
1627,2006-07,1696.730472,48078.694002
1628,2006-08,1729.561257,48920.258166
1629,2006-09,1779.392252,50103.551688
1630,2006-10,1851.056905,51080.952758


These columns are taken from the data from 2006-2010. This is an interesting time period because part of it is during the Great Recession. The data will be put into three clusters.

In [80]:
from sklearn import cluster

clustering_data = recession_data.drop('date',axis=1)
k_means = cluster.KMeans(n_clusters=3, max_iter=50, random_state=1)
k_means.fit(clustering_data) 
labels = k_means.labels_
pd.DataFrame(labels, index=recession_data.date, columns=['Cluster ID'])



Unnamed: 0_level_0,Cluster ID
date,Unnamed: 1_level_1
2006-01,1
2006-02,1
2006-03,1
2006-04,1
2006-05,1
2006-06,1
2006-07,1
2006-08,1
2006-09,1
2006-10,1


The official time of the Great Recession is December 2007 - June 2009. The clusters don't exactly line up with this but are instead delayed several months. This is probably because it took time for the recession and end of recession to affect the selected metrics.

The k-means clustering algorithm placed the non-recession dates into Cluster 1. The dates in the transtion time into and out of the recession were placed into Cluster 2. The dates that are deep in the recession were placed in Cluster 0.