## Goal

#### This thesis will aim to examine temporal app-usage data and achieve two goals:       
    1) Model the data by clustering groups of activities into states        
    2) Make predictions on the next state given the current state  

To do this, a baseline model will first be created, and subsequently improved models will be built that aim to surpass the quality of modelling and prediction       
        
##### Baseline:      
    1) PCA to reduce data dimensionality followed by k-means clustering to discover clusters in data.     
    2) Add idle time to data and do above
    3) Perform temporal dimensionality reduction
    4) Perform temporal clustering
    5) Use RNNs
    
    
#### To Do:
    1) Clustering vs K: Done
    2) G-means

### Get cleaned data

In [1]:
from importlib import reload
from utils import *
import pandas as pd

time_percentage = 0.9
explained_variance = 0.9
df = pd.read_csv("rescuetime_data-ac-min.csv")
data_pd = Clean_DF(df)
data_pd.clean_data(time_percentage=time_percentage)
data_pd.clean_df = data_pd.clean_df.reset_index()
data_pd.get_pca(explained_variance=explained_variance)
data_pd.get_day_time()

In [2]:
print("Dataset size:", data_pd.clean_df.shape,'\n')
print("Number of apps that consume", time_percentage*100, "% of all users time: ",len(data_pd.popular_apps), '\n')
print("Cleaned dataset columns:",'\n', data_pd.clean_df.columns.values, '\n')
print("Number of components that explain", explained_variance*100,"% of the data: ",data_pd.pca_data.shape[1], '\n')

Dataset size: (10983, 7) 

Number of apps that consume 90.0 % of all users time:  91 

Cleaned dataset columns: 
 ['Date' 'Time Spent (seconds)' 'Activity' 'Category' 'Productivity'
 'Activity Vector' 'Productivity Score'] 

Number of components that explain 90.0 % of the data:  31 



### Compute Principal Components and visualize top-3 modes

In [3]:
import plotly.plotly as py
import plotly.graph_objs as go

data_pd.get_pca(explained_variance=explained_variance)

c = data_pd.clean_df['Productivity Score']
x = data_pd.pca_data[:,0]
y = data_pd.pca_data[:,1]
z = data_pd.pca_data[:,2]
t = data_pd.clean_df['Activity']
t = data_pd.clean_df['Activity'].tolist()
t = ['-'.join(x) for x in t]

trace1 = go.Scatter3d(x=x,y=y,z=z,text=t,mode='markers',marker=dict(size=12,color=c, colorscale='RdYlGn',opacity=0.8))
data = [trace1]
layout = go.Layout(margin=dict(l=0,r=0,b=0,t=0))
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='PCA-3 Visualization')

### Visualize entire PCA dimensional space using t-SNE

In [4]:
from sklearn.manifold import TSNE
tsne30 = TSNE(n_components=3, verbose=0, perplexity=30, n_iter=5000)
tsne_results30 = tsne30.fit_transform(data_pd.pca_data)

In [5]:
c = data_pd.clean_df['Productivity Score']
x = tsne_results30[:,0]
y = tsne_results30[:,1]
z = tsne_results30[:,2]
t = data_pd.clean_df['Activity']
t = data_pd.clean_df['Activity'].tolist()
t = ['-'.join(x) for x in t]

trace1 = go.Scatter3d(x=x,y=y,z=z,text=t,mode='markers',marker=dict(size=12,color=c, colorscale='RdYlGn',opacity=0.8))
data = [trace1]
layout = go.Layout(margin=dict(l=0,r=0,b=0,t=0))
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='t-SNE PCA 90% variance Visualization')

## Clustering data in PCA dimensions

### Clustering data using the elbow method

    1) Run Kmeans for N iterations recording the inertia for each
    2) At each iteration save the error
    3) When viewing the errors, pick the iteration that is at the elbow of the error curve


In [7]:
from sklearn.cluster import KMeans
inertia = np.zeros(25)
for i in range(2,26):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(data_pd.pca_data)
    inertia[i-2] = kmeans.inertia_

In [8]:
trace = go.Scatter(x=[i for i in range(0,25)], y=inertia, mode='markers')
data= [trace]
py.iplot(data, filename='K-Means inerta')

### Plot PCA data with k-Means labels

In [14]:
for i in range(0,10):
    print(list(kmeans.labels_).count(i))

4851
3088
373
988
506
303
280
311
283
0


In [16]:
kmeans = KMeans(n_clusters=9)
kmeans.fit(data_pd.pca_data)
print(set(kmeans.labels_))
c = kmeans.labels_
x = tsne_results30[:,0]
y = tsne_results30[:,1]
z = tsne_results30[:,2]
t = data_pd.clean_df['Activity']
t = data_pd.clean_df['Activity'].tolist()
t = ['-'.join(x) for x in t]
for i in range(0,len(t)):
    t[i] = str(kmeans.labels_[i]) + '---' + t[i]

trace1 = go.Scatter3d(x=x,y=y,z=z,text=t, mode='markers',marker=dict(size=12,color=c, colorscale = 'Viridis', opacity=0.8))
data = [trace1]
layout = go.Layout(margin=dict(l=0,r=0,b=0,t=0))
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='t-SNE PCA k-means elbow')


{0, 1, 2, 3, 4, 5, 6, 7, 8}


## Generate heatmap of daily activities

In [23]:
print(data_pd.clean_df['Day'], data_pd.clean_df['Time'], pd.Series(c))

0        2017-04-06
1        2017-04-06
2        2017-04-06
3        2017-04-06
4        2017-04-06
5        2017-04-06
6        2017-04-06
7        2017-04-06
8        2017-04-06
9        2017-04-06
10       2017-04-06
11       2017-04-06
12       2017-04-06
13       2017-04-06
14       2017-04-06
15       2017-04-06
16       2017-04-06
17       2017-04-06
18       2017-04-06
19       2017-04-06
20       2017-04-06
21       2017-04-06
22       2017-04-06
23       2017-04-06
24       2017-04-06
25       2017-04-06
26       2017-04-06
27       2017-04-06
28       2017-04-06
29       2017-04-06
            ...    
10953    2017-07-01
10954    2017-07-01
10955    2017-07-01
10956    2017-07-01
10957    2017-07-01
10958    2017-07-01
10959    2017-07-01
10960    2017-07-01
10961    2017-07-01
10962    2017-07-01
10963    2017-07-01
10964    2017-07-01
10965    2017-07-01
10966    2017-07-01
10967    2017-07-01
10968    2017-07-01
10969    2017-07-01
10970    2017-07-01
10971    2017-07-01
