## Team Members : Anmol Singh Suag, Sanuj Bhatia

We have chosen to work on <b> VAST Challenge 2015 </b> for the final project. The VAST Challenge 2015 has 2 Mini-challenges and 1 Grand Challenge. For the Homeworks 5-7, we would be working on the <b>Mini-Challenge 1</b>.

### Questions to be answered

* In homework 6, we discussed the Mini-Challenge 1 of VAST Challenge 2015, its data set and the kinds of analysis expected from it. We selected another data set from Stanford's SNAP that had similar characteristics. Some of the questions that MC 1 needs answering are:
    * Number of group types that visit the park
    * Size of these groups
    * Most frequently visited places by these groups
    * Observations about the groups from the data
    * Activity patterns of the group
    * Suggesting improvement in the park
    * Anomalies in the park over the 3 days

We have attempted to cluster and visualise our data set to answer similar questions on it, but due to the data not being of an identical nature, the answers take different forms, and this will be discussed later.

In [74]:
#Importing commonly used ML Libraries
from bokeh.io import output_notebook, show
from bokeh.layouts import column, row, widgetbox
from bokeh.plotting import figure
from bokeh.models import HoverTool, ColumnDataSource, LabelSet, CustomJS, Slider, Range1d
from bokeh.models.widgets import Select, Panel, Tabs
import pandas as pd
import numpy as np
from scipy import stats
from scipy.optimize import curve_fit
import math
import copy
import warnings
warnings.filterwarnings('ignore')
from sklearn.cluster import KMeans, DBSCAN
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold   #For K-fold cross validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
from random import randint
import pickle

In [75]:
output_notebook()

In [3]:
df = pd.read_table('Brightkite_totalCheckins.txt', header=None, names=['id', 'time', 'latitude', 'longitude', 'location_id'], usecols=['id', 'time', 'latitude', 'longitude'], nrows=713652, parse_dates=[1], infer_datetime_format=True)
df.head()

Unnamed: 0,id,time,latitude,longitude
0,0,2010-10-17 01:48:53,39.747652,-104.99251
1,0,2010-10-16 06:02:04,39.891383,-105.070814
2,0,2010-10-16 03:48:54,39.891077,-105.068532
3,0,2010-10-14 18:25:51,39.750469,-104.999073
4,0,2010-10-14 00:21:47,39.752713,-104.996337


### Data Preparation
1. We round the latitude and longitude to 2 decimal places to account for slight variation in checking-in at the same location. Second decimal place comes to anywhere from 700m to 1.1km variation, depending upon the exact location. This way, we can recognise unique locations throughout the data. 

2. The location ID column was removed because it was a hash value combining the 6 decimal place latitude and longitudes, and is not helpful in actually recognising unique locations after rounding them off.


In [76]:
df['latitude'] = round(df['latitude'], 2)
df['longitude'] = round(df['longitude'], 2)
df.head(5)

Unnamed: 0,id,time,latitude,longitude
0,0,2010-10-17 01:48:53,39.75,-104.99
1,0,2010-10-16 06:02:04,39.89,-105.07
2,0,2010-10-16 03:48:54,39.89,-105.07
3,0,2010-10-14 18:25:51,39.75,-105.0
4,0,2010-10-14 00:21:47,39.75,-105.0


### Clustering to find groups

* We mark unique locations as a conjugation of the truncated latitude and longitude values. These conjugated location identifiers would help us in clustering the users that have checked-in similar locations.

* Then we create a 2-D matrix of user-Ids and these location identifiers. The cells in the matrix contain the number of times the user has checked-in at that specific location.

* We aim to cluster the users based on this check-in matrix. Users that have similar check-ins and similar number of check-ins at different locations can be assumed to be a part of a group.

* We have used time stamps to find simulatneous check-ins and to visualise data after the groups have been created.

In [77]:
unique_locations = {}
id_visits = {}
for i in range(713652):
    loc_id = str(df['latitude'][i])+str(df['longitude'][i])
    unique_locations[loc_id] = True

* 43028 unique location marked by the conjugation of latitude and longitude values were found.
* These are all the locations any user has visited, and realistically visits only a small number of these locations.
* Our features matrix is therefore sparse, and KMeans from sklearn handles sparse matrices well.
* We now create our feature matrix with 43028 features each for every user, which are 2000 in number.

In [78]:
print (len(unique_locations))
unique_locations = list(unique_locations)

43028


In [79]:
users = np.zeros(shape=(2000,43028))
for i in range(713652):
    userId = df['id'][i]
    locIndex = unique_locations.index(str(df['latitude'][i])+str(df['longitude'][i]))
    users[userId][locIndex]+=1

### KMeans

* We had used the KMeans Clustering algorithm to find groups in the data. Kmeans is an unsupervised and robust learning algorithm that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

* After analysing the nature of data and probable number of groups that could be identified, we ran the algorithm on our user-checkin matrix with different number of cluster values. After many test runs, we finalised the cluster count value as 1000 to be a good test guess.

* Below, the kmeans calculation has been commented out, since it takes around 1 hour to run. Instead, we have saved the result of the kmeans run (the cluster labels) and pickled it. We have been loading (unpickling) this same result to analyse and visualise data beyond this point.

* If the grader wishes to run the kmeans code, please uncomment the line below, and comment out the first two lines in 'In[11]:'. Note that this could present you a view of the visualisations a little different to ours

In [80]:
# kmeans = KMeans(n_clusters=1000, n_jobs=-3, n_init=10).fit(users) #Uncomment these two lines if running Kmeans
# kmeans_groups = kmeans.labels_

In [81]:
#with open('Data/kmeans_result.pickle', 'wb') as handle:
    #pickle.dump(kmeans.labels_, handle)

In [82]:
# Separating Groups
with open('Data/kmeans_result.pickle', 'rb') as handle: #Comment these two lines if running kmeans fresh.
    kmeans_groups = pickle.load(handle)
groups = {}
for i in range(len(kmeans_groups)):
    if kmeans_groups[i] in groups:
        groups[kmeans_groups[i]].append(i)
    else:
        groups[kmeans_groups[i]] = [i]

#### Looking into large groups
* We look into group size that may be larger than 100 members and attempt to cluster them once further in order to smooth out the cluster sizes and get more meaningful groups.
* We run kmeans again on these clusters using a part of the feature matrix, and add the newly found groups to the initial groups dictionary.

In [83]:
#Looking into large clusters
toPop = []
kmeans2_groups = []
for gID, members in groups.items():
    if len(members) > 100:
        print('Group ID: ', gID, ' Group Size: ', len(members))
        largeGroup = []
        toPop.append(gID)
        for member in members:
            largeGroup.append(users[member])
        kmeans2 = KMeans(n_clusters=100, n_jobs=-3, n_init=5).fit(largeGroup)
        kmeans2_groups.append(kmeans2.labels_)
        

for x in kmeans2_groups:
    len_groups = len(groups)
    for i in range(len(x)):
        if (x[i]+len_groups) in groups:
            groups[x[i]+len_groups].append(i)
        else:
            groups[x[i]+len_groups] = [i]
for x in toPop:
    _=groups.pop(x)

Group ID:  156  Group Size:  887


* We now have groups that should represent people checking in together at different locations.
* We will now create some dictionaries representing different statistics like number of groups and their sizes, number of groups and check in frequencies, including group and friend checkins.

In [84]:
numG_vs_gSize = {}
numG_vs_numCheckins = {}
g_vs_checkins = {}
for group,members in groups.items():
    gSize = len(members)
    if gSize in numG_vs_gSize:
        numG_vs_gSize[gSize]+=1
    else:
        numG_vs_gSize[gSize] = 1
        
    numCheckins = 0
    presenceDict = {}
    for member in members:
        memberDF = df[df['id'] == member]
        numCheckins+=len(memberDF)
        if gSize > 1:
            for j in range(len(memberDF)):
                presenceID = str(memberDF.iloc[j,1].date()) + '_' + str(memberDF.iloc[j,2]) + '_' + str(memberDF.iloc[j,3])
                if presenceID in presenceDict:
                    if member in presenceDict[presenceID]:
                        pass
                    else:
                        presenceDict[presenceID].append(member)
                else:
                    presenceDict[presenceID] = [member]
     
    if gSize > 1:
        g_vs_checkins[group] = [0,0]
        for key,val in presenceDict.items():
            g_vs_checkins[group][0]+=1
            if len(val) > 1:
                g_vs_checkins[group][1]+=1
        
    if numCheckins in numG_vs_numCheckins:
        numG_vs_numCheckins[numCheckins]+=1
    else:
        numG_vs_numCheckins[numCheckins] = 1
        


* We also populate checkins for days and months to find check in patterns in groups over time.

In [85]:
numCheckins_vs_day = {0:0,1:0,2:0,3:0,4:0,5:0,6:0}
numCheckins_vs_month={1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0}
for i in range(len(df)):
    numCheckins_vs_day[df.iloc[i,1].weekday()]+=1
    numCheckins_vs_month[df.iloc[i,1].month]+=1


### Figure 1: Visualising Check-in Locations
1. We plot the locations in latitude and longitude of different clusters grouped together by their group size.
2. We hope to gather information about different groups and their movements, frequented locations, etc, by visualising the check in locations.
3. In this visualisation we have provided hover that tells the latitude and longitude of the point, and a select widget that can be used to change the group size.
4. The points in red are all the locations visited by individuals in all groups, and those highlighted in yellow are the ones visited by the currently selected groups.

In [86]:
#Figure 1: Plotting Locations visited by Groups of a a particular size
fig1_dict={}
locations=[]
for i in range(len(df)):
    location=(round(df.iloc[i,2]),round(df.iloc[i,3]))
    if(location not in locations):
        locations.append(location)    
    
lats=[]
longs=[]
for lat,long in locations:
    lats.append(lat)
    longs.append(long)
fig1_dict["lats"]=lats
fig1_dict["longs"]=longs

In [87]:
groupIdToLocations={}
groupIdToMemberCount={}
for group,members in groups.items():
    if(len(members)<=1):
        continue
    groupIdToMemberCount[group]=len(members)
    group_locations=[]
    for member in members:
        memberDF = df[df['id'] == member]
        for j in range(len(memberDF)):
            location=(round(memberDF.iloc[j,2]),round(memberDF.iloc[j,3]))
            if(location not in group_locations):
                group_locations.append(location)
    
    groupIdToLocations[group]=group_locations    
            
        

In [88]:
groupSizeToLocations={}
for groupId in groupIdToLocations:
    locations=[]
    if(groupIdToMemberCount[groupId] in groupSizeToLocations):
        locations=groupSizeToLocations[groupIdToMemberCount[groupId]]
    
    for location in groupIdToLocations[groupId]:
        if(location not in locations):
            locations.append(location)
    groupSizeToLocations[groupIdToMemberCount[groupId]]=locations

for groupSize in groupSizeToLocations:
    lats=[]
    longs=[]
    for location in groupSizeToLocations[groupSize]:
        lats.append(location[0])
        longs.append(location[1])
    fig1_dict["groupSize_"+str(groupSize)+"_lats"]=lats
    fig1_dict["groupSize_"+str(groupSize)+"_longs"]=longs
    
currentGroupSize=2
fig1_dict["curr_lat"]=fig1_dict["groupSize_2_lats"]
fig1_dict["curr_long"]=fig1_dict["groupSize_2_longs"]

fig1_source=ColumnDataSource(data=fig1_dict)

In [89]:
fig1_menu=[]
for groupSize in sorted(groupSizeToLocations):
    fig1_menu.append((str(groupSize),str(groupSize)))

fig1_dd=Select(title="Choose Group Size",value="2", options=fig1_menu,width=150)

fig1=figure(title='Locations visited by groups of a particular size', plot_height = 400,plot_width=600)
fig1.circle(fig1_dict["longs"],fig1_dict["lats"],size=2, color="#E74C3C")
fig1_c=fig1.circle(x="curr_long",y="curr_lat",size=2, color="#FFFF00",source=fig1_source)
fig1.add_tools(HoverTool(tooltips=[
    ("Longitude: ", "$x"),
    ("Latitude:", "$y")
],renderers=[fig1_c]))
update_curve = CustomJS(args=dict(source=fig1_source,fig1_dd=fig1_dd), code="""

    groupSize=fig1_dd.value
    source.data['curr_lat']=source.data['groupSize_'+groupSize+'_lats']
    source.data['curr_long']=source.data['groupSize_'+groupSize+'_longs']
    source.trigger('change');


""")

fig1_dd.js_on_change('value', update_curve)

fig1.background_fill_color = "#3498DB"
fig1.background_fill_alpha = 0.4
fig1.xaxis.axis_label = 'Longitude'
fig1.yaxis.axis_label_text_font='times'
fig1.yaxis.axis_label = 'Latitude'
fig1.axis.major_label_text_color = "black"
fig1.ygrid.grid_line_alpha = 0.8
fig1.ygrid.grid_line_dash = [5, 3]
fig1.xgrid.grid_line_alpha = 0.8
fig1.xgrid.grid_line_dash = [5, 3]


fig1.legend.location = "top_right"
fig1.legend.click_policy="hide"
show(row(fig1,fig1_dd))


### Figure 1: Observations
Some patterns we have observed in the map above that approximates the world map are:
1. Groups size 2 are probably couples that travel far to the east, to europe, to islands, and various locations in the world.
2. Groups of size 3 are probably families that travel to coastal places and beaches.
3. Groups of size 4 are concentrated in an area in the US.
4. Similarly, there are various patterns in the groups of different sizes, and the general trend is that larger groups have more varied location preferences, because of various different friend circles.

### Figure 2: Checkins by day of Week
* We have plotted the number of total checkins by groups and individuals by day of week
* The trend is pretty clear, as we can see that the checkins increase on Friday, Saturday, and Sunday as people go out with friends and get a chance to use the check-in services at different locations.
* The maximum checkins are on Saturdays.
* This plot and the next help us answer activity patterns about the groups we have accumulated.

In [90]:
#Figure 2: Plotting Check-in Frequency by Day of week
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekCheckins = figure(title='Number of Check-ins by day of week', x_range=days, plot_height = 400)
weekCheckins.line(days, [numCheckins_vs_day[x] for x in range(7)],color="#2ECC71")
fig2_c=weekCheckins.circle(days, [numCheckins_vs_day[x] for x in range(7)],size=12, fill_color="white", line_color="#85C1E9", line_width=3)
fig2_c2=weekCheckins.circle(days, [numCheckins_vs_day[x] for x in range(7)],size=3, color="white")

weekCheckins.add_tools(HoverTool(tooltips=[
    ("Day: ", "$x"),
    ("Check-in Count: ", "$y")
],renderers=[fig2_c2]))


weekCheckins.axis.major_label_text_color = "black"
weekCheckins.ygrid.grid_line_alpha = 0.8
weekCheckins.ygrid.grid_line_dash = [5, 3]
weekCheckins.xgrid.grid_line_alpha = 0.8
weekCheckins.xgrid.grid_line_dash = [5, 3]
weekCheckins.xaxis.axis_label = 'Day of the Week'
weekCheckins.yaxis.axis_label = 'Total Check-ins'
show(weekCheckins)

### Figure 3: Checkins by Month of the year
* Plotted number of checkins by all users over the months of the year.
* May, June, July have the maximum number of checkins, since these are summer months and people travel much more with families and friends.
* December, January, and February are the winter months and people tend to stay home during these months.

In [91]:
#Figure 3: Plotting Check-in Frequency by Month of year
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul','Aug','Sep','Oct','Nov','Dec']
monthCheckins = figure(title='Number of checkins by Month of Year', x_range=months, plot_height = 400)
monthCheckins.line(months, [numCheckins_vs_month[x] for x in range(1,13)],color="#2ECC71")
fig3_c=monthCheckins.circle(months, [numCheckins_vs_month[x] for x in range(1,13)],size=12, fill_color="white", line_color="#85C1E9", line_width=3)
fig3_c2=monthCheckins.circle(months, [numCheckins_vs_month[x] for x in range(1,13)],size=3, color="white")

monthCheckins.add_tools(HoverTool(tooltips=[
    ("Month: ", "$x"),
    ("Check-in Count: ", "$y")
],renderers=[fig3_c2]))

monthCheckins.axis.major_label_text_color = "black"
monthCheckins.ygrid.grid_line_alpha = 0.8
monthCheckins.ygrid.grid_line_dash = [5, 3]
monthCheckins.xgrid.grid_line_alpha = 0.8
monthCheckins.xgrid.grid_line_dash = [5, 3]
monthCheckins.xaxis.axis_label = 'Month'
monthCheckins.yaxis.axis_label = 'Total Check-ins'
show(monthCheckins)

### Figure 4: Groups and groups sizes
* We have plotted the  number of groups by their sizes, to identify the distribution we are dealing with.
* We see that groups of size 1 (individuals) are the most common. This is because the data is from an app and there are people using it all over the world and may not be physically in contact with their friends to be able to travel together to different locations. Such users are the most common.
* There are 16 groups of size 2. All other group sizes greater than that are generally lowering in number, and this is expected.
* We would get a much more smoother distribution if we had included the whole data in our analysis, but the kmeans step is a limiting factor as it already takes around an hour to run it on a fifth of the data.

In [92]:
#Figure 4: Number of Groups of a particular size
gsize_numG = list(numG_vs_gSize)
gsize_numG.sort()
numCheckins_numG = list(numG_vs_numCheckins)
numCheckins_numG.sort()
hover_numGPlot1 = HoverTool(
        tooltips=[
            ('Group Size', '$x{int}'),
            ('Number of Groups', '$y{int}')
        ],
        names=['circles']
    )
numGPlot1 = figure(plot_height = 450, plot_width = 700,title='Number of groups per group size', x_axis_type='log', y_axis_type='log', tools = [hover_numGPlot1, 'pan', 'wheel_zoom', 'reset','previewsave'])
numGPlot1.circle(gsize_numG, [numG_vs_gSize[x] for x in gsize_numG], name='circles',size=3, fill_color="white")
numGPlot1.circle(gsize_numG, [numG_vs_gSize[x] for x in gsize_numG], size=9, fill_color="white", line_color="#3498DB", line_width=2)
numGPlot1.line(gsize_numG, [numG_vs_gSize[x] for x in gsize_numG],color="#2ECC71")


numGPlot1.axis.major_label_text_color = "black"
numGPlot1.ygrid.grid_line_alpha = 0.8
numGPlot1.ygrid.grid_line_dash = [5, 3]
numGPlot1.xgrid.grid_line_alpha = 0.8
numGPlot1.xgrid.grid_line_dash = [5, 3]
numGPlot1.xaxis.axis_label = 'Group Size'
numGPlot1.yaxis.axis_label = 'Number of Groups'
show(numGPlot1)

### Figure 5: Comparison of Checkins - Individuals and in groups
* This visualisation compares the behaviour all groups with respect to how often individuals in the group visit locations and when they visit with friends.
* The x-axis is group numbers, and y-axis is the check ins by that groups number.
* The top line represents total checkins including individual check ins of group members, while the lower line represents only those checkins where at least 2 members from the group checked in together at the same location.
* The size of the circle for each group represents the group size.
* We notice that for most of the groups, irrespective of the total checkins made by the members, the checkins with friends remains a lower percentage and does not seem to increase with either the size of the group of the number of total checkins by the members.
* This tells us that the size of the group is probably not the best estimate for the size of the friend circles, as not all members in the group seem to know each other, which, if it was the case, would have resulted in a much higher percentage of checkins together than it is now.
* This also says that people travel to places alone much more frequently, and then tell their friends about it. This may mean that they will travel there again, but not always, since the

In [100]:
#Figure 5: Comparing Friend check-ins and total check-ins of groups
list1 = []
sizes = []
for key,value in g_vs_checkins.items():
    list1.append(value)
    sizes.append(len(groups[key]))
sizes2 = sizes
sizes = np.array(sizes) + 2
list2 = [x[1] for x in list1]
list1 = [x[0] for x in list1]
list1, list2, sizes, sizes2 = (list(x) for x in zip(*sorted(zip(list1, list2, sizes, sizes2))))
index = np.argmax(sizes)
sizes[index]  = 36
hoverPlot3 = HoverTool(
        tooltips=[
            ('Group Number', '$x{int}'),
            ('Check-ins', '$y{int}'),
            ('Group Size', '@sizeToShow')
        ], names=['abc']
    )
sourcePlot3 = ColumnDataSource(data=dict(x = [i for i in range(len(list1))], y1 = list1, y2 = list2, sizeToPlot = sizes, sizeToShow = sizes2))
plot3 = figure(title='Comparing Friend check-ins and total check-ins of groups',tools=[hoverPlot3, 'reset', 'wheel_zoom'], y_axis_type='log', plot_width=700)
plot3.line('x','y1', source = sourcePlot3, color = '#2874A6', legend='Total Check-ins')
plot3.line('x','y2', source= sourcePlot3, color='green', legend='Friend Check-ins')
plot3.circle('x','y1', source= sourcePlot3, color="white", line_color='#2874A6', size=3, name='abc')
plot3.circle('x','y1', source= sourcePlot3, fill_color = '#85C1E9', line_color='#2874A6', size='sizeToPlot')

plot3.circle('x','y2', source=sourcePlot3, fill_color = 'green', line_color = 'green', size = 8)
plot3.legend.location='top_left'

plot3.axis.major_label_text_color = "black"
plot3.ygrid.grid_line_alpha = 0.8
plot3.ygrid.grid_line_dash = [5, 3]
plot3.xgrid.grid_line_alpha = 0.8
plot3.xgrid.grid_line_dash = [5, 3]
plot3.xaxis.axis_label = 'Group Number'
plot3.yaxis.axis_label = 'Number of Check-ins'
show(plot3)