<div style="background-color:dodgerblue;height:100px;"><br>
<center><h1><font style="color:white">The Path to Self-Discovery: Find Habits</font></h1></center>
</div>

# Deja Vu

Let's bring our data back!  Before, we loaded our DataFrame from a list of dictionaries.  Now, let's load it from a csv!

In [28]:
import pandas as pd
df = pd.read_csv('./MyStructuredSearch.csv')

## Exercise 1

Remember when we cast our Lat, Lon, and Datestamp columns to the right datatypes?  Let's do it again!

# The Apply Function

Pandas has a wonderful transform function called the apply button.  It allows us to perform an operation against an entire Series, DataFrame, or Panel.

To show us how great it is, let's grab the weekday of each of our dates, so that we can more easily specify and filter by day.

First step: make a function that, given a datetime, will return its weekday. Note, 0 is Monday in Python land.

In [29]:
def get_weekday(val):
    if not pd.isnull(val):
        return val.weekday()
    else:
        return None

Now, we can apply that to our Datestamp column, and make a new column!

In [30]:
df['Weekday'] = df['Datestamp'].apply(get_weekday)

AttributeError: 'str' object has no attribute 'weekday'

Let's better understand what weekdays have the most searches.

In [31]:
df['Weekday'].value_counts()

KeyError: 'Weekday'

And we can plot this to better understand what our data is telling us!

We won't go too detailed into plotting, but we can very easily make some histograms with pandas.

In [32]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [33]:
df['Weekday'].plot('hist')

KeyError: 'Weekday'

It looks like Mondays and Wednesdays, famously the least enjoyable days of the week, are also the days most searches are done.

## Exercise 7

What are the top days of the week during which the user searched for directions.
* Make a function that determines whether or not the search is for directions
* Filter for all rows where directions were searched for
* Plot the days of the week and number of times directions were searched each day

## Exercise 8

What are the top hours of the day during which the user is on his/her phone?

## Exercise 9

Create a column that identifies the website for each Visited activity.

`https://productforums.google.com/forum/` 

would be parsed as 

`productforums.google.com`

Then find the most visited site.

# Location, Location

As our last little neat skill, let's learn about distance!  First, let's calculate the centroid of the points.

In [34]:
avg_lat = df['Lat'].mean(skipna=True)
avg_lon = df['Lon'].mean(skipna=True)
avg_loc = (avg_lat, avg_lon)

Python's Anaconda distribution has a great library called geopy that we can use to calculate the distance between two points!  For example, let's take the two ends of I-95: Miami, FL and Houlton, ME.

We can use the Vincenty method.  According to the great Wikipedia: "Vincenty's formulae are two related iterative methods used in geodesy to calculate the distance between two points on the surface of a spheroid, developed by Thaddeus Vincenty (1975a)."

In [35]:
from geopy.distance import vincenty

miami = (25.7617, 80.1918)
houlton = (46.135256, 67.781246)

dist = vincenty(miami, houlton)
print(dist.miles)
print(dist.meters)

1563.1414786888304
2515632.360839848


Now let's make a function that calculates a coordinates distance from the centroid!

In [36]:
def dist_from_centroid(ser):
    if not pd.isnull(ser['Lat']):
        lat = ser['Lat']
        lon = ser['Lon']
        coord = (lat, lon)
        return vincenty(coord, avg_loc).miles
    return None

Now we want to use the apply function, but we want to use it across a set of multiple columns.  Fortunately, we can call apply on a whole DataFrame, not just a column!  We just have to tell pandas which axis we want to use, since by default, it will move along the 0th (or columnar) axis, rather than the rows (axis 1).

In [37]:
df['Distance From Centroid'] = df.apply(dist_from_centroid, axis=1)

## Exercise 10
What was the location furthest from the centroid?  What were the searches made at this location?

The last piece of pandas we are going to explore is the `groupby` function.  This function allows us to group the data by one or more columns.  Basically, it creates little DataFrames associated with each of the groups.  Then we can perform operations on each of those and zip the results back up to each group!  

In [38]:
df.groupby(['Lat', 'Lon']).size().head()

Lat        Lon       
34.543727  -79.249273    1
34.552794  -79.435186    2
34.553374  -79.430387    2
           -79.430364    3
34.553496  -79.430418    2
dtype: int64

To make this more useful, we can sort the data as well (which we can do for Series, DataFrames, etc).

In [39]:
df.groupby(['Lat', 'Lon']).size().sort_values(ascending=False).head()

Lat        Lon       
39.105718  -77.157550    388
38.913474  -77.002429    194
38.913337  -77.001765     25
38.922177  -77.020175     25
38.916431  -76.997355     25
dtype: int64

# Dig Deep

Using what you have learned, dig into the data and find out some cool stuff!

Share what you learned here!

https://docs.google.com/forms/d/e/1FAIpQLSexLLJ3N64y7GnyXnLcbJxMxPiJOErBjtPorwb4uw8Q7novlw/viewform?usp=sf_link

## Visualizing Data in the Context of Location

We can also learn a lot about the locations that someone searches in if we look at the searches by location.  Let's pop over to a visualization tool to plot our points.

For an easy tool that lets you plot and color on an attribute, you can use https://www.google.com/mymaps.  They already have all this data anyways :)

## Clustering

Now looking at the data, we can clearly see that there are two distinct groups of points- one in the DC area and one in the North Carolina area.  We want to group these together, but how do we do so in an automated way that doesn't require our eyes? 

Clustering!  Clustering allows us to group a series of data points together based on their attributes.  Clustering can be done based on a variety of attributes.  In this case, we will be using the K-Means clustering method.  With k-means, we tell the algorithm how many clusters we want to create, and in turn, it tries to calculate that number of clusters, using the criteria that the group of points be one in which the distance from the centroid is minimized.

To do k-means, we will be using the scikitlearn library, which is Python's go to for machine learning.

In [None]:
from sklearn.cluster import KMeans
import numpy as np

To start, we need to grab just the latitudes and longitudes from our data.  Then, we can simply throw them into the scikitlearn algorithm to calculate our clusters!

In [54]:
X = df[['Lat', 'Lon']].dropna().drop_duplicates()
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)
X['Cluster No'] = kmeans.labels_
X['Cluster No'].value_counts()

0    74
2    55
1    10
Name: Cluster No, dtype: int64

Now, let's put our cluster number in as an attribute in our data set and visualize!

In [55]:
df['Cluster No'] = X['Cluster No']
df.to_csv('Clustered Points.csv')

In [56]:
df.head()

Unnamed: 0,Activity Type,Datestamp,Lat,Lon,Product,Target,Distance From Centroid,Cluster No
0,Visited,2017-08-31 21:54:24,,,Search,https://stackoverflow.com/questions/2299022/se...,,
1,Visited,2017-08-31 21:54:06,,,Search,https://stackoverflow.com/questions/12901979/s...,,
2,Searched for,2017-08-31 21:54:03,38.913383,-77.001811,Search,sql query multiple servers,7.022226,0.0
3,Visited,2017-08-31 21:53:58,,,Search,https://social.msdn.microsoft.com/forums/sqlse...,,
4,Searched for,2017-08-31 21:53:54,38.913384,-77.001803,Search,sql query multiple databases,7.022614,0.0
