# Vast Challenge 2015

We will look at the location data-set from the [vast challenge 2015](http://vacommunity.org/2015+VAST+Challenge%3A+MC1).

This initial exploration will be accomplished using the following tools:

- [ipython notebook](http://ipython.org/)
- [Pandas](http://pandas.pydata.org/)
- [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) 
- [numpy](http://www.numpy.org/)

## 0. Setup environment

In [1]:
import pandas as pd
%matplotlib inline
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
sns.set_style("darkgrid")

## 1. Read the data

In [2]:
df = pd.read_csv("park-movement-Sun.csv")

  interactivity=interactivity, compiler=compiler, result=result)


Let's look at the first five rows

In [3]:
df.head()

Unnamed: 0,Timestamp,id,type,X,Y
0,2014-6-08 08:00:11,1923259,check-in,0,67
1,2014-6-08 08:00:11,39012,check-in,0,67
2,2014-6-08 08:00:11,613364,check-in,0,67
3,2014-6-08 08:00:14,100951,check-in,99,77
4,2014-6-08 08:00:14,1959069,check-in,99,77


What is the size of the table?

In [4]:
df.shape

(10932426, 5)

What are the types of the data?

In [5]:
df.dtypes

Timestamp    object
id           object
type         object
X            object
Y            object
dtype: object

What are the values of *type* ?

In [6]:
df["type"].unique()

array(['check-in', 'movement', nan, 'type'], dtype=object)

In [7]:
df.groupby("type")["type"].count()

type
check-in      130658
movement    10801766
type               1
Name: type, dtype: int64

How many different ids are there?

In [8]:
df["id"].unique().shape

(8217,)

In [9]:
pd.pivot_table(df,columns="type", values="X", index="id", aggfunc=len).head()

type,check-in,movement,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
436,12,1385,
878,26,1870,
941,23,1287,
1197,24,1831,
1217,15,1111,


In [10]:
pd.pivot_table(df,columns="type", values="X", index="id", aggfunc=len).mean()

type
check-in      17.079477
movement    1314.883262
type           1.000000
dtype: float64

What is the type of the timestamps?

In [11]:
type(df.Timestamp[0])

str

They are strings, it would be better if they were dats, lets fix that
with the [to_datetime](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html?highlight=to_datetime#pandas.to_datetime) function

In [12]:
df["time"] = pd.to_datetime(df.Timestamp, format="%Y-%m-%d %H:%M:%S")

ValueError: time data '2014-6-08 08:00:11' does match format specified

In [None]:
df.tail()

In [None]:
df.dtypes

Now the *time* column contains datetime objects

First, take a random subsample to speed up exploration

In [None]:
df_small = df.sample(10000)

## 2. Looking at location data

In [None]:
df_small.shape

We will now create a simple scatter plot with all the X and Y values in our subsample

In [None]:
df_small.plot("X","Y","scatter")

It looks very similar to the pats in the map

Now lets look at just the *check-in* samples

In [None]:
df_small.loc[df_small["type"]=="check-in"].plot("X","Y","scatter")

Lets look at the range of the location data

In [None]:
df["X"].min()

In [None]:
df["X"].max()

In [None]:
df["Y"].min()

In [None]:
df["Y"].max()

Now lets create a 2d histogram to see which locations are more popular. We will use the [hist2d](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist2d) function

In [None]:
cnts, xe, ye, img = plt.hist2d(df_small["X"], df_small["Y"],range=((0,100),(0,100)),normed=True)

We can increase the number of bins

In [None]:
cnts, xe, ye, img = plt.hist2d(df_small["X"], df_small["Y"],range=((0,100),(0,100)),normed=True, bins=20)

In [None]:
df_small.plot("X","Y","hexbin")

## 3. Single guest

Now lets plot the locations for a single random person

In [None]:
guest_id = np.random.choice(df["id"])

In [None]:
guest_df = df.loc[df["id"]==guest_id]

In [None]:
guest_df.shape

In [None]:
guest_df.plot("X","Y","scatter")

Now lets try to use the time information

In [None]:
plt.scatter(guest_df["X"],guest_df["Y"],c=guest_df["time"])

At what time did he arrive?

In [None]:
guest_df["time"].min()

At what time did he leave?

In [None]:
guest_df["time"].max()

So how long did he stay?

In [None]:
guest_df["time"].max() - guest_df["time"].min()

## 4. Single time frame

Where were the guests between 12:00 and 12:05 ?

In [None]:
noon_dates = (df["time"] < '2014-06-06 12:05:00') & (df["time"] >= '2014-06-06 12:00:00')

In [None]:
noon_df = df.loc[noon_dates]

In [None]:
noon_df.shape

In [None]:
plt.scatter(noon_df["X"], noon_df["Y"], alpha=0.01, marker="o", s=30)

lets add some jitter

In [None]:
plt.scatter(noon_df["X"] +5*np.random.random(len(noon_df))
           ,noon_df["Y"]+5*np.random.random(len(noon_df)),
            alpha=0.01, marker="o", s=30)

## 5. Time analysis

Now lets try to ask some simple questions about time data

- At what time do guests arrive?
- At what time do they leave?
- How long they stay?
- How does park ocupacy vary during the day?

To answer the first questions we needd to transform the data

In [None]:
grouped_times = df.groupby("id")["time"]

In [None]:
arrivals = grouped_times.min()

In [None]:
departures = grouped_times.max()

In [None]:
duration = departures - arrivals

In [None]:
sns.distplot(arrivals.dt.hour+arrivals.dt.minute/60)

In [None]:
sns.distplot(departures.dt.hour+departures.dt.minute/60)

In [None]:
h_duration = duration.dt.seconds/60/60

In [None]:
sns.distplot(h_duration)

Now for the question of park occupacy, we need to group the dataframe by time

In [None]:
time_groups = df.groupby(df.time.dt.hour)

In [None]:
occupancy = time_groups["id"].aggregate(lambda x:len(np.unique(x)))

In [None]:
occupancy.plot()

## Questions

What places did the people who stayed for less than 4 hours visit?

What is the distribution of total traveled distance of park visitors?

What is the mean speed of the park visitors?

Who are the visitors who walked more?

At what times are *check-in* samples recorded?