# Introduction

In this notebook, I'll be doing an exploratory and predictive analysis using the walk/run dataset curated by [Viktor Malyi](https://www.kaggle.com/vmalyi). This dataset contains 88588 samples of "accelerometer" and "gyroscope" data recorded using iPhone 5c at ~5.4/second frequency. Additionally, there is an "activity" column to represent the activity type which will be our label, a column to indicate the "wrist" on which the device was worn,  and also date",  "time", "username" columns. More information regarding the dataset can be found [here](https://www.kaggle.com/vmalyi/run-or-walk/home).

**Current version only deals with exploratory data analysis. Next version will have predictive analysis.**

**Table of contents feature doesn't work on Kaggle as of now. Please scroll down to the sections.**

## Table of contents

#### [Step 0: Load the dataset](#load)
#### [Step 1: Exploratory Data Analysis](#eda)
 - [1.1: Preliminary analysis](#panda)
 - [1.2: Datetime feature creation](#dt_crt)
 - [1.3: Datetime feature analysis](#dt_sis)
 - [1.4: Wrist & activity analysis](#nom_sis)
 - [1.5: Sensor data analysis 1](#sensor_sis1)
 - [1.6: Sensor data analysis 2](#sensor_sis2)

In [None]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
import seaborn as sns

##all variables display for cells
#from IPython.core.interactiveshell import InteractiveShell
#InteractiveShell.ast_node_interactivity = "all"

#rendering plots
%matplotlib inline

#set deafault seaborn theme for all plots
sns.set()



## Step 0: Loading the dataset

In [None]:
df = pd.read_csv("../input/dataset.csv")
print("Dataset shape -> (rows, columns):", df.shape)

## Step 1: Exploratory Data Analysis 
   
   Exploratory data analysis is a key part of data science as it gives us an insight into the dataset without making any prior assumptions. Understanding the data we're dealing with can be invalubale during predicitve analysis.

### 1.1: Learning more about the dataset using pandas methods

In [None]:
#print first few columns of the dataset
df.head()

In [None]:
#print data types of the columns
df.dtypes

In [None]:
#check statistics of int64, float64 columns
df.describe()

**Observations** on *int* and *float* columns:
 - "count" prints the number of rows excluding null values. As all of the above features have their count values the same as total rows, there are **no null values**.
 - "wrist" and "activity" are **nominal features**.
  - "wrist" refers to the hand on which the device was worn while recording, it can take only two values i.e., 0 for "left" and 1 for "right".
  - "activity" refers to the physical activity being performed during recording, 0 for "walk" and 1 for "run".
  - For binary variables, "mean" can give valuable information on skewness. Mean values of "wrist" and "activity" are roughly around 0.5 indicating the sample collection is not heavily skewed towards one of the values.
 -  The remaining six features are (x,y,z) acceleration & orientation values measured by the device, and they are **ratio features**. 
   - Percentile & mean values provide a decent understanding of the skewness for ratio features. If mean is closer to  25th or  75th percentiles more than 50th percentile, that indicates an underlying skewness in the distribution.
   - Quick glance tells us that "acceletation_y", "acceleration_z" have skewness in their distribution.
   - In data visualization section, we'll look at the distributions of these features.

In [None]:
#check statistics of object columns
df.describe(include=["O"])

**Observations** on *object* columns:
 - Once again, there are no null values. So the complete dataset is devoid of any null values.
 - "username" refers to the different users whose data was collected. For current dataset, we have only one name "viktor". So I don't see much value in keeping it.
 - "date", "time" specify when a particular sample was recorded and are **interval features** . They will come handy for sorting data, making visualizations, and learning more about data collection timeline.

### 1.2: Creating Datetime index

Before we do further data analysis, let's create a **datetime object** using string format "date", "time" columns, then index and sort the dataframe using that object. We'll also drop "username" feature from further analysis. 

In [None]:
#date column reformat
df_date_reformat = df["date"].str.split("-", expand=True)

#time column reformat
df_time_reformat = df["time"].str.split(":", expand=True)

#join formatted date and time dataframes
df_date_time_reformat = pd.concat([df_date_reformat, df_time_reformat], axis=1)
df_date_time_reformat.columns = ["year","month", "day", "hour", "minute", "second", "ns"] #rename columns

#create a datetime object
df_date_time_obj = pd.to_datetime(df_date_time_reformat)

#add datetime object to a new dataframe and set it as index
df_sorted = df.copy()
df_sorted["datetime"] = df_date_time_obj
df_sorted.set_index("datetime", inplace=True)
df_sorted.drop(axis=1, columns=["username"], inplace=True) # drop "username" column
print("*** last row timestamp before sorting ***")
print(df_sorted.index[-1])
#sort df_sorted data by "datetime" index
df_sorted.sort_index(inplace=True)
print("*** last row timestamp after sorting ***")
print(df_sorted.index[-1])

Looks like the original dataset was infact not sorted by datetime.
Newly created dataframe "df_sorted" is sorted by datetime, and column "username" has been dropped.

In [None]:
print("*** dataframe with datetime index ***")
df_sorted.head()

### 1.3: Analysis of datetime feature

Let's get some more information on when the samples were recorded. 

Time feature analysis can help us draw important insights into activity patterns among different users like hours/days when they are most active.



In [None]:
#quick insight using pandas methods
print("Start time of data recording ->", df_sorted.index.min())
print("End time of data recording ->", df_sorted.index.max())
print("Number of days of data collection ->", df_sorted.index.day.nunique())
print("Days of data collection ->", df_sorted.date.unique())

Data was collected on **12 days** printed above between **June 30, 2017** and **July 17, 2017**

In [None]:
#visualization of user activity pattern
f, ax =  plt.subplots(ncols=1, nrows=2, figsize = (14,10))

#sample count vs hour of day
arr_hr = np.unique(df_sorted.index.hour, return_counts = True)
ax[0].bar(arr_hr[0], arr_hr[1])
ax[0].set_title("Recorded sample count for different hours of the day")
ax[0].set_xlabel("Hour")
ax[0].set_ylabel("Number of samples")
ax[0].xaxis.set_major_formatter(FormatStrFormatter('%d:00'))

#sample count vs day of week
arr_day = np.unique(df_sorted.index.dayofweek, return_counts = True)
ax[1].bar(arr_day[0], arr_day[1])
ax[1].set_title("Recorded sample count for different days of the week")
ax[1].set_xlabel("Day")
ax[1].set_ylabel("Number of samples")
ax[1].set_xticklabels(['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

f.tight_layout()
f.show()

**Observations** on data collection timline:
 - Hours of day: Most of the samples were recorded between 2pm and 8pm with the highest count coming from 6pm. The dip in sample count at 5pm looks out of place and worth noting.
 - Days of week: Sunday dominates the sample count which could be due to it being no work day. Rest of the days have similar sample counts except for Wednesday which has zero. 
 

### 1.4: Analysis of activity and wrist features

For further data visualization, we'll create a temporary dataframe by replacing "wrist" column values (0 with "left" and 1 with "right") and "activity" column values (0 with "walk" and 1 with "run") as it helps make the charts more intuitive. 

After that, we'll visualize the sample distribution among different "activity" and "wrist" values. **Count plots** are a good visualization tool for **nominal** features.

In [None]:
#temporary dataframe with "wrist" and "activity" column values replaced
df_sorted_viz =  df_sorted.copy()
df_sorted_viz["wrist"].replace(to_replace={0:"left", 1:"right"}, inplace=True)
df_sorted_viz["activity"].replace(to_replace={0:"walk", 1:"run"}, inplace=True)
#sanity check to see if values were updated correctly
print("Updated unique values")
for each_col in ["wrist", "activity"]:
    print(each_col,":", df_sorted_viz[each_col].unique())
df_sorted_viz.head(1)

In [None]:
#visualizing counts of "activity" and "wrist" features
plt_ht = 4
plt_asp = 2.5
#first plot
g_act = sns.catplot(x = "activity", kind = "count", height = plt_ht, aspect = plt_asp, data=df_sorted_viz)
g_act.ax.set_title("Recorded sample count for walk and run")
#second plot
g_wrist = sns.catplot(x = "wrist", kind = "count", height =plt_ht, aspect = plt_asp, data=df_sorted_viz)
g_wrist.ax.set_title("Recorded sample count for left & right wrists")
#third plot
g_act_wri = sns.catplot(x = "activity", kind="count", hue = "wrist", height = plt_ht, aspect = plt_asp, data=df_sorted_viz)
g_act_wri.ax.set_title("Recorded sample count for different activities, split by the wrist")
plt.show()

**Observations** on *nominal* features "activity" and "wrist":
 - Sample distribution is roughly even for different "activity", "wrist" values, maybe a bit skewed towards *right* wrist but not by much.
 - The third chart above illustrates that for *walk* we have more samples for *right wrist* and vice-versa for *run*.
    

### 1.5: Distribution plots of accelerometer and gyroscope data

For **ratio features**, density plots are a great way to visualize how the variable's data is distrbuted over it's range.  They provide valuable information on data skewness, inconsistencies in data collection.

In [None]:
fig_kde, ax_kde = plt.subplots(nrows=3, ncols=2, figsize=(16, 10))
ax_num = 0
for each_col in df.columns.values[5:11]:
    g_kde = sns.kdeplot(df_sorted_viz[each_col], ax=ax_kde[ax_num % 3][ax_num // 3])
    ax_num += 1
fig_kde.suptitle("Data density")
#fig_kde.tight_layout()
fig_kde.show()

**Observations** on data distribution of device data:
 - For x-axis, accelerometer data is roughly symmetric and the double peak pattern is because of two "wrist" values. Same behavior is noticed in gyroscope data.
 - For y-axis, gyroscope data has normal distribution with mean = 0. Accelerometer data on the other hand looks skewed, and has the most inconsistent distribution among all the ratio features.
 - For z-axis, gyroscope data looks symmetric. Accelerometer data is slightly skewed but not as much as y-axis data.

### 1.6: Visualizing "activity" split accelerometer and gyroscope data

Visualizing device data split by the labels("activity") can shed light on how separable data is between different activity values.

In [None]:
fig_str, ax_str = plt.subplots(nrows=2, ncols=3, figsize=(16, 10))
fig_str.suptitle("Device data split by activity")    
ax_num = 0
for each_col in df.columns.values[5:11]:
    g_str = sns.stripplot(x = "activity", y = each_col, hue = "activity", ax=ax_str[ax_num // 3][ax_num % 3], data = df_sorted_viz)
    ax_num += 1
    g_str.set_xlabel("")
#fig_str.tight_layout()
fig_str.show()

**Observations** from "activity" split accelerometer and gyroscope recordings:
 - "acceleration_x", "acceleration_z"  show clear differentiation between walking and running, with running yielding much higher(+ve, -ve based on the wrist) values.
 - "acceleration_y" shows some separation but not as pronounced as the two other dimensions.
 - "gyroscope" data on the other hand look quite similar for walking and running.
 
For predictive analysis, **acceleration_x** could be the most important feature because of it's data distribution quality and ability to differentiate "activity". 
Although **acceleration_x, acceleration_z** do show sepration, they suffer from some inconsistencies in data distribution which might hamper their prominence.
It'll be interesting to see the effects of **gyroscope** data.

In predictive analysis section, I'll be dealing with topics such as feature scaling, pca, prediction algorithm testing/efficiency.