<h1 align="center" style="background-color:#616161;color:white">Outlier Analysis & Cleanup</h1>

<h3>Summary</h3>

<font color=blue>
Two types of analysis was conducted
* Daily LIstening Habits: Analysis of the number of unique tracks vs. num. of plays a user seen on a daily basis
* Histogram of the time period in between song plays
</font>    

<h3 style="background-color:#616161;color:white">0. Code setup</h3>

In [1]:
import os
import sys
import pandas as pd
import numpy as np
import datetime
import csv
import json
import sqlite3
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

from pathlib import Path

%matplotlib inline
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [4]:
## Parameters you can change

# Abs path to settings file
root = "C:/DS/Github/MusicRecommendation"  # BA, Windows

## Import the codebase module
fPath = root + "/1_codemodule"
if fPath not in sys.path: sys.path.append(fPath)
import codebase as cb

# This is where preliminary analysis output gets stored. Folder with todays date is created.
i = datetime.datetime.now()
outputPath = root + "/4_preliminaryanalysis/outputs/%s_%s_%s/" % (i.day, i.month, i.year) 

## Finish setting up
os.chdir(root)

settingsDict =  cb.loadSettings()

# Load data from database
dbPath = root + settingsDict['dbPath']

In [5]:
#--------------------- Functions ----------------------------------#
def getRandomUsers(maxUsers):
    db = sqlite3.connect(dbPath)
    SQStr ="SELECT userID FROM tblUser ORDER BY RANDOM() LIMIT " + maxUsers

<h3 style="background-color:#616161;color:white">1. Generate CSV exports for analysis in visualization software</h3>

In [7]:
con = sqlite3.connect(dbPath)
cur = con.cursor()

sqlStr ='SELECT Cast(substr(userID,-5) as integer) as user,date(PlayedTimestamp) as PlayedTimeStamp ,count(*) as NumOfPlays, count(Distinct trackID) as NumOfTracks from tblInputData group by userID, date(PlayedTimestamp) ORDER BY NumOfPlays;'

# Export to CSV
cur.execute(sqlStr)
cb.exportToCSV(cur,outputPath + 'dataset1.csv')

con.close()

<sqlite3.Cursor at 0x1c75a256570>

<h3 style="background-color:#616161;color:white">2. Basic analysis</h3>

In [8]:
con = sqlite3.connect(dbPath)
sqlStr ='SELECT Cast(substr(userID,-5) as integer) as user,date(PlayedTimestamp) as PlayedTimeStamp ,count(*) as NumOfPlays, count(Distinct trackID) as NumOfTracks from tblInputData group by userID, date(PlayedTimestamp) ORDER BY NumOfPlays;'

# Load into Pandas
res = pd.read_sql_query(sqlStr, con)
con.close()

# Change data types
res['user'] = res['user'].astype('str')
res['PlayedTimeStamp'] =  pd.to_datetime(res['PlayedTimeStamp'])
#res.dtypes

In [None]:
res['PlayedTimeStamp'].describe()

<font color=blue>
The date range is from 27th April 2009 to 29th Sept 2013 
</font>

In [None]:
res['user'].describe()

<font color=blue>
There are 992 unique users. 
</font>

In [None]:
res.describe(percentiles = [.5, .95, .99])


* <font color=blue>The average number of daily plays was 48, and the average number of unique daily tracks was 35</font>
*  <font color=blue>The higest number of daily plays was a very large <b>2862</b></font>
*  <font color=blue>The 99th percentile was only 295, so perhaps this makes a good cut-off point</font>

In [None]:
qNumOfPlays = 295

res[(res['NumOfPlays'] > qNumOfPlays)].user.nunique()

# If you wish to drill down further use this:
#res[(res['NumOfPlays'] > qNumOfPlays)].groupby(['user']).count()

* <font color=blue>We would be excluding 253 users if we did this - which is a large portion of our 992 unique users.</font>

In [None]:
res.groupby(['user']).mean().describe([.5, .95, .99])

* <font color=blue>When we take the average tracks played on a daily basis by user we get a 99th percentile of 208 with the max average being 337.</font>
* <font color=blue>The question at this stage is whether we exclude any days where the number of tracks played by a user exceeded a certain threshold (say 295).</font>
* <font color=blue>Our analysis suggests that a large portion of users (253 our of 992) did have such excessive plays therefore exlcuding this many users is not an option</font>
* <font color=blue>If we average out across thee days then we find the data looks more normal - the max average was 337 tracks, still large but within the bounds of reality.</font>
* <font color=blue>Of course why there are some days with excessively high track plays is a mystery. Further analysis did not show any obvious patterns other than user 8 who appeared to be particularly excessive number of plays. </font>

In [None]:
tmp=res[(res['NumOfPlays'] > 800)].groupby(['user']).count()
tmp

In [None]:
#res[(res['user'] =='8') & (res['NumOfPlays'] > 500)]
res[(res['user'] =='8')].mean()

<font color=blue>
This analysis plus additional ones done that show user 8 playing songs in very rapid succession for an entire day indicate it may not be a real user and therefore ought to be removed.

It is unclear if this is the case for other excessive users however it was decided they would be kept in the analysis.
</font>

<h3 style="background-color:#616161;color:white">3. Interval time</h3>

In [9]:
con = sqlite3.connect(dbPath)
sqlStr ='Select userID, MinsSincePrevPlay, PlayedTimestamp, historyID from tblMain order by userID, historyID'
res = pd.read_sql_query(sqlStr, con)
con.close()

# Change data types
res['UserID'] = res['UserID'].astype('str')
res['PlayedTimestamp'] =  pd.to_datetime(res['PlayedTimestamp'])

In [None]:
# Chart 1
a=res[(res['MinsSincePrevPlay'] < 30)]
b=res[(res['MinsSincePrevPlay'] > 30)]

# the histogram of the data
n, bins, patches = plt.hist(a['MinsSincePrevPlay'], 30,normed=1)
_ = plt.xlabel('Mins since previous play')
_ = plt.ylabel('Probability')
_ = plt.title('Play interval times less than 30 minutes')
plt.grid(True)

plt.show()

bins = [0,10,30, 60, 120, 240, 480, 960, 1440,2880,4320, 5760, 7200, 8640, 10080]
group_names = ['<10min','30 min', '1hr', '2hrs', '4hrs', '8 hrs','16hrs','1day','2days','3days','4days','5days','6days','7days+']
categories = pd.cut(res['MinsSincePrevPlay'], bins,labels=group_names)
res['categories'] = pd.cut(res['MinsSincePrevPlay'], bins,labels=group_names)
res['dayOfWeek'] = res['PlayedTimestamp'].dt.dayofweek

pd.value_counts(res['categories'],sort=False)

* <font color=blue>As one would expect, the average interval between plays of 4 minutes corresponds to an average length of a song, indicating consequtive listens. It tails off significantly after that.</font>
* <font color=blue>Beyond the hour mark we see that 16 hours appears to be unusually popular. Let's see if it's associated with any day of the week</font>

In [None]:
pd.crosstab(res['categories'],res['dayOfWeek'],normalize ='index') # 0 = monday

* There does not appear to be any patterns here

<h3 style="background-color:#616161;color:white">4. Conclusion</h3>

Section 2:
* <font color=blue>User 8 appears not to be a valid user and therefore will be removed. </font>
* <font color=blue>It is unclear if this is the case for other excessive users however it was decided they would be kept in the analysis. </font>

Section 3:
* <font color=blue>The vast majority of users listen to songs consequtively as one would expect</font>
* <font color=blue>Beyond that the drop-off rate is steep, and particularly drops after the 1 day period</font>
* <font color=blue>From a modeling point of view we will classify the data into 3 buckets:</font> 
    * <font color=blue>'single session' : 10 minutes or under</font>
    * <font color=blue>'same day': 24 hours or under</font>
    * <font color=blue>'long gap': more than 24 hours</font>
* <font color=blue>We will evaluate whether it's better tol have 1 model for all three of these categories or whether two or three models perform better</font>

<h3 style="background-color:#616161;color:white">END</h3>