## Tobi Bosede
PyCon Workshop  
May 2, 2019

Purpose of this notebook is to describe algorithmic workflow, not real world implementation, as sample data is being used and Spark is absent.

In [1]:
import pandas as pd
import numpy as np
from scipy.signal import vectorstrength
import time
from sklearn.linear_model import LogisticRegression

## Looking at gym class attendance  data
Can begin forming hypothesis.  
Those who are committed stay for full time and are early whereas those who are less commited might sometimes only come for portion of class

In [2]:
df=pd.read_csv("sample_gym_data.csv")

In [3]:
df.head()

Unnamed: 0,Account,Class,Date,Hours_spent
0,123,kickboxing,1/15/2019,1.1
1,345,kickboxing,1/15/2019,0.95
2,123,spin,1/17/2019,0.75
3,123,HIIT,1/12/2019,0.5
4,123,spin,1/20/2019,0.7


### Transforming Data  	

In [4]:
classes = np.unique(df[['Class']])
classes

array(['HIIT', 'kickboxing', 'spin', 'yoga'], dtype=object)

In [5]:
accounts = np.unique(df[['Account']])
accounts

array([123, 345])

In [6]:
df_i = pd.DataFrame()
for j in accounts:
    for i in classes:
        df_s = df.loc[(df['Class'] == i) & (df['Account'] == j)]
        if np.array(df_s.Date).size==0: 
            pass
        else:
            df_t = {'Account': [j],
'Class': [i],
'Dates': [np.array(df_s.Date)],
'Hours_spent': [np.array(df_s.Hours_spent)]}
            df_trans = pd.DataFrame(df_t)
            df_i=df_i.append(df_trans)

In [7]:
df_i

Unnamed: 0,Account,Class,Dates,Hours_spent
0,123,HIIT,"[1/12/2019, 1/25/2019]","[0.5, 0.55]"
0,123,kickboxing,"[1/15/2019, 1/22/2019, 1/29/2019, 2/5/2019, 2/...","[1.1, 1.0, 1.0, 1.05, 1.05]"
0,123,spin,"[1/17/2019, 1/20/2019, 1/27/2019, 2/7/2019, 2/...","[0.75, 0.7, 0.75, 0.75, 0.7]"
0,345,HIIT,"[2/8/2019, 2/18/2019]","[0.44, 0.5]"
0,345,kickboxing,"[1/15/2019, 1/24/2019, 2/12/2019]","[0.95, 0.9, 0.9]"
0,345,yoga,"[1/20/2019, 1/27/2019, 2/3/2019, 2/10/2019]","[1.1, 1.15, 1.0, 1.1]"


### Generating labels

From eyeballing the data, for gym member account:  
- 123 kickboxing appears recurring, HIIT and spin does not.  
- 345 yoga and kickboxing appear recurring, HIIT does not.

We calculating vector strength considering periods of weekly and biweekly. 

In [8]:
vs7l=[]
vs14l=[]
for i in np.array(df_i.Dates):
    time_epochs=[time.mktime(time.strptime(x, '%m/%d/%Y')) for x in i]
    vs7, vs14=vectorstrength(time_epochs, [604800, 604800*2])[0]
    vs7l.append(vs7)
    vs14l.append(vs14)
vs7l   

[0.9009688679025161,
 1.0,
 0.3685715414076013,
 0.22252093395560107,
 0.6757642804158898,
 1.0]

In [9]:
vs14l

[0.9749279121818485,
 0.19999999999999998,
 0.20000000000006302,
 0.6234898018590196,
 0.5276826479539619,
 2.8960927447468397e-13]

In [10]:
label=(np.array(vs7l)>.9) | (np.array(vs14l)>.9)*1
label

array([1, 1, 0, 0, 0, 1])

In [11]:
df_i["Recurring"]=label
df_i

Unnamed: 0,Account,Class,Dates,Hours_spent,Recurring
0,123,HIIT,"[1/12/2019, 1/25/2019]","[0.5, 0.55]",1
0,123,kickboxing,"[1/15/2019, 1/22/2019, 1/29/2019, 2/5/2019, 2/...","[1.1, 1.0, 1.0, 1.05, 1.05]",1
0,123,spin,"[1/17/2019, 1/20/2019, 1/27/2019, 2/7/2019, 2/...","[0.75, 0.7, 0.75, 0.75, 0.7]",0
0,345,HIIT,"[2/8/2019, 2/18/2019]","[0.44, 0.5]",0
0,345,kickboxing,"[1/15/2019, 1/24/2019, 2/12/2019]","[0.95, 0.9, 0.9]",0
0,345,yoga,"[1/20/2019, 1/27/2019, 2/3/2019, 2/10/2019]","[1.1, 1.15, 1.0, 1.1]",1


### Creating Features
We calculate mean delta_t, sd delta_t, mean_hours, and sd_hours.	

In [12]:
dtmeans=[]
dtSDs=[]
for i in np.array(df_i.Dates):
    time_epochs=[time.mktime(time.strptime(x, '%m/%d/%Y')) for x in i]
    dt=[(j-i)/(60*60*24) for i, j, in zip(time_epochs[:-1], time_epochs[1:])]
    dtMean=np.mean(dt)
    dtSD=np.std(dt)
    dtmeans.append(dtMean)
    dtSDs.append(dtSD)
dtmeans 

[13.0, 7.0, 5.75, 10.0, 14.0, 7.0]

In [9]:
dtSDs

[0.0, 0.0, 3.5619517121937516, 0.0, 5.0, 0.0]

In [13]:
df_i["mean_delta_t"]=dtmeans
df_i["sd_delta_t"]=dtSDs
df_i["mean_hours"]=[np.mean(i) for i in np.array(df_i.Hours_spent)]
df_i["sd_hours"]=[np.std(i) for i in np.array(df_i.Hours_spent)]
df_i["vector_strength"]=vs7l
df_i

Unnamed: 0,Account,Class,Dates,Hours_spent,Recurring,mean_delta_t,sd_delta_t,mean_hours,sd_hours,vector_strength
0,123,HIIT,"[1/12/2019, 1/25/2019]","[0.5, 0.55]",1,13.0,0.0,0.525,0.025,0.900969
0,123,kickboxing,"[1/15/2019, 1/22/2019, 1/29/2019, 2/5/2019, 2/...","[1.1, 1.0, 1.0, 1.05, 1.05]",1,7.0,0.0,1.04,0.037417,1.0
0,123,spin,"[1/17/2019, 1/20/2019, 1/27/2019, 2/7/2019, 2/...","[0.75, 0.7, 0.75, 0.75, 0.7]",0,5.75,3.561952,0.73,0.024495,0.368572
0,345,HIIT,"[2/8/2019, 2/18/2019]","[0.44, 0.5]",0,10.0,0.0,0.47,0.03,0.222521
0,345,kickboxing,"[1/15/2019, 1/24/2019, 2/12/2019]","[0.95, 0.9, 0.9]",0,14.0,5.0,0.916667,0.02357,0.675764
0,345,yoga,"[1/20/2019, 1/27/2019, 2/3/2019, 2/10/2019]","[1.1, 1.15, 1.0, 1.1]",1,7.0,0.0,1.0875,0.054486,1.0


In [14]:
df_iFeat=df_i
df_iTarg=np.array(df_iFeat[["Recurring"]])
df_iFeat=df_iFeat[["vector_strength","mean_hours", "sd_hours", "mean_delta_t", "sd_delta_t"]]
df_iFeat.head()

Unnamed: 0,vector_strength,mean_hours,sd_hours,mean_delta_t,sd_delta_t
0,0.900969,0.525,0.025,13.0,0.0
0,1.0,1.04,0.037417,7.0,0.0
0,0.368572,0.73,0.024495,5.75,3.561952
0,0.222521,0.47,0.03,10.0,0.0
0,0.675764,0.916667,0.02357,14.0,5.0


### Training Model

Skipping dividing into test and train intentionally because data small.  
Using train data for testing. (WOULD NOT DO IN PRACTICE) 	

In [15]:
X, y = np.array(df_iFeat),  np.reshape(df_iTarg, (6,))
clf = LogisticRegression(random_state=0, solver='lbfgs').fit(X, y)

### Prediction with Model
Not great, since small amount of data.

In [16]:
clf.predict(X)

array([1, 1, 0, 1, 0, 1])

In [17]:
label

array([1, 1, 0, 0, 0, 1])

In [18]:
clf.predict_proba(X) #probabilities of both classes, each row adds to 1

array([[0.32611905, 0.67388095],
       [0.23869954, 0.76130046],
       [0.86039918, 0.13960082],
       [0.38526909, 0.61473091],
       [0.95293972, 0.04706028],
       [0.23658871, 0.76341129]])