## Table of Contents

[*Approach*](#Approach)

[*Libraries Used*](#Libraries Used)

[*Input Data*](#Input Data)

[*Creating Customer journeys*](#Creating Customer journeys)

[*Output*](#Output)


<a name="Approach">
### Approach : With traditional clustering model after segementing customer journies we end up with clusters without any knowlege of characterstic of individual clusters. The Affinity Propagation Clustering model solves this problem by providing cluster exemplar which is representative of the cluster and thus making it easier to interpret.
</a>

<a name="Libraries Used">
### Libraries Used : Numpy, Pandas, Time, Matplotlib, Seaborn, Textdistance, re, Sklearn, Distance
</a>

In [1]:
# import packages
import numpy as np
import pandas as pd
import time
import matplotlib
import seaborn as sns
import textdistance
import re
import sklearn.cluster
import distance

<a name="Input Data">
### Input Data : Read the input data about customer events
</a>

In [2]:
event=pd.read_csv('zpj_sent_event.csv')

In [3]:
## sort the data by account and time (using event_id as proxy)
event = event.sort_values(by=['newid','event_timestamp','event_id'])

#Total number of unique users in data
event.newid.nunique()

47327

In [4]:
event.head()

Unnamed: 0,newid,event_timestamp,event_name,platform,event_id,event_session_id,event_code
305631,470.736995,1/18/2018,past_care.home,web,441248107,103035b7-ad83-468c-9489-b3cc36a0cbde,.
282321,470.736995,1/18/2018,past_care.participants.kimberly,web,441251503,103035b7-ad83-468c-9489-b3cc36a0cbde,.
486561,470.736995,1/18/2018,past_care.participants.brian,web,441251977,103035b7-ad83-468c-9489-b3cc36a0cbde,.
103104,470.736995,2/1/2018,choosing_wisely.shown,web,451321701,1b96e665-8744-4ad7-a1b2-f1b2c8804857,w
511326,470.736995,2/1/2018,search.filter.page,web,451324506,1b96e665-8744-4ad7-a1b2-f1b2c8804857,s


<a name="Creating Customer journeys">
### Creating Customer journeys: As the data for a customer is stored in multiple rows as logs, so we collate the events to create event sequence
</a>

In [5]:
df = pd.DataFrame(event.groupby('newid')['event_code'].apply(lambda x: x.sum()))
df = df.reset_index()
df.head()

Unnamed: 0,newid,event_code
0,470.736995,...wssssps.wsSpppwppcfssssss.sws..psSssssssssp...
1,497.857109,ssS..sSppbb..ssspsSssbllSS.
2,538.770109,.Sbbb.S.
3,1311.595552,p
4,1372.080296,bnnnbbSn..c.bSSSfe


In [6]:
# We replace repeating events with a single event and remove the uncategorized (dot) events
non_dup = []
for i in range(0,len(df)):
    number = df.iloc[i,1]
    number = number.replace(".","")
    temp_list = []
    for item in number:
        if len(temp_list) == 0:
            temp_list.append(item)

        elif len(temp_list) > 0:
            if  temp_list[-1] != item:
                temp_list.append(item)
    
    non_dup.append(''.join(temp_list))

In [7]:
# Customers should have minimum 10 events before they can be clustered, also the number of total events is capped to 30. 
df['unq_event'] = non_dup
df1 = df[(df.unq_event.str.len() > 9) & (df.unq_event.str.len() < 31)]
df1.head()

Unnamed: 0,newid,event_code,unq_event
0,470.736995,...wssssps.wsSpppwppcfssssss.sws..psSssssssssp...,wspswsSpwpcfswspsSspSp
1,497.857109,ssS..sSppbb..ssspsSssbllSS.,sSsSpbspsSsblS
4,1372.080296,bnnnbbSn..c.bSSSfe,bnbSncbSfe
5,1635.667868,b....bcncbbib.nt....b..l.i....nn.bf.p.s.spcc.....,bcncbibntblinbfpspcbtvbt
17,3237.100318,f.spp.sp.....ss..s.....b.ib.nb..p.,fspspsbibnbp


In [8]:
#create dataframe with just the userid and de-duplicated and cleaned events
df2 = df1.filter(items =['newid','unq_event'])
df2.to_csv("castlight_events2.csv")
seq = df2
seq.head()

Unnamed: 0,newid,unq_event
0,470.736995,wspswsSpwpcfswspsSspSp
1,497.857109,sSsSpbspsSsblS
4,1372.080296,bnbSncbSfe
5,1635.667868,bcncbibntblinbfpspcbtvbt
17,3237.100318,fspspsbibnbp


In [9]:
#the total number of customers is 10,000 and to test our approach we are using 1,000 customers
seqevent = seq[0:1000]
seqevent.head()

Unnamed: 0,newid,unq_event
0,470.736995,wspswsSpwpcfswspsSspSp
1,497.857109,sSsSpbspsSsblS
4,1372.080296,bnbSncbSfe
5,1635.667868,bcncbibntblinbfpspcbtvbt
17,3237.100318,fspspsbibnbp


<a name="Output">
### Output : Creating final clusters for 1000 customers using Affinity Propagation Clustering model
</a>

In [10]:

start_ts = time.time()

# Calculate Levenshtien similarity score between individual customer jouneys to form a similarity matrix
words = np.asarray(np.asarray(seqevent['unq_event'])) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

# Apply Affinity Propagation Clustering model using damping factor of 0.5 so that clusters get stablized after multiple iterations
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]] # Exemplar: Representative of the clusters elements
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster) # Cluster elements are separated by comma
    print(" - *%s:* %s" % (exemplar, cluster_str))
    
    
runtime = time.time() - start_ts
print('The pairwise comparison using Levenshtein distance ran for {}'.format(runtime))

## OUTPUT : The pairwise comparison using Levenshtein distance ran for 216.4708869457245

 - *bcncbibntblinbfpspcbtvbt:* bcncbibntblinbfpspcbtvbt
 - *cbsfinbpfnsbpspbpsbSsSsSsSsSsS:* cbsfinbpfnsbpspbpsbSsSsSsSsSsS
 - *spfcfcfcspsfSbSpnbfpfpbnfsc:* spfcfcfcspsfSbSpnbfpfpbnfsc
 - *bplplspbps:* bSlSlwpswpsps, bSlbsSspbps, bfbpbtnbpsb, bfpspebfps, bplplspbps, bpwpwpsctbtsb, bwpwpbspebfpepfb, epwcflspbsfsp, flwbspbfpfs, fpSlpsesps, fpSpfSpsnbnbs, lbplplpfpfpb, lbplspbspls, lplbSplflfp, lsbplbpblbpbs, plplbSbwps, sblblsfebfs, sebpwlpslpse, spblslscsps, spbslscblc, splsSlwSbpbe, wpwsplplps, wspwplswps
 - *flbesSpsSsfbfbfbfpsnfbnb:* flbesSpsSsfbfbfbfpsnfbnb
 - *tbtctstbstbtbftbtbtbtbntbft:* tbtctstbstbtbftbtbtbtbntbft
 - *SeSsSsbpbcSspspspspspspSblbfcp:* SeSsSsbpbcSspspspspspspSblbfcp
 - *lspsbswbswscbplbsSseblblbl:* lspsbswbswscbplbsSseblblbl
 - *bspspfspsb:* Sebfspwspb, SpSpfnbibf, Sswslpcpcb, bSbspcfcbcsSb, bScsfsplbc, bSpSspSfbsfpfsb, bfcbpbplsb, bfspspspsb, bpspspbspsp, bsbfblsSsbs, bsbspfsfspnp, bspspfspsb, bspstbnctbtn, bspwfspncn, bswpwpfsfswpsp, cspscbifwpwspfebnc, cswspws