# Build an association mining model on this dataset to identify the common routes that positive patients have travelled:  

Consider each patients’ route as a transaction and build an association mining model on this dataset to identify the common routes that positive patients have travelled. The task is to conduct Association analysis on this dataset.

1. [What variables did you include in the analysis? Justify your choice.](#attri)
2. Conduct association mining and answer the following:
    * [What is the ‘min_support’ threshold set? Discuss why it is chosen.](#support)
    * [Report the top 5 frequently occurring rules and interpret them.](#maxlift)
3. [Identify at least 10 common routes that positive patients from ‘Daegu_Buk-gu’ have travelled.](#buk-gu)
4. Can you perform sequence analysis on this dataset? If yes, present your results. If not, rationalise why.
5. How can the outcome of this study be used by the decision-makers?

In [2]:
import pandas as pd
from apyori import apriori

data = pd.read_csv('PatientRoute.csv')
data

Unnamed: 0,patient_id,global_num,date,location,latitude,longitude
0,1000000001,2.0,22/01/2020,Gyeonggi-do_Gimpo-si,37.615246,126.715632
1,1000000001,2.0,24/01/2020,Seoul_Jung-gu,37.567241,127.005659
2,1000000002,5.0,25/01/2020,Seoul_Seongbuk-gu,37.592560,127.017048
3,1000000002,5.0,26/01/2020,Seoul_Seongbuk-gu,37.591810,127.016822
4,1000000002,5.0,26/01/2020,Seoul_Seongdong-gu,37.563992,127.029534
...,...,...,...,...,...,...
6709,6100000090,,24/03/2020,Seoul_Gangseo-gu,37.558654,126.794474
6710,6100000090,,24/03/2020,Busan_Gangseo-gu,35.173220,128.946459
6711,6100000090,,25/03/2020,Gyeongsangnam-do_Yangsan-si,35.336944,129.026389
6712,6100000090,,25/03/2020,Gyeongsangnam-do_Yangsan-si,35.335757,129.025003


<a id="attri"></a>
**Selecting Attribute**

This problem is focus on patient and their travel pattern , hence we will be using patient_id and location variable. We will be storing variable in seperate dataframe to apply association mining.

In [3]:
df=data[["patient_id","location"]]
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6714 entries, 0 to 6713
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   patient_id  6714 non-null   int64 
 1   location    6714 non-null   object
dtypes: int64(1), object(1)
memory usage: 105.0+ KB


Aprior Algorithm takes various parameter. We will be creating **travel_list** which contain travel pattern of patient. We will then feed this into our algorithm.

In [4]:
temp = df.groupby(['patient_id'])['location'].apply(list)
travel_list=list(temp)

sequences=temp.values.tolist() # WE will use this one to for Sequentail analysis later as argument .


## Applying apriori algorithm
apriori alogrithm is found in library called apyori. We will be using this library for this problem. Note we do have option to use apriori from **MLXTend** library
<a id="support"></a>
For this instance we will be using support of 0.01 since we want to include rare travelling pattern of patient. 

In [5]:
from apyori import apriori
route=list(apriori(travel_list,min_support=0.01))

Cleaning up rules for proper layout

In [6]:
def tidy_results(route):
    rules=[]
    for rule_set in route:
        for rule in rule_set.ordered_statistics:
            rules.append([','.join(rule.items_base),','.join(rule.items_add),
                         rule_set.support, rule.confidence, rule.lift])
    return pd.DataFrame(rules,columns=['From','To', 'Support','Confidence', 'Lift'])
route_df=tidy_results(route)
route_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93 entries, 0 to 92
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   From        93 non-null     object 
 1   To          93 non-null     object 
 2   Support     93 non-null     float64
 3   Confidence  93 non-null     float64
 4   Lift        93 non-null     float64
dtypes: float64(3), object(2)
memory usage: 3.8+ KB


<a id="maxlift"></a>
Top 5 patient route can be calculated by sorting our rule by Life value.We will using **head()** function. 

In [24]:
route_df.sort_values(by='Lift',ascending=False).head(10)

Unnamed: 0,From,To,Support,Confidence,Lift
55,Busan_Busanjin-gu,Busan_Yeonje-gu,0.010735,0.619048,10.269406
56,Busan_Yeonje-gu,Busan_Busanjin-gu,0.010735,0.178082,10.269406
67,Chungcheongnam-do_Asan-si,Chungcheongnam-do_Cheonan-si,0.017341,0.75,9.560526
68,Chungcheongnam-do_Cheonan-si,Chungcheongnam-do_Asan-si,0.017341,0.221053,9.560526
61,Busan_Haeundae-gu,Busan_Yeonje-gu,0.014038,0.53125,8.812928
62,Busan_Yeonje-gu,Busan_Haeundae-gu,0.014038,0.232877,8.812928
58,Busan_Dongnae-gu,Busan_Yeonje-gu,0.017341,0.5,8.294521
59,Busan_Yeonje-gu,Busan_Dongnae-gu,0.017341,0.287671,8.294521
71,Daegu_Jung-gu,Daegu_Buk-gu,0.01569,0.253333,7.866325
70,Daegu_Buk-gu,Daegu_Jung-gu,0.01569,0.487179,7.866325


We can see that there was a lot of inter-province travel within Busan, Daegu and Chungcheongnam-do. 
* Busan province: Lot of patients have history of travelling among Busanjin-gu ,Yeonje-gu, Haeundae-gu, Dongnae-gu. If we look at confidence betweeen Busanjin-gu and  Yeonje-gu, we can interpret that Patients are likely traveling from Busanjin-gu to Yeonje-gu than vice-versa. Busanjin-gu =>  Yeonje-gu has 0.61 confidence whereas Yeonje-gu=> Busanjin-gu has 0.17 despite having same lift value.
* Daegu province: Patient in his provice have travel pattern between do_Asan-si and Cheonan-si.Considering confidence like we did in Busan case we can see patient are likely traveling from Asan-si to Cheonan-si than vice-versa.
* Daegu province: Travel pattern of patient in this city is from Buk-gu to Jung-gu is likely.

<a id="buk-gu"></a>
To identify travel pattern of patients from Daegu_Buk-gu we will have to play around with our max support value. In compare to common travel pattern, travel pattern from Buk-gu city is small hence  for this problem we will be reducing our support value to 0.002 which should gives us our pattern.


In [20]:
route_df

Unnamed: 0,From,To,Support,Confidence,Lift
0,,Busan_Busanjin-gu,0.017341,0.017341,1.000000
1,,Busan_Dong-gu,0.011561,0.011561,1.000000
2,,Busan_Dongnae-gu,0.034682,0.034682,1.000000
3,,Busan_Haeundae-gu,0.026424,0.026424,1.000000
4,,Busan_Seo-gu,0.014038,0.014038,1.000000
...,...,...,...,...,...
88,Seoul_Dongjak-gu,Seoul_Gwanak-gu,0.011561,0.134615,3.881410
89,Seoul_Gwanak-gu,Seoul_Dongjak-gu,0.011561,0.333333,3.881410
90,,"Seoul_Jungnang-gu,Seoul_Songpa-gu",0.015690,0.015690,1.000000
91,Seoul_Jungnang-gu,Seoul_Songpa-gu,0.015690,0.206522,4.547233


In [8]:
#3Change value of support to 0.002
route_buk_gu=list(apriori(travel_list,min_support=0.002))
route_buk_gu_df=tidy_results(route_buk_gu)


In [29]:
#Select row with 'Daegu_Buk-gu' from columen 'From'
options = ['Daegu_Buk-gu'] 
buk_gu_df = route_buk_gu_df.loc[route_buk_gu_df['From'].isin(options)]
#sort new dataframe by Lift Value
buk_gu_df.sort_values(by='Lift',ascending=False).head(10)


Unnamed: 0,From,To,Support,Confidence,Lift
1262,Daegu_Buk-gu,"Daegu_Jung-gu,Daegu_Nam-gu,Seoul_Jung-gu,Daegu...",0.002477,0.076923,31.051282
1217,Daegu_Buk-gu,"Daegu_Seo-gu,Daegu_Nam-gu,Seoul_Jung-gu",0.002477,0.076923,31.051282
1202,Daegu_Buk-gu,"Daegu_Jung-gu,Seoul_Jung-gu,Daegu_Seo-gu",0.002477,0.076923,31.051282
1187,Daegu_Buk-gu,"Daegu_Jung-gu,Daegu_Nam-gu,Seoul_Jung-gu",0.002477,0.076923,31.051282
1172,Daegu_Buk-gu,"Daegu_Jung-gu,Daegu_Nam-gu,Daegu_Seo-gu",0.002477,0.076923,31.051282
968,Daegu_Buk-gu,"Daegu_Seo-gu,Seoul_Jung-gu",0.002477,0.076923,31.051282
961,Daegu_Buk-gu,"Daegu_Nam-gu,Seoul_Jung-gu",0.002477,0.076923,31.051282
954,Daegu_Buk-gu,"Daegu_Seo-gu,Daegu_Nam-gu",0.002477,0.076923,23.288462
940,Daegu_Buk-gu,"Daegu_Jung-gu,Daegu_Seo-gu",0.002477,0.076923,18.630769
933,Daegu_Buk-gu,"Daegu_Jung-gu,Daegu_Nam-gu",0.004129,0.128205,17.250712


**Common travel pattern of patients from Buk-gu involves cities to Jung-gu, Seo-gu ,Nam-gu**


## Performing sequencial analysis 

Following function will give sequential rules

In [16]:
from collections import defaultdict
import subprocess
import re

''' Uses SPMF to find association rules in supplied transactions '''
def get_association_rules(sequences, min_sup, min_conf):
    # step 1: create required input for SPMF
    
    # prepare a dict to uniquely assign each item in the transactions to an int ID
    item_dict = defaultdict(int)
    output_dict = defaultdict(str)
    item_id = 1
    
    # write your sequences in SPMF format
    with open('seq_rule_input.txt', 'w+') as f:
        for sequence in sequences:
            z = []
            for itemset in sequence:
                # if there are multiple items in one itemset
                if isinstance(itemset, list):
                    for item in itemset:
                        if item not in item_dict:
                            item_dict[item] = item_id
                            item_id += 1

                        z.append(item_dict[item])
                else:
                    if itemset not in item_dict:
                        item_dict[itemset] = item_id
                        output_dict[str(item_id)] = itemset
                        item_id += 1
                    z.append(item_dict[itemset])
                    
                # end of itemset
                z.append(-1)
            
            # end of a sequence
            z.append(-2)
            f.write(' '.join([str(x) for x in z]))
            f.write('\n')
    
    # run SPMF with supplied parameters
    supp_param = '{}%'.format(int(min_sup * 100))
    conf_param = '{}%'.format(int(min_conf * 100))
    subprocess.call(['java', '-jar', 'spmf.jar', 'run', 'RuleGrowth', 'seq_rule_input.txt', 'seq_rule_output.txt', supp_param, conf_param], shell=True)
    
    # read back the output rules
    outputs = open('seq_rule_output.txt', 'r').read().strip().split('\n')
    output_rules = []
    for rule in outputs:
        left, right, sup, conf = re.search(pattern=r'([0-9\,]+) ==> ([0-9\,]+) #SUP: ([0-9]+) #CONF: ([0-9\.]+)', string=rule).groups()
        sup = int(sup) / len(sequences)
        conf = float(conf)
        output_rules.append([[output_dict[x] for x in left.split(',')], [output_dict[x] for x in right.split(',')], sup, conf])
    
    # return pandas DataFrame
    return pd.DataFrame(output_rules, columns = ['Left_rule', 'Right_rule', 'Support', 'Confidence'])

In [27]:
get_association_rules(sequences, 0.01, 0.01)

Unnamed: 0,Left_rule,Right_rule,Support,Confidence
0,[Gyeonggi-do_Gimpo-si],[Seoul_Jung-gu],3.574732,0.63151
1,[Gyeonggi-do_Gimpo-si],"[Seoul_Jung-gu, Seoul_Seongbuk-gu]",1.639967,0.289716
2,[Gyeonggi-do_Gimpo-si],"[Seoul_Jung-gu, Seoul_Jongno-gu]",0.940545,0.166156
3,[Gyeonggi-do_Gimpo-si],"[Seoul_Jung-gu, Gyeonggi-do_Seongnam-si]",0.735756,0.129978
4,[Gyeonggi-do_Gimpo-si],[Seoul_Seongbuk-gu],2.388109,0.421882
5,"[Gyeonggi-do_Gimpo-si, Seoul_Jung-gu]",[Seoul_Seongbuk-gu],1.639967,0.458766
6,[Gyeonggi-do_Gimpo-si],[Seoul_Seongdong-gu],1.028076,0.181619
7,[Gyeonggi-do_Gimpo-si],[Seoul_Gangnam-gu],0.745665,0.131729
8,[Gyeonggi-do_Gimpo-si],[Seoul_Jongno-gu],1.384806,0.244639
9,"[Gyeonggi-do_Gimpo-si, Seoul_Jung-gu]",[Seoul_Jongno-gu],0.940545,0.263109


In [23]:
route_df

Unnamed: 0,From,To,Support,Confidence,Lift
0,,Busan_Busanjin-gu,0.017341,0.017341,1.000000
1,,Busan_Dong-gu,0.011561,0.011561,1.000000
2,,Busan_Dongnae-gu,0.034682,0.034682,1.000000
3,,Busan_Haeundae-gu,0.026424,0.026424,1.000000
4,,Busan_Seo-gu,0.014038,0.014038,1.000000
...,...,...,...,...,...
88,Seoul_Dongjak-gu,Seoul_Gwanak-gu,0.011561,0.134615,3.881410
89,Seoul_Gwanak-gu,Seoul_Dongjak-gu,0.011561,0.333333,3.881410
90,,"Seoul_Jungnang-gu,Seoul_Songpa-gu",0.015690,0.015690,1.000000
91,Seoul_Jungnang-gu,Seoul_Songpa-gu,0.015690,0.206522,4.547233


In [32]:
route_df.sort_values(by='Confidence',ascending=False).head(20)

Unnamed: 0,From,To,Support,Confidence,Lift
67,Chungcheongnam-do_Asan-si,Chungcheongnam-do_Cheonan-si,0.017341,0.75,9.560526
55,Busan_Busanjin-gu,Busan_Yeonje-gu,0.010735,0.619048,10.269406
61,Busan_Haeundae-gu,Busan_Yeonje-gu,0.014038,0.53125,8.812928
58,Busan_Dongnae-gu,Busan_Yeonje-gu,0.017341,0.5,8.294521
70,Daegu_Buk-gu,Daegu_Jung-gu,0.01569,0.487179,7.866325
83,Seoul_Songpa-gu,Incheon_Jung-gu,0.020644,0.454545,3.669697
77,Seoul_Gangnam-gu,Incheon_Jung-gu,0.030553,0.352381,2.844889
92,Seoul_Songpa-gu,Seoul_Jungnang-gu,0.01569,0.345455,4.547233
89,Seoul_Gwanak-gu,Seoul_Dongjak-gu,0.011561,0.333333,3.88141
59,Busan_Yeonje-gu,Busan_Dongnae-gu,0.017341,0.287671,8.294521


In [33]:
route_df.sort_values(by='Lift',ascending=False).head(20)

Unnamed: 0,From,To,Support,Confidence,Lift
55,Busan_Busanjin-gu,Busan_Yeonje-gu,0.010735,0.619048,10.269406
56,Busan_Yeonje-gu,Busan_Busanjin-gu,0.010735,0.178082,10.269406
67,Chungcheongnam-do_Asan-si,Chungcheongnam-do_Cheonan-si,0.017341,0.75,9.560526
68,Chungcheongnam-do_Cheonan-si,Chungcheongnam-do_Asan-si,0.017341,0.221053,9.560526
61,Busan_Haeundae-gu,Busan_Yeonje-gu,0.014038,0.53125,8.812928
62,Busan_Yeonje-gu,Busan_Haeundae-gu,0.014038,0.232877,8.812928
58,Busan_Dongnae-gu,Busan_Yeonje-gu,0.017341,0.5,8.294521
59,Busan_Yeonje-gu,Busan_Dongnae-gu,0.017341,0.287671,8.294521
71,Daegu_Jung-gu,Daegu_Buk-gu,0.01569,0.253333,7.866325
70,Daegu_Buk-gu,Daegu_Jung-gu,0.01569,0.487179,7.866325


In [34]:
route_df.sort_values(by='Support',ascending=False).head(20)

Unnamed: 0,From,To,Support,Confidence,Lift
28,,Incheon_Jung-gu,0.123865,0.123865,1.0
35,,Seoul_Gangnam-gu,0.086705,0.086705,1.0
32,,Seoul_Dongjak-gu,0.085879,0.085879,1.0
9,,Chungcheongnam-do_Cheonan-si,0.078448,0.078448,1.0
42,,Seoul_Jungnang-gu,0.07597,0.07597,1.0
38,,Seoul_Guro-gu,0.066887,0.066887,1.0
12,,Daegu_Jung-gu,0.061932,0.061932,1.0
6,,Busan_Yeonje-gu,0.060281,0.060281,1.0
41,,Seoul_Jung-gu,0.057803,0.057803,1.0
50,,Seoul_Yangcheon-gu,0.052023,0.052023,1.0
