# Using apriori and association rules from mlxtend to look at associations between domain names

Some useful reference for understanding apriori and mlxtend (I never found one great piece and pulled from several): 
- [Data Science - Apriori Algorithm in Python- Market Basket Analysis](https://intellipaat.com/blog/data-science-apriori-algorithm/)
-[Apriori Algorithm implementation in Python](https://highontechs.com/recommendation-system/apriori-algorithm-implementation-in-python/)
- [Market Basket Analysis (on Kaggle)](https://www.kaggle.com/roshansharma/market-basket-analysis)
- [Market Basket Analysis](https://pbpython.com/market-basket-analysis.html)
- [Association analysis in python](https://medium.com/analytics-vidhya/association-analysis-in-python-2b955d0180c)
- [aprioti mlxtend documentation](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/)
- [Association Rules Generation from Frequent Itemsets](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/)

In [1]:
# If having import problems, this was the solution to the import problem for mlxtend: https://medium.com/@shivangisareen/for-anyone-using-jupyter-notebook-installing-packages-18a9468d0c1c
# LIFE SAVER!!!!
# import sys
# !{sys.executable} -m pip install mlxtend

In [2]:
import pandas as pd
import mlxtend
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from ast import literal_eval
import modules.sifunctions as sif

## Notebook Functions

In [3]:
def support_to_count(supp):
    """
    A quick way to find how the support in apriori relates to the count of occurences in the dataframe.
    This means if the returned value is 4, the support value indicates there are 4 occurences of an item in the data.
    """
    transactions = data.shape[0]
    return int(supp * transactions)

In [4]:
def find_fzset(ls):
    """
    Search the rules dataframe for a specific frozenset
    """
    find = frozenset(ls)
    results = rules[rules["antecedents"] == find]
    return results

In [5]:
def find_ants(dom):
    """
    Search the rules dataframe for a specific antecedent.
    """
    df = pd.DataFrame([],columns = rules.columns)
    len = rules.shape[0]
    for i in range(0,len):
        if dom in rules.iloc[i]["antecedents"]:
            df = df.append(rules.iloc[i], ignore_index=True)
            
    return df
    

In [6]:
def summary(rules,obj="no"):
    """
    Print an overview of the results applying the rules will have on removing URLs.
    """
    desc = {
        "url_exclude_count": rules[rules["consequents"]==rm]["antecedent occurences"].sum(),
        "exclusion_rules_count": rules.shape[0],
        "exclusion_rules": list(rules["antecedents"])
    }
    print(f'Description: \n excludes {desc["url_exclude_count"]} URLs from {data.shape[0]}.\n and has {desc["exclusion_rules_count"]} exclusion rules.')
    
    if obj == "yes":
        return desc

## Read in Data

In [7]:
data = pd.read_csv('data/domain_dataFrame.csv', index_col=0)

In [8]:
data.head()

Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,abcs.mgh.harvard.edu,200.0,ABCs - MICCAI 2020 Challenge,"['abcs', 'mgh', 'harvard', 'edu']",harvard.edu,abcs.mgh,mgh,abcs,,,...,1,0,0,https://abcs.mgh.harvard.edu/,1,KEEP,HTTP Check: requested http://abcs.mgh.harvard....,abcs,4,abcs
1,abel.harvard.edu,200.0,Harvard Mathematics Department : Home page,"['abel', 'harvard', 'edu']",harvard.edu,abel,abel,,,,...,1,0,0,http://abel.harvard.edu/,1,KEEP,VALID: http://abel.harvard.edu: 200,abel,3,abel
2,about.my.harvard.edu,200.0,Service Portal - IT Help,"['about', 'my', 'harvard', 'edu']",harvard.edu,about.my,my,about,,,...,1,0,0,https://harvard.service-now.com/ithelp,0,CHECK,CHECK: requested http://about.my.harvard.edu a...,about,4,about
3,ac-web.dce.harvard.edu,200.0,AC-WEB: Academic Computing,"['ac-web', 'dce', 'harvard', 'edu']",harvard.edu,ac-web.dce,dce,ac-web,,,...,1,0,0,https://ac-web.dce.harvard.edu/,1,KEEP,HTTP Check: requested http://ac-web.dce.harvar...,ac-web,4,ac-
4,academicresourcecenter.harvard.edu,200.0,Academic Resource Center,"['academicresourcecenter', 'harvard', 'edu']",harvard.edu,academicresourcecenter,academicresourcecenter,,,,...,1,0,0,https://academicresourcecenter.harvard.edu/,1,KEEP,HTTP Check: requested http://academicresourcec...,academicresourcecenter,3,acade


In [9]:
data.describe()

Unnamed: 0,http status code,res_status,redirect,redirect_code,public,login,harvard_key,success,domain_count
count,2998.0,17498.0,17498.0,17498.0,17498.0,17498.0,17498.0,17498.0,17498.0
mean,58.353903,8.831752,0.03469,10.452395,0.038804,0.005029,0.003258,-0.925191,4.079838
std,110.072208,48.171832,0.182998,55.139486,0.193134,0.07074,0.056983,0.362174,0.62072
min,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,3.0
25%,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,4.0
50%,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,4.0
75%,-1.0,-1.0,0.0,0.0,0.0,0.0,0.0,-1.0,4.0
max,503.0,530.0,1.0,307.0,1.0,1.0,1.0,1.0,8.0


In [10]:
data.shape

(17498, 26)

In [11]:
# take the domain name columns from the dataframe for associations.
data_test = data[["subdomain1","subdomain2","subdomain3","subdomain4","subdomain5","subdomain6","assess"]].copy()

In [12]:
data_test.head()

Unnamed: 0,subdomain1,subdomain2,subdomain3,subdomain4,subdomain5,subdomain6,assess
0,mgh,abcs,,,,,KEEP
1,abel,,,,,,KEEP
2,my,about,,,,,CHECK
3,dce,ac-web,,,,,KEEP
4,academicresourcecenter,,,,,,KEEP


In [13]:
# one hot encode the domain names for categorical analysis
domain_sets = pd.get_dummies(data_test)

In [14]:
domain_sets.shape

(17498, 17585)

## Look at associations

In [15]:
apriori(domain_sets, min_support=0.00015, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.000171,(subdomain1_adsabs)
1,0.001829,(subdomain1_bidmc)
2,0.000457,(subdomain1_bih)
3,0.015716,(subdomain1_bwh)
4,0.001772,(subdomain1_cadm)
...,...,...
529,0.000171,"(subdomain3_campusservices, assess_REMOVE, sub..."
530,0.000343,"(assess_KEEP, subdomain3_com, subdomain2_ezp-p..."
531,0.000171,"(subdomain4_stage, subdomain3_ats, subdomain2_..."
532,0.000171,"(subdomain4_prod, subdomain2_cloud, subdomain1..."


In [16]:
# See how frequently "bidmc" is a domain
support_to_count(.001829)

32

In [17]:
# look at the occurences of the subdomain bidmc
sif.find_from_sd("bidmc", data)

Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,arftopaz.bidmc.harvard.edu,200.0,IIS7,"['arftopaz', 'bidmc', 'harvard', 'edu']",harvard.edu,arftopaz.bidmc,bidmc,arftopaz,,,...,1,0,0,http://arftopaz.bidmc.harvard.edu/,1,KEEP,VALID: http://arftopaz.bidmc.harvard.edu: 200,arftopaz,4,arfto
1,www.bidmc.harvard.edu,302.0,Beth Israel Deaconess Medical Center | BIDMC o...,"['www', 'bidmc', 'harvard', 'edu']",harvard.edu,www.bidmc,bidmc,www,,,...,1,0,0,https://www.bidmc.org/,0,CHECK,CHECK: requested http://www.bidmc.harvard.edu ...,www,4,www
2,127num2.bidmc.harvard.edu,-1.0,,"['127num2', 'bidmc', 'harvard', 'edu']",harvard.edu,127num2.bidmc,bidmc,127num2,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://127num2.bidmc.harvard.edu : HTTP...,127num2,4,127nu
3,adams2.bidmc.harvard.edu,-1.0,,"['adams2', 'bidmc', 'harvard', 'edu']",harvard.edu,adams2.bidmc,bidmc,adams2,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://adams2.bidmc.harvard.edu : HTTPC...,adams2,4,adams
4,careweb.bidmc.harvard.edu,-1.0,,"['careweb', 'bidmc', 'harvard', 'edu']",harvard.edu,careweb.bidmc,bidmc,careweb,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://careweb.bidmc.harvard.edu : HTTP...,careweb,4,carew
5,cirrus.bidmc.harvard.edu,-1.0,,"['cirrus', 'bidmc', 'harvard', 'edu']",harvard.edu,cirrus.bidmc,bidmc,cirrus,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://cirrus.bidmc.harvard.edu : HTTPC...,cirrus,4,cirru
6,comnprt1.bidmc.harvard.edu,-1.0,,"['comnprt1', 'bidmc', 'harvard', 'edu']",harvard.edu,comnprt1.bidmc,bidmc,comnprt1,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://comnprt1.bidmc.harvard.edu : HTT...,comnprt1,4,comnp
7,demo-nu1c4rw5py.bidmc.harvard.edu,-1.0,,"['demo-nu1c4rw5py', 'bidmc', 'harvard', 'edu']",harvard.edu,demo-nu1c4rw5py.bidmc,bidmc,demo-nu1c4rw5py,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://demo-nu1c4rw5py.bidmc.harvard.ed...,demo-nu1c4rw5py,4,demo-
8,enterprise.bidmc.harvard.edu,,,"['enterprise', 'bidmc', 'harvard', 'edu']",harvard.edu,enterprise.bidmc,bidmc,enterprise,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://enterprise.bidmc.harvard.edu : H...,enterprise,4,enter
9,gmeimet.bidmc.harvard.edu,,,"['gmeimet', 'bidmc', 'harvard', 'edu']",harvard.edu,gmeimet.bidmc,bidmc,gmeimet,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://gmeimet.bidmc.harvard.edu : HTTP...,gmeimet,4,gmeim


In [18]:
support_to_count(.0007)

12

In [19]:
# df = domain_sets
# Look at domains with a higher minimum support value.
frequent_domains = apriori(domain_sets, min_support=0.0007, use_colnames=True)

# Add a column to the returned dataframe to count the length of the itemset.
frequent_domains['length'] = frequent_domains['itemsets'].apply(lambda x: len(x))

frequent_domains

Unnamed: 0,support,itemsets,length
0,0.001829,(subdomain1_bidmc),1
1,0.015716,(subdomain1_bwh),1
2,0.001772,(subdomain1_cadm),1
3,0.052977,(subdomain1_cfa),1
4,0.003829,(subdomain1_chem),1
...,...,...,...
153,0.002572,"(subdomain2_client, subdomain1_law, assess_REM...",3
154,0.008344,"(assess_REMOVE, subdomain2_nmr, subdomain1_mgh)",3
155,0.002743,"(subdomain2_client, subdomain1_student, assess...",3
156,0.001029,"(assess_KEEP, subdomain3_com, subdomain2_ezp-p...",3


In [20]:
# Look at the associations
frequent_domains[frequent_domains['length'] > 2][0:30]

Unnamed: 0,support,itemsets,length
142,0.001429,"(assess_REMOVE, subdomain2_ad, subdomain1_fas)",3
143,0.015716,"(subdomain2_client, assess_REMOVE, subdomain1_...",3
144,0.001886,"(assess_REMOVE, subdomain1_fas, subdomain2_net...",3
145,0.002115,"(assess_REMOVE, subdomain2_rc, subdomain1_fas)",3
146,0.000857,"(assess_REMOVE, subdomain2_roam, subdomain1_fas)",3
147,0.006344,"(assess_REMOVE, subdomain2_unix, subdomain1_fas)",3
148,0.003429,"(assess_REMOVE, subdomain2_webroots, subdomain...",3
149,0.162304,"(subdomain2_wrls-client, assess_REMOVE, subdom...",3
150,0.001029,"(subdomain3_com, subdomain2_ezp-prod1, subdoma...",3
151,0.002115,"(assess_KEEP, subdomain1_hul, subdomain2_ezp-p...",3


## Association Rules for individual subdomains

### Confidence 
Confidence means the likelihood of a consequent (following item) given the antecedent (preceding item). For example, if I buy pizza the confidence that I buy beer is .8. So I buy beer 80% of the time I buy pizza. However the reverse is not necessarily true. If I first buy beer, the confidence that I will buy pizza is only .4. 

In [21]:
# define the rules for finding associations. Looking for sets that have a confidence grater than 0.9
rules = association_rules(frequent_domains, metric = "confidence", min_threshold = 0.9)

In [22]:
# add a column to the rules dataframe that counts the number of occurences of an antcedent in the data.
rules["attecedent occurences"] = rules["antecedent support"].apply(lambda x: support_to_count(x))

In [23]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
0,(subdomain1_bidmc),(assess_REMOVE),0.001829,0.956795,0.001657,0.906250,0.947173,-0.000092,0.460853,32
1,(subdomain1_bwh),(assess_REMOVE),0.015716,0.956795,0.014687,0.934545,0.976746,-0.000350,0.660075,275
2,(subdomain1_cadm),(assess_REMOVE),0.001772,0.956795,0.001772,1.000000,1.045156,0.000077,inf,31
3,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.052520,0.991370,1.036136,0.001832,5.006372,927
4,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.000000,1.045156,0.000165,inf,67
...,...,...,...,...,...,...,...,...,...,...
88,"(subdomain3_com, subdomain2_ezp-prod1, subdoma...",(assess_KEEP),0.001029,0.030975,0.001029,1.000000,32.284133,0.000997,inf,18
89,"(assess_KEEP, subdomain3_com)","(subdomain1_hul, subdomain2_ezp-prod1)",0.001029,0.002115,0.001029,1.000000,472.918919,0.001027,inf,18
90,"(subdomain3_com, subdomain2_ezp-prod1)","(assess_KEEP, subdomain1_hul)",0.001029,0.002172,0.001029,1.000000,460.473684,0.001026,inf,18
91,"(subdomain3_com, subdomain1_hul)","(assess_KEEP, subdomain2_ezp-prod1)",0.001029,0.002115,0.001029,1.000000,472.918919,0.001027,inf,18


In [24]:
find_ants("subdomain1_bidmc")

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
0,(subdomain1_bidmc),(assess_REMOVE),0.001829,0.956795,0.001657,0.90625,0.947173,-9.2e-05,0.460853,32


In [25]:
find_fzset(["subdomain1_chem"])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
4,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.0,1.045156,0.000165,inf,67


In [26]:
data[data["subdomain1"] == "cfa"]["name"]

85        chandra.cfa.harvard.edu
309       library.cfa.harvard.edu
392      pinpoint.cfa.harvard.edu
485         usvoa.cfa.harvard.edu
549           xrs.cfa.harvard.edu
                   ...           
17485      zinc92.cfa.harvard.edu
17486      zinc93.cfa.harvard.edu
17487         zip.cfa.harvard.edu
17493        zoom.cfa.harvard.edu
17494       zop-v.cfa.harvard.edu
Name: name, Length: 927, dtype: object

In [27]:
sif.find_from_sd("bidmc",data)


Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,arftopaz.bidmc.harvard.edu,200.0,IIS7,"['arftopaz', 'bidmc', 'harvard', 'edu']",harvard.edu,arftopaz.bidmc,bidmc,arftopaz,,,...,1,0,0,http://arftopaz.bidmc.harvard.edu/,1,KEEP,VALID: http://arftopaz.bidmc.harvard.edu: 200,arftopaz,4,arfto
1,www.bidmc.harvard.edu,302.0,Beth Israel Deaconess Medical Center | BIDMC o...,"['www', 'bidmc', 'harvard', 'edu']",harvard.edu,www.bidmc,bidmc,www,,,...,1,0,0,https://www.bidmc.org/,0,CHECK,CHECK: requested http://www.bidmc.harvard.edu ...,www,4,www
2,127num2.bidmc.harvard.edu,-1.0,,"['127num2', 'bidmc', 'harvard', 'edu']",harvard.edu,127num2.bidmc,bidmc,127num2,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://127num2.bidmc.harvard.edu : HTTP...,127num2,4,127nu
3,adams2.bidmc.harvard.edu,-1.0,,"['adams2', 'bidmc', 'harvard', 'edu']",harvard.edu,adams2.bidmc,bidmc,adams2,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://adams2.bidmc.harvard.edu : HTTPC...,adams2,4,adams
4,careweb.bidmc.harvard.edu,-1.0,,"['careweb', 'bidmc', 'harvard', 'edu']",harvard.edu,careweb.bidmc,bidmc,careweb,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://careweb.bidmc.harvard.edu : HTTP...,careweb,4,carew
5,cirrus.bidmc.harvard.edu,-1.0,,"['cirrus', 'bidmc', 'harvard', 'edu']",harvard.edu,cirrus.bidmc,bidmc,cirrus,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://cirrus.bidmc.harvard.edu : HTTPC...,cirrus,4,cirru
6,comnprt1.bidmc.harvard.edu,-1.0,,"['comnprt1', 'bidmc', 'harvard', 'edu']",harvard.edu,comnprt1.bidmc,bidmc,comnprt1,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://comnprt1.bidmc.harvard.edu : HTT...,comnprt1,4,comnp
7,demo-nu1c4rw5py.bidmc.harvard.edu,-1.0,,"['demo-nu1c4rw5py', 'bidmc', 'harvard', 'edu']",harvard.edu,demo-nu1c4rw5py.bidmc,bidmc,demo-nu1c4rw5py,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://demo-nu1c4rw5py.bidmc.harvard.ed...,demo-nu1c4rw5py,4,demo-
8,enterprise.bidmc.harvard.edu,,,"['enterprise', 'bidmc', 'harvard', 'edu']",harvard.edu,enterprise.bidmc,bidmc,enterprise,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://enterprise.bidmc.harvard.edu : H...,enterprise,4,enter
9,gmeimet.bidmc.harvard.edu,,,"['gmeimet', 'bidmc', 'harvard', 'edu']",harvard.edu,gmeimet.bidmc,bidmc,gmeimet,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://gmeimet.bidmc.harvard.edu : HTTP...,gmeimet,4,gmeim


In [28]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
0,(subdomain1_bidmc),(assess_REMOVE),0.001829,0.956795,0.001657,0.906250,0.947173,-0.000092,0.460853,32
1,(subdomain1_bwh),(assess_REMOVE),0.015716,0.956795,0.014687,0.934545,0.976746,-0.000350,0.660075,275
2,(subdomain1_cadm),(assess_REMOVE),0.001772,0.956795,0.001772,1.000000,1.045156,0.000077,inf,31
3,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.052520,0.991370,1.036136,0.001832,5.006372,927
4,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.000000,1.045156,0.000165,inf,67
...,...,...,...,...,...,...,...,...,...,...
88,"(subdomain3_com, subdomain2_ezp-prod1, subdoma...",(assess_KEEP),0.001029,0.030975,0.001029,1.000000,32.284133,0.000997,inf,18
89,"(assess_KEEP, subdomain3_com)","(subdomain1_hul, subdomain2_ezp-prod1)",0.001029,0.002115,0.001029,1.000000,472.918919,0.001027,inf,18
90,"(subdomain3_com, subdomain2_ezp-prod1)","(assess_KEEP, subdomain1_hul)",0.001029,0.002172,0.001029,1.000000,460.473684,0.001026,inf,18
91,"(subdomain3_com, subdomain1_hul)","(assess_KEEP, subdomain2_ezp-prod1)",0.001029,0.002115,0.001029,1.000000,472.918919,0.001027,inf,18


In [29]:
# Look at the rows where sets resulted in REMOVE
rm = frozenset({'assess_REMOVE'})
rules[rules["consequents"]==rm]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
0,(subdomain1_bidmc),(assess_REMOVE),0.001829,0.956795,0.001657,0.90625,0.947173,-9.2e-05,0.460853,32
1,(subdomain1_bwh),(assess_REMOVE),0.015716,0.956795,0.014687,0.934545,0.976746,-0.00035,0.660075,275
2,(subdomain1_cadm),(assess_REMOVE),0.001772,0.956795,0.001772,1.0,1.045156,7.7e-05,inf,31
3,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.05252,0.99137,1.036136,0.001832,5.006372,927
4,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.0,1.045156,0.000165,inf,67
5,(subdomain1_dfci),(assess_REMOVE),0.076123,0.956795,0.075437,0.990991,1.03574,0.002603,4.795748,1331
12,(subdomain1_fas),(assess_REMOVE),0.264316,0.956795,0.257858,0.975568,1.01962,0.004962,1.768344,4625
13,(subdomain1_flybase),(assess_REMOVE),0.001429,0.956795,0.001429,1.0,1.045156,6.2e-05,inf,25
14,(subdomain1_gslb),(assess_REMOVE),0.001543,0.956795,0.001543,1.0,1.045156,6.7e-05,inf,27
15,(subdomain1_hcl),(assess_REMOVE),0.0028,0.956795,0.002743,0.979592,1.023826,6.4e-05,2.117042,49


In [30]:
# Look at times dfci- returned no response
find = sif.find_from_sd("dfci-", data)
find[find["success"]<0]

Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,dfci-5x889ewbj9.dfci.harvard.edu,-1.0,,"['dfci-5x889ewbj9', 'dfci', 'harvard', 'edu']",harvard.edu,dfci-5x889ewbj9.dfci,dfci,dfci-5x889ewbj9,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-5x889ewbj9.dfci.harvard.edu...,dfci-5x889ewbj9,4,dfci-
1,dfci-6fa3f4f252.dfci.harvard.edu,-1.0,,"['dfci-6fa3f4f252', 'dfci', 'harvard', 'edu']",harvard.edu,dfci-6fa3f4f252.dfci,dfci,dfci-6fa3f4f252,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-6fa3f4f252.dfci.harvard.edu...,dfci-6fa3f4f252,4,dfci-
2,dfci-brusic.dfci.harvard.edu,-1.0,,"['dfci-brusic', 'dfci', 'harvard', 'edu']",harvard.edu,dfci-brusic.dfci,dfci,dfci-brusic,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-brusic.dfci.harvard.edu : H...,dfci-brusic,4,dfci-
3,dfci-dir.masco.harvard.edu,-1.0,,"['dfci-dir', 'masco', 'harvard', 'edu']",harvard.edu,dfci-dir.masco,masco,dfci-dir,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-dir.masco.harvard.edu : HTT...,dfci-dir,4,dfci-
4,dfci-fw-ex-failover.dfci.harvard.edu,-1.0,,"['dfci-fw-ex-failover', 'dfci', 'harvard', 'edu']",harvard.edu,dfci-fw-ex-failover.dfci,dfci,dfci-fw-ex-failover,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-fw-ex-failover.dfci.harvard...,dfci-fw-ex-failover,4,dfci-
5,dfci-gw-v650.dfci.harvard.edu,-1.0,,"['dfci-gw-v650', 'dfci', 'harvard', 'edu']",harvard.edu,dfci-gw-v650.dfci,dfci,dfci-gw-v650,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-gw-v650.dfci.harvard.edu : ...,dfci-gw-v650,4,dfci-
6,dfci-gw-v679.dfci.harvard.edu,-1.0,,"['dfci-gw-v679', 'dfci', 'harvard', 'edu']",harvard.edu,dfci-gw-v679.dfci,dfci,dfci-gw-v679,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-gw-v679.dfci.harvard.edu : ...,dfci-gw-v679,4,dfci-
7,dfci-r-korsoff.dfci.harvard.edu,-1.0,,"['dfci-r-korsoff', 'dfci', 'harvard', 'edu']",harvard.edu,dfci-r-korsoff.dfci,dfci,dfci-r-korsoff,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-r-korsoff.dfci.harvard.edu ...,dfci-r-korsoff,4,dfci-
8,dfci-r-w00sm756.dfci.harvard.edu,-1.0,,"['dfci-r-w00sm756', 'dfci', 'harvard', 'edu']",harvard.edu,dfci-r-w00sm756.dfci,dfci,dfci-r-w00sm756,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-r-w00sm756.dfci.harvard.edu...,dfci-r-w00sm756,4,dfci-
9,dfci-rj812heinrich.dfci.harvard.edu,-1.0,,"['dfci-rj812heinrich', 'dfci', 'harvard', 'edu']",harvard.edu,dfci-rj812heinrich.dfci,dfci,dfci-rj812heinrich,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dfci-rj812heinrich.dfci.harvard....,dfci-rj812heinrich,4,dfci-


## Lift

Lift describes the likeihood often item A and item B occuring together. A lift of 1 indicates there is no association between the two items.

In [31]:
# Look at sets with a lift greater than 1 (i.e. there is some association)
rules = association_rules(frequent_domains, metric = "lift", min_threshold = 1)

In [32]:
rules["attecedent occurences"] = rules["antecedent support"].apply(lambda x: support_to_count(x))

In [33]:
rm = frozenset({'assess_REMOVE'})
rules[rules["consequents"]==rm]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
3,(subdomain1_cadm),(assess_REMOVE),0.001772,0.956795,0.001772,1.0,1.045156,7.7e-05,inf,31
5,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.05252,0.99137,1.036136,0.001832,5.006372,927
6,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.0,1.045156,0.000165,inf,67
9,(subdomain1_dfci),(assess_REMOVE),0.076123,0.956795,0.075437,0.990991,1.03574,0.002603,4.795748,1331
27,(subdomain1_fas),(assess_REMOVE),0.264316,0.956795,0.257858,0.975568,1.01962,0.004962,1.768344,4625
29,(subdomain1_flybase),(assess_REMOVE),0.001429,0.956795,0.001429,1.0,1.045156,6.2e-05,inf,25
33,(subdomain1_gslb),(assess_REMOVE),0.001543,0.956795,0.001543,1.0,1.045156,6.7e-05,inf,27
34,(subdomain1_hcl),(assess_REMOVE),0.0028,0.956795,0.002743,0.979592,1.023826,6.4e-05,2.117042,49
36,(subdomain1_hcs),(assess_REMOVE),0.001372,0.956795,0.001372,1.0,1.045156,5.9e-05,inf,23
43,(subdomain1_huh),(assess_REMOVE),0.001886,0.956795,0.001829,0.969697,1.013485,2.4e-05,1.425763,33


In [34]:
sif.find_from_sd("cadm", data)

Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,apollo10.cadm.harvard.edu,-1.0,,"['apollo10', 'cadm', 'harvard', 'edu']",harvard.edu,apollo10.cadm,cadm,apollo10,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://apollo10.cadm.harvard.edu : HTTP...,apollo10,4,apoll
1,apps.itis-wcmint.cadm.harvard.edu,-1.0,,"['apps', 'itis-wcmint', 'cadm', 'harvard', 'edu']",harvard.edu,apps.itis-wcmint.cadm,cadm,itis-wcmint,apps,,...,0,0,0,-1,-1,REMOVE,ERROR: http://apps.itis-wcmint.cadm.harvard.ed...,apps,5,apps
2,boothill.cadm.harvard.edu,-1.0,,"['boothill', 'cadm', 'harvard', 'edu']",harvard.edu,boothill.cadm,cadm,boothill,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://boothill.cadm.harvard.edu : HTTP...,boothill,4,booth
3,brockman.cadm.harvard.edu,-1.0,,"['brockman', 'cadm', 'harvard', 'edu']",harvard.edu,brockman.cadm,cadm,brockman,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://brockman.cadm.harvard.edu : HTTP...,brockman,4,brock
4,caadsftp-lbx-tst.cadm.harvard.edu,-1.0,,"['caadsftp-lbx-tst', 'cadm', 'harvard', 'edu']",harvard.edu,caadsftp-lbx-tst.cadm,cadm,caadsftp-lbx-tst,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://caadsftp-lbx-tst.cadm.harvard.ed...,caadsftp-lbx-tst,4,caadsftp-
5,camail2-dr.cadm.harvard.edu,-1.0,,"['camail2-dr', 'cadm', 'harvard', 'edu']",harvard.edu,camail2-dr.cadm,cadm,camail2-dr,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://camail2-dr.cadm.harvard.edu : HT...,camail2-dr,4,camail2-
6,cletus.cadm.harvard.edu,-1.0,,"['cletus', 'cadm', 'harvard', 'edu']",harvard.edu,cletus.cadm,cadm,cletus,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://cletus.cadm.harvard.edu : HTTPCo...,cletus,4,cletu
7,dev2.wds-appdev.cadm.harvard.edu,-1.0,,"['dev2', 'wds-appdev', 'cadm', 'harvard', 'edu']",harvard.edu,dev2.wds-appdev.cadm,cadm,wds-appdev,dev2,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dev2.wds-appdev.cadm.harvard.edu...,dev2,5,dev2
8,dr-hrmsprd1.cadm.harvard.edu,,,"['dr-hrmsprd1', 'cadm', 'harvard', 'edu']",harvard.edu,dr-hrmsprd1.cadm,cadm,dr-hrmsprd1,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://dr-hrmsprd1.cadm.harvard.edu : H...,dr-hrmsprd1,4,dr-
9,engyro-prod.cadm.harvard.edu,,,"['engyro-prod', 'cadm', 'harvard', 'edu']",harvard.edu,engyro-prod.cadm,cadm,engyro-prod,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://engyro-prod.cadm.harvard.edu : H...,engyro-prod,4,engyro-


## Lift and Confidence

Looking at the association rules data based on both lift and confidence

In [35]:
rules[(rules['lift'] >= 1) & (rules['confidence'] >=0.9)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
3,(subdomain1_cadm),(assess_REMOVE),0.001772,0.956795,0.001772,1.000000,1.045156,0.000077,inf,31
5,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.052520,0.991370,1.036136,0.001832,5.006372,927
6,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.000000,1.045156,0.000165,inf,67
9,(subdomain1_dfci),(assess_REMOVE),0.076123,0.956795,0.075437,0.990991,1.035740,0.002603,4.795748,1331
15,(subdomain2_netmgt),(subdomain1_fas),0.001886,0.264316,0.001886,1.000000,3.783351,0.001387,inf,33
...,...,...,...,...,...,...,...,...,...,...
201,"(subdomain3_com, subdomain2_ezp-prod1, subdoma...",(assess_KEEP),0.001029,0.030975,0.001029,1.000000,32.284133,0.000997,inf,18
202,"(assess_KEEP, subdomain3_com)","(subdomain1_hul, subdomain2_ezp-prod1)",0.001029,0.002115,0.001029,1.000000,472.918919,0.001027,inf,18
205,"(subdomain3_com, subdomain2_ezp-prod1)","(assess_KEEP, subdomain1_hul)",0.001029,0.002172,0.001029,1.000000,460.473684,0.001026,inf,18
206,"(subdomain3_com, subdomain1_hul)","(assess_KEEP, subdomain2_ezp-prod1)",0.001029,0.002115,0.001029,1.000000,472.918919,0.001027,inf,18


In [36]:
rules[(rules['lift'] >= 1) & (rules['confidence'] >=0.9) & (rules["consequents"]==rm)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
3,(subdomain1_cadm),(assess_REMOVE),0.001772,0.956795,0.001772,1.0,1.045156,7.7e-05,inf,31
5,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.05252,0.99137,1.036136,0.001832,5.006372,927
6,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.0,1.045156,0.000165,inf,67
9,(subdomain1_dfci),(assess_REMOVE),0.076123,0.956795,0.075437,0.990991,1.03574,0.002603,4.795748,1331
27,(subdomain1_fas),(assess_REMOVE),0.264316,0.956795,0.257858,0.975568,1.01962,0.004962,1.768344,4625
29,(subdomain1_flybase),(assess_REMOVE),0.001429,0.956795,0.001429,1.0,1.045156,6.2e-05,inf,25
33,(subdomain1_gslb),(assess_REMOVE),0.001543,0.956795,0.001543,1.0,1.045156,6.7e-05,inf,27
34,(subdomain1_hcl),(assess_REMOVE),0.0028,0.956795,0.002743,0.979592,1.023826,6.4e-05,2.117042,49
36,(subdomain1_hcs),(assess_REMOVE),0.001372,0.956795,0.001372,1.0,1.045156,5.9e-05,inf,23
43,(subdomain1_huh),(assess_REMOVE),0.001886,0.956795,0.001829,0.969697,1.013485,2.4e-05,1.425763,33


In [37]:
find_fzset(['subdomain1_cfa'])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
5,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.05252,0.99137,1.036136,0.001832,5.006372,927


In [38]:
sif.find_from_sd("cfa", data)

Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,chandra.cfa.harvard.edu,200.0,,"['chandra', 'cfa', 'harvard', 'edu']",harvard.edu,chandra.cfa,cfa,chandra,,,...,1,0,0,https://chandra.cfa.harvard.edu/,1,KEEP,HTTP Check: requested http://chandra.cfa.harva...,chandra,4,chand
1,library.cfa.harvard.edu,200.0,John G. Wolbach Library,"['library', 'cfa', 'harvard', 'edu']",harvard.edu,library.cfa,cfa,library,,,...,1,0,0,https://library.cfa.harvard.edu/,1,KEEP,HTTP Check: requested http://library.cfa.harva...,library,4,libra
2,pinpoint.cfa.harvard.edu,200.0,PinpointWCS from the Chandra X-ray Center,"['pinpoint', 'cfa', 'harvard', 'edu']",harvard.edu,pinpoint.cfa,cfa,pinpoint,,,...,1,0,0,http://pinpoint.cfa.harvard.edu/,1,KEEP,VALID: http://pinpoint.cfa.harvard.edu: 200,pinpoint,4,pinpo
3,usvoa.cfa.harvard.edu,200.0,Site Maintenance,"['usvoa', 'cfa', 'harvard', 'edu']",harvard.edu,usvoa.cfa,cfa,usvoa,,,...,1,0,0,http://usvoa.cfa.harvard.edu/USVOA/maintenance...,1,KEEP,VALID: http://usvoa.cfa.harvard.edu: 200,usvoa,4,usvoa
4,xrs.cfa.harvard.edu,200.0,Lynx,"['xrs', 'cfa', 'harvard', 'edu']",harvard.edu,xrs.cfa,cfa,xrs,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://xrs.cfa.harvard.edu : HTTPSConne...,xrs,4,xrs
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
924,zip.cfa.harvard.edu,,,"['zip', 'cfa', 'harvard', 'edu']",harvard.edu,zip.cfa,cfa,zip,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://zip.cfa.harvard.edu : HTTPConnec...,zip,4,zip
925,zoom.cfa.harvard.edu,,,"['zoom', 'cfa', 'harvard', 'edu']",harvard.edu,zoom.cfa,cfa,zoom,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://zoom.cfa.harvard.edu : HTTPConne...,zoom,4,zoom
926,zop-v.cfa.harvard.edu,,,"['zop-v', 'cfa', 'harvard', 'edu']",harvard.edu,zop-v.cfa,cfa,zop-v,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://zop-v.cfa.harvard.edu : HTTPConn...,zop-v,4,zop-
927,cfa.lib.harvard.edu,302.0,,"['cfa', 'lib', 'harvard', 'edu']",harvard.edu,cfa.lib,lib,cfa,,,...,1,0,0,https://dataverse.harvard.edu/dataverse/cfa,0,CHECK,CHECK: requested http://cfa.lib.harvard.edu an...,cfa,4,cfa


## Association Rules for domains AND truncated bottom-domain

In [39]:
data_test_dttd = data[["subdomain1","subdomain2","subdomain3","subdomain4","subdomain5","subdomain6","assess","bottom_dom_trunc"]].copy()

In [40]:
domain_sets_dttd = pd.get_dummies(data_test_dttd)

### Look at association rules

In [41]:
support_to_count(.0002)

3

In [42]:
frequent_domains_dttd = apriori(domain_sets_dttd, min_support=0.00015, use_colnames=True)
frequent_domains_dttd['length'] = frequent_domains_dttd['itemsets'].apply(lambda x: len(x))

frequent_domains_dttd

Unnamed: 0,support,itemsets,length
0,0.000171,(subdomain1_adsabs),1
1,0.001829,(subdomain1_bidmc),1
2,0.000457,(subdomain1_bih),1
3,0.015716,(subdomain1_bwh),1
4,0.001772,(subdomain1_cadm),1
...,...,...,...
2408,0.000171,"(assess_KEEP, subdomain5_journals, subdomain2_...",4
2409,0.000171,"(subdomain4_stage, subdomain3_ats, subdomain2_...",5
2410,0.000171,"(subdomain4_prod, subdomain2_cloud, subdomain1...",5
2411,0.000343,"(assess_KEEP, subdomain1_hul, subdomain2_ezp-p...",5


In [43]:
frequent_domains_dttd[frequent_domains_dttd['length'] > 2].shape

(575, 3)

#### CONFIDENCE

In [44]:
rules = association_rules(frequent_domains_dttd, metric = "confidence", min_threshold = 0.99)

In [45]:
rules["antecedent occurences"] = rules["antecedent support"].apply(lambda x: support_to_count(x))

In [46]:
rm = frozenset({'assess_REMOVE'})
exclude = rules[rules["consequents"]==rm]
exclude

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent occurences
0,(subdomain1_bih),(assess_REMOVE),0.000457,0.956795,0.000457,1.00000,1.045156,0.000020,inf,8
8,(subdomain1_cadm),(assess_REMOVE),0.001772,0.956795,0.001772,1.00000,1.045156,0.000077,inf,31
9,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.052520,0.99137,1.036136,0.001832,5.006372,927
26,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.00000,1.045156,0.000165,inf,67
32,(subdomain1_cyber),(assess_REMOVE),0.000171,0.956795,0.000171,1.00000,1.045156,0.000007,inf,2
...,...,...,...,...,...,...,...,...,...,...
2240,"(bottom_dom_trunc_nmr-, subdomain2_nmr, subdom...",(assess_REMOVE),0.000286,0.956795,0.000286,1.00000,1.045156,0.000012,inf,5
2244,"(bottom_dom_trunc_syngo, subdomain2_nmr, subdo...",(assess_REMOVE),0.000171,0.956795,0.000171,1.00000,1.045156,0.000007,inf,2
2247,"(subdomain2_client, subdomain1_student, bottom...",(assess_REMOVE),0.002686,0.956795,0.002686,1.00000,1.045156,0.000116,inf,47
2255,"(subdomain3_campusservices, subdomain2_cloud, ...",(assess_REMOVE),0.000171,0.956795,0.000171,1.00000,1.045156,0.000007,inf,2


In [47]:
sif.search("lenovo", data, ["name"])

Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,fstrf-lenovo-01.dfci.harvard.edu,,,"['fstrf-lenovo-01', 'dfci', 'harvard', 'edu']",harvard.edu,fstrf-lenovo-01.dfci,dfci,fstrf-lenovo-01,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://fstrf-lenovo-01.dfci.harvard.edu...,fstrf-lenovo-01,4,fstrf-
1,fstrf-lenovo-03.dfci.harvard.edu,,,"['fstrf-lenovo-03', 'dfci', 'harvard', 'edu']",harvard.edu,fstrf-lenovo-03.dfci,dfci,fstrf-lenovo-03,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://fstrf-lenovo-03.dfci.harvard.edu...,fstrf-lenovo-03,4,fstrf-
2,fstrf-lenovo-81.dfci.harvard.edu,,,"['fstrf-lenovo-81', 'dfci', 'harvard', 'edu']",harvard.edu,fstrf-lenovo-81.dfci,dfci,fstrf-lenovo-81,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://fstrf-lenovo-81.dfci.harvard.edu...,fstrf-lenovo-81,4,fstrf-
3,lenovo-54447439.mgh.harvard.edu,,,"['lenovo-54447439', 'mgh', 'harvard', 'edu']",harvard.edu,lenovo-54447439.mgh,mgh,lenovo-54447439,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://lenovo-54447439.mgh.harvard.edu ...,lenovo-54447439,4,lenovo-
4,lenovo-a3551f0f.dfci.harvard.edu,,,"['lenovo-a3551f0f', 'dfci', 'harvard', 'edu']",harvard.edu,lenovo-a3551f0f.dfci,dfci,lenovo-a3551f0f,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://lenovo-a3551f0f.dfci.harvard.edu...,lenovo-a3551f0f,4,lenovo-
5,lenovo-e4eec8c5.mclean.harvard.edu,,,"['lenovo-e4eec8c5', 'mclean', 'harvard', 'edu']",harvard.edu,lenovo-e4eec8c5.mclean,mclean,lenovo-e4eec8c5,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://lenovo-e4eec8c5.mclean.harvard.e...,lenovo-e4eec8c5,4,lenovo-


In [48]:
summary(exclude)

Description: 
 excludes 40134 URLs from 17498.
 and has 946 exclusion rules.


#### LIFT

In [49]:
rules = association_rules(frequent_domains_dttd, metric = "lift", min_threshold = 1)

In [50]:
rules["attecedent occurences"] = rules["antecedent support"].apply(lambda x: support_to_count(x))

In [51]:
rules[rules["consequents"]==rm]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
2,(subdomain1_bih),(assess_REMOVE),0.000457,0.956795,0.000457,1.00000,1.045156,0.000020,inf,8
29,(subdomain1_cadm),(assess_REMOVE),0.001772,0.956795,0.001772,1.00000,1.045156,0.000077,inf,31
35,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.052520,0.99137,1.036136,0.001832,5.006372,927
70,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.00000,1.045156,0.000165,inf,67
85,(subdomain1_cyber),(assess_REMOVE),0.000171,0.956795,0.000171,1.00000,1.045156,0.000007,inf,2
...,...,...,...,...,...,...,...,...,...,...
5715,"(bottom_dom_trunc_nmr-, subdomain2_nmr, subdom...",(assess_REMOVE),0.000286,0.956795,0.000286,1.00000,1.045156,0.000012,inf,5
5728,"(bottom_dom_trunc_syngo, subdomain2_nmr, subdo...",(assess_REMOVE),0.000171,0.956795,0.000171,1.00000,1.045156,0.000007,inf,2
5740,"(subdomain2_client, subdomain1_student, bottom...",(assess_REMOVE),0.002686,0.956795,0.002686,1.00000,1.045156,0.000116,inf,47
5768,"(subdomain3_campusservices, subdomain2_cloud, ...",(assess_REMOVE),0.000171,0.956795,0.000171,1.00000,1.045156,0.000007,inf,2


#### LIFT AND CONFIDENCE

In [52]:
rules[(rules['lift'] >= 1) & (rules['confidence'] >=0.98) & (rules["consequents"]==rm) & (rules["attecedent occurences"]>5)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
2,(subdomain1_bih),(assess_REMOVE),0.000457,0.956795,0.000457,1.000000,1.045156,0.000020,inf,8
29,(subdomain1_cadm),(assess_REMOVE),0.001772,0.956795,0.001772,1.000000,1.045156,0.000077,inf,31
35,(subdomain1_cfa),(assess_REMOVE),0.052977,0.956795,0.052520,0.991370,1.036136,0.001832,5.006372,927
70,(subdomain1_chem),(assess_REMOVE),0.003829,0.956795,0.003829,1.000000,1.045156,0.000165,inf,67
99,(subdomain1_dfci),(assess_REMOVE),0.076123,0.956795,0.075437,0.990991,1.035740,0.002603,4.795748,1331
...,...,...,...,...,...,...,...,...,...,...
5264,"(bottom_dom_trunc_core-, subdomain1_fas, subdo...",(assess_REMOVE),0.001886,0.956795,0.001886,1.000000,1.045156,0.000081,inf,33
5306,"(bottom_dom_trunc_net-, subdomain2_roam, subdo...",(assess_REMOVE),0.000857,0.956795,0.000857,1.000000,1.045156,0.000037,inf,15
5334,"(subdomain2_wrls-client, bottom_dom_trunc_wrls...",(assess_REMOVE),0.162304,0.956795,0.162304,1.000000,1.045156,0.007012,inf,2840
5661,"(subdomain2_client, subdomain1_law, bottom_dom...",(assess_REMOVE),0.002572,0.956795,0.002572,1.000000,1.045156,0.000111,inf,45


In [55]:
rules[(rules['lift'] >= 1) & (rules['confidence'] >=0.99) & (rules["consequents"]==rm)].sort_values("antecedent support", ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
622,(subdomain1_mgh),(assess_REMOVE),0.223111,0.956795,0.221911,0.994621,1.039534,0.008439,8.032004,3904
2119,(bottom_dom_trunc_wrls-),(assess_REMOVE),0.180935,0.956795,0.180935,1.000000,1.045156,0.007817,inf,3166
3045,"(subdomain1_fas, bottom_dom_trunc_wrls-)",(assess_REMOVE),0.162304,0.956795,0.162304,1.000000,1.045156,0.007012,inf,2840
2861,"(subdomain1_fas, subdomain2_wrls-client)",(assess_REMOVE),0.162304,0.956795,0.162304,1.000000,1.045156,0.007012,inf,2840
5334,"(subdomain1_fas, bottom_dom_trunc_wrls-, subdo...",(assess_REMOVE),0.162304,0.956795,0.162304,1.000000,1.045156,0.007012,inf,2840
...,...,...,...,...,...,...,...,...,...,...
2945,"(bottom_dom_trunc_h2-, subdomain1_fas)",(assess_REMOVE),0.000171,0.956795,0.000171,1.000000,1.045156,0.000007,inf,2
2938,"(bottom_dom_trunc_fasit-, subdomain1_fas)",(assess_REMOVE),0.000171,0.956795,0.000171,1.000000,1.045156,0.000007,inf,2
2933,"(subdomain1_fas, bottom_dom_trunc_econw)",(assess_REMOVE),0.000171,0.956795,0.000171,1.000000,1.045156,0.000007,inf,2
1448,(bottom_dom_trunc_event),(assess_REMOVE),0.000171,0.956795,0.000171,1.000000,1.045156,0.000007,inf,2


In [57]:
sif.find_from_sd("test-", data)

Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,test-176793.mgh.harvard.edu,,,"['test-176793', 'mgh', 'harvard', 'edu']",harvard.edu,test-176793.mgh,mgh,test-176793,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://test-176793.mgh.harvard.edu : HT...,test-176793,4,test-
1,test-dom03.med.harvard.edu,,,"['test-dom03', 'med', 'harvard', 'edu']",harvard.edu,test-dom03.med,med,test-dom03,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://test-dom03.med.harvard.edu : HTT...,test-dom03,4,test-
2,test-gfiler.cfa.harvard.edu,,,"['test-gfiler', 'cfa', 'harvard', 'edu']",harvard.edu,test-gfiler.cfa,cfa,test-gfiler,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://test-gfiler.cfa.harvard.edu : HT...,test-gfiler,4,test-
3,test-ldap-2.dce.harvard.edu,,,"['test-ldap-2', 'dce', 'harvard', 'edu']",harvard.edu,test-ldap-2.dce,dce,test-ldap-2,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://test-ldap-2.dce.harvard.edu : HT...,test-ldap-2,4,test-
4,test-oi.mgh.harvard.edu,,,"['test-oi', 'mgh', 'harvard', 'edu']",harvard.edu,test-oi.mgh,mgh,test-oi,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://test-oi.mgh.harvard.edu : HTTPCo...,test-oi,4,test-
5,test-vpn.dce.harvard.edu,,,"['test-vpn', 'dce', 'harvard', 'edu']",harvard.edu,test-vpn.dce,dce,test-vpn,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://test-vpn.dce.harvard.edu : HTTPC...,test-vpn,4,test-


In [58]:
sif.search ("test", data,["name"])

Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,as-app-test.hsl.harvard.edu,200.0,IIS Windows Server,"['as-app-test', 'hsl', 'harvard', 'edu']",harvard.edu,as-app-test.hsl,hsl,as-app-test,,,...,1,0,0,http://as-app-test.hsl.harvard.edu/,1,KEEP,VALID: http://as-app-test.hsl.harvard.edu: 200,as-app-test,4,as-
1,nora.test.ats.cloud.huit.harvard.edu,200.0,GSAS Non-Residential Application Portal,"['nora', 'test', 'ats', 'cloud', 'huit', 'harv...",harvard.edu,nora.test.ats.cloud.huit,huit,cloud,ats,test,...,1,0,0,https://nora.test.ats.cloud.huit.harvard.edu:443/,1,KEEP,HTTP Check: requested http://nora.test.ats.clo...,nora,7,nora
2,test.hbsp.harvard.edu,200.0,Harvard Business Publishing Education,"['test', 'hbsp', 'harvard', 'edu']",harvard.edu,test.hbsp,hbsp,test,,,...,1,0,0,https://test.hbsp.harvard.edu/,1,KEEP,HTTP Check: requested http://test.hbsp.harvard...,test,4,test
3,upload-test.hio.harvard.edu,200.0,HarvardKey: Error,"['upload-test', 'hio', 'harvard', 'edu']",harvard.edu,upload-test.hio,hio,upload-test,,,...,0,1,1,https://www.pin1.harvard.edu/cas/login?service...,1,KEEP,HARVARD KEY: http://upload-test.hio.harvard.ed...,upload-test,4,upload-
4,bractontest.law.harvard.edu,302.0,,"['bractontest', 'law', 'harvard', 'edu']",harvard.edu,bractontest.law,law,bractontest,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://bractontest.law.harvard.edu : ('...,bractontest,4,bract
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
104,wjh-1380-test-1.wjh.harvard.edu,,,"['wjh-1380-test-1', 'wjh', 'harvard', 'edu']",harvard.edu,wjh-1380-test-1.wjh,wjh,wjh-1380-test-1,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://wjh-1380-test-1.wjh.harvard.edu ...,wjh-1380-test-1,4,wjh-
105,wjh-test1.client.fas.harvard.edu,,,"['wjh-test1', 'client', 'fas', 'harvard', 'edu']",harvard.edu,wjh-test1.client.fas,fas,client,wjh-test1,,...,0,0,0,-1,-1,REMOVE,ERROR: http://wjh-test1.client.fas.harvard.edu...,wjh-test1,5,wjh-
106,wjh-test2.client.fas.harvard.edu,,,"['wjh-test2', 'client', 'fas', 'harvard', 'edu']",harvard.edu,wjh-test2.client.fas,fas,client,wjh-test2,,...,0,0,0,-1,-1,REMOVE,ERROR: http://wjh-test2.client.fas.harvard.edu...,wjh-test2,5,wjh-
107,wjhcs-test-print.wjh.harvard.edu,,,"['wjhcs-test-print', 'wjh', 'harvard', 'edu']",harvard.edu,wjhcs-test-print.wjh,wjh,wjhcs-test-print,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://wjhcs-test-print.wjh.harvard.edu...,wjhcs-test-print,4,wjhcs-


## Association for bottom_domain_truncated 

Looking only at the first part of the URL.

In [59]:
data_test_td_trunc = data[["bottom_dom_trunc","assess"]].copy()

In [60]:
domain_sets_td_trunc = pd.get_dummies(data_test_td_trunc)

In [61]:
apriori(domain_sets_td_trunc, min_support=0.002, use_colnames=True)

Unnamed: 0,support,itemsets
0,0.005258,(bottom_dom_trunc_core-)
1,0.002172,(bottom_dom_trunc_csc-)
2,0.034633,(bottom_dom_trunc_dhcp-)
3,0.006801,(bottom_dom_trunc_ksg-)
4,0.043376,(bottom_dom_trunc_meei-)
...,...,...
115,0.002000,"(assess_REMOVE, bottom_dom_trunc_w0082)"
116,0.002172,"(assess_REMOVE, bottom_dom_trunc_w0084)"
117,0.002400,"(bottom_dom_trunc_w0087, assess_REMOVE)"
118,0.002629,"(assess_REMOVE, bottom_dom_trunc_w0117)"


In [62]:
frequent_domains_td_trunc = apriori(domain_sets_td_trunc, min_support=0.002, use_colnames=True)
frequent_domains_td_trunc['length'] = frequent_domains_td_trunc['itemsets'].apply(lambda x: len(x))

frequent_domains_td_trunc

Unnamed: 0,support,itemsets,length
0,0.005258,(bottom_dom_trunc_core-),1
1,0.002172,(bottom_dom_trunc_csc-),1
2,0.034633,(bottom_dom_trunc_dhcp-),1
3,0.006801,(bottom_dom_trunc_ksg-),1
4,0.043376,(bottom_dom_trunc_meei-),1
...,...,...,...
115,0.002000,"(assess_REMOVE, bottom_dom_trunc_w0082)",2
116,0.002172,"(assess_REMOVE, bottom_dom_trunc_w0084)",2
117,0.002400,"(bottom_dom_trunc_w0087, assess_REMOVE)",2
118,0.002629,"(assess_REMOVE, bottom_dom_trunc_w0117)",2


In [66]:
frequent_domains_td_trunc[frequent_domains_td_trunc['length'] > 1][0:30]

Unnamed: 0,support,itemsets,length
62,0.005258,"(assess_REMOVE, bottom_dom_trunc_core-)",2
63,0.002172,"(assess_REMOVE, bottom_dom_trunc_csc-)",2
64,0.034118,"(bottom_dom_trunc_dhcp-, assess_REMOVE)",2
65,0.006801,"(assess_REMOVE, bottom_dom_trunc_ksg-)",2
66,0.043376,"(bottom_dom_trunc_meei-, assess_REMOVE)",2
67,0.013316,"(assess_REMOVE, bottom_dom_trunc_net-)",2
68,0.0024,"(bottom_dom_trunc_otp00, assess_REMOVE)",2
69,0.003143,"(assess_REMOVE, bottom_dom_trunc_sao-)",2
70,0.013202,"(assess_REMOVE, bottom_dom_trunc_sfp-)",2
71,0.003372,"(bottom_dom_trunc_sph16-, assess_REMOVE)",2


### Confidence

In [67]:
rules = association_rules(frequent_domains_td_trunc, metric = "confidence", min_threshold = 0.9)

In [68]:
rules["attecedent occurences"] = rules["antecedent support"].apply(lambda x: support_to_count(x))

In [69]:
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
0,(bottom_dom_trunc_core-),(assess_REMOVE),0.005258,0.956795,0.005258,1.0,1.045156,0.000227,inf,92
1,(bottom_dom_trunc_csc-),(assess_REMOVE),0.002172,0.956795,0.002172,1.0,1.045156,9.4e-05,inf,38
2,(bottom_dom_trunc_dhcp-),(assess_REMOVE),0.034633,0.956795,0.034118,0.985149,1.029634,0.000982,2.909132,606
3,(bottom_dom_trunc_ksg-),(assess_REMOVE),0.006801,0.956795,0.006801,1.0,1.045156,0.000294,inf,119
4,(bottom_dom_trunc_meei-),(assess_REMOVE),0.043376,0.956795,0.043376,1.0,1.045156,0.001874,inf,759
5,(bottom_dom_trunc_net-),(assess_REMOVE),0.013316,0.956795,0.013316,1.0,1.045156,0.000575,inf,233
6,(bottom_dom_trunc_otp00),(assess_REMOVE),0.0024,0.956795,0.0024,1.0,1.045156,0.000104,inf,42
7,(bottom_dom_trunc_sao-),(assess_REMOVE),0.003143,0.956795,0.003143,1.0,1.045156,0.000136,inf,55
8,(bottom_dom_trunc_sfp-),(assess_REMOVE),0.013202,0.956795,0.013202,1.0,1.045156,0.00057,inf,231
9,(bottom_dom_trunc_sph16-),(assess_REMOVE),0.003372,0.956795,0.003372,1.0,1.045156,0.000146,inf,59


### LIFT

In [70]:
rules = association_rules(frequent_domains_td_trunc, metric = "lift", min_threshold = 1)

In [71]:
rules["attecedent occurences"] = rules["antecedent support"].apply(lambda x: support_to_count(x))

In [72]:
rules[rules["consequents"]==rm].sort_values("antecedent support")

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
107,(bottom_dom_trunc_w0082),(assess_REMOVE),0.002,0.956795,0.002,1.0,1.045156,8.6e-05,inf,35
89,(bottom_dom_trunc_w0048),(assess_REMOVE),0.002,0.956795,0.002,1.0,1.045156,8.6e-05,inf,35
69,(bottom_dom_trunc_w0024),(assess_REMOVE),0.002057,0.956795,0.002057,1.0,1.045156,8.9e-05,inf,36
3,(bottom_dom_trunc_csc-),(assess_REMOVE),0.002172,0.956795,0.002172,1.0,1.045156,9.4e-05,inf,38
109,(bottom_dom_trunc_w0084),(assess_REMOVE),0.002172,0.956795,0.002172,1.0,1.045156,9.4e-05,inf,38
104,(bottom_dom_trunc_w0081),(assess_REMOVE),0.002229,0.956795,0.002229,1.0,1.045156,9.6e-05,inf,39
72,(bottom_dom_trunc_w0028),(assess_REMOVE),0.002229,0.956795,0.002229,1.0,1.045156,9.6e-05,inf,39
81,(bottom_dom_trunc_w0042),(assess_REMOVE),0.002286,0.956795,0.002286,1.0,1.045156,9.9e-05,inf,40
29,(bottom_dom_trunc_sph180-),(assess_REMOVE),0.002286,0.956795,0.002286,1.0,1.045156,9.9e-05,inf,40
97,(bottom_dom_trunc_w0057),(assess_REMOVE),0.002343,0.956795,0.002343,1.0,1.045156,0.000101,inf,41


In [73]:
support_to_count(.128419)

2247

### LIFT AND CONFIDENCE

In [74]:
rules[(rules['lift'] >= 1) & (rules['confidence'] >=0.98) & (rules["consequents"]==rm)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,attecedent occurences
1,(bottom_dom_trunc_core-),(assess_REMOVE),0.005258,0.956795,0.005258,1.0,1.045156,0.000227,inf,92
3,(bottom_dom_trunc_csc-),(assess_REMOVE),0.002172,0.956795,0.002172,1.0,1.045156,9.4e-05,inf,38
4,(bottom_dom_trunc_dhcp-),(assess_REMOVE),0.034633,0.956795,0.034118,0.985149,1.029634,0.000982,2.909132,606
7,(bottom_dom_trunc_ksg-),(assess_REMOVE),0.006801,0.956795,0.006801,1.0,1.045156,0.000294,inf,119
8,(bottom_dom_trunc_meei-),(assess_REMOVE),0.043376,0.956795,0.043376,1.0,1.045156,0.001874,inf,759
11,(bottom_dom_trunc_net-),(assess_REMOVE),0.013316,0.956795,0.013316,1.0,1.045156,0.000575,inf,233
12,(bottom_dom_trunc_otp00),(assess_REMOVE),0.0024,0.956795,0.0024,1.0,1.045156,0.000104,inf,42
15,(bottom_dom_trunc_sao-),(assess_REMOVE),0.003143,0.956795,0.003143,1.0,1.045156,0.000136,inf,55
17,(bottom_dom_trunc_sfp-),(assess_REMOVE),0.013202,0.956795,0.013202,1.0,1.045156,0.00057,inf,231
18,(bottom_dom_trunc_sph16-),(assess_REMOVE),0.003372,0.956795,0.003372,1.0,1.045156,0.000146,inf,59


In [76]:
sif.find_from_sd('cmir-', data)

Unnamed: 0,name,http status code,title,url,domain,predomain,subdomain1,subdomain2,subdomain3,subdomain4,...,public,login,harvard_key,resolved_url,success,assess,note,bottom_domain,domain_count,bottom_dom_trunc
0,cmir-5031-xp.mgh.harvard.edu,-1.0,,"['cmir-5031-xp', 'mgh', 'harvard', 'edu']",harvard.edu,cmir-5031-xp.mgh,mgh,cmir-5031-xp,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://cmir-5031-xp.mgh.harvard.edu : H...,cmir-5031-xp,4,cmir-
1,cmir-clbit.mgh.harvard.edu,-1.0,,"['cmir-clbit', 'mgh', 'harvard', 'edu']",harvard.edu,cmir-clbit.mgh,mgh,cmir-clbit,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://cmir-clbit.mgh.harvard.edu : HTT...,cmir-clbit,4,cmir-
2,cmir-server.mgh.harvard.edu,-1.0,,"['cmir-server', 'mgh', 'harvard', 'edu']",harvard.edu,cmir-server.mgh,mgh,cmir-server,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://cmir-server.mgh.harvard.edu : HT...,cmir-server,4,cmir-
3,cmir-t32.mgh.harvard.edu,-1.0,,"['cmir-t32', 'mgh', 'harvard', 'edu']",harvard.edu,cmir-t32.mgh,mgh,cmir-t32,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://cmir-t32.mgh.harvard.edu : HTTPC...,cmir-t32,4,cmir-
4,cmir-xraid.mgh.harvard.edu,-1.0,,"['cmir-xraid', 'mgh', 'harvard', 'edu']",harvard.edu,cmir-xraid.mgh,mgh,cmir-xraid,,,...,0,0,0,-1,-1,REMOVE,ERROR: http://cmir-xraid.mgh.harvard.edu : HTT...,cmir-xraid,4,cmir-
