# Association rule discovery

To complete the exercise we will need the `mlextend` library since `scikit-learn` does not provide any tools for frequent itemset or association rule discovery.

In [1]:
!pip install mlxtend



Our first step is to download a piece of text from Wikipedia and to parse paragraphs.

In [2]:
from bs4 import BeautifulSoup
import requests

respond = requests.get("https://en.wikipedia.org/wiki/Poznań")
soup = BeautifulSoup(respond.text, "lxml")
page = soup.find_all('p')

raw_text = [paragraph.text for paragraph in page]

print(raw_text)

['\n', "Poznań (Polish:\xa0[ˈpɔznaɲ] (listen))[a] is a city on the River Warta in west-central Poland, within the Greater Poland region. The city is an important cultural and business centre, and one of Poland's most populous regions with many regional customs such as Saint John's Fair (Jarmark Świętojański), traditional Saint Martin's croissants and a local dialect. Among its most important heritage sites are the Renaissance Old Town, Town Hall and Gothic Cathedral.\n", "Poznań is the fifth-largest and one of the oldest cities in Poland. As of 2020, the city's population is 532,048, while the Poznań metropolitan area (Metropolia Poznań) comprising Poznań County and several other communities is inhabited by over 1.1\xa0million people.[2] It is one of four historical capitals of medieval Poland and the ancient capital of the Greater Poland region, currently the administrative capital of the province called Greater Poland Voivodeship.\n", "Poznań is a center of trade, sports, education, 

Next, we will split the text into paragraphs and remove the lines with less than 3 words.

In [3]:
text = [ line.split() for line in raw_text if len(line) > 2 ]

for line in text[:10]:
    print(line)

['Poznań', '(Polish:', '[ˈpɔznaɲ]', '(listen))[a]', 'is', 'a', 'city', 'on', 'the', 'River', 'Warta', 'in', 'west-central', 'Poland,', 'within', 'the', 'Greater', 'Poland', 'region.', 'The', 'city', 'is', 'an', 'important', 'cultural', 'and', 'business', 'centre,', 'and', 'one', 'of', "Poland's", 'most', 'populous', 'regions', 'with', 'many', 'regional', 'customs', 'such', 'as', 'Saint', "John's", 'Fair', '(Jarmark', 'Świętojański),', 'traditional', 'Saint', "Martin's", 'croissants', 'and', 'a', 'local', 'dialect.', 'Among', 'its', 'most', 'important', 'heritage', 'sites', 'are', 'the', 'Renaissance', 'Old', 'Town,', 'Town', 'Hall', 'and', 'Gothic', 'Cathedral.']
['Poznań', 'is', 'the', 'fifth-largest', 'and', 'one', 'of', 'the', 'oldest', 'cities', 'in', 'Poland.', 'As', 'of', '2020,', 'the', "city's", 'population', 'is', '532,048,', 'while', 'the', 'Poznań', 'metropolitan', 'area', '(Metropolia', 'Poznań)', 'comprising', 'Poznań', 'County', 'and', 'several', 'other', 'communities', '

Our text still contains a lot of stop-words and some additional tokens such as 1.2, [2], etc. We will use the `nltk` library to remove the stop-words and we'll transform everything to alpha tokens.

In [4]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\1625203\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
from nltk.corpus import stopwords

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [6]:
from nltk.corpus import stopwords

clean_text = [
    [ 
        word.lower() 
         for word 
        in line 
        if word.isalpha() 
        and word.lower() not in stopwords.words('english') 
    ]
    for line 
    in text
]

for line in clean_text[:10]:
    print(line)

['poznań', 'city', 'river', 'warta', 'within', 'greater', 'poland', 'city', 'important', 'cultural', 'business', 'one', 'populous', 'regions', 'many', 'regional', 'customs', 'saint', 'fair', 'traditional', 'saint', 'croissants', 'local', 'among', 'important', 'heritage', 'sites', 'renaissance', 'old', 'town', 'hall', 'gothic']
['poznań', 'one', 'oldest', 'cities', 'population', 'poznań', 'metropolitan', 'area', 'comprising', 'poznań', 'county', 'several', 'communities', 'inhabited', 'million', 'one', 'four', 'historical', 'capitals', 'medieval', 'poland', 'ancient', 'capital', 'greater', 'poland', 'currently', 'administrative', 'capital', 'province', 'called', 'greater', 'poland']
['poznań', 'center', 'technology', 'important', 'academic', 'students', 'adam', 'mickiewicz', 'third', 'largest', 'polish', 'city', 'serves', 'seat', 'oldest', 'polish', 'one', 'populous', 'catholic', 'archdioceses', 'city', 'also', 'hosts', 'poznań', 'international', 'fair', 'biggest', 'industrial', 'fair', 

Now we are ready to transform the list of lists into the format suitable for association rule mining, i.e., to transform the input lists into boolean flags.

In [7]:
import pandas as pd

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

te = TransactionEncoder()
te_array = te.fit(clean_text).transform(clean_text)

In [8]:
# te_array contains binary version of the input data

te_array

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False,  True, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

In [9]:
te_array.shape

(91, 1341)

In [10]:
# original tokens are preserved in the columns_ field

te.columns_[:10]

['ab',
 'academic',
 'academy',
 'access',
 'according',
 'accounted',
 'achieving',
 'acoustics',
 'acquire',
 'acquired']

`mlxtend` package assumes that the input data are stored as a `pandas.DataFrame`

In [None]:
df = pd.DataFrame(te_array, columns=te.columns_)

df.head()

Now we are ready to find frequent collections of words.

In [None]:
frequent_itemsets = apriori(df, min_support=0.05, use_colnames=True)

frequent_itemsets

We can also mine association rules which will have additional measures of quality and interestingness

In [None]:
from mlxtend.frequent_patterns import association_rules
?association_rules

In [None]:
from mlxtend.frequent_patterns import association_rules

association_rules(frequent_itemsets, 
                  metric='confidence', 
                  min_threshold=0.7)

In [None]:
association_rules(frequent_itemsets, metric='lift', min_threshold=5.0)

Both frequent itemsets and association rules (antecedens and consequents) are returned as `frozenset`s, so we can use [standard API calls](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset) to find subsets, supersets, etc.

In [None]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)

capital_idx =  rules['antecedents'].apply(lambda x: x.issuperset({'capital'}))
rules[capital_idx]