# Topic modeling

| Student Name         | Student-ID |
|----------------------|------------|
| Marco Di Francesco   | 100632815  |
| Loreto García Tejada | 100643862  |
| György Bence Józsa   | 100633270  |
| József-Hunor Jánosi  | 100516724  |
| Sara-Jane Bittner    | 100498554  |

_Learning goal: Processing text data, converting it into a numerical format and performing topic analysis using SVD._

To complete the assignment you are allowed to use the NLTK natural language toolkit.

In [1]:
import re
import nltk
import pandas as pd
from nltk import ngrams
from nltk.corpus import stopwords
from typing import Union, Tuple, Dict, List
from functools import reduce
import numpy as np

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gyuri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

a) Load _applications_of_DM.csv_, e.g. with `pandas` in Python. It has titles and text content of Wikipedia articles on data mining. In the following tasks, you are to work with the "tex"-attribute.

In [3]:
df = pd.read_csv('applications_of_DM.csv')
df

Unnamed: 0,title,text
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...
2,Business analytics,Business analysis is a professional discipline...
3,CORE (research service),CORE (Connecting Repositories) is a service pr...
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...
5,Data Applied,Data Applied is a software vendor headquartere...
6,Data mining in agriculture,Data mining in agriculture is a recent researc...
7,Data thinking,"Data thinking is a buzzword for the generic ""m..."
8,Document processing,Document processing is a field of research and...
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as..."


b) Process and tokenize the text data: lowercase all words, remove digits, punctuation, any special characters and NLTK’s common English stopwords. List $10$ most frequent $n$-grams, $n = 1, 2, 3,$ and their raw counts. Then, convert $1700$ most frequent $n$-grams, across all $n$-values, into numerical matrices, i.e., features $X_\mathrm{raw},X_\mathrm{tf-idf} \in \mathbb{R}^{docs \times terms}$. Features $X_\mathrm{raw}$ are raw $n$-gram counts, $X_\mathrm{tf-idf}$ are tf-idf values (see the specific format below).

In [4]:
def process_text(text: str) -> list[str]:
    """Removes special characters and numbers from a given text and returns a list of words of that text in lower case, removing common english stopwords.

    :param text: the raw text to be processed
    :return: a list of words in the original text with no special characters, numbers and common english stopwords in lower case.
    """
    # remove special characters and numbers
    text = re.sub(r'[^a-zA-Z]', ' ', text.lower())
    words = text.split()
    # filter out common english stopwords and return list
    return [word for word in words if word not in stopwords.words('english')]

In [5]:
df['filtered_text'] = df['text'].map(process_text)
df

Unnamed: 0,title,text,filtered_text
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS...","[anomaly, detection, multiple, scales, adams, ..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...,"[behaviorism, systematic, approach, understand..."
2,Business analytics,Business analysis is a professional discipline...,"[business, analysis, professional, discipline,..."
3,CORE (research service),CORE (Connecting Repositories) is a service pr...,"[core, connecting, repositories, service, prov..."
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...,"[daisy, intelligence, canadian, artificial, in..."
5,Data Applied,Data Applied is a software vendor headquartere...,"[data, applied, software, vendor, headquartere..."
6,Data mining in agriculture,Data mining in agriculture is a recent researc...,"[data, mining, agriculture, recent, research, ..."
7,Data thinking,"Data thinking is a buzzword for the generic ""m...","[data, thinking, buzzword, generic, mental, pa..."
8,Document processing,Document processing is a field of research and...,"[document, processing, field, research, set, p..."
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as...","[equifax, workforce, solutions, formerly, know..."


In [6]:
def make_ngram(corpus: list[str], n: int) -> Dict[Tuple[str, ...], int]:
    """Creates a frequency-list of specified n-grams in the given corpus

    :param corpus: Ordered list of words in the corpus
    :param n: number of elements in n-gram creation
    :return: dictionary with keys being unique n-grams and values being the frequencies of them
    """
    grams = list(ngrams(corpus, n))
    return {i:grams.count(i) for i in set(grams)}

In [7]:
df['unigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=1))
df['bigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=2))
df['trigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=3))
df

Unnamed: 0,title,text,filtered_text,unigrams,bigrams,trigrams
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS...","[anomaly, detection, multiple, scales, adams, ...","{('may',): 1, ('insider',): 4, ('wikileaks',):...","{('darpa', 'information'): 1, ('employee', 'ab...","{('insider', 'becoming', 'malicious'): 1, ('us..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...,"[behaviorism, systematic, approach, understand...","{('verification',): 1, ('early',): 9, ('instru...","{('observer', 'recalls'): 1, ('concept', 'mind...","{('used', 'desired', 'actions'): 1, ('explains..."
2,Business analytics,Business analysis is a professional discipline...,"[business, analysis, professional, discipline,...","{('outsourced',): 1, ('transformation',): 2, (...","{('describes', 'character'): 1, ('costs', 'div...","{('done', 'well', 'e'): 1, ('forward', 'tactic..."
3,CORE (research service),CORE (Connecting Repositories) is a service pr...,"[core, connecting, repositories, service, prov...","{('monitor',): 2, ('ideas',): 1, ('base',): 1,...","{('overtaken', 'internet'): 1, ('million', 'co...","{('sustainability', 'model', 'core'): 1, ('ava..."
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...,"[daisy, intelligence, canadian, artificial, in...","{('helps',): 1, ('east',): 1, ('ranked',): 1, ...","{('supermarkets', 'determine'): 1, ('fraudulen...","{('company', 'moved', 'suburban'): 1, ('canadi..."
5,Data Applied,Data Applied is a software vendor headquartere...,"[data, applied, software, vendor, headquartere...","{('headquartered',): 1, ('microsoft',): 1, ('s...","{('headquartered', 'washington'): 1, ('maps', ...","{('forecasting', 'correlation', 'analysis'): 1..."
6,Data mining in agriculture,Data mining in agriculture is a recent researc...,"[data, mining, agriculture, recent, research, ...","{('early',): 2, ('practices',): 1, ('monitor',...","{('allow', 'farmer'): 1, ('pesticide', 'use'):...","{('birth', 'type', 'platform'): 1, ('conveyor'..."
7,Data thinking,"Data thinking is a buzzword for the generic ""m...","[data, thinking, buzzword, generic, mental, pa...","{('concrete',): 2, ('proofed',): 1, ('transfor...","{('driven', 'transformation'): 1, ('approach',...","{('rogerio', 'panigassi', 'writing'): 1, ('com..."
8,Document processing,Document processing is a field of research and...,"[document, processing, field, research, set, p...","{('volumes',): 1, ('automation',): 2, ('nap',)...","{('scanners', 'must'): 1, ('broadly', 'transcr...","{('character', 'recognition', 'ocr'): 2, ('pro..."
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as...","[equifax, workforce, solutions, formerly, know...","{('verification',): 2, ('submitted',): 2, ('ea...","{('talx', 'business'): 1, ('revenue', 'profit'...","{('service', 'work', 'number'): 2, ('known', '..."


In [8]:
def merge_frequency_dictionaries(first: Dict[Tuple[str, ...], int], second: Dict[Tuple[str, ...], int]) -> Dict[Tuple[str, ...], int]:
    """Merges two dictionaries of frequencies

    :param first: first dictionary containing frequencies
    :param second: second dictionary containing frequencies
    :return: a merged dictionary
    """
    res = {}
    for key, value in first.items():
        res[key] = value
    for key, value in second.items():
        if key in res:
            res[key] += value
        else:
            res[key] = value
    return res

In [9]:
def combine(series):
    """ Combines a list of dictionaries into a single dictionary

    :param series: a list of dictionaries
    :return: a single dictionary
    """
    return reduce(lambda x,y: merge_frequency_dictionaries(x,y), series)

In [10]:
def get_most_frequent(grams: pd.Series, n: int) -> list[Tuple[int, Tuple[str, ...]]]:
    """ Returns the n most frequent n-grams

    :param grams: a list of dictionaries
    :param n: the number of n-grams to return
    :return: a list of tuples containing the frequency and the n-gram
    """
    combined_grams = grams.aggregate(combine)
    return sorted([(v, k) for k, v in combined_grams.items()], reverse=True)[:n]

In [11]:
get_most_frequent(df['unigrams'], 10)

[(160, ('behavior',)),
 (156, ('analysis',)),
 (150, ('data',)),
 (148, ('business',)),
 (103, ('also',)),
 (102, ('text',)),
 (86, ('anpr',)),
 (85, ('use',)),
 (84, ('used',)),
 (84, ('plate',))]

In [12]:
get_most_frequent(df['bigrams'], 10)


[(43, ('text', 'mining')),
 (43, ('license', 'plate')),
 (33, ('business', 'analysis')),
 (22, ('data', 'mining')),
 (20, ('open', 'access')),
 (19, ('document', 'processing')),
 (17, ('business', 'analysts')),
 (17, ('anpr', 'systems')),
 (16, ('radical', 'behaviorism')),
 (16, ('behavior', 'analysis'))]

In [13]:
get_most_frequent(df['trigrams'], 10)


[(8, ('number', 'plate', 'recognition')),
 (7, ('b', 'f', 'skinner')),
 (7, ('automatic', 'number', 'plate')),
 (6, ('optical', 'character', 'recognition')),
 (6, ('license', 'plate', 'capture')),
 (6, ('average', 'speed', 'cameras')),
 (5, ('text', 'data', 'mining')),
 (5, ('open', 'access', 'content')),
 (5, ('natural', 'language', 'processing')),
 (4, ('text', 'mining', 'software'))]

In [16]:
def get_count_matrix(ngram: str) -> pd.DataFrame:
    """Returns a raw count matrix for the given n-gram.

    The matrix should contain the n most frequent n-grams as columns and the titles as rows.

    :param ngram: the n-gram to use. Can be 'unigram', 'bigram', or 'trigram'
    :return: a count matrix. The columns are the n-grams, the rows are the titles.
    """
    assert ngram in ['unigrams', 'bigrams', 'trigrams']
    mfw = get_most_frequent(df[ngram], 1700)
    arr = np.zeros([17, 1700])
    for doc_idx, doc in df.iterrows():
        mfw_doc = doc[ngram]
        words_ls = []
        for w_idx, (_, words) in enumerate(mfw):
            words_ls.append(words)
            count = mfw_doc[words] if words in mfw_doc else 0
            arr[doc_idx][w_idx] = count
    return pd.DataFrame(arr, columns=words_ls, index=df['title'])

In [19]:
X_raw = get_count_matrix("unigrams")

In [49]:
def tf(term: Tuple[str, ...], document: str, X_raw: pd.DataFrame, all_data: pd.DataFrame, ngram: str) -> float:
    assert ngram in ['unigrams', 'bigrams', 'trigrams']
    numerator = X_raw.loc[document][term]
    terms: dict = all_data.loc[all_data['title'] == document][ngram].values[0]
    denominator = len(terms)
    return numerator/denominator

In [47]:
def idf(term: Tuple[str, ...], all_data: pd.DataFrame, ngram: str) -> float:
    assert ngram in ['unigrams', 'bigrams', 'trigrams']
    total_documents = all_data.shape[0]
    has_term = all_data[ngram].map(lambda x: 1 if term in x else 0)
    docs_with_term = has_term.aggregate(sum)
    return np.log(total_documents/(docs_with_term + 1) + 1)

In [48]:
def tf_idf(tf: float, idf: float) -> float:
    return (1 + np.log(tf)) * idf

1.3437347467010947

c) Using SVD, decompose and truncate your numerical features $X = U \Sigma V^\top$ into ($docs \times topics$) and ($topics \times terms$) matrices (left/right singular vectors, respectively) using $6$ topics. List $5$ most significant $n$-grams for each topic, measured by values of the ($topics \times terms$) matrix. Do this for both features $X_\mathrm{raw}$ and $X_\mathrm{tf-idf}$.

d) Compare and comment the results with respect to the selection of features and the $n$ value in $n$-grams.

**The format of tf-idf you need to use:**

$$ f(tf, idf) = (1 + \ln (tf)) \cdot idf $$

where:

$$ tf = \frac{number\ of\ times\ term\ w\ appears\ in\ a\ document}{total\ number\ of\ terms\ in\ that\ document} $$

and

$$ idf = \ln(\frac{total\ number\ of\ documents}{number\ of\ documents\ with\ term\ w\ in\ them + 1} + 1) $$