# Topic modeling

| Student Name         | Student-ID |
|----------------------|------------|
| Marco Di Francesco   | 100632815  |
| Loreto García Tejada | 100643862  |
| György Bence Józsa   | 100633270  |
| József-Hunor Jánosi  | 100516724  |
| Sara-Jane Bittner    | 100498554  |

_Learning goal: Processing text data, converting it into a numerical format and performing topic analysis using SVD._

To complete the assignment you are allowed to use the NLTK natural language toolkit.

In [1]:
import re
import nltk
import pandas as pd
from nltk import ngrams
from nltk.corpus import stopwords
from typing import Union, Tuple, Dict, List
from functools import reduce

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sara-\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

a) Load _applications_of_DM.csv_, e.g. with `pandas` in Python. It has titles and text content of Wikipedia articles on data mining. In the following tasks, you are to work with the "tex"-attribute.

In [3]:
df = pd.read_csv('applications_of_DM.csv')
df

Unnamed: 0,title,text
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...
2,Business analytics,Business analysis is a professional discipline...
3,CORE (research service),CORE (Connecting Repositories) is a service pr...
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...
5,Data Applied,Data Applied is a software vendor headquartere...
6,Data mining in agriculture,Data mining in agriculture is a recent researc...
7,Data thinking,"Data thinking is a buzzword for the generic ""m..."
8,Document processing,Document processing is a field of research and...
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as..."


b) Process and tokenize the text data: lowercase all words, remove digits, punctuation, any special characters and NLTK’s common English stopwords. List $10$ most frequent $n$-grams, $n = 1, 2, 3,$ and their raw counts. Then, convert $1700$ most frequent $n$-grams, across all $n$-values, into numerical matrices, i.e., features $X_\mathrm{raw},X_\mathrm{tf-idf} \in \mathbb{R}^{docs \times terms}$. Features $X_\mathrm{raw}$ are raw $n$-gram counts, $X_\mathrm{tf-idf}$ are tf-idf values (see the specific format below).

In [4]:
def process_text(text: str) -> list[str]:
    """Removes special characters and numbers from a given text and returns a list of words of that text in lower case, removing common english stopwords.

    :param text: the raw text to be processed
    :return: a list of words in the original text with no special characters, numbers and common english stopwords in lower case.
    """
    # remove special characters and numbers
    text = re.sub(r'[^a-zA-Z]', ' ', text.lower())
    words = text.split()
    # filter out common english stopwords and return list
    return [word for word in words if word not in stopwords.words('english')]

In [5]:
df['filtered_text'] = df['text'].map(process_text)
df

Unnamed: 0,title,text,filtered_text
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS...","[anomaly, detection, multiple, scales, adams, ..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...,"[behaviorism, systematic, approach, understand..."
2,Business analytics,Business analysis is a professional discipline...,"[business, analysis, professional, discipline,..."
3,CORE (research service),CORE (Connecting Repositories) is a service pr...,"[core, connecting, repositories, service, prov..."
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...,"[daisy, intelligence, canadian, artificial, in..."
5,Data Applied,Data Applied is a software vendor headquartere...,"[data, applied, software, vendor, headquartere..."
6,Data mining in agriculture,Data mining in agriculture is a recent researc...,"[data, mining, agriculture, recent, research, ..."
7,Data thinking,"Data thinking is a buzzword for the generic ""m...","[data, thinking, buzzword, generic, mental, pa..."
8,Document processing,Document processing is a field of research and...,"[document, processing, field, research, set, p..."
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as...","[equifax, workforce, solutions, formerly, know..."


In [6]:
def make_ngram(corpus: list[str], n: int) -> Dict[Tuple[str, ...], int]:
    """Creates a frequency-list of specified n-grams in the given corpus

    :param corpus: Ordered list of words in the corpus
    :param n: number of elements in n-gram creation
    :return: dictionary with keys being unique n-grams and values being the frequencies of them
    """
    grams = list(ngrams(corpus, n))
    return {i:grams.count(i) for i in set(grams)}

In [7]:
df['unigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=1))
df['bigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=2))
df['trigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=3))
df

Unnamed: 0,title,text,filtered_text,unigrams,bigrams,trigrams
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS...","[anomaly, detection, multiple, scales, adams, ...","{('recipients',): 1, ('access',): 1, ('specifi...","{('project', 'intended'): 1, ('noted', 'high')...","{('information', 'specific', 'cases'): 1, ('ma..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...,"[behaviorism, systematic, approach, understand...","{('certain',): 2, ('number',): 1, ('physics',)...","{('video', 'games'): 2, ('understanding', 'cov...","{('bell', 'ring', 'number'): 1, ('viewpoint', ..."
2,Business analytics,Business analysis is a professional discipline...,"[business, analysis, professional, discipline,...","{('incomplete',): 1, ('scheduling',): 1, ('num...","{('ongoing', 'massive'): 1, ('tasks', 'task'):...","{('storage', 'databases', 'analysis'): 1, ('es..."
3,CORE (research service),CORE (Connecting Repositories) is a service pr...,"[core, connecting, repositories, service, prov...","{('along',): 1, ('kingdom',): 1, ('number',): ...","{('tool', 'uk'): 1, ('text', 'open'): 1, ('acc...","{('applications', 'making', 'use'): 1, ('knoth..."
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...,"[daisy, intelligence, canadian, artificial, in...","{('concentrated',): 1, ('help',): 1, ('uses',)...","{('provides', 'data'): 1, ('promotional', 'mix...","{('mail', 'annual', 'list'): 1, ('retailers', ..."
5,Data Applied,Data Applied is a software vendor headquartere...,"[data, applied, software, vendor, headquartere...","{('implements',): 1, ('vendor',): 1, ('product...","{('data', 'mining'): 2, ('tree', 'maps'): 1, (...","{('trees', 'association', 'rules'): 1, ('self'..."
6,Data mining in agriculture,Data mining in agriculture is a recent researc...,"[data, mining, agriculture, recent, research, ...","{('along',): 2, ('cotton',): 5, ('results',): ...","{('affect', 'longevity'): 1, ('animals', 'caus...","{('sometimes', 'insurance', 'reasons'): 1, ('a..."
7,Data thinking,"Data thinking is a buzzword for the generic ""m...","[data, thinking, buzzword, generic, mental, pa...","{('profitability',): 2, ('defined',): 1, ('con...","{('areas', 'successful'): 1, ('ai', 'driven'):...","{('feasibility', 'stage', 'also'): 1, ('data',..."
8,Document processing,Document processing is a field of research and...,"[document, processing, field, research, set, p...","{('world',): 1, ('mechanical',): 1, ('orders',...","{('obtain', 'digital'): 1, ('massively', 'extr...","{('processing', 'involved', 'data'): 1, ('bill..."
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as...","[equifax, workforce, solutions, formerly, know...","{('number',): 6, ('certain',): 3, ('w',): 3, (...","{('credit', 'approvals'): 1, ('focus', 'core')...","{('auditors', 'sec', 'sought'): 1, ('ceo', 'ad..."


In [8]:
def merge_frequency_dictionaries(first: Dict[Tuple[str, ...], int], second: Dict[Tuple[str, ...], int]) -> Dict[Tuple[str, ...], int]:
    """Merges two dictionaries of frequencies

    :param first: first dictionary containing frequencies
    :param second: second dictionary containing frequencies
    :return: a merged dictionary
    """
    res = {}
    for key, value in first.items():
        res[key] = value
    for key, value in second.items():
        if key in res:
            res[key] += value
        else:
            res[key] = value
    return res

In [9]:
def combine(series):
    """ Combines a list of dictionaries into a single dictionary

    :param series: a list of dictionaries
    :return: a single dictionary
    """
    return reduce(lambda x,y: merge_frequency_dictionaries(x,y), series)

In [10]:
def get_most_frequent(grams: pd.Series, n: int) -> list[Tuple[int, Tuple[str, ...]]]:
    """ Returns the n most frequent n-grams

    :param grams: a list of dictionaries
    :param n: the number of n-grams to return
    :return: a list of tuples containing the frequency and the n-gram
    """
    combined_grams = grams.aggregate(combine)
    return sorted([(v, k) for k, v in combined_grams.items()], reverse=True)[:n]

<h4>10 Most Frequent Unigrams</h4>

In [11]:
get_most_frequent(df['unigrams'], 10)

[(160, ('behavior',)),
 (156, ('analysis',)),
 (150, ('data',)),
 (148, ('business',)),
 (103, ('also',)),
 (102, ('text',)),
 (86, ('anpr',)),
 (85, ('use',)),
 (84, ('used',)),
 (84, ('plate',))]

<h4>10 Most Frequent Bigrams</h4>

In [12]:
get_most_frequent(df['bigrams'], 10)


[(43, ('text', 'mining')),
 (43, ('license', 'plate')),
 (33, ('business', 'analysis')),
 (22, ('data', 'mining')),
 (20, ('open', 'access')),
 (19, ('document', 'processing')),
 (17, ('business', 'analysts')),
 (17, ('anpr', 'systems')),
 (16, ('radical', 'behaviorism')),
 (16, ('behavior', 'analysis'))]

<h4>10 Most Frequent Trigrams</h4>

In [13]:
get_most_frequent(df['trigrams'], 10)


[(8, ('number', 'plate', 'recognition')),
 (7, ('b', 'f', 'skinner')),
 (7, ('automatic', 'number', 'plate')),
 (6, ('optical', 'character', 'recognition')),
 (6, ('license', 'plate', 'capture')),
 (6, ('average', 'speed', 'cameras')),
 (5, ('text', 'data', 'mining')),
 (5, ('open', 'access', 'content')),
 (5, ('natural', 'language', 'processing')),
 (4, ('text', 'mining', 'software'))]

<h4>Making 2 Matrixes:</h4>
X = 1700 most frequent n-grams
Y = Document Number

<ol>
<li>  Raw Ngram </li>
<li> TF-IDF </li>
</ol>

c) Using SVD, decompose and truncate your numerical features $X = U \Sigma V^\top$ into ($docs \times topics$) and ($topics \times terms$) matrices (left/right singular vectors, respectively) using $6$ topics. List $5$ most significant $n$-grams for each topic, measured by values of the ($topics \times terms$) matrix. Do this for both features $X_\mathrm{raw}$ and $X_\mathrm{tf-idf}$.

In [14]:
from sklearn.decomposition import TruncatedSVD 

In [None]:
truncatedSVD_count=TruncatedSVD(6)
truncatedSVD_tfidf=TruncatedSVD(6)

In [None]:
X_truncated_count = truncatedSVD.fit_transform(count_matrix)
X_truncated_tfidf = truncatedSVD.fit_transform(tfidf_matrix)

d) Compare and comment the results with respect to the selection of features and the $n$ value in $n$-grams.

**The format of tf-idf you need to use:**

$$ f(tf, idf) = (1 + \ln (tf)) \cdot idf $$

where:

$$ tf = \frac{number\ of\ times\ term\ w\ appears\ in\ a\ document}{total\ number\ of\ terms\ in\ that\ document} $$

and

$$ idf = \ln(\frac{total\ number\ of\ documents}{number\ of\ documents\ with\ term\ w\ in\ them + 1} + 1) $$