# Topic modeling

| Student Name         | Student-ID |
|----------------------|------------|
| Marco Di Francesco   | 100632815  |
| Loreto García Tejada | 100643862  |
| György Bence Józsa   | 100633270  |
| József-Hunor Jánosi  | 100516724  |
| Sara-Jane Bittner    | 100498554  |

_Learning goal: Processing text data, converting it into a numerical format and performing topic analysis using SVD._

To complete the assignment you are allowed to use the NLTK natural language toolkit.

In [1]:
import re
import nltk
import pandas as pd
from nltk import ngrams
from nltk.corpus import stopwords
from typing import Union, Tuple, Dict, List
from functools import reduce
import numpy as np
from sklearn.decomposition import TruncatedSVD

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/marco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

a) Load _applications_of_DM.csv_, e.g. with `pandas` in Python. It has titles and text content of Wikipedia articles on data mining. In the following tasks, you are to work with the "tex"-attribute.

In [3]:
df = pd.read_csv('applications_of_DM.csv')
df

Unnamed: 0,title,text
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...
2,Business analytics,Business analysis is a professional discipline...
3,CORE (research service),CORE (Connecting Repositories) is a service pr...
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...
5,Data Applied,Data Applied is a software vendor headquartere...
6,Data mining in agriculture,Data mining in agriculture is a recent researc...
7,Data thinking,"Data thinking is a buzzword for the generic ""m..."
8,Document processing,Document processing is a field of research and...
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as..."


b) Process and tokenize the text data: lowercase all words, remove digits, punctuation, any special characters and NLTK’s common English stopwords. List $10$ most frequent $n$-grams, $n = 1, 2, 3,$ and their raw counts. Then, convert $1700$ most frequent $n$-grams, across all $n$-values, into numerical matrices, i.e., features $X_\mathrm{raw},X_\mathrm{tf-idf} \in \mathbb{R}^{docs \times terms}$. Features $X_\mathrm{raw}$ are raw $n$-gram counts, $X_\mathrm{tf-idf}$ are tf-idf values (see the specific format below).

In [4]:
def process_text(text: str) -> list[str]:
    """Removes special characters and numbers from a given text and returns a list of words of that text in lower case, removing common english stopwords.

    :param text: the raw text to be processed
    :return: a list of words in the original text with no special characters, numbers and common english stopwords in lower case.
    """
    # remove special characters and numbers
    text = re.sub(r'[^a-zA-Z]', ' ', text.lower())
    words = text.split()
    # filter out common english stopwords and return list
    return [word for word in words if word not in stopwords.words('english')]

In [5]:
df['filtered_text'] = df['text'].map(process_text)
df

Unnamed: 0,title,text,filtered_text
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS...","[anomaly, detection, multiple, scales, adams, ..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...,"[behaviorism, systematic, approach, understand..."
2,Business analytics,Business analysis is a professional discipline...,"[business, analysis, professional, discipline,..."
3,CORE (research service),CORE (Connecting Repositories) is a service pr...,"[core, connecting, repositories, service, prov..."
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...,"[daisy, intelligence, canadian, artificial, in..."
5,Data Applied,Data Applied is a software vendor headquartere...,"[data, applied, software, vendor, headquartere..."
6,Data mining in agriculture,Data mining in agriculture is a recent researc...,"[data, mining, agriculture, recent, research, ..."
7,Data thinking,"Data thinking is a buzzword for the generic ""m...","[data, thinking, buzzword, generic, mental, pa..."
8,Document processing,Document processing is a field of research and...,"[document, processing, field, research, set, p..."
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as...","[equifax, workforce, solutions, formerly, know..."


In [6]:
def make_ngram(corpus: list[str], n: int) -> Dict[Tuple[str, ...], int]:
    """Creates a frequency-list of specified n-grams in the given corpus

    :param corpus: Ordered list of words in the corpus
    :param n: number of elements in n-gram creation
    :return: dictionary with keys being unique n-grams and values being the frequencies of them
    """
    grams = list(ngrams(corpus, n))
    return {i: grams.count(i) for i in set(grams)}

In [7]:
df['unigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=1))
df['bigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=2))
df['trigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=3))
df["all_grams"] = "{}"

for idx, row in df.iterrows():
    row["all_grams"] = {**row['unigrams'], **row["bigrams"], **row["trigrams"]}

for _, g in df.iterrows():
    print(len(g["unigrams"]), len(g["bigrams"]), len(g["trigrams"]), len(g["all_grams"]))

df

89 100 100 289
1515 3273 3576 8364
936 1843 2000 4779
308 529 578 1415
59 66 65 190
47 54 55 156
412 617 656 1685
204 312 336 852
303 484 518 1305
373 576 611 1560
59 70 72 201
95 124 127 346
290 505 544 1339
1892 4231 4571 10694
131 155 156 442
878 1605 1731 4214
141 210 222 573


Unnamed: 0,title,text,filtered_text,unigrams,bigrams,trigrams,all_grams
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS...","[anomaly, detection, multiple, scales, adams, ...","{('us',): 1, ('government',): 1, ('large',): 1...","{('program', 'threat'): 1, ('high', 'performan...","{('high', 'performance', 'computing'): 1, ('ag...","{('us',): 1, ('government',): 1, ('large',): 1..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...,"[behaviorism, systematic, approach, understand...","{('pecked',): 1, ('chamber',): 2, ('started',)...","{('use', 'consequences'): 1, ('interests', 'am...","{('example', 'person', 'teaching'): 1, ('love'...","{('pecked',): 1, ('chamber',): 2, ('started',)..."
2,Business analytics,Business analysis is a professional discipline...,"[business, analysis, professional, discipline,...","{('user',): 2, ('improvements',): 2, ('else',)...","{('analysts', 'understanding'): 1, ('time', 'l...","{('ideas', 'options', 'useful'): 1, ('business...","{('user',): 2, ('improvements',): 2, ('else',)..."
3,CORE (research service),CORE (Connecting Repositories) is a service pr...,"[core, connecting, repositories, service, prov...","{('dataset',): 1, ('managers',): 1, ('technica...","{('research', 'validate'): 1, ('searchable', '...","{('repository', 'dashboard', 'tool'): 1, ('fun...","{('dataset',): 1, ('managers',): 1, ('technica..."
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...,"[daisy, intelligence, canadian, artificial, in...","{('growth',): 1, ('subset',): 1, ('known',): 1...","{('east', 'area'): 1, ('intelligence', 'canadi...","{('suburban', 'vaughan', 'ontario'): 1, ('tech...","{('growth',): 1, ('subset',): 1, ('known',): 1..."
5,Data Applied,Data Applied is a software vendor headquartere...,"[data, applied, software, vendor, headquartere...","{('series',): 1, ('types',): 1, ('time',): 1, ...","{('product', 'supports'): 1, ('environments', ...","{('series', 'forecasting', 'correlation'): 1, ...","{('series',): 1, ('types',): 1, ('time',): 1, ..."
6,Data mining in agriculture,Data mining in agriculture is a recent researc...,"[data, mining, agriculture, recent, research, ...","{('details',): 1, ('longevity',): 1, ('meal',)...","{('farm', 'also'): 1, ('analysis', 'querying')...","{('weight', 'birth', 'type'): 2, ('data', 'min...","{('details',): 1, ('longevity',): 1, ('meal',)..."
7,Data thinking,"Data thinking is a buzzword for the generic ""m...","[data, thinking, buzzword, generic, mental, pa...","{('user',): 3, ('could',): 1, ('data',): 31, (...","{('analysis', 'business'): 1, ('use', 'centere...","{('exploration', 'phase', 'concrete'): 1, ('bu...","{('user',): 3, ('could',): 1, ('data',): 31, (..."
8,Document processing,Document processing is a field of research and...,"[document, processing, field, research, set, p...","{('digitally',): 1, ('contracts',): 1, ('entry...","{('algorithms', 'technologies'): 1, ('document...","{('sometimes', 'also', 'uses'): 1, ('particula...","{('digitally',): 1, ('contracts',): 1, ('entry..."
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as...","[equifax, workforce, solutions, formerly, know...","{('injunction',): 1, ('persons',): 1, ('articl...","{('incentive', 'management'): 1, ('online', 'p...","{('sec', 'investigation', 'accounting'): 1, ('...","{('injunction',): 1, ('persons',): 1, ('articl..."


In [8]:
def merge_frequency_dictionaries(first: Dict[Tuple[str, ...], int], second: Dict[Tuple[str, ...], int]) -> Dict[
    Tuple[str, ...], int]:
    """Merges two dictionaries of frequencies

    :param first: first dictionary containing frequencies
    :param second: second dictionary containing frequencies
    :return: a merged dictionary
    """
    res = {}
    for key, value in first.items():
        res[key] = value
    for key, value in second.items():
        if key in res:
            res[key] += value
        else:
            res[key] = value
    return res

In [9]:
def combine(series):
    """ Combines a list of dictionaries into a single dictionary

    :param series: a list of dictionaries
    :return: a single dictionary
    """
    return reduce(lambda x, y: merge_frequency_dictionaries(x, y), series)

In [10]:
def get_most_frequent(grams: pd.Series, n: int) -> list[Tuple[int, Tuple[str, ...]]]:
    """ Returns the n most frequent n-grams

    :param grams: a list of dictionaries
    :param n: the number of n-grams to return
    :return: a list of tuples containing the frequency and the n-gram
    """
    combined_grams = grams.aggregate(combine)
    return sorted([(v, k) for k, v in combined_grams.items()], reverse=True)[:n]

In [11]:
get_most_frequent(df['unigrams'], 10)

[(160, ('behavior',)),
 (156, ('analysis',)),
 (150, ('data',)),
 (148, ('business',)),
 (103, ('also',)),
 (102, ('text',)),
 (86, ('anpr',)),
 (85, ('use',)),
 (84, ('used',)),
 (84, ('plate',))]

In [12]:
get_most_frequent(df['bigrams'], 10)

[(43, ('text', 'mining')),
 (43, ('license', 'plate')),
 (33, ('business', 'analysis')),
 (22, ('data', 'mining')),
 (20, ('open', 'access')),
 (19, ('document', 'processing')),
 (17, ('business', 'analysts')),
 (17, ('anpr', 'systems')),
 (16, ('radical', 'behaviorism')),
 (16, ('behavior', 'analysis'))]

In [13]:
get_most_frequent(df['trigrams'], 10)

[(8, ('number', 'plate', 'recognition')),
 (7, ('b', 'f', 'skinner')),
 (7, ('automatic', 'number', 'plate')),
 (6, ('optical', 'character', 'recognition')),
 (6, ('license', 'plate', 'capture')),
 (6, ('average', 'speed', 'cameras')),
 (5, ('text', 'data', 'mining')),
 (5, ('open', 'access', 'content')),
 (5, ('natural', 'language', 'processing')),
 (4, ('text', 'mining', 'software'))]

In [14]:
get_most_frequent(df['all_grams'], 10)

[(160, ('behavior',)),
 (156, ('analysis',)),
 (150, ('data',)),
 (148, ('business',)),
 (103, ('also',)),
 (102, ('text',)),
 (86, ('anpr',)),
 (85, ('use',)),
 (84, ('used',)),
 (84, ('plate',))]

In [15]:
def get_count_matrix(ngram: str = "all_grams") -> pd.DataFrame:
    """Returns a raw count matrix for the given n-gram.

    The matrix should contain the n most frequent n-grams as columns and the titles as rows.

    :param ngram: the n-gram to use. Can be 'unigram', 'bigram', or 'trigram'
    :return: a count matrix. The columns are the n-grams, the rows are the titles.
    """
    assert ngram in ['unigrams', 'bigrams', 'trigrams', "all_grams"]
    mfw = get_most_frequent(df[ngram], 1700)
    arr = np.zeros([17, 1700])
    for doc_idx, doc in df.iterrows():
        mfw_doc = doc[ngram]
        for w_idx, (_, words) in enumerate(mfw):
            count = mfw_doc[words] if words in mfw_doc else 0
            arr[doc_idx][w_idx] = count
    words_ls = [words for _, words in mfw]
    return pd.DataFrame(arr, columns=words_ls, index=df['title'])


X_raw = get_count_matrix()
X_raw

Unnamed: 0_level_0,"(behavior,)","(analysis,)","(data,)","(business,)","(also,)","(text,)","(anpr,)","(use,)","(used,)","(plate,)",...,"(defense,)","(dedicated,)","(decades,)","(decade,)","(dbt,)","(dataset,)","(data, science)","(data, mining, techniques)","(data, entry)","(data, centre)"
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anomaly Detection at Multiple Scales,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Behavioral analytics,154.0,30.0,2.0,1.0,26.0,0.0,0.0,10.0,10.0,0.0,...,0.0,0.0,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0
Business analytics,0.0,55.0,6.0,122.0,8.0,0.0,0.0,3.0,13.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CORE (research service),0.0,1.0,9.0,0.0,3.0,7.0,0.0,4.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Daisy Intelligence,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Data Applied,0.0,2.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Data mining in agriculture,0.0,2.0,19.0,0.0,5.0,0.0,0.0,6.0,9.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
Data thinking,0.0,4.0,31.0,5.0,4.0,0.0,0.0,3.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0
Document processing,0.0,1.0,9.0,2.0,11.0,6.0,0.0,4.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
Equifax Workforce Solutions,0.0,0.0,0.0,4.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
def tf(term: Tuple[str, ...], document: str, X_raw: pd.DataFrame, all_data: pd.DataFrame, ngram: str) -> float:
    """Calculates the term frequency of a term in a document.

    The term frequency is calculated as the number of times the term appears in the document divided by the number of words in the document.

    :param term: the term to calculate the frequency of, e.g. ('word1', 'word2')
    :param document: the title of the document to calculate the frequency in
    :param X_raw: the raw count matrix
    :param all_data: the dataframe containing all the data
    :param ngram: e.g. 'unigrams'
    :return: the term frequency
    """
    assert ngram in ['unigrams', 'bigrams', 'trigrams', 'all_grams']
    numerator = X_raw.loc[document][term]
    terms: dict = all_data.loc[all_data['title'] == document][ngram].values[0]
    denominator = len(terms)
    return numerator / denominator

In [17]:
def idf(term: Tuple[str, ...], all_data: pd.DataFrame, ngram: str) -> float:
    """Calculates the inverse document frequency of a term.

    :param term: the term to calculate the frequency of, e.g. ('word1', 'word2')
    :param all_data: the dataframe containing all the data
    :param ngram: the n-gram to use. Can be 'unigram', 'bigram', or 'trigram'
    :return: the inverse document frequency of the term
    """
    assert ngram in ['unigrams', 'bigrams', 'trigrams', 'all_grams']
    total_documents = all_data.shape[0]
    has_term = all_data[ngram].map(lambda x: 1 if term in x else 0)
    docs_with_term = has_term.aggregate(sum)
    return np.log(total_documents / (docs_with_term + 1) + 1)

In [18]:
def tf_idf(tf: float, idf: float) -> float:
    """Calculates the tf-idf score of a term in a document.

    :param tf: the term frequency of the term in the document
    :param idf: the inverse document frequency of the term
    :return: the tf-idf score of the term in the document
    """
    tf = (1 + np.log(tf)) if tf > 0 else 0
    return (1 + np.log(tf)) * idf

In [23]:
def get_tfidf_matrix(ngram: str = "all_grams") -> pd.DataFrame:
    """Returns a raw count matrix for the given n-gram.

    The matrix should contain the n most frequent n-grams as columns and the titles as rows.

    :param ngram: the n-gram to use. Can be 'unigram', 'bigram', or 'trigram'
    :return: a count matrix. The columns are the n-grams, the rows are the titles.
    """
    assert ngram in ['unigrams', 'bigrams', 'trigrams', "all_grams"]
    mfw = get_most_frequent(df[ngram], 1700)
    arr = np.zeros([17, 1700])
    # Documents / Articles
    for doc_idx, doc in df.iterrows():
        # Most frequent words - {('girvan',): 1, ('operates',): 2, ('aclu',): 7, ...}
        mfw_doc = doc[ngram]
        for w_idx, (word_freq, word) in enumerate(mfw):
            tf_val = tf(term=word, document=doc["title"], X_raw=X_raw, all_data=df, ngram=ngram)
            idf_val = idf(term=word, all_data=df, ngram=ngram)
            arr[doc_idx][w_idx] = tf_idf(tf_val, idf_val)
    words_ls = [words for _, words in mfw]
    return pd.DataFrame(arr, columns=words_ls, index=df['title'])


X_tfidf = get_tfidf_matrix()
X_tfidf

  return (1 + np.log(tf)) * idf
  return (1 + np.log(tf)) * idf


Unnamed: 0_level_0,"(behavior,)","(analysis,)","(data,)","(business,)","(also,)","(text,)","(anpr,)","(use,)","(used,)","(plate,)",...,"(defense,)","(dedicated,)","(decades,)","(decade,)","(dbt,)","(dataset,)","(data, science)","(data, mining, techniques)","(data, entry)","(data, centre)"
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anomaly Detection at Multiple Scales,-inf,,,-inf,-inf,-inf,-inf,-inf,-inf,-inf,...,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf
Behavioral analytics,,,,,,-inf,-inf,,,-inf,...,-inf,-inf,,,,-inf,-inf,-inf,-inf,-inf
Business analytics,-inf,,,,,-inf,-inf,,,-inf,...,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf
CORE (research service),-inf,,,-inf,,,-inf,,,-inf,...,-inf,-inf,-inf,-inf,-inf,,-inf,-inf,-inf,-inf
Daisy Intelligence,-inf,,,-inf,,-inf,-inf,-inf,-inf,-inf,...,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf
Data Applied,-inf,,,,-inf,-inf,-inf,-inf,-inf,-inf,...,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf
Data mining in agriculture,-inf,,,-inf,,-inf,-inf,,,-inf,...,-inf,-inf,,-inf,-inf,-inf,-inf,,-inf,-inf
Data thinking,-inf,,,,,-inf,-inf,,-inf,-inf,...,-inf,-inf,-inf,-inf,-inf,-inf,,-inf,-inf,-inf
Document processing,-inf,,,,,,-inf,,,-inf,...,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,,-inf
Equifax Workforce Solutions,-inf,-inf,-inf,,,-inf,-inf,-inf,-inf,-inf,...,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf,-inf


c) Using SVD, decompose and truncate your numerical features $X = U \Sigma V^\top$ into ($docs \times topics$) and ($topics \times terms$) matrices (left/right singular vectors, respectively) using $6$ topics. List $5$ most significant $n$-grams for each topic, measured by values of the ($topics \times terms$) matrix. Do this for both features $X_\mathrm{raw}$ and $X_\mathrm{tf-idf}$.

In [20]:
def get_left_right_svd(X):
    """Performs SVD and truncation.
    Returns the left and right singular vectors for matrix X.

    :param X: matrix (raw n-gram count or tf-idf) of size (docs, terms)
    :return left: left singular vector of X, size (docs, topics)
    :return right: right singular vector of X, size (topics, terms)
    """
    svd = TruncatedSVD(n_components=6, random_state=42)
    # left singular vectors. (docs x topics)
    left = svd.fit_transform(X)
    # right singular vectors. (topics x terms)
    right = svd.components_

    return left, right

In [21]:
def get_top5_ngrams_svd(X):
    """
    Returns top 5 most significant n-grams for the 6 topics.

    :param X: matrix (raw n-gram count or tf-idf) of size (docs, terms)
    :return top5: list of lists. For every topic (6) returns the top 5 most
        significant n-grams based on X's SVD.
    """
    # perform SVD
    left, right = get_left_right_svd(X)

    # create Pandas df of the right vector
    right_df = pd.DataFrame(right.copy(), columns=X_raw.columns)

    # temp value for getting the top5
    min = -99
    # 6 empty lists
    top5 = [[] for i in range(len(right_df))]
    for i in range(5):
        # argmax for topics
        ids = right_df.idxmax(axis=1)
        # putting negative values for the max values so we don't get the same values again as max
        for j, idx in enumerate(ids):
            right_df[idx][j] = min

        # inverting the ids list so we have topic[]=[n-grams], instead of top[]=[ngram-topic1, ngram-topic2...]
        for j in range(len(right_df)):
            top5[j].append(ids[j])

    print('Top 5 n-grams for every topic:')
    for i in range(len(right_df)):
        print(f'Topic {i+1}: {top5[i]}')

    return top5


In [22]:
# get top 5 most significant words per topics
# raw n-gram counts
print('Raw n-gram counts:')
top5 = get_top5_ngrams_svd(X_raw)

Raw n-gram counts:
Top 5 n-grams for every topic:
Topic 1: [('anpr',), ('plate',), ('behavior',), ('license',), ('system',)]
Topic 2: [('behavior',), ('behaviorism',), ('skinner',), ('analysis',), ('stimulus',)]
Topic 3: [('business',), ('analysis',), ('requirements',), ('business', 'analysis'), ('text',)]
Topic 4: [('text',), ('mining',), ('text', 'mining'), ('data',), ('information',)]
Topic 5: [('access',), ('core',), ('open',), ('open', 'access'), ('content',)]
Topic 6: [('document',), ('processing',), ('data',), ('document', 'processing'), ('also',)]


d) Compare and comment the results with respect to the selection of features and the $n$ value in $n$-grams.

**The format of tf-idf you need to use:**

$$ f(tf, idf) = (1 + \ln (tf)) \cdot idf $$

where:

$$ tf = \frac{number\ of\ times\ term\ w\ appears\ in\ a\ document}{total\ number\ of\ terms\ in\ that\ document} $$

and

$$ idf = \ln(\frac{total\ number\ of\ documents}{number\ of\ documents\ with\ term\ w\ in\ them + 1} + 1) $$