# Topic modeling

| Student Name         | Student-ID |
|----------------------|------------|
| Marco Di Francesco   | 100632815  |
| Loreto García Tejada | 100643862  |
| György Bence Józsa   | 100633270  |
| József-Hunor Jánosi  | 100516724  |
| Sara-Jane Bittner    | 100498554  |

_Learning goal: Processing text data, converting it into a numerical format and performing topic analysis using SVD._

To complete the assignment you are allowed to use the NLTK natural language toolkit.

In [2]:
import re
import nltk
import pandas as pd
from nltk import ngrams
from nltk.corpus import stopwords
from typing import Union, Tuple, Dict, List
from functools import reduce
import numpy as np
from sklearn.decomposition import TruncatedSVD

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Gyuri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

a) Load _applications_of_DM.csv_, e.g. with `pandas` in Python. It has titles and text content of Wikipedia articles on data mining. In the following tasks, you are to work with the "tex"-attribute.

In [None]:
df = pd.read_csv('applications_of_DM.csv')
df

Unnamed: 0,title,text
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...
2,Business analytics,Business analysis is a professional discipline...
3,CORE (research service),CORE (Connecting Repositories) is a service pr...
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...
5,Data Applied,Data Applied is a software vendor headquartere...
6,Data mining in agriculture,Data mining in agriculture is a recent researc...
7,Data thinking,"Data thinking is a buzzword for the generic ""m..."
8,Document processing,Document processing is a field of research and...
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as..."


b) Process and tokenize the text data: lowercase all words, remove digits, punctuation, any special characters and NLTK’s common English stopwords. List $10$ most frequent $n$-grams, $n = 1, 2, 3,$ and their raw counts. Then, convert $1700$ most frequent $n$-grams, across all $n$-values, into numerical matrices, i.e., features $X_\mathrm{raw},X_\mathrm{tf-idf} \in \mathbb{R}^{docs \times terms}$. Features $X_\mathrm{raw}$ are raw $n$-gram counts, $X_\mathrm{tf-idf}$ are tf-idf values (see the specific format below).

In [None]:
def process_text(text: str) -> list[str]:
    """Removes special characters and numbers from a given text and returns a list of words of that text in lower case, removing common english stopwords.

    :param text: the raw text to be processed
    :return: a list of words in the original text with no special characters, numbers and common english stopwords in lower case.
    """
    # remove special characters and numbers
    text = re.sub(r'[^a-zA-Z]', ' ', text.lower())
    words = text.split()
    # filter out common english stopwords and return list
    return [word for word in words if word not in stopwords.words('english')]

In [None]:
df['filtered_text'] = df['text'].map(process_text)
df

Unnamed: 0,title,text,filtered_text
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS...","[anomaly, detection, multiple, scales, adams, ..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...,"[behaviorism, systematic, approach, understand..."
2,Business analytics,Business analysis is a professional discipline...,"[business, analysis, professional, discipline,..."
3,CORE (research service),CORE (Connecting Repositories) is a service pr...,"[core, connecting, repositories, service, prov..."
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...,"[daisy, intelligence, canadian, artificial, in..."
5,Data Applied,Data Applied is a software vendor headquartere...,"[data, applied, software, vendor, headquartere..."
6,Data mining in agriculture,Data mining in agriculture is a recent researc...,"[data, mining, agriculture, recent, research, ..."
7,Data thinking,"Data thinking is a buzzword for the generic ""m...","[data, thinking, buzzword, generic, mental, pa..."
8,Document processing,Document processing is a field of research and...,"[document, processing, field, research, set, p..."
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as...","[equifax, workforce, solutions, formerly, know..."


In [None]:
def make_ngram(corpus: list[str], n: int) -> Dict[Tuple[str, ...], int]:
    """Creates a frequency-list of specified n-grams in the given corpus

    :param corpus: Ordered list of words in the corpus
    :param n: number of elements in n-gram creation
    :return: dictionary with keys being unique n-grams and values being the frequencies of them
    """
    grams = list(ngrams(corpus, n))
    return {i: grams.count(i) for i in set(grams)}

In [None]:
df['unigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=1))
df['bigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=2))
df['trigrams'] = df['filtered_text'].map(lambda x: make_ngram(x, n=3))
df["all_grams"] = "{}"

for idx, row in df.iterrows():
    row["all_grams"] = {**row['unigrams'], **row["bigrams"], **row["trigrams"]}

for _, g in df.iterrows():
    print(len(g["unigrams"]), len(g["bigrams"]), len(g["trigrams"]), len(g["all_grams"]))

df

89 100 100 289
1515 3273 3576 8364
936 1843 2000 4779
308 529 578 1415
59 66 65 190
47 54 55 156
412 617 656 1685
204 312 336 852
303 484 518 1305
373 576 611 1560
59 70 72 201
95 124 127 346
290 505 544 1339
1892 4231 4571 10694
131 155 156 442
878 1605 1731 4214
141 210 222 573


Unnamed: 0,title,text,filtered_text,unigrams,bigrams,trigrams,all_grams
0,Anomaly Detection at Multiple Scales,"Anomaly Detection at Multiple Scales, or ADAMS...","[anomaly, detection, multiple, scales, adams, ...","{('adams',): 2, ('office',): 1, ('detection',)...","{('good', 'mental'): 1, ('project', 'georgia')...","{('researcher', 'david', 'bader'): 1, ('innoce...","{('adams',): 2, ('office',): 1, ('detection',)..."
1,Behavioral analytics,Behaviorism is a systematic approach to unders...,"[behaviorism, systematic, approach, understand...","{('otherwise',): 1, ('versions',): 1, ('others...","{('rating', 'modern'): 1, ('children', 'used')...","{('groups', 'sigs', 'within'): 1, ('analysis',...","{('otherwise',): 1, ('versions',): 1, ('others..."
2,Business analytics,Business analysis is a professional discipline...,"[business, analysis, professional, discipline,...","{('change',): 11, ('otherwise',): 1, ('effecti...","{('resources', 'costs'): 1, ('need', 'well'): ...","{('matrix', 'example', 'table'): 1, ('generall...","{('change',): 11, ('otherwise',): 1, ('effecti..."
3,CORE (research service),CORE (Connecting Repositories) is a service pr...,"[core, connecting, repositories, service, prov...","{('analytics',): 1, ('used',): 1, ('including'...","{('must', 'openly'): 1, ('point', 'develop'): ...","{('mine', 'large', 'amounts'): 1, ('project', ...","{('analytics',): 1, ('used',): 1, ('including'..."
4,Daisy Intelligence,Daisy Intelligence is a Canadian Artificial In...,"[daisy, intelligence, canadian, artificial, in...","{('ai',): 3, ('suburban',): 1, ('promotional',...","{('claims', 'company'): 1, ('ai', 'technology'...","{('globe', 'mail', 'annual'): 1, ('ai', 'techn...","{('ai',): 3, ('suburban',): 1, ('promotional',..."
5,Data Applied,Data Applied is a software vendor headquartere...,"[data, applied, software, vendor, headquartere...","{('vendor',): 1, ('rules',): 1, ('analytical',...","{('mining', 'product'): 1, ('maps', 'time'): 1...","{('reporting', 'tree', 'maps'): 1, ('outlier',...","{('vendor',): 1, ('rules',): 1, ('analytical',..."
6,Data mining in agriculture,Data mining in agriculture is a recent researc...,"[data, mining, agriculture, recent, research, ...","{('primary',): 1, ('correct',): 1, ('done',): ...","{('agriculture', 'data'): 1, ('exclusively', '...","{('techniques', 'k', 'means'): 1, ('allow', 'f...","{('primary',): 1, ('correct',): 1, ('done',): ..."
7,Data thinking,"Data thinking is a buzzword for the generic ""m...","[data, thinking, buzzword, generic, mental, pa...","{('assessments',): 1, ('digital',): 1, ('model...","{('typically', 'applied'): 1, ('proof', 'conce...","{('applied', 'step', 'developed'): 1, ('availa...","{('assessments',): 1, ('digital',): 1, ('model..."
8,Document processing,Document processing is a field of research and...,"[document, processing, field, research, set, p...","{('line',): 1, ('digital',): 4, ('emerged',): ...","{('massively', 'extracting'): 1, ('automatical...","{('documents', 'newspaper', 'archives'): 1, ('...","{('line',): 1, ('digital',): 4, ('emerged',): ..."
9,Equifax Workforce Solutions,"Equifax Workforce Solutions, formerly known as...","[equifax, workforce, solutions, formerly, know...","{('knew',): 1, ('items',): 1, ('correct',): 1,...","{('percent', 'inflated'): 1, ('financial', 'ta...","{('competition', 'talx', 'believed'): 1, ('sho...","{('knew',): 1, ('items',): 1, ('correct',): 1,..."


In [None]:
def merge_frequency_dictionaries(first: Dict[Tuple[str, ...], int], second: Dict[Tuple[str, ...], int]) -> Dict[
    Tuple[str, ...], int]:
    """Merges two dictionaries of frequencies

    :param first: first dictionary containing frequencies
    :param second: second dictionary containing frequencies
    :return: a merged dictionary
    """
    res = {}
    for key, value in first.items():
        res[key] = value
    for key, value in second.items():
        if key in res:
            res[key] += value
        else:
            res[key] = value
    return res

In [None]:
def combine(series):
    """ Combines a list of dictionaries into a single dictionary

    :param series: a list of dictionaries
    :return: a single dictionary
    """
    return reduce(lambda x, y: merge_frequency_dictionaries(x, y), series)

In [None]:
def get_most_frequent(grams: pd.Series, n: int) -> list[Tuple[int, Tuple[str, ...]]]:
    """ Returns the n most frequent n-grams

    :param grams: a list of dictionaries
    :param n: the number of n-grams to return
    :return: a list of tuples containing the frequency and the n-gram
    """
    combined_grams = grams.aggregate(combine)
    return sorted([(v, k) for k, v in combined_grams.items()], reverse=True)[:n]

In [None]:
get_most_frequent(df['unigrams'], 10)

[(160, ('behavior',)),
 (156, ('analysis',)),
 (150, ('data',)),
 (148, ('business',)),
 (103, ('also',)),
 (102, ('text',)),
 (86, ('anpr',)),
 (85, ('use',)),
 (84, ('used',)),
 (84, ('plate',))]

In [None]:
get_most_frequent(df['bigrams'], 10)

[(43, ('text', 'mining')),
 (43, ('license', 'plate')),
 (33, ('business', 'analysis')),
 (22, ('data', 'mining')),
 (20, ('open', 'access')),
 (19, ('document', 'processing')),
 (17, ('business', 'analysts')),
 (17, ('anpr', 'systems')),
 (16, ('radical', 'behaviorism')),
 (16, ('behavior', 'analysis'))]

In [None]:
get_most_frequent(df['trigrams'], 10)

[(8, ('number', 'plate', 'recognition')),
 (7, ('b', 'f', 'skinner')),
 (7, ('automatic', 'number', 'plate')),
 (6, ('optical', 'character', 'recognition')),
 (6, ('license', 'plate', 'capture')),
 (6, ('average', 'speed', 'cameras')),
 (5, ('text', 'data', 'mining')),
 (5, ('open', 'access', 'content')),
 (5, ('natural', 'language', 'processing')),
 (4, ('text', 'mining', 'software'))]

In [None]:
get_most_frequent(df['all_grams'], 10)

[(160, ('behavior',)),
 (156, ('analysis',)),
 (150, ('data',)),
 (148, ('business',)),
 (103, ('also',)),
 (102, ('text',)),
 (86, ('anpr',)),
 (85, ('use',)),
 (84, ('used',)),
 (84, ('plate',))]

In [None]:
def get_count_matrix(ngram: str = "all_grams") -> pd.DataFrame:
    """Returns a raw count matrix for the given n-gram.

    The matrix should contain the n most frequent n-grams as columns and the titles as rows.

    :param ngram: the n-gram to use. Can be 'unigram', 'bigram', or 'trigram'
    :return: a count matrix. The columns are the n-grams, the rows are the titles.
    """
    assert ngram in ['unigrams', 'bigrams', 'trigrams', "all_grams"]
    mfw = get_most_frequent(df[ngram], 1700)
    arr = np.zeros([17, 1700])
    for doc_idx, doc in df.iterrows():
        mfw_doc = doc[ngram]
        for w_idx, (_, words) in enumerate(mfw):
            count = mfw_doc[words] if words in mfw_doc else 0
            arr[doc_idx][w_idx] = count
    words_ls = [words for _, words in mfw]
    return pd.DataFrame(arr, columns=words_ls, index=df['title'])


X_raw = get_count_matrix()
X_raw

Unnamed: 0_level_0,"(behavior,)","(analysis,)","(data,)","(business,)","(also,)","(text,)","(anpr,)","(use,)","(used,)","(plate,)",...,"(defense,)","(dedicated,)","(decades,)","(decade,)","(dbt,)","(dataset,)","(data, science)","(data, mining, techniques)","(data, entry)","(data, centre)"
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anomaly Detection at Multiple Scales,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Behavioral analytics,154.0,30.0,2.0,1.0,26.0,0.0,0.0,10.0,10.0,0.0,...,0.0,0.0,1.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0
Business analytics,0.0,55.0,6.0,122.0,8.0,0.0,0.0,3.0,13.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CORE (research service),0.0,1.0,9.0,0.0,3.0,7.0,0.0,4.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Daisy Intelligence,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Data Applied,0.0,2.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Data mining in agriculture,0.0,2.0,19.0,0.0,5.0,0.0,0.0,6.0,9.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
Data thinking,0.0,4.0,31.0,5.0,4.0,0.0,0.0,3.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0
Document processing,0.0,1.0,9.0,2.0,11.0,6.0,0.0,4.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
Equifax Workforce Solutions,0.0,0.0,0.0,4.0,6.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
def tf(term: Tuple[str, ...], document: str, X_raw: pd.DataFrame, all_data: pd.DataFrame, ngram: str) -> float:
    """Calculates the term frequency of a term in a document.

    The term frequency is calculated as the number of times the term appears in the document divided by the number of words in the document.

    :param term: the term to calculate the frequency of, e.g. ('word1', 'word2')
    :param document: the title of the document to calculate the frequency in
    :param X_raw: the raw count matrix
    :param all_data: the dataframe containing all the data
    :param ngram: e.g. 'unigrams'
    :return: the term frequency
    """
    assert ngram in ['unigrams', 'bigrams', 'trigrams', 'all_grams']
    numerator = X_raw.loc[document][term]
    terms: dict = all_data.loc[all_data['title'] == document][ngram].values[0]
    denominator = len(terms)
    return numerator / denominator

In [None]:
def idf(term: Tuple[str, ...], all_data: pd.DataFrame) -> float:
    """Calculates the inverse document frequency of a term

    :param term: the term to calculate the frequency of
    :param all_data: the dataframe containing all the data
    :return: the inverse document frequency of the term
    """
    total_documents = all_data.shape[0]
    has_term = all_data['all_grams'].map(lambda x: 1 if term in x else 0)
    docs_with_term = has_term.aggregate(sum)
    return np.log(total_documents / docs_with_term) + 1

In [None]:
def tf_idf(term: Tuple[str, ...], document: str, X_raw: pd.DataFrame, all_data: pd.DataFrame) -> float:
    """Calculates the tf-idf score of a term in a document.

    :param term: the term to calculate the frequency of
    :param document: the title of the document to calculate the frequency in
    :param X_raw: the raw count matrix
    :param all_data: the dataframe containing all the data
    :return: the tf-idf score of the term in the document
    """
    return (np.log(1 + X_raw.loc[document][term])) * idf(term, all_data)

In [None]:
def get_tfidf_matrix(ngram: str = "all_grams") -> pd.DataFrame:
    """Returns a raw count matrix for the given n-gram.

    The matrix should contain the n most frequent n-grams as columns and the titles as rows.

    :param ngram: the n-gram to use. Can be 'unigram', 'bigram', or 'trigram'
    :return: a count matrix. The columns are the n-grams, the rows are the titles.
    """
    assert ngram in ['unigrams', 'bigrams', 'trigrams', "all_grams"]
    mfw = get_most_frequent(df[ngram], 1700)
    arr = np.zeros([17, 1700])
    # Documents / Articles
    for doc_idx, doc in df.iterrows():
        # Most frequent words - {('girvan',): 1, ('operates',): 2, ('aclu',): 7, ...}
        mfw_doc = doc[ngram]
        for w_idx, (word_freq, word) in enumerate(mfw):
            arr[doc_idx][w_idx] = tf_idf(term=word, document=doc['title'], X_raw=X_raw, all_data=df)
    words_ls = [words for _, words in mfw]
    return pd.DataFrame(arr, columns=words_ls, index=df['title'])


X_tfidf = get_tfidf_matrix()
X_tfidf

Unnamed: 0_level_0,"(behavior,)","(analysis,)","(data,)","(business,)","(also,)","(text,)","(anpr,)","(use,)","(used,)","(plate,)",...,"(defense,)","(dedicated,)","(decades,)","(decade,)","(dbt,)","(dataset,)","(data, science)","(data, mining, techniques)","(data, entry)","(data, centre)"
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anomaly Detection at Multiple Scales,0.0,0.827726,0.779904,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Behavioral analytics,11.215445,4.100716,1.236118,1.133981,4.179991,0.0,0.0,3.670286,3.92293,0.0,...,0.0,0.0,1.895481,2.176528,5.313962,0.0,0.0,0.0,0.0,0.0
Business analytics,0.0,4.806898,2.189466,7.87268,2.786661,0.0,0.0,2.121901,4.317468,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
CORE (research service),0.0,0.827726,2.590784,0.0,1.758187,4.245084,0.0,2.463451,1.133981,0.0,...,0.0,0.0,0.0,0.0,0.0,2.176528,0.0,0.0,0.0,0.0
Daisy Intelligence,0.0,0.827726,0.779904,0.0,0.879094,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Data Applied,0.0,1.311914,2.189466,1.133981,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Data mining in agriculture,0.0,1.311914,3.370688,0.0,2.272424,0.0,0.0,2.978465,3.767003,0.0,...,0.0,0.0,1.895481,0.0,0.0,0.0,0.0,3.449715,0.0,0.0
Data thinking,0.0,1.92192,3.899518,2.931298,2.041192,0.0,0.0,2.121901,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.313962,0.0,0.0,0.0
Document processing,0.0,0.827726,2.590784,1.797317,3.151518,3.972486,0.0,2.463451,3.183487,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.313962,0.0
Equifax Workforce Solutions,0.0,0.0,0.0,2.633022,2.467928,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


c) Using SVD, decompose and truncate your numerical features $X = U \Sigma V^\top$ into ($docs \times topics$) and ($topics \times terms$) matrices (left/right singular vectors, respectively) using $6$ topics. List $5$ most significant $n$-grams for each topic, measured by values of the ($topics \times terms$) matrix. Do this for both features $X_\mathrm{raw}$ and $X_\mathrm{tf-idf}$.

In [None]:
def get_left_right_svd(X):
    """Performs SVD and truncation.
    Returns the left and right singular vectors for matrix X.

    :param X: matrix (raw n-gram count or tf-idf) of size (docs, terms)
    :return left: left singular vector of X, size (docs, topics)
    :return right: right singular vector of X, size (topics, terms)
    """
    svd = TruncatedSVD(n_components=6, random_state=42)
    # left singular vectors. (docs x topics)
    left = svd.fit_transform(X)
    # right singular vectors. (topics x terms)
    right = svd.components_

    return left, right

In [None]:
def get_top5_ngrams_svd(X):
    """
    Returns top 5 most significant n-grams for the 6 topics.

    :param X: matrix (raw n-gram count or tf-idf) of size (docs, terms)
    :return top5: list of lists. For every topic (6) returns the top 5 most
        significant n-grams based on X's SVD.
    """
    # perform SVD
    left, right = get_left_right_svd(X)

    # create Pandas df of the right vector
    right_df = pd.DataFrame(right.copy(), columns=X_raw.columns)

    # temp value for getting the top5
    min = -99
    # 6 empty lists
    top5 = [[] for i in range(len(right_df))]
    for i in range(5):
        # argmax for topics
        ids = right_df.idxmax(axis=1)
        # putting negative values for the max values so we don't get the same values again as max
        for j, idx in enumerate(ids):
            right_df[idx][j] = min

        # inverting the ids list so we have topic[]=[n-grams], instead of top[]=[ngram-topic1, ngram-topic2...]
        for j in range(len(right_df)):
            top5[j].append(ids[j])

    print('Top 5 n-grams for every topic:')
    for i in range(len(right_df)):
        print(f'Topic {i+1}: {top5[i]}')

    return top5


In [None]:
# get top 5 most significant words per topics
# raw n-gram counts
print('Raw n-gram counts:')
top5_raw = get_top5_ngrams_svd(X_raw)

Raw n-gram counts:
Top 5 n-grams for every topic:
Topic 1: [('anpr',), ('plate',), ('behavior',), ('license',), ('system',)]
Topic 2: [('behavior',), ('behaviorism',), ('skinner',), ('analysis',), ('stimulus',)]
Topic 3: [('business',), ('analysis',), ('requirements',), ('business', 'analysis'), ('text',)]
Topic 4: [('text',), ('mining',), ('text', 'mining'), ('data',), ('information',)]
Topic 5: [('access',), ('core',), ('open',), ('open', 'access'), ('content',)]
Topic 6: [('document',), ('processing',), ('data',), ('document', 'processing'), ('also',)]




In [None]:
# get top 5 most significant words per topics
# raw n-gram counts
print('tf-idf n-gram counts:')
top5_tfidf = get_top5_ngrams_svd(X_tfidf)

tf-idf n-gram counts:
Top 5 n-grams for every topic:
Topic 1: [('anpr',), ('plate',), ('cameras',), ('license', 'plate'), ('plates',)]
Topic 2: [('behaviorism',), ('skinner',), ('stimulus',), ('conditioning',), ('cognitive',)]
Topic 3: [('business', 'analysis'), ('business', 'analysts'), ('business',), ('business', 'analyst'), ('costs',)]
Topic 4: [('text', 'mining'), ('text',), ('text', 'analytics'), ('copyright',), ('sentiment',)]
Topic 5: [('agriculture',), ('fruit',), ('pesticide',), ('defects',), ('cotton',)]
Topic 6: [('talx',), ('sec',), ('cohen',), ('inc',), ('equifax',)]




d) Compare and comment the results with respect to the selection of features and the $n$ value in $n$-grams. How many mono-/bi-/tri-grams were there among the 1700 features, and among the topics.

The Top 5 n-grams for every topic are different, although there are some common values in some topics.  Whereas for $X_\mathrm{raw}$, we have the raw n-gram counts in $X_\mathrm{tf-idf}$ we have the tf-idf values, which views the frequency of a particular term of interest in relation to the document. 

Among the Top 5 n-grams, it can be observed that although for both the most common are mono-grams, for the tf-idf n-gram counts, we have more bi-grams. Although for neither of the two we have tri-grams, this makes sense, since the 5 most significant n-grams are being listed, and tri-grams are much less frequent throughout the documents. The exact number of mono-/bi-/tri-grams were there among the 1700 features and among the topics is found below. 

In [None]:
all_grams= get_most_frequent(df['all_grams'], 1700)
count_grams(all_grams)

Number of mono-grams: 1429
Number of bi-grams: 231
Number of tri-grams: 40


In [None]:
def count_grams (data):
    """
    Returns the number of mono-/bi-/tri-grams

    :param X: 1700 features
    :return: print the number of mono-/bi-/tri-grams 
    """
    mono, bi, tri= 0 , 0 , 0
    for elem in data:
        if len(elem[1]) ==1: mono+=1
        elif len(elem[1]) ==2 :bi+=1
        elif len(elem[1]) ==3 :tri+=1
    print("Number of mono-grams: " + str(mono))
    print("Number of bi-grams: " + str(bi))
    print("Number of tri-grams: " + str(tri))

**The format of tf-idf you need to use:**

$$ f(tf, idf) = (1 + \ln (tf)) \cdot idf $$

where:

$$ tf = \frac{number\ of\ times\ term\ w\ appears\ in\ a\ document}{total\ number\ of\ terms\ in\ that\ document} $$

and

$$ idf = \ln(\frac{total\ number\ of\ documents}{number\ of\ documents\ with\ term\ w\ in\ them + 1} + 1) $$