## Topic Modelling

- Topic modeling in NLP discovers abstract themes in document collections using algorithms like LDA (Latent Dirichlet Allocation), LSA (Latent Semantic Analysis), etc. 
- It's unsupervised, clusters similar expressions, and aids in organizing, summarizing, and analyzing large text datasets.

In [1]:
%load_ext watermark
%watermark -v -p numpy,pandas,polars,omegaconf --conda

Python implementation: CPython
Python version       : 3.11.8
IPython version      : 8.22.2

numpy    : 1.26.4
pandas   : 2.2.1
polars   : 0.20.18
omegaconf: 2.3.0

conda environment: torch_p11



In [2]:
# Built-in library
from pathlib import Path
import re
import json
from typing import Any, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import numpy.typing as npt
from pprint import pprint
import pandas as pd
import polars as pl
from rich.console import Console
from rich.theme import Theme

custom_theme = Theme(
    {
        "info": "#76FF7B",
        "warning": "#FBDDFE",
        "error": "#FF0000",
    }
)
console = Console(theme=custom_theme)

# Visualization
import matplotlib.pyplot as plt

# NumPy settings
np.set_printoptions(precision=4)

# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

# Polars settings
pl.Config.set_fmt_str_lengths(1_000)
pl.Config.set_tbl_cols(n=1_000)

warnings.filterwarnings("ignore")


# Black code formatter (Optional)
%load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

### Latent Dirichlet Allocation (LDA)

- LDA is a generative probabilistic model that tries to find groups of words that appear frequently together across different documents.
- It assumes that each document is a mixture of topics, and each topic is a distribution over words.
- LDA assumes documents are "bags of words." i.e. the order of words doesn't matter, and only the frequency of word occurrences is considered.
  - This is a simplification, as word order can be important for understanding meaning.

In [3]:
fp: str = "../../data/ImDB_data.parquet"
df: pl.DataFrame = pl.read_parquet(fp)
df.head(2)

review,sentiment
str,i64
"""I felt this movie was as much about human sexuality as anything else, whether intentionally or not. We are also shown how absurd and paradoxical it is for women not to be allowed to such a nationally important event, meanwhile forgetting the pasts of our respective ""advanced"" nations. I write from Japan, where women merely got the right to vote 60 years ago, and female technical engineers are a recent phenomenon. Pubs in England were once all-male, the business world was totally off-limits for women in America until rather recently, and women in China had their feet bound so they couldn't develop feet strong enough to escape their husbands. Iran is conveniently going through this stage in our time, and we get a good look at how ridiculous we have all looked at one time or another. Back to the issue of sexuality, we are made to wonder what it may be intrinsically about women that make them unfit for a soccer game (the official reason is that the men are bad). Especially such boyish gir…",1
"""Let's face it, a truly awful movie, no...I mean a ""truly"" awful movie, is a rare, strange, and beautiful thing to behold. I admite that there is a special place in my heart for films like Plan 9 From Outer Space, Half Caste, Species, etc. And although I'm giving this film a 1, I highly urge anyone who enjoys a bad film for what it truly is (a bad film) to find a friend, snacks, something to drink, and make the special occasion it deserves out of: Aussie Park Boyz. <br /><br />From the very first moments of the lead actor's side to side eye-rolling performance as he attempts to inject intensity directly into the film without ever looking at a camera (a slice of ham straight out of silent pictures--eat your heart out Rudolph Valentino) to the sudden hey-we're-out-of-film conclusion, you...will...not...stop...laughing. <br /><br />To sum the film up, its a poor man's Warriors down under, complete--and that description alone should be enough, but then comes the wonders of ""the spaghetti e…",0


In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from scipy.sparse import csr_matrix


count: CountVectorizer = CountVectorizer(
    stop_words="english", max_df=0.1, max_features=10_000
)
X: csr_matrix = count.fit_transform(df.select("review").to_numpy().squeeze().tolist())


lda = LatentDirichletAllocation(
    n_components=10, random_state=123, learning_method="online"
)
X_topics = lda.fit_transform(X)

In [5]:
# 5_000 features per topic. i.e. 10 topics, 5000 features.
lda.components_.shape

(10, 10000)

In [7]:
np.argsort?

[0;31mSignature:[0m       [0mnp[0m[0;34m.[0m[0margsort[0m[0;34m([0m[0ma[0m[0;34m,[0m [0maxis[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m,[0m [0mkind[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0morder[0m[0;34m=[0m[0;32mNone[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mCall signature:[0m  [0mnp[0m[0;34m.[0m[0margsort[0m[0;34m([0m[0;34m*[0m[0margs[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mType:[0m            _ArrayFunctionDispatcher
[0;31mString form:[0m     <function argsort at 0x104b36d40>
[0;31mFile:[0m            ~/miniconda3/envs/torch_p11/lib/python3.11/site-packages/numpy/core/fromnumeric.py
[0;31mDocstring:[0m      
Returns the indices that would sort an array.

Perform an indirect sort along the given axis using the algorithm specified
by the `kind` keyword. It returns an array of indices of the same shape as
`a` that index data along the given axis in sorted order.

Parameters
-------

In [8]:
n_top_words: int = 5
feature_names: np.ndarray = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f"Topic {topic_idx + 1}:")
    print(
        " ".join([feature_names[i] for i in topic.argsort()[: -n_top_words - 1 : -1]])
    )

Topic 1:
girl sex guy women woman
Topic 2:
horror series original episode tv
Topic 3:
role performance john comedy played
Topic 4:
american war english french history
Topic 5:
worst minutes watched ll maybe
Topic 6:
action game space fight effects
Topic 7:
art documentary human reality subject
Topic 8:
book kids comedy read children
Topic 9:
family father mother beautiful novel
Topic 10:
killer murder death police dead


In [None]:
from glob import glob


def create_id_text_mapping(filepath: str) -> pl.DataFrame:
    """
    Create a mapping DataFrame from a Parquet file.

    This function reads a Parquet file, processes the data, and creates a mapping
    DataFrame with 'id', 'text', and 'label' columns.

    Parameters
    ----------
    filepath : str
        The path to the Parquet file.

    Returns
    -------
    pl.DataFrame
        A DataFrame with shape (1, 3) containing 'id', 'text', and 'label' columns.
    """
    pattern: str = r"salary|gigworker"
    delimiter: str = "|"

    df: pl.DataFrame = (
        pl.scan_parquet(filepath)
        .with_columns(tags=pl.col("tags").map_elements(lambda x: "".join(x)))
        .filter(pl.col("tags").str.to_lowercase().str.contains(pattern))
        .with_columns(
            label=pl.col("tags")
            .str.extract_all(pattern)
            .map_elements(lambda x: "".join(set(x)))
        )
        .drop("tags")
        .collect()
    )
    try:
        df = df.rename({"analysisId": "id"})
    except:
        pass

    df_grpby: pl.DataFrame = df.group_by("id").agg(
        text=(pl.struct(["date", "description", "amount"]))
    )
    body: list[str] = []
    for row in df_grpby.select("text").to_dicts():
        for data_ in row["text"]:
            date: str = data_["date"]
            description: str = data_["description"]
            amount: float = data_["amount"]
            b_str: str = f"{date} {delimiter} {description} {delimiter} {amount} "
            value: str = f"{b_str}\n"
            body.append(value)

    id: str = str(df.select("id").unique().to_numpy().squeeze())
    label: str = str(df.select("label").unique().to_numpy().squeeze())
    data: dict[str, Any] = {"id": id, "text": body, "label": label}
    mapping_df: pl.DataFrame = pl.DataFrame([data])

    return mapping_df


def create_dataset(
    filepath: str = "./data/*.parquet", output_path: str | None = None
) -> pl.DataFrame:
    """
    Create a dataset by combining multiple Parquet files and save as JSONL.

    Parameters
    ----------
    filepath : str, optional
        Glob pattern for input Parquet files, by default "./data/*.parquet"
    output_path : str | None, optional
        Path to save the output JSONL file, by default None

    Returns
    -------
    pl.DataFrame
        Combined DataFrame from all input files

    Notes
    -----
    If output_path is None, the function will use "output.jsonl" as default.
    """

    files: list[str] = glob(filepath)
    all_df: pl.DataFrame = pl.DataFrame()

    for f in files:
        df: pl.DataFrame = create_id_text_mapping(filepath=f)
        all_df = all_df.vstack(df)

    # Convert the DataFrame to JSONL
    if output_path is None:
        output_path = "output.jsonl"
    all_df.write_ndjson(output_path)
    return all_df