# Topic modeling with AutoTM

Topic Modeling is a powerful technique that unveils the hidden structure of textual corpora, transforming them into intuitive topics and their representations within texts. This approach significantly enhances the interpretability of complex datasets, making it a breeze to extract meaningful insights and comprehend vast amounts of information.

In this tutorial we will train topic modeling on the set of  imdb reviews to understand the main topics.

### Installation

Pip version is currently available only for linux system. You should also download ```en_core_web_sm``` from ```spacy``` for correct dataset preprocessing. 

In [None]:
! pip install autotm
! python -m spacy download en_core_web_sm

In [1]:
from autotm.base import AutoTM
import pandas as pd
import logging

Now let's load nesessary for English datasets nltk package

In [2]:
import nltk 
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
logging.basicConfig(
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S'
)
logger = logging.getLogger()

### Dataset

First of all let's download the dataset from Huggingface Datasets. If you do not have the Datasets library you should first install it with ```pip install --quiet datasets``` or you can load your own ```csv``` dataset. 

In [4]:
from datasets import load_dataset

dataset = load_dataset("SetFit/20_newsgroups")
pd_dataset = dataset['train'].to_pandas()

Repo card metadata block was not found. Setting CardData to empty.


In [None]:
pd_dataset.shape

In 20 Newsgroups dataset text is in "text" column and we will use it s a modeling target 

In AutoTM we have have the basic object ```AutoTM``` that can be used with default parameters or configured for your specific dataset.
- Basically user should set ```topic_count``` - the number of topics that should be obtained; column name that contain text to process ```texts_column_name``` and ```working_dir_path``` to store the results
- AutoTm implements dataset preprocessing procedure, so user only needs to define language (special preprocessing is implemented for 'en' and 'ru')
- User can also manipulate with ```alg_params``` and change algorithms from genetic to bayesian or select another way of quality calculation

In [5]:
autotm = AutoTM(
        topic_count=25,
        texts_column_name='text',
        preprocessing_params={
            "lang": "en", # available languages with special preprocessing options: "ru" and "en", if you have dataset in another language do not set this parameter
        },
        working_dir_path='tmp',
        alg_params={
            "num_iterations": 50, # setting iteration number to 30 or you can use default parameter
            "use_pipeline": True, # the latest default version of GA-based algorithm (default version), set it to False if you want to use the previous version
            # "individual_type": "llm", # if you want to use llm as a quality measure 
            # "surrogate_name": "random-forest-regressor" # enable surrogate modeling to speed up computation
        },
    )

If you worked with ```sklearn``` library than in ```AutoTM``` you should also be comfortable with ```fit```, ```predict``` and their combined version ```fit_predict```. As a reault of ```fit``` you will get a fitted ```autotm``` model that is tuned to your data.

Let's process the dataset

In [None]:
mixtures = autotm.fit_predict(pd_dataset.sample(1000)) # we will do the modeling on 1000 random examples from ACL-23 dataset

2024-06-09 23:43:20 - autotm.base - INFO - Stage 0: Create working dir tmp if not exists
2024-06-09 23:43:20 - autotm.base - INFO - Stage 1: Dataset preparation
  return bound(*args, **kwds)
2024-06-09 23:43:31 - autotm.preprocessing.text_preprocessing - INFO - Saved to tmp/01cefc7c-1438-4248-8789-9aeb6997bba7/dataset_processed.csv
2024-06-09 23:43:31 - autotm.preprocessing.dictionaries_preparation - INFO -  batches tmp/01cefc7c-1438-4248-8789-9aeb6997bba7/batches 
 vocabulary tmp/01cefc7c-1438-4248-8789-9aeb6997bba7/test_set_data_voc.txt 
 are ready
E0609 23:43:31.881521 52299 dictionary_operations.cc:381] Error at line 1, file tmp/01cefc7c-1438-4248-8789-9aeb6997bba7/test_set_data_voc.txt. Expected format: <token> [<class_id>], dictionary will be gathered in random token order
2024-06-09 23:43:32 - autotm.preprocessing.cooc - INFO - Calculating cooc: 10 batches were found in tmp/01cefc7c-1438-4248-8789-9aeb6997bba7/batches, start processing
2024-06-09 23:43:35 - autotm.preprocessing.

Now we are going to look at resulting topics. We defined 25 topics, so they can be accessed by "mainN" key

In [8]:
print(autotm.topics['back0'])

KeyError: 'back0'

In [27]:
print(autotm.topics['main5'])

['get', 'also', 'new', 'know', 'two', 'people', 'well', 'think', 'make', 'even', 'would', 'see', 'time', 'right', 'give', 'need', 'do', 'work', 'find', 'god', 'want', 'window', 'include', 'back', 'use', 'really', 'case', 'one', 'please', 'way', 'game', 'post', 'since', 'fact', 'etc', 'mean', 'space', 'ask', 'chip', 'cause', 'day', 'state', 'name', 'say', 'put', 'order', 'drive', 'go', 'support', 'information']


In [17]:
mixtures

Unnamed: 0,main0,main1,main2,main3,main4,main5,main6,main7,main8,main9,...,main15,main16,main17,main18,main19,main20,main21,main22,main23,main24
300,0.040013,0.039991,0.039977,0.040017,0.039987,0.040010,0.040007,0.039989,0.040043,0.039974,...,0.040000,0.039997,0.040009,0.040018,0.040017,0.039996,0.039998,0.040009,0.039990,0.040011
301,0.039987,0.040000,0.039999,0.040006,0.039980,0.040005,0.039993,0.040001,0.039997,0.040015,...,0.039999,0.040011,0.040001,0.040015,0.039994,0.039992,0.040013,0.039992,0.040004,0.039996
302,0.039994,0.040013,0.040014,0.039979,0.039965,0.039984,0.039986,0.039996,0.040050,0.040006,...,0.039965,0.040004,0.040048,0.040026,0.040020,0.039977,0.040009,0.039967,0.039987,0.039990
303,0.040001,0.040000,0.040001,0.039996,0.040004,0.040004,0.040002,0.039997,0.040002,0.039998,...,0.039999,0.039997,0.039999,0.040000,0.039998,0.040000,0.040000,0.040000,0.040004,0.039999
304,0.040010,0.040009,0.040002,0.039993,0.039984,0.040017,0.040008,0.040002,0.039996,0.039998,...,0.040012,0.040021,0.039996,0.039998,0.040004,0.040002,0.039989,0.040009,0.039992,0.039980
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
595,0.039995,0.040007,0.040051,0.039997,0.040017,0.039974,0.039970,0.039976,0.040007,0.040028,...,0.039988,0.040037,0.039951,0.040013,0.039995,0.039979,0.039991,0.039989,0.039983,0.040013
596,0.040006,0.040001,0.040007,0.039995,0.039999,0.040005,0.039996,0.040007,0.039998,0.040000,...,0.039999,0.039998,0.039990,0.039998,0.040008,0.039997,0.040013,0.039995,0.039991,0.039998
597,0.040013,0.039992,0.040032,0.039978,0.039998,0.039989,0.039985,0.040022,0.039992,0.040025,...,0.039977,0.040013,0.040019,0.039997,0.039979,0.040021,0.040022,0.039993,0.040013,0.039989
598,0.039991,0.040000,0.040021,0.039992,0.039988,0.040005,0.040011,0.039996,0.040022,0.040028,...,0.040002,0.040004,0.040008,0.040025,0.039996,0.039985,0.039982,0.039995,0.040011,0.039989


If user wants to save the resulting model 

In [None]:
autotm.save('model_artm')

Trained model structure:
```
|model_artm
| -- artm_model
| -- | -- n_wt.bin
| -- | -- p_wt.bin
| -- | -- parameters.bin
| -- | -- parameters.json
| -- | -- scre_tracker.bin
| -- autotm_data
```