# Topic modeling with AutoTM

Topic Modeling is a powerful technique that unveils the hidden structure of textual corpora, transforming them into intuitive topics and their representations within texts. This approach significantly enhances the interpretability of complex datasets, making it a breeze to extract meaningful insights and comprehend vast amounts of information.

In this tutorial we will train topic modeling on the set of  imdb reviews to understand the main topics.

### Installation

Pip version is currently available only for linux system. You should also download ```en_core_web_sm``` from ```spacy``` for correct dataset preprocessing. 

In [None]:
! pip install autotm
! python -m spacy download en_core_web_sm

In [None]:
from autotm.base import AutoTM
import pandas as pd
import logging

Now let's load nesessary for English datasets nltk package

In [None]:
import nltk 
nltk.download('averaged_perceptron_tagger')

In [None]:
logging.basicConfig(
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S'
)
logger = logging.getLogger()

### Dataset

First of all let's download the dataset from Huggingface Datasets. If you do not have the Datasets library you should first install it with ```pip install --quiet datasets``` or you can load your own ```csv``` dataset. 

In [None]:
from datasets import load_dataset

dataset = load_dataset("SetFit/20_newsgroups")
pd_dataset = dataset['train'].to_pandas()

In [None]:
pd_dataset.shape

In 20 Newsgroups dataset text is in "text" column and we will use it s a modeling target 

In AutoTM we have have the basic object ```AutoTM``` that can be used with default parameters or configured for your specific dataset.
- Basically user should set ```topic_count``` - the number of topics that should be obtained; column name that contain text to process ```texts_column_name``` and ```working_dir_path``` to store the results
- AutoTm implements dataset preprocessing procedure, so user only needs to define language (special preprocessing is implemented for 'en' and 'ru')
- User can also manipulate with ```alg_params``` and change algorithms from genetic to bayesian or select another way of quality calculation

In [None]:
autotm = AutoTM(
        topic_count=25,
        texts_column_name='text',
        preprocessing_params={
            "lang": "en", # available languages with special preprocessing options: "ru" and "en", if you have dataset in another language do not set this parameter
        },
        working_dir_path='tmp',
        alg_params={
            "num_iterations": 50, # setting iteration number to 30 or you can use default parameter
            "use_pipeline": True, # the latest default version of GA-based algorithm (default version), set it to False if you want to use the previous version
            # "individual_type": "llm", # if you want to use llm as a quality measure 
            # "surrogate_name": "random-forest-regressor" # enable surrogate modeling to speed up computation
        },
    )

If you worked with ```sklearn``` library than in ```AutoTM``` you should also be comfortable with ```fit```, ```predict``` and their combined version ```fit_predict```. As a reault of ```fit``` you will get a fitted ```autotm``` model that is tuned to your data.

Let's process the dataset

In [None]:
mixtures = autotm.fit_predict(pd_dataset.sample(1000)) # we will do the modeling on 1000 random examples from ACL-23 dataset

Now we are going to look at resulting topics. We defined 25 topics, so they can be accessed by "mainN" key

In [None]:
print(autotm.topics['main11'])

If user wants to save the resulting model 

In [None]:
autotm.save('model_artm')

Trained model structure:
```
|model_artm
| -- artm_model
| -- | -- n_wt.bin
| -- | -- p_wt.bin
| -- | -- parameters.bin
| -- | -- parameters.json
| -- | -- scre_tracker.bin
| -- autotm_data
```