# Group messages through Latent Dirichlet Allocation
### Script that will group objects using LDA, visualise groups based on word clouds and tag the objects with the LDA grouping.

## Expected message input:

| Property | Data Type | Description |
| :------- | :-------- | :---------- |
| objectId | string | Id of the tweet, post or comment |
| clean_text  | string | Message data to be analysed |
| {grouping_column} | string | Optional column name used to run separate LDAs per group |

## Expected grouping output:
| Property | Data Type | Description |
| :------- | :-------- | :---------- |
| objectId | string | Id of the tweet, post or comment |
| clean_text  | string | Message data to be analysed |
| {grouping_column} | string | Optional column name used to run separate LDAs per group |
| lda_name | string | grouping name when running separate LDAs per group. Uses values in {grouping_column} if given grouping_column, defaults to 'all' group name if not given |
| lda_cloud | int | cloud number which object belongs to (default 1-10) |
| lda_cloud_confidence | float | confidence that object belongs to the lda_cloud group (0.0-1.0 |


### The LatentDirichletAllocator will also be saved as pickle. To load it, uncomment the last cell and run it

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import datetime

import pandas as pd
import tentaclio

from phoenix.common import artifacts
from phoenix.common import utils
from phoenix.tag.clustering import latent_dirichlet_allocation


In [None]:
# Parametrise the run execution date.
# Format of the run date
RUN_DATE_FORMAT = "%Y-%m-%d"
# This can be overwritten at execution time by Papermill to enable historic runs and backfills etc.
RUN_DATE = datetime.datetime.today().strftime(RUN_DATE_FORMAT)

# Set Artefacts URL
ARTIFACTS_BASE_URL = f"{artifacts.urls.get_local()}{RUN_DATE}/"

# Input
ALL_OBJECTS = artifacts.dataframes.url(ARTIFACTS_BASE_URL, "objects")
TOPICS_DF = artifacts.dataframes.url(ARTIFACTS_BASE_URL, "topics")
# Group names in dataframe
GROUP_NAMES = "topic"

In [None]:
utils.setup_notebook_output()
utils.setup_notebook_logging()

In [None]:
# Display params.
print(
ARTIFACTS_BASE_URL,
ALL_OBJECTS,
TOPICS_DF,
GROUP_NAMES,
RUN_DATE,
sep='\n',
)

In [None]:
object_df = artifacts.dataframes.get(ALL_OBJECTS).dataframe

In [None]:
object_df

In [None]:
if GROUP_NAMES == "topic":
    topic_df = artifacts.dataframes.get(TOPICS_DF).dataframe
    object_df = topic_df.merge(object_df[["object_id","clean_text"]], on="object_id")

In [None]:
object_df.shape

In [None]:
# This will immediately fit a StemmedCountVectorizer and might take a while to complete.
lda = latent_dirichlet_allocation.LatentDirichletAllocator(object_df, grouping_column=GROUP_NAMES)

In [None]:
print(lda.dfs.items())

In [None]:
lda.vectorizers

In [None]:
# This will train the Latent Dirichlet Allocation model and use GridSearch 
# to find the best hyperparameters, This will take quite a while to complete.
lda.train()

In [None]:
lda.save_plot(ARTIFACTS_BASE_URL)

In [None]:
lda.tag_dataframe()

In [None]:
lda.persist(ARTIFACTS_BASE_URL)

In [None]:
lda.persist_model(ARTIFACTS_BASE_URL)

### The LatentDirichletAllocator will also be saved as pickle. To load it, uncomment the last cell and run it

In [None]:
# import pickle
# with tentaclio.open(f"{ARTIFACTS_BASE_URL}latent_dirichlet_allocator_model.sav", 'rb') as f:
#     lda_loaded = pickle.load(f)