## notebook for exploring similar terms/sentences

Turns tweets/posts into 512d vectors using a pretrained model, after which we use dimensionality reduction algorithms to turn the 512d vectors into 2d. We can then use the 2d vectors to visualise these tweets/posts in an interactive graph together with an analyst (currently using the `bulk` package). It will allow us to highlight snippets that have a particular word in them, and see which other snippets are close by. 

This would help analysts explore similar text snippets, and 

1: Give them a better idea of the size and scope of the topics that they are interested in (denoted by those words)

2: Provide inspiration for other words that could have something to do with that cluster, which can be used to bootstrap the SFLM model, or a spaCy model using `patterns` 

- [x] Load data
- [x] load spacy arabic model
    - Used distiluse-base-multilingual-cased-v1 instead of spacy
- [x] Add spacy model to sklearn pipeline
    - Used huggingface through embetter to get BERT model
- [x] Prep and export dataset to show similar sentences through bulk
    - [x] run text through embedding
    - [x] UMAP to dim reduction
    - [x] run bulk to create a small 2d graph of similar sentences

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import datetime

import pandas as pd
import tentaclio
import embetter
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder

import umap
import hdbscan
import sklearn.cluster as cluster
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score

from phoenix.common import artifacts, run_params, utils
from phoenix.tag.labelling import prodigy_utils

In [None]:
# !pip install embetter
# !pip install "embetter[sentence-tfm]"
# !pip install umap-learn hdbscan

In [None]:
utils.setup_notebook_output()
utils.setup_notebook_logging()

In [None]:
prodigy_dmaps_df_path = f"{artifacts.urls.get_local()}/prodigy/"
tweets_dmaps_path = f"{artifacts.urls.get_local()}/prodigy/dmaps_jordan_tweets.csv"
written_path = "/Users/andrewsutjahjo/git/python/phoenix/local_artifacts//prodigy/dmaps_jordan_tweets-11.csv"

output_path = f"{artifacts.urls.get_local()}/prodigy/dmaps_jordan_tweets-11.csv"

In [None]:
df = pd.read_csv(written_path)

In [None]:
# df = df[:10]

In [None]:
text_emb_pipeline = make_pipeline(
    ColumnGrabber("text"),
    SentenceEncoder("distiluse-base-multilingual-cased-v1")
)


In [None]:
embeddings_array = text_emb_pipeline.transform(df)

In [None]:
umap_embeddings = umap.UMAP().fit_transform(embeddings_array)

In [None]:
umap_embeddings

In [None]:
umap_embeddings.shape[0]

In [None]:
df["x"] = umap_embeddings[:,0]
df["y"] = umap_embeddings[:,1]

In [None]:
with tentaclio.open(output_path, "w") as fb:
    df.to_csv(fb)