## notebook for exploring similar terms/sentences

Turns tweets/posts into 512d vectors using a pretrained model, after which we use dimensionality reduction algorithms to turn the 512d vectors into 2d. We can then use the 2d vectors to visualise these tweets/posts in an interactive graph together with an analyst (currently using the `bulk` package). It will allow us to highlight snippets that have a particular word in them, and see which other snippets are close by. 

This would help analysts explore similar text snippets, and 

1: Give them a better idea of the size and scope of the topics that they are interested in (denoted by those words)

2: Provide inspiration for other words that could have something to do with that cluster, which can be used to bootstrap the SFLM model, or a spaCy model using `patterns` 

- [x] Load data
- [x] load spacy arabic model
    - Used distiluse-base-multilingual-cased-v1 instead of spacy
- [x] Add spacy model to sklearn pipeline
    - Used huggingface through embetter to get BERT model
- [x] Prep and export dataset to show similar sentences through bulk
    - [x] run text through embedding
    - [x] UMAP to dim reduction
    - [x] run bulk to create a small 2d graph of similar sentences

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import datetime

import pandas as pd
import tentaclio
import embetter
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from embetter.grab import ColumnGrabber
from embetter.text import SentenceEncoder

import umap
import hdbscan
import sklearn.cluster as cluster
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score

from phoenix.common import artifacts, run_params, utils
from phoenix.tag.labelling import prodigy_utils

In [None]:
# !pip install embetter
# !pip install "embetter[sentence-tfm]"
# !pip install umap-learn hdbscan
# !pip install ipywidgets
# !pip install --upgrade jupyter

In [3]:
utils.setup_notebook_output()
utils.setup_notebook_logging()

LOG:2023-02-03 11:49:45,435 - pid:11 - [/src/phoenix/common/utils.py:24] - root - INFO - Outputting logs within notebook enabled. Set level:INFO.


<RootLogger root (INFO)>

In [4]:
prodigy_dmaps_df_path = f"{artifacts.urls.get_local()}/prodigy/"
tweets_dmaps_path = f"{artifacts.urls.get_local()}/prodigy/dmaps_jordan_tweets.csv"
written_path = "/Users/andrewsutjahjo/git/python/phoenix/local_artifacts//prodigy/dmaps_jordan_tweets-11.csv"
fb_posts_tanzania_path = f"{prodigy_dmaps_df_path}tanzania_facebook_posts_final.parquet"
output_path = f"{artifacts.urls.get_local()}/prodigy/dmaps_jordan_tweets-11.csv"

In [11]:
# wiki_data = f"{artifacts.urls.get_local()}/prodigy/sw_small.txt"
wiki_data = f"{artifacts.urls.get_local()}/prodigy/sw_tiny.txt"
output_path = f"{artifacts.urls.get_local()}/prodigy/sw_small_umap.csv"

In [12]:
with tentaclio.open(wiki_data, "r") as fb:
    df_wiki = pd.read_csv(fb, sep="\n", error_bad_lines=False)

b'Skipping line 507: expected 1 fields, saw 2\nSkipping line 653: expected 1 fields, saw 2\nSkipping line 993: expected 1 fields, saw 2\nSkipping line 1703: expected 1 fields, saw 2\nSkipping line 1736: expected 1 fields, saw 2\nSkipping line 2092: expected 1 fields, saw 2\nSkipping line 2219: expected 1 fields, saw 2\nSkipping line 2360: expected 1 fields, saw 2\nSkipping line 2361: expected 1 fields, saw 2\nSkipping line 3064: expected 1 fields, saw 2\nSkipping line 3441: expected 1 fields, saw 2\nSkipping line 5025: expected 1 fields, saw 2\nSkipping line 5566: expected 1 fields, saw 2\nSkipping line 6050: expected 1 fields, saw 2\nSkipping line 6574: expected 1 fields, saw 2\nSkipping line 6668: expected 1 fields, saw 2\nSkipping line 7725: expected 1 fields, saw 2\nSkipping line 8852: expected 1 fields, saw 2\nSkipping line 8922: expected 1 fields, saw 2\n'


In [13]:
df_wiki.rename({"MKUTANO WA BIASHARA": "text"}, axis=1, inplace=True)

In [14]:
df_wiki

Unnamed: 0,text
0,▪ Je Ungependa Kupata Mualiko Maalum Kuhudhuri...
1,▪ Nikupe Maelezo Zaidi Ya Namna Ya Ushiriki Wa...
2,Labels: MKUTANO WA BIASHARA
3,"Lakini inakubaliwa hadhi ya ""dini"" katika nchi..."
4,"Kadiri ya hesabu yake, hao mwaka 2005 walikuwa..."
...,...
8167,Mwenyekiti wa Tanzania Saccos For Women Entrep...
8168,"Alisema kuwa TASWE inamatawi 14, katika matawi..."
8169,"Mshereheshaji, Angela Bondo akizungumza katika..."
8170,Rais wa Chama cha wenye viwanda Biashara na Ki...


In [5]:
df = artifacts.dataframes.get(fb_posts_tanzania_path).dataframe

In [6]:
df.groupby("language_from_api").count()

#  df = df[:10]

Unnamed: 0_level_0,phoenix_post_id,platform_id,platform,date,updated,type,text,link,post_url,subscriber_count,total_interactions,video_length_ms,id,image_text,title,caption,description,account_name,account_handle,account_platform_id,account_page_category,account_page_admin_top_country,account_page_description,account_url,account_page_created_date,statistics_actual_like_count,statistics_actual_comment_count,statistics_actual_share_count,statistics_actual_love_count,statistics_actual_wow_count,statistics_actual_haha_count,statistics_actual_sad_count,statistics_actual_angry_count,statistics_actual_care_count,overperforming_score,interaction_rate,underperforming_score,post_created,timestamp_filter,date_filter,year_filter,month_filter,day_filter,medium_type,text_link,text_hash,scrape_url,url_post_id,file_timestamp
language_from_api,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1
ar,4161,4161,4161,4161,4161,4161,4161,3922,4161,4161,4161,514,4161,313,2430,2434,2430,4161,4156,4161,4161,4161,4161,4161,4159,4161,4161,4161,4161,4161,4161,4161,4161,4161,0,0,0,4161,4161,4161,4161,4161,4161,4161,4161,4161,4161,4161,4161
en,40,40,40,40,40,40,40,40,40,40,40,3,40,16,9,9,10,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,0,0,0,40,40,40,40,40,40,40,40,40,40,40,40
fr,111,111,111,111,111,111,111,111,111,111,111,0,111,2,108,108,106,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,111,0,0,0,111,111,111,111,111,111,111,111,111,111,111,111
und,294,294,294,294,294,294,294,292,294,294,294,1,294,8,276,275,277,294,294,294,294,294,294,294,294,294,294,294,294,294,294,294,294,294,0,0,0,294,294,294,294,294,294,294,294,294,294,294,294


In [7]:
df[df["language_from_api"] == "und"]

Unnamed: 0,phoenix_post_id,platform_id,platform,date,updated,type,text,link,post_url,subscriber_count,total_interactions,video_length_ms,language_from_api,id,image_text,title,caption,description,account_name,account_handle,account_platform_id,account_page_category,account_page_admin_top_country,account_page_description,account_url,account_page_created_date,statistics_actual_like_count,statistics_actual_comment_count,statistics_actual_share_count,statistics_actual_love_count,statistics_actual_wow_count,statistics_actual_haha_count,statistics_actual_sad_count,statistics_actual_angry_count,statistics_actual_care_count,overperforming_score,interaction_rate,underperforming_score,post_created,timestamp_filter,date_filter,year_filter,month_filter,day_filter,medium_type,text_link,text_hash,scrape_url,url_post_id,file_timestamp
137,100037423069130-f0c594ec84786b46,1.000374e+29,Facebook,2023-01-27 09:26:38,2023-01-27 12:01:37+00:00,photo,#بلدية_الهري,https://www.facebook.com/photo.php?fbid=896086...,https://www.facebook.com/100037423069130/posts...,99898,1.0,,und,11785489|896086731648754,,,,,KOURA ONLINE,kouraon,100037423069130,MEDIA_NEWS_COMPANY,LB,\nصفحة إخبارية،اجتماعية،منوعة. مستقلة،لا تنتمي...,https://www.facebook.com/360303224491343,2018-05-13 10:29:18+00:00,1,0,0,0,0,0,0,0,0,,,,2023-01-27 09:26:38+00:00,2023-01-27 09:26:38+00:00,2023-01-27,2023,1,27,photo,#بلدية_الهري-https://www.facebook.com/photo.ph...,f0c594ec84786b46,https://mbasic.facebook.com/100037423069130/po...,896086731648754,2023-01-27 12:29:38.172827+00:00
155,100044236668951-aa5d33c1c81107cb,1.000442e+29,Facebook,2023-01-25 16:42:41,2023-01-27 10:42:51+00:00,photo,"""تجدد"": مستمرّون في النضال إلى جانب أهالي ضحاي...",https://www.facebook.com/kutlattajadod/photos/...,https://www.facebook.com/100044236668951/posts...,101220,85.0,,und,885198|755065189311363,,,,"اجتمعت كتلة ""تجدد"" في مقرها في سن الفيل وأصدرت...",Michel Moawad,michelmoawadofficial,100044236668951,POLITICIAN,LB,"Member of the 🇱🇧 Parliament, proudly represent...",https://www.facebook.com/144651662390924,2013-06-04 06:42:26+00:00,70,7,3,5,0,0,0,0,0,,,,2023-01-25 16:42:41+00:00,2023-01-25 16:42:41+00:00,2023-01-25,2023,1,25,photo,"""تجدد"": مستمرّون في النضال إلى جانب أهالي ضحاي...",aa5d33c1c81107cb,https://mbasic.facebook.com/100044236668951/po...,755065189311363,2023-01-27 12:29:38.172827+00:00
177,100044242578779-5ced73c0afb8d956,1.000442e+29,Facebook,2023-01-27 07:19:56,2023-01-27 11:23:38+00:00,link,⬆️,https://maghapress.blogspot.com/p/blog-page.html,https://www.facebook.com/100044242578779/posts...,128689,3.0,,und,10287848|725574232260656,,أسعار الدولار وصيرفة والعملات الرقمية,maghapress.blogspot.com,,آغابرس اخبار صيدا,aghapress1,100044242578779,TOPIC_PUBLISHER,LB,NEWS\nصفحة إخبارية متنوعة,https://www.facebook.com/1415405748740902,2014-06-04 18:10:31+00:00,3,0,0,0,0,0,0,0,0,,,,2023-01-27 07:19:56+00:00,2023-01-27 07:19:56+00:00,2023-01-27,2023,1,27,link,⬆️-https://maghapress.blogspot.com/p/blog-page...,5ced73c0afb8d956,https://mbasic.facebook.com/100044242578779/po...,725574232260656,2023-01-27 12:29:38.172827+00:00
178,100044242578779-66793e9fb0b7e20e,1.000442e+29,Facebook,2023-01-27 06:11:41,2023-01-27 11:23:38+00:00,photo,Antar tours,https://www.facebook.com/photo.php?fbid=725545...,https://www.facebook.com/100044242578779/posts...,128689,1.0,,und,10287848|725545175596895,"‎one or more people, ‎'‎رحله الى الشام rours A...",,,,آغابرس اخبار صيدا,aghapress1,100044242578779,TOPIC_PUBLISHER,LB,NEWS\nصفحة إخبارية متنوعة,https://www.facebook.com/1415405748740902,2014-06-04 18:10:31+00:00,1,0,0,0,0,0,0,0,0,,,,2023-01-27 06:11:41+00:00,2023-01-27 06:11:41+00:00,2023-01-27,2023,1,27,photo,Antar tours-https://www.facebook.com/photo.php...,66793e9fb0b7e20e,https://mbasic.facebook.com/100044242578779/po...,725545175596895,2023-01-27 12:29:38.172827+00:00
198,100044242578779-ee3b8fddbcc03963,1.000442e+29,Facebook,2023-01-27 05:42:16,2023-01-27 11:23:38+00:00,photo,Antar tours,https://www.facebook.com/photo.php?fbid=725532...,https://www.facebook.com/100044242578779/posts...,128689,0.0,,und,10287848|725532812264798,‎'‎عرض بعده مستمر 22$ tours Antar رحلة منامي ا...,,,,آغابرس اخبار صيدا,aghapress1,100044242578779,TOPIC_PUBLISHER,LB,NEWS\nصفحة إخبارية متنوعة,https://www.facebook.com/1415405748740902,2014-06-04 18:10:31+00:00,0,0,0,0,0,0,0,0,0,,,,2023-01-27 05:42:16+00:00,2023-01-27 05:42:16+00:00,2023-01-27,2023,1,27,photo,Antar tours-https://www.facebook.com/photo.php...,ee3b8fddbcc03963,https://mbasic.facebook.com/100044242578779/po...,725532812264798,2023-01-27 12:29:38.172827+00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4165,322181790311-fd01f93aeadf7180,3.221818e+28,Facebook,2023-01-26 16:46:15,2023-01-27 11:54:07+00:00,link,#فلسطين #إسرائيل,https://bit.ly/3He1Fws,https://www.facebook.com/322181790311/posts/10...,347536,1.0,,und,766084|10161080324360312,,القيادة الفلسطينية تعلن وقف التنسيق الأمني مع ...,lebanese-forces.com,أعلنت القيادة الفلسطينية، عن وقف التنسيق الأمن...,Lebanese Forces News,lebaneseforcesnews,322181790311,NEWS_SITE,LB,Lebanese Forces News Website,https://www.facebook.com/322181790311,2010-02-09 11:53:27+00:00,1,0,0,0,0,0,0,0,0,,,,2023-01-26 16:46:15+00:00,2023-01-26 16:46:15+00:00,2023-01-26,2023,1,26,link,#فلسطين #إسرائيل-https://bit.ly/3He1Fws,fd01f93aeadf7180,https://mbasic.facebook.com/322181790311/posts...,10161080324360312,2023-01-27 12:29:38.172827+00:00
4166,322181790311-fd0a67f60a4505a4,3.221818e+28,Facebook,2023-01-27 11:48:22,2023-01-27 11:54:01+00:00,link,#أوكرانيا,https://bit.ly/3jdgcAH,https://www.facebook.com/322181790311/posts/10...,347534,0.0,,und,766084|10161081920725312,,أوكرانيا تهدّد بمقاطعة الألعاب الأولمبية - Leb...,lebanese-forces.com,"هدّدت أوكرانيا بـ""مقاطعة دورة الألعاب الأولمبي...",Lebanese Forces News,lebaneseforcesnews,322181790311,NEWS_SITE,LB,Lebanese Forces News Website,https://www.facebook.com/322181790311,2010-02-09 11:53:27+00:00,0,0,0,0,0,0,0,0,0,,,,2023-01-27 11:48:22+00:00,2023-01-27 11:48:22+00:00,2023-01-27,2023,1,27,link,#أوكرانيا-https://bit.ly/3jdgcAH,fd0a67f60a4505a4,https://mbasic.facebook.com/322181790311/posts...,10161081920725312,2023-01-27 12:29:38.172827+00:00
4168,322181790311-fdbd0f090e1d079d,3.221818e+28,Facebook,2023-01-25 14:51:41,2023-01-27 11:54:13+00:00,link,#لبنان #القوات_اللبنانية #عماد_واكيم #القضاء #...,https://bit.ly/3wv87dE,https://www.facebook.com/322181790311/posts/10...,347531,19.0,,und,766084|10161078061795312,,واكيم: لمصلحة من محاولة التغطية على جريمة المر...,lebanese-forces.com,"سأل النائب السابق عماد واكيم، ""ماذا يجري داخل ...",Lebanese Forces News,lebaneseforcesnews,322181790311,NEWS_SITE,LB,Lebanese Forces News Website,https://www.facebook.com/322181790311,2010-02-09 11:53:27+00:00,15,1,2,1,0,0,0,0,0,,,,2023-01-25 14:51:41+00:00,2023-01-25 14:51:41+00:00,2023-01-25,2023,1,25,link,#لبنان #القوات_اللبنانية #عماد_واكيم #القضاء #...,fdbd0f090e1d079d,https://mbasic.facebook.com/322181790311/posts...,10161078061795312,2023-01-27 12:29:38.172827+00:00
4169,322181790311-ff97c2a0a50a790b,3.221818e+28,Facebook,2023-01-25 09:40:47,2023-01-27 07:58:24+00:00,link,#رئاسة_الجمهورية #قصر_بعبدا Almassira المسيرة,https://www.lebanese-forces.com/2023/01/25/leb...,https://www.facebook.com/322181790311/posts/10...,347531,3.0,,und,766084|10161077646160312,,الرئيس حليف الخير... ولكن! - Lebanese Forces O...,lebanese-forces.com,كتب العميد الركن المتقاعد والوزير السابق فرنسو...,Lebanese Forces News,lebaneseforcesnews,322181790311,NEWS_SITE,LB,Lebanese Forces News Website,https://www.facebook.com/322181790311,2010-02-09 11:53:27+00:00,3,0,0,0,0,0,0,0,0,,,,2023-01-25 09:40:47+00:00,2023-01-25 09:40:47+00:00,2023-01-25,2023,1,25,link,#رئاسة_الجمهورية #قصر_بعبدا Almassira المسيرة-...,ff97c2a0a50a790b,https://mbasic.facebook.com/322181790311/posts...,10161077646160312,2023-01-27 12:29:38.172827+00:00


In [9]:
text_emb_pipeline = make_pipeline(
    ColumnGrabber("text"),
    SentenceEncoder("Davlan/bert-base-multilingual-cased-finetuned-swahili")
)


LOG:2023-02-03 11:50:02,090 - pid:11 - [/usr/local/lib/python3.9/site-packages/sentence_transformers/SentenceTransformer.py:66] - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: Davlan/bert-base-multilingual-cased-finetuned-swahili


Some weights of the model checkpoint at /root/.cache/torch/sentence_transformers/Davlan_bert-base-multilingual-cased-finetuned-swahili were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at /root/.cache/torch/sentence_tr

In [15]:
# embeddings_array = text_emb_pipeline.transform(df)
embeddings_array = text_emb_pipeline.transform(df_wiki)

Batches:   0%|          | 0/256 [00:00<?, ?it/s]

In [None]:
umap_embeddings = umap.UMAP().fit_transform(embeddings_array)

In [None]:
umap_embeddings

In [None]:
umap_embeddings.shape[0]

In [None]:
df["x"] = umap_embeddings[:,0]
df["y"] = umap_embeddings[:,1]

In [None]:
with tentaclio.open(output_path, "w") as fb:
    df.to_csv(fb)