- Load the model.
- Load hierarchy dictionary.
- Process text: lower, split, keep words in model.vocab.
- Inference with model.search_topic().
 - Get topic_words, word_scores, topic_scores, topic_nums.
 - Use threshold for num topics cutoff.
 - Code case where no topic meets cutoff. Return "unkown".
- Enhancement: Exclude redundant words such as ipad and ipads.

In [1]:
from top2vec import Top2Vec
import pandas as pd
from pathlib import Path
import json

In [2]:

DATASET = "sample50k_health_tech"

MODEL_PATH = Path("../results/models")/DATASET
TOPIC_PATH = Path("../results/topics")

Load Model

In [3]:
embedding_module = 'universal-sentence-encoder'
speed = 'learn'
model = Top2Vec.load(f"{MODEL_PATH}/top2vec_{embedding_module}_{speed}")

Load dictionary with topic grouping. Keys are the topic numbers for the lowest level topics.

In [4]:
with open(f"{TOPIC_PATH}/{DATASET}_top2vec.json") as f:
    topics = f.read()
    topics = json.loads(topics)

Data preprocessing:

Note: the only preprocessing before training was .lower.split()! Was testing to see if removing stopwords before inference could help.

In [5]:
from src.data_cleaning import decontracted, clean, create_lemmas
import spacy
from spacy.lang.en.stop_words import STOP_WORDS as en_stop

nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])


In [19]:
# txt = df.loc[0].article

txt = """
A new study from Pennington Biomedical Research Center, published in the journal Nature Communications, found that reducing the amount of protein in the diet produced an array of favorable health outcomes, including an extension of lifespan, and that these effects depend on a liver-derived metabolic hormone called Fibroblast Growth Factor 21 (FGF21).

It has long been known that reducing the amount you eat improves health and extends lifespan, and there has been increasing interest in the possibility that reducing protein or amino acid intake contributes to this beneficial effect. Several recent studies suggest that diets that are low in protein, but not so low that they produce malnutrition, can improve health. Conversely, overconsumption of high-protein diets has been linked to increased mortality in certain age groups.
"""

txt = clean(decontracted(txt))
txt = create_lemmas(txt)
txt_proc = [w for w in txt if w in model.vocab]

In [20]:
print(txt_proc)

['new', 'study', 'biomedical', 'research', 'center', 'publish', 'journal', 'nature', 'communication', 'find', 'reduce', 'protein', 'diet', 'produce', 'array', 'favorable', 'health', 'outcome', 'include', 'extension', 'lifespan', 'effect', 'depend', 'metabolic', 'hormone', 'growth', 'factor', 'long', 'know', 'reduce', 'eat', 'improve', 'health', 'extend', 'lifespan', 'increase', 'interest', 'possibility', 'reduce', 'protein', 'amino', 'acid', 'intake', 'contribute', 'beneficial', 'effect', 'recent', 'study', 'suggest', 'diet', 'low', 'protein', 'low', 'produce', 'malnutrition', 'improve', 'health', 'diet', 'link', 'increase', 'mortality', 'certain', 'age', 'group']


In [21]:
topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=txt_proc, num_topics=5)

In [22]:
best_topics = list(filter(lambda x: x[1] > 0.02, zip(topic_nums, topic_scores)))

for t, _ in best_topics:
    print(topics[str(t)]["main_topic"])
    print(topics[str(t)]["topic_level1_descr"])
    print("_".join(topics[str(t)]["topic_words_level2"][:5]))
    print("-----")


health
celebrity
un_feeds_parenthood_tout_episodes
-----
health
plastic surgery
bariatric_ripa_toned_liposuction_tess
-----
health
nutrition
nutritionist_nutritional_nutritionists_dietitian_statin
-----
health
nutrition
keto_ketosis_diets_carbs_nutritionist
-----
health
plastic surgery
surgeries_implants_surgery_mastectomy_botox
-----
