Automatic classification of stigmatizing mental illness articles in online news journals - April 2022
Author: Alina Yanchuk - alinayanchuk@ua.pt

### Table of contents:

* [4. Topic modeling](#chapter4)
    * [4.1 Requirements](#section_4_1)
    * [4.2 Imports](#section_4_2)
    * [4.3 Get data](#section_4_3)
    * [4.4 With top2vec](#section_4_4)

# 4. Topic modeling <a class="anchor" id="chapter4"></a>

Topic modeling is a machine learning technique (unsupervised) that automatically analyzes text data to determine cluster words (mapped to topics) for a set of documents.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Some benefits: automatically finds number of topics, works on short text, doesn't ingore semantics.

Note: execute with GPU.

References: 1.https://github.com/ddangelov/Top2Vec 

## 4.1 Requirements <a class="anchor" id="section_4_1"></a>

In [None]:
#pip install gensim==3.8.3

In [None]:
#pip install top2vec

## 4.2 Imports <a class="anchor" id="section_4_2"></a>

In [8]:
import pandas as pd

from top2vec import Top2Vec

2022-05-18 17:24:53.067600: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-05-18 17:24:53.067726: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


## 4.3 Get data <a class="anchor" id="section_4_3"></a>

In [3]:
# top2vec doesn't need pre-processing, but we will still use the already cleaned dataset

data = pd.read_pickle('data_preprocessed_tm.pkl')
data.insert(0, 'ID', range(0, len(data)))
data.head()

Unnamed: 0,ID,label,content
0,0,0,prisão perpétua homem tentou assassinar senado...
1,1,0,john nash matemático mente brilhante morre aci...
2,2,1,mito reeleição mínima garantida cavaco sairá d...
3,3,0,morreu rita levintalcini grande dama ciência i...
4,4,0,trás porta amarela homem problemas psicológico...


In [4]:
# Create mapping index-document ID

ids = {}
for index in data.index:
  ids[index] = data.iloc[index].ID

In [5]:
content = data.loc[:,'content']
content.head()

0    prisão perpétua homem tentou assassinar senado...
1    john nash matemático mente brilhante morre aci...
2    mito reeleição mínima garantida cavaco sairá d...
3    morreu rita levintalcini grande dama ciência i...
4    trás porta amarela homem problemas psicológico...
Name: content, dtype: object

## 4.4 With top2vec <a class="anchor" id="section_4_4"></a>

In [None]:
# Convert dataset to list of strings

documents = list(content.values.flatten())

In [None]:
# Train a Top2Vec model on our news dataset

model = Top2Vec(documents, speed="learn", workers=8)

In [4]:
# Total number of topics found

total_topics = model.get_num_topics()

print("Found: "+str(total_topics)+" topics.")


Found 10 topics.



In [None]:
# For each topic, the top 50 words are returned, in order of semantic similarity to the topic

topic_words, word_scores, topic_nums = model.get_topics(total_topics)

In [5]:
# 50 most relevant words for each topic

topic_words


array([['doencas', 'estudo', 'doenca', 'medicamentos', 'ansiedade',
        'sintomas', 'doentes', 'estudos', 'saude', 'tratamentos',
        'tratamento', 'mental', 'mentais', 'pacientes', 'investigadores',
        'existem', 'efeitos', 'utilizacao', 'genetica', 'comportamentos',
        'medica', 'secundarios', 'risco', 'sofrem', 'substancia',
        'aumentar', 'depressao', 'psicoticos', 'psiquiatria',
        'investigador', 'destes', 'perturbacoes', 'associado', 'tipos',
        'esquizofrenia', 'cerebro', 'casos', 'graves', 'psiquiatricos',
        'uso', 'medico', 'medicacao', 'perda', 'alucinacoes',
        'cientistas', 'desenvolver', 'psiquiatra', 'tratar', 'cuidados',
        'consumo'],
       ['homicidio', 'prisao', 'policia', 'crime', 'encontrado',
        'crimes', 'inimputavel', 'tribunal', 'matou', 'sofre',
        'psiquiatrica', 'vitima', 'arguido', 'psiquiatrico',
        'internamento', 'internado', 'matar', 'acusacao', 'acusado',
        'condenado', 'suspeito',

In [6]:
# And their scores

word_scores


array([[0.66990924, 0.63975966, 0.62038016, 0.61935025, 0.617918  ,
        0.6161783 , 0.6157184 , 0.6069458 , 0.59013665, 0.5872875 ,
        0.57665765, 0.57321715, 0.5730446 , 0.5637206 , 0.5499758 ,
        0.5452902 , 0.5446592 , 0.5391974 , 0.539122  , 0.53110605,
        0.5299418 , 0.5295639 , 0.5276976 , 0.5265435 , 0.52526313,
        0.5237713 , 0.5234479 , 0.52059764, 0.51766413, 0.5146994 ,
        0.506858  , 0.5032734 , 0.502999  , 0.50271577, 0.5025605 ,
        0.50228304, 0.4988374 , 0.4969626 , 0.49252406, 0.4922303 ,
        0.4845404 , 0.4840532 , 0.48088893, 0.47902697, 0.4772811 ,
        0.4766344 , 0.47609985, 0.4746457 , 0.47357175, 0.47322702],
       [0.63625103, 0.62617075, 0.6212035 , 0.618063  , 0.5796766 ,
        0.57817936, 0.5777791 , 0.5777204 , 0.564726  , 0.5645703 ,
        0.56097203, 0.5569245 , 0.54993206, 0.5369404 , 0.5229149 ,
        0.5223444 , 0.51159775, 0.51016814, 0.508513  , 0.5083684 ,
        0.5010875 , 0.48722336, 0.47801363, 0.

In [None]:
# Wordcloud for each topic

for topic in topic_nums:
    model.generate_topic_wordcloud(topic)

In [3]:
# Number of documents in each topic

topic_sizes, topic_nums = model.get_topic_sizes()
for i in topic_nums:
  print("Topic "+str(i)+" has "+str(topic_sizes[i])+" documents.")


Topic 0 has 232 documents.
Topic 1 has 158 documents.
Topic 2 has 112 documents.
Topic 3 has 92 documents.
Topic 4 has 85 documents.
Topic 5 has 80 documents.
Topic 6 has 70 documents.
Topic 7 has 70 documents.
Topic 8 has 41 documents.
Topic 9 has 38 documents.



In [None]:
# Search documents by topic. Ordered by (decreasing) similarity.
# Note: in every execution of this notebook, the topics retrieved may be slightly different. Adapt this part to your results.

# Topic 0
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=0, num_docs=topic_sizes[0])
documents_topic0 = []
for index in document_indexes:
  documents_topic0.append(ids.get(index))

# Topic 1
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=1, num_docs=topic_sizes[1])
documents_topic1 = []
for index in document_indexes:
  documents_topic1.append(ids.get(index))

documents_topic1

# Topic 2
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=2, num_docs=topic_sizes[2])
documents_topic2 = []
for index in document_indexes:
  documents_topic2.append(ids.get(index))

# Topic 3
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=3, num_docs=topic_sizes[3])
documents_topic3 = []
for index in document_indexes:
  documents_topic3.append(ids.get(index))

# Topic 4
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=4, num_docs=topic_sizes[4])
documents_topic4 = []
for index in document_indexes:
  documents_topic4.append(ids.get(index))

# Topic 5
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=5, num_docs=topic_sizes[5])
documents_topic5 = []
for index in document_indexes:
  documents_topic5.append(ids.get(index))

# Topic 6
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=6, num_docs=topic_sizes[6])
documents_topic6 = []
for index in document_indexes:
  documents_topic6.append(ids.get(index))

# Topic 7
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=7, num_docs=topic_sizes[7])
documents_topic7 = []
for index in document_indexes:
  documents_topic7.append(ids.get(index))

# Topic 8
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=8, num_docs=topic_sizes[8])
documents_topic8 = []
for index in document_indexes:
  documents_topic8.append(ids.get(index))

# Topic 9
documents, document_scores, document_indexes = model.search_documents_by_topic(topic_num=9, num_docs=topic_sizes[9])
documents_topic9 = []
for index in document_indexes:
  documents_topic9.append(ids.get(index))

In [16]:
# Add topics to Visualization and Analysis file 
# Note: in every execution of this notebook, the topics retrieved may be slightly different. Adapt this part to your results.

parent = os.path.dirname(os.getcwd())

data_va = pd.read_pickle(parent+'/4.visualization and analysis/data_preprocessed_va.pkl')
data_va.head()

data_va.insert(10, 'topic', "")

def add_topics(id):

    topic = " "
     
    if id in documents_topic0:
      topic = "Saúde"
    elif id in documents_topic1:
      topic = "Crime"
    elif id in documents_topic2:
      topic = "Cinema"
    elif id in documents_topic3:
      topic = "Economia"
    elif id in documents_topic4:
      topic = "Conflitos militares"
    elif id in documents_topic5:
      topic = "Política"
    elif id in documents_topic6:
      topic = "Literatura"
    elif id in documents_topic7:
      topic = "Música"
    elif id in documents_topic8:
      topic = "Desporto"
    elif id in documents_topic9:
      topic = "Justiça"

    return topic
    
data_va["topic"] = data_va.ID.apply(lambda x: add_topics(x))
data_va.head()

data_va.to_pickle(parent+"/4.visualization and analysis/data_preprocessed_va.pkl")


Unnamed: 0,ID,label,journal,journalTitle,content,authors,publishDate,archiveDate,year,linkToArchive,topic
0,0,literal,publico.pt,Público,dia janeiro jared loughner tentou matar sucess...,[],,2012-12-30,2012,https://arquivo.pt/wayback/20121230181331/http...,Crime
1,1,literal,publico.pt,Público,john nash matemático nobel economia retratado ...,[],,2016-01-17,2016,https://arquivo.pt/wayback/20160117223452/http...,Cinema
2,2,estigmatizante,publico.pt,Público,cavaco sairá desta campanha pior entrou casos ...,"['Nuno Ferreira Santos', 'Arquivo']",,2011-01-21,2011,https://arquivo.pt/wayback/20110121142608/http...,Política
3,3,literal,publico.pt,Público,cientista senadora italiana rita levintalcini ...,['Clara Barata'],,2013-01-17,2013,https://arquivo.pt/wayback/20130117170513/http...,Conflitos Militares
4,4,literal,publico.pt,Público,ninguém sabe fazer ninguém sabe pensa come sob...,[],,2015-04-20,2015,https://arquivo.pt/wayback/20150420143056/http...,Crime
