## **Introduction to exploratory GDSMM usage in determining contextualization of men and women in news outlets**

As discussed in the general introduction about GDSMM in file **GDSMM.ipynb**, GDSMM is used as well in order to extract the context in which women and men are most commonly mentioned within the quotes. 

Since this data analysis is still on an exploratory level, the two GDSMM pipelines have not yet been streamlined, and we will explore the advantages of both before the final analyses. The main difference lies in the treatment of stopwords, which has been more proactive in this approach due to the inherent task. Words for which has been filtered - eg woman, man -- and thus occur in every quote, need to be removed in any case. Further words - already mentioned in the other notebook - such as think, would, become, have already been removed here and were determined by looking at early runs of the output. The here used library for stopword removal was the well known nlp module nltk.

It shall be noted as well, that in this case, the non-cleaned dataset has been used, as the cleaning mainly consisted of dropping quotes where no speaker (nor their gender) could be determined. As for the purpose of this analysis, the gender of the speaker is of less relevance, the dataset has not been altered as of yet, but for the final analysis it will be streamlined to the same dataset throughout.



# Importing and updating

To use Vaex, python has to be updated and runtime restarted


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install vaex
!pip install --upgrade ipython
!pip install gensim
!pip install tld
!pip install h5py

In [2]:
import numpy as np
import pandas as pd
import vaex
import matplotlib.pyplot as plt

#Data used
This analysis focuses on the content of the quote alone, therefore it could initially be run on the raw data, as no additional information about the speaker was needed. For the finalized report it will be run on the streamlined data will be used.

In [3]:
quotes_2020 = vaex.open('/content/drive/MyDrive/Q-Bank/DATA_HDF5/quotes-2020/*.hdf5')

#quotes_2019 = vaex.open('/content/drive/MyDrive/Q-Bank/DATA_HDF5/quotes-2019/*.hdf5')

In [4]:
quotes_2020

#,Unnamed: 0,quoteID,quotation,speaker,qids,date,probas,urls,phase,numOccurrences
0,0.0,2020-01-28-000082,'[ D ] espite the efforts of the partners to cre...,,[],2020-01-28 08:04:05,"""[['None', '0.7272'], ['Prime Minister Netanyahu...","""['http://israelnationalnews.com/News/News.aspx/...",E,--
1,0.0,2020-01-16-000088,'[ Department of Homeland Security ] was livid a...,Sue Myrick,['Q367796'],2020-01-16 12:00:13,"""[['Sue Myrick', '0.8867'], ['None', '0.0992'], ...","""['http://thehill.com/opinion/international/4782...",E,--
2,0.0,2020-02-10-000142,'... He (Madhav) also disclosed that the illegal...,,[],2020-02-10 23:45:54,"[['None', '0.8926'], ['Prakash Rai', '0.1074']]","""['https://indianexpress.com/article/business/ec...",E,--
3,0.0,2020-02-15-000053,"... [ I ] f it gets to the floor,",,[],2020-02-15 14:12:51,"[['None', '0.581'], ['Andy Harris', '0.4191']]","""['https://patriotpost.us/opinion/68622-trump-bu...",E,--
4,0.0,2020-01-24-000168,'[ I met them ] when they just turned 4 and 7. T...,Meghan King Edmonds,['Q20684375'],2020-01-24 20:37:09,"""[['Meghan King Edmonds', '0.5446'], ['None', '0...","""['https://people.com/parents/meghan-king-edmond...",E,--
...,...,...,...,...,...,...,...,...,...,...
5244917,0.0,2020-02-24-080186,"""you're seeing a young team that's maturing, tha...",Brendan Whittet,['Q18115465'],2020-02-24 05:00:28,"""[['Brendan Whittet', '0.7077'], ['None', '0.292...","""['http://feeds.browndailyherald.com/~r/BrownDai...",E,1
5244918,0.0,2020-02-07-122251,"""You're talking about African-Americans, right? ...",Barry Michael Cooper,['Q3635235'],2020-02-07 00:00:00,"""[['Barry Michael Cooper', '0.5605'], ['None', '...","""['https://www.villagevoice.com/2020/02/07/1980-...",E,1
5244919,0.0,2020-02-27-098715,"""You've got to own the team. I see Broosky and h...",,[],2020-02-27 05:59:00,"""[['None', '0.8899'], ['Trent Barrett', '0.0539'...","""['https://www.foxsports.com.au/nrl/nrl-premiers...",E,1
5244920,0.0,2020-02-04-118820,"""You've got to sometimes take that leap of faith...",Brad Gushue,['Q896796'],2020-02-04 14:47:00,"""[['Brad Gushue', '0.706'], ['None', '0.2919'], ...","""['http://timescolonist.com/in-the-rings-steski-...",E,10


In [None]:
print(len(quotes_2020))

5244922


# Filtering quotes for associated keywords
 - Women/girls/etc appearing in quote including female pronouns
 - Men/boys/etc. including male pronouns

In [103]:
#word_women_extended = ['woman', 'women', 'lady', 'dame', 'girl', 'bitch', 'sister', 'mother', 'daughter', 'wife']


word_women = ['woman', 'women', 'lady', 'dame', 'girl', 'ladies', 'girls', 'she', 'her',]
words_men = [' man ', ' men ', 'boy', 'boys', 'gentlemen', 'gentleman', 'sir', 'he', 'his', 'him']


#initmethod for list does not work - write function create_subframes_with_words(wordlist)
df_word_girl = quotes_2020[quotes_2020['quotation'].str.lower().str.contains('girl')]

In [104]:
#Function to create new dataframes with only these keywords - separate for men and women

def create_subframes_with_words(word_list):
  ''' Input Vaex dataframe with all the quotes
      Output: Vaex dataframe filtered with only relevant quotes'''

  df_filtered_words = quotes_2020[quotes_2020['quotation'].str.lower().str.contains(word_list[1])]
  for i in word_list[1:]:
    df_word = quotes_2020[quotes_2020['quotation'].str.lower().str.contains(i)]
    df_filtered_words = vaex.concat((df_filtered_words, df_word), resolver='flexible')
    
    return df_filtered_words

## Creation of Dataframes

In [105]:
df_filtered_words_women = create_subframes_with_words(word_women)


In [106]:
print(len(df_filtered_words_women))

88520


In [101]:
df_filtered_words_men = create_subframes_with_words(words_men)


In [102]:
print(len(df_filtered_words_men))

27232


In [None]:
df_filtered_words_women.head()

In [None]:
df_filtered_words_men.head()

#,Unnamed: 0,quoteID,quotation,speaker,qids,date,probas,urls,phase,numOccurrences
0,0,2020-01-23-005418,"""And I think we've come to a time where we are p...",Sheila Oliver,['Q7493177'],2020-01-23 21:15:14,"[['Sheila Oliver', '0.9228'], ['None', '0.0772']]","""['https://wobm.com/how-toxic-is-nj-political-cl...",E,--
1,0,2020-02-20-015931,"'Finally, we reached the point a few weeks ago w...",Elizabeth Warren,['Q434706'],2020-02-20 00:00:00,"""[['Elizabeth Warren', '0.7201'], ['None', '0.25...","""['http://feeds.foxnews.com/~r/foxnews/politics/...",E,--
2,0,2020-01-23-005418,"""And I think we've come to a time where we are p...",Sheila Oliver,['Q7493177'],2020-01-23 21:15:14,"[['Sheila Oliver', '0.9228'], ['None', '0.0772']]","""['https://wobm.com/how-toxic-is-nj-political-cl...",E,--
3,0,2020-02-20-015931,"'Finally, we reached the point a few weeks ago w...",Elizabeth Warren,['Q434706'],2020-02-20 00:00:00,"""[['Elizabeth Warren', '0.7201'], ['None', '0.25...","""['http://feeds.foxnews.com/~r/foxnews/politics/...",E,--
4,0,2020-03-02-032147,'It got expanded to black men playing basketball...,,[],2020-03-02 03:05:24,"[['None', '0.6239'], ['Doug Bruno', '0.3761']]","""['https://depauliaonline.com/47011/sports/depau...",E,--
5,0,2020-02-21-041290,'(Last year) was heartbreaking for the black com...,,[],2020-02-21 20:57:05,"""[['None', '0.8859'], ['Tim Walz', '0.069'], ['K...","""['https://www.twincities.com/2020/02/21/mn-hous...",E,--
6,0,2020-02-03-057804,'On the right hand side of the mural are the Spa...,,[],2020-02-03 01:59:19,"""[['None', '0.7019'], ['Carmen Guerrero Nakpil',...","""['http://abs-cbnnews.com/ancx/culture/spotlight...",E,--
7,0,2020-04-12-019275,'Our lights must shine on men in these crucial t...,Sir John,"['Q28124344', 'Q45996744']",2020-04-12 06:24:20,"[['Sir John', '0.9398'], ['None', '0.0602']]","""['http://graphic.com.gh/news/politics/christ-s-...",E,--
8,0,2020-02-05-106321,'We are in the midst of one of the greatest gend...,,[],2020-02-05 18:21:00,"[['None', '0.9579'], ['Eric Coble', '0.0421']]","""['https://www.broadwayworld.com/sarasota/articl...",E,--
9,0,2020-01-18-003714,'At a time when we should be celebrating in my o...,,[],2020-01-18 00:31:37,"[['None', '0.729'], ['Odell Beckham', '0.271']]","""['https://sportsnaut.com/2020/01/new-orleans-di...",E,--


## Processing of data to be able to use GDSMM




In [13]:
#Removal of irrelevant columns of databank for this analysis

df_women = df_filtered_words_men['quotation'].str.lower()
df_men = df_filtered_words_men['quotation'].str.lower()

In [15]:
# Functions to remove punctionations and numbers from quotes

def remove_from_str(vaex_df, list):
  for i in list:
    vaex_df = vaex_df.str.replace(str(i), "")
  return(vaex_df)

def repl_wh_from_str(vaex_df, list):
  for i in list:
    vaex_df = vaex_df.str.replace(str(i), " ")
  return(vaex_df)



In [16]:
#Clean women quotes
df_women = repl_wh_from_str(df_women, [".", ",", "'", ":", ".;", "!", "?", "'", "-"])
df_women = remove_from_str(df_women, ["[", "]", "(", ")", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"])

#Clean men quotes
df_men = repl_wh_from_str(df_men, [".", ",", "'", ":", ".;", "!", "?", "'", "-"])
df_men = remove_from_str(df_men, ["[", "]", "(", ")", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"])



## Further preprocessing
  - tokenize the quotes
  - lemmatize (filter for word groups)
  - stop word removal 

In [17]:
!pip install git+https://github.com/rwalk/gsdmm.git

Collecting git+https://github.com/rwalk/gsdmm.git
  Cloning https://github.com/rwalk/gsdmm.git to /tmp/pip-req-build-01p_rwqn
  Running command git clone -q https://github.com/rwalk/gsdmm.git /tmp/pip-req-build-01p_rwqn
Building wheels for collected packages: gsdmm
  Building wheel for gsdmm (setup.py) ... [?25l[?25hdone
  Created wheel for gsdmm: filename=gsdmm-0.1-py3-none-any.whl size=4603 sha256=f0ff07268744548f353113f8e1fd79f7fded0ff30a548eb50eaa352cc0420872
  Stored in directory: /tmp/pip-ephem-wheel-cache-r1iab0ef/wheels/34/65/a6/7eef67b88abae954fecd22587bd755c27b58a9ffe488d6b0de
Successfully built gsdmm
Installing collected packages: gsdmm
Successfully installed gsdmm-0.1


In [18]:
import gensim
from gsdmm import MovieGroupProcess
import pyarrow as pa
from ast import literal_eval

In [21]:
#For further analysis need to convert vaex to numpy
women_numpy = df_filtered_words_women.evaluate(df_filtered_words_women['quotation'], i1=1, i2=88520).to_pandas().to_numpy()
men_numpy = df_filtered_words_men.evaluate(df_filtered_words_men['quotation'], i1=1, i2=27232).to_pandas().to_numpy()

In [25]:
#Tokenization, ie. make each word one list element

def tokenize_sentences_of_np_array(np_array):
  '''returns array of words for every sentence and a sentence is a list'''
  numpy_quotes = []
  for quote in np_array:
    list_quote = []
    list_quote = list(gensim.utils.tokenize(quote))
    numpy_quotes.append(list_quote)
  numpy_quotes = np.asarray(numpy_quotes)

  return numpy_quotes

In [28]:
women_numpy_tokenized = tokenize_sentences_of_np_array(women_numpy)
men_numpy_tokenized = tokenize_sentences_of_np_array(men_numpy)

  return array(a, dtype, copy=False, order=order)


In [31]:
# Remove numbers, but not words that contain numbers.

def remove_numbers(numpy_quotes):
  numpy_quotes = [[token for token in doc if not token.isnumeric()] for doc in numpy_quotes]
  return numpy_quotes


# Remove words that are only one character

def remove_one_char_words(numpy_quotes):
  numpy_quotes = [[token for token in doc if len(token) > 4] for doc in numpy_quotes]
  return numpy_quotes

In [32]:
women_numpy_tokenized = remove_numbers(women_numpy_tokenized)
men_numpy_tokenized = remove_numbers(men_numpy_tokenized)

women_numpy_tokenized = remove_one_char_words(women_numpy_tokenized)
men_numpy_tokenized = remove_one_char_words(men_numpy_tokenized)

### Lemmatization
Acknowledging that for topic/portrayal analyis mainly nouns and verbs and adjective are relevant, we remove all the rest and in addition to previous experiments generic words like think, as well as all other words for which we filtered.

In [None]:
! python -m spacy download en_core_web_sm
# English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.

In [34]:
import spacy

In [35]:
#Function for lemmatization

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB']):
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [37]:
# do lemmatization keeping only noun, vb, adv
women_lemmatized = lemmatization(women_numpy_tokenized, allowed_postags=['NOUN', 'ADJ', 'VERB'])
men_lemmatized = lemmatization(men_numpy_tokenized, allowed_postags=['NOUN', 'ADJ', 'VERB'])


## Stopwords

- remove words that obscure topic analysis (such as woman is not of help if appears as most common word if we filtered for quotes containing woman)
- iterative process: run once and look at very common words that appear void of content (think, people, women, man, world, believe...)


In [117]:
#Import nltk library for stopword custimzation
#Function to remove stopwords from data

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

new_stopwords_women = ['woman', 'women', 'lady', 'dame', 'girl', 'ladies', 'girls', 'she', 'her']
new_stopwords_men = [' man ', ' men ', 'boy', 'boys', 'gentlemen', 'gentleman', 'sir', 'he', 'his', 'him']
general_stopwords = ['woman', 'people', 'take', 'would', 'believe', 'could', 'make', 'want', 'happen', 'thing', 'go', 'great', 'think', 'include']


def remove_custom_stopwords(text, custom_stopwords, general_stopwords):
  '''Input the analysed text and the general additional stopwords as well as custom stopwords for each analysis subgroup)'''
  stpwrd = nltk.corpus.stopwords.words('english')
  stpwrd.extend(custom_stopwords)
  stpwrd.extend(general_stopwords)
  text_wo_stopwords = []
  for sentence in text:
    sentence_wo_stopwords = [words for words in sentence if not words in stpwrd]
    text_wo_stopwords.append(sentence_wo_stopwords)
  return text_wo_stopwords



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [118]:
# remove stop words after lemmatization
women_lem_stopwords = remove_custom_stopwords(women_lemmatized, new_stopwords_women, general_stopwords)
men_lem_stopwords = remove_custom_stopwords(men_lemmatized, new_stopwords_men, general_stopwords)


## Run GSDMM on Women Quotes

In [121]:
# initialize GSDMM
gsdmm = MovieGroupProcess(K=10, alpha=0.1, beta=0.3, n_iters=12)

vocab = set(x for quote in women_lem_stopwords for x in quote)
vocab_size = len(vocab)

# fit GSDMM model
y = gsdmm.fit(women_lem_stopwords, vocab_size)

In stage 0: transferred 74823 clusters with 10 clusters populated
In stage 1: transferred 56076 clusters with 10 clusters populated
In stage 2: transferred 42203 clusters with 10 clusters populated
In stage 3: transferred 34315 clusters with 10 clusters populated
In stage 4: transferred 29374 clusters with 10 clusters populated
In stage 5: transferred 26559 clusters with 10 clusters populated
In stage 6: transferred 25076 clusters with 10 clusters populated
In stage 7: transferred 23986 clusters with 10 clusters populated
In stage 8: transferred 22931 clusters with 10 clusters populated
In stage 9: transferred 22588 clusters with 10 clusters populated
In stage 10: transferred 22286 clusters with 10 clusters populated
In stage 11: transferred 21801 clusters with 10 clusters populated


In [122]:
# print number of documents per topic
doc_count = np.array(gsdmm.cluster_doc_count)
print('Number of documents per topic :', doc_count)

# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-15:][::-1]
print('Most important clusters (by number of docs inside):', top_index)


# define function to get top words per topic
def top_words(cluster_word_distribution, top_cluster, values):
  sort_dicts = {}
  for cluster in top_cluster:
      sort_list = sorted(cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
      sort_dicts.update({'Cluster' + str(cluster): sort_list})
      print("\nCluster %s : %s"%(cluster, sort_list))
  return sort_dicts


# get top words in topics
topic_dict = top_words(gsdmm.cluster_word_distribution, top_index, 10)

Number of documents per topic : [10206  8306  9103  9558  6848  6030 15807 12163  2476  8022]
Most important clusters (by number of docs inside): [6 7 0 3 2 1 9 4 5 8]

Cluster 6 : [('story', 1672), ('world', 1234), ('young', 1082), ('change', 913), ('black', 880), ('female', 864), ('strong', 861), ('celebrate', 774), ('inspire', 774), ('work', 762)]

Cluster 7 : [('support', 1535), ('community', 1168), ('opportunity', 936), ('work', 847), ('business', 831), ('provide', 829), ('country', 815), ('continue', 728), ('serve', 711), ('leadership', 698)]

Cluster 0 : [('sexual', 1068), ('child', 1067), ('violence', 866), ('abuse', 782), ('victim', 613), ('assault', 601), ('speak', 543), ('crime', 515), ('police', 502), ('young', 448)]

Cluster 3 : [('sport', 1475), ('player', 1170), ('basketball', 992), ('play', 752), ('football', 685), ('world', 675), ('cricket', 659), ('opportunity', 629), ('coach', 593), ('support', 577)]

Cluster 2 : [('right', 2502), ('fight', 740), ('black', 649), ('eq

# Visualization of word clusters with wordclouds to infer the topic

In [123]:
# Import the wordcloud library
from wordcloud import WordCloud

topic_number = 10
values = 10
for i in range(topic_number):

  # Get topic word distributions from gsdmm model
  cluster_word_distribution = gsdmm.cluster_word_distribution

  # Generate a word cloud image
  wordcloud = WordCloud(background_color='#fcf2ed', 
                              width=1800,
                              height=700,
                              font_path='/content/drive/MyDrive/Q-Bank/Notebooks/Experimental_GDSMM/GDSCMM_Genders_in_quotes2020/ArialCE.ttf',
                              colormap='flag').generate_from_frequencies(cluster_word_distribution[i])

  # Print the generated word cloud to the screen
  fig, ax = plt.subplots(figsize=[20,10])
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off");

  # Save the word cloud to the disk
  wordcloud.to_file('/content/drive/MyDrive/Q-Bank/Notebooks/Experimental_GDSMM/GDSCMM_Genders_in_quotes2020/test_word_maps_female_topic_' + str(i) + '.png')

Output hidden; open in https://colab.research.google.com to view.

## Run GSDMM on Men Quotes

In [124]:
# initialize GSDMM
gsdmm = MovieGroupProcess(K=10, alpha=0.1, beta=0.3, n_iters=12)

vocab = set(x for quote in women_lem_stopwords for x in quote)
vocab_size = len(vocab)

# fit GSDMM model
y = gsdmm.fit(men_lem_stopwords, vocab_size)

In stage 0: transferred 21670 clusters with 10 clusters populated
In stage 1: transferred 12686 clusters with 10 clusters populated
In stage 2: transferred 9618 clusters with 10 clusters populated
In stage 3: transferred 8285 clusters with 10 clusters populated
In stage 4: transferred 7662 clusters with 10 clusters populated
In stage 5: transferred 7253 clusters with 10 clusters populated
In stage 6: transferred 7004 clusters with 10 clusters populated
In stage 7: transferred 6752 clusters with 10 clusters populated
In stage 8: transferred 6563 clusters with 10 clusters populated
In stage 9: transferred 6426 clusters with 10 clusters populated
In stage 10: transferred 6290 clusters with 10 clusters populated
In stage 11: transferred 6046 clusters with 10 clusters populated


In [125]:
# print number of documents per topic
doc_count = np.array(gsdmm.cluster_doc_count)
print('Number of documents per topic :', doc_count)

# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-15:][::-1]
print('Most important clusters (by number of docs inside):', top_index)


# define function to get top words per topic
def top_words(cluster_word_distribution, top_cluster, values):
  sort_dicts = {}
  for cluster in top_cluster:
      sort_list = sorted(cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
      sort_dicts.update({'Cluster' + str(cluster): sort_list})
      print("\nCluster %s : %s"%(cluster, sort_list))
  return sort_dicts


# get top words in topics
topic_dict = top_words(gsdmm.cluster_word_distribution, top_index, 10)

Number of documents per topic : [2032 4873 2231 2019 2737 2033 1460 1813 4707 3326]
Most important clusters (by number of docs inside): [1 8 9 4 2 5 0 3 7 6]

Cluster 1 : [('young', 948), ('player', 391), ('coach', 350), ('play', 321), ('look', 319), ('stage', 311), ('year', 259), ('start', 237), ('group', 223), ('election', 217)]

Cluster 8 : [('serve', 545), ('support', 490), ('community', 413), ('country', 405), ('young', 403), ('family', 394), ('protect', 352), ('service', 344), ('work', 343), ('nation', 324)]

Cluster 9 : [('country', 425), ('year', 336), ('world', 327), ('right', 241), ('work', 237), ('support', 214), ('young', 213), ('issue', 212), ('equal', 180), ('interest', 173)]

Cluster 4 : [('young', 482), ('year', 195), ('change', 162), ('world', 161), ('girl', 130), ('give', 130), ('power', 126), ('family', 119), ('important', 118), ('work', 113)]

Cluster 2 : [('black', 349), ('young', 244), ('white', 133), ('child', 131), ('world', 115), ('family', 108), ('place', 95),

In [126]:
# Import the wordcloud library
from wordcloud import WordCloud

topic_number = 10
values = 10
for i in range(topic_number):

  # Get topic word distributions from gsdmm model
  cluster_word_distribution = gsdmm.cluster_word_distribution

  # Generate a word cloud image
  wordcloud = WordCloud(background_color='#fcf2ed', 
                              width=1800,
                              height=700,
                              font_path='/content/drive/MyDrive/Q-Bank/Notebooks/Experimental_GDSMM/GDSCMM_Genders_in_quotes2020/ArialCE.ttf',
                              colormap='flag').generate_from_frequencies(cluster_word_distribution[i])

  # Print the generated word cloud to the screen
  fig, ax = plt.subplots(figsize=[20,10])
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off");

  # Save the word cloud to the disk
  wordcloud.to_file('/content/drive/MyDrive/Q-Bank/Notebooks/Experimental_GDSMM/GDSCMM_Genders_in_quotes2020/test_word_maps_male_topic_' + str(i) + '.png')

Output hidden; open in https://colab.research.google.com to view.

## Discussion of Preliminary Results

The preliminary results we obtain with the 2020 dataset are quite promising, although should always be taken with a grain of salt, since maximally trends of topic clusters can be deduced from such methods such as GDSMM. Depending on the input, the stopwords, they are subject to a high variance, in particular with respect to the ordering of the most "common" clusters. 

Nevertheless, the clusters between men and women show striking differences, that have also been visualized as word clouds. 
This encourages us to further explore this approach and extend it to the analysis of bigger portions of the dataset and across differet dimensions.