# Setup

The below [autoreload](https://ipython.org/ipython-doc/3/config/extensions/autoreload.html) extension ensures that if any locally imported python files change, the modules defined there are reloaded

In [1]:
%load_ext autoreload
%autoreload 2

The below imports [`here`](https://pypi.org/project/pyprojroot/), which allows one to refer to the root directory of the project in a consistent manner across execution environments. It then adds `here()` (the root directory) to the system path to ensure that we can load python modules defined in the project.

In [2]:
from hereutil import here, add_to_sys_path
add_to_sys_path(here())

Having ensured that the root path of the project is in the system path, we can load common basis functions from [src/common_basis.py](/src/common_basis.py). The template assumes that functions useful for most work be defined in `common_basis.py`, whereas code useful for individual analyses is defined where needed.

Naturally, if more refined organisation of common code is needed, one is also free to define whichever other modules one wants.

The central object defined in `common_basis` is `con`, which is the [MariaDB](https://mariadb.com/) (MySQL) database connection (an [SQLAlchemy Connection](https://docs.sqlalchemy.org/en/14/core/connections.html)) through which both ready data is accessed, as well as new data stored for others to reuse. Below, you will see both how to use con to store data in the database, as well as how query it.

The details of the database connection are stored in [`db_params.yaml`](/db_params.yaml). The password is given separately. **DO NOT INCLUDE THE PASSWORD IN ANY CODE YOU COMMIT TO GITHUB**. If running this notebook, it will ask for the password the first time you run it, and then store it separately in your keyring. This requires a working keyring implementation on your system. Consult the [`keyring`](https://pypi.org/project/keyring/) package documentation if you have problems.  If you cannot get it to work, a second option is to create a `db_secret.yaml` file in the project root directory with `db_pass: [PASSWORD]` as the content. This file is already set to be ignored by Git so it wouldn't accidentally get included in a commit, but still, if you do this, **DON'T MAKE THE MISTAKE OF COMMITTING THE FILE TO GITHUB**.


In [3]:
from src.common_basis import *
i = load_incel_parquet()
i

IncelData(incel_threads, incel_posts, incel_users, incel_quotes)

## Example of reading tweets from lynching_tweets_a table

The following code reads a random sample of tweets from lynching_tweets_a table with following specifications:

- **keyword**: each tweet returned by the query has the keyword as a substring
- **n** : upper limit of how many tweets are fetched
- **start_date**: fetches tweets where date_created_at is the start_date or later
- **end_date**: looks for tweets until the day end_date - 1 

In [4]:
from sqlalchemy import text

i.incel_posts

Unnamed: 0,post_id,post_id_str,poster_id,time_posted,post_content,post_html,thread_id,post_order_in_thread
0,1,post-8897672,0,2022-06-02 04:02:35,Or maybe she just realized Chad will never com...,"<div class=""message-content js-messageContent""...",0,1
1,2,post-8897678,1,2022-06-02 04:04:16,Ofc Chad will never commit to a crazy bpd toilet.,"<div class=""message-content js-messageContent""...",0,2
2,3,post-8897741,2,2022-06-02 04:12:32,Ill give her a ride in exchange for some head.,"<div class=""message-content js-messageContent""...",0,3
3,4,post-8897798,0,2022-06-02 04:20:35,A man doing the same thing would have been sub...,"<div class=""message-content js-messageContent""...",0,4
4,5,post-8898099,3,2022-06-02 05:25:56,ControlledInsanity said: A man doing the ...,"<div class=""message-content js-messageContent""...",0,5
...,...,...,...,...,...,...,...,...
2266019,2266019,post-65986,1111,2017-12-01 02:25:46,WarmIncelation said: From my perspective ...,"<div class=""message-content js-messageContent""...",132761,33
2266020,2266020,post-66068,444,2017-12-01 03:00:02,universallyabhorred said: Finally it is w...,"<div class=""message-content js-messageContent""...",132761,34
2266021,2266021,post-66078,1111,2017-12-01 03:03:54,nausea said: I am sure the admin and the ...,"<div class=""message-content js-messageContent""...",132761,35
2266022,2266022,post-66110,444,2017-12-01 03:23:43,universallyabhorred said: Admins don't gi...,"<div class=""message-content js-messageContent""...",132761,36


# Topic Modelling

In [5]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords


import spacy


from nltk.corpus import stopwords
from gensim.models import TfidfModel
# vis
import pyLDAvis
import pyLDAvis.gensim

import spacy


import nltk

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel

# spacy
import spacy

nltk.download('stopwords')
from nltk.corpus import stopwords
from gensim.models import TfidfModel
# vis
import pyLDAvis
import pyLDAvis.gensim


spacy.load("en_core_web_sm")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nceck\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nceck\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<spacy.lang.en.English at 0x1fa0818ab80>

In [6]:

from bs4 import BeautifulSoup


def prep_data(data) -> list:
    """
    Wandelt die JSON Struktur des Datensets um in eine Liste.
    :param data: Zu bearbeitendes Datenset.
    :return: Liste mit Post-Inhalten aus dem Datenset.
    """
    print("data prep")
    text_list = []
    for index, row in data.iterrows():

        print("index", index, "/", str(250000))

        #post_content = row["post_content"]
        post_content = row['cleaned_text']
        text_list.append(post_content)
    return text_list

def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    """
    Filtert die Stopwörter aus den Post Texten heraus.
    :param texts: Liste an Post-Texten aus dem Datenset.
    :param allowed_postags: Wörter die nicht Herausgefiltert werden, z.B. Nomen.
    :return: Lemmatized texts, daher die gefilterten Texte.
    """
    print("lemmetization")
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]) # Quelle: https://github.com/explosion/spaCy/issues/7453
    texts_out = []
    for idx, text in enumerate(texts):
        print("idx", idx, "/", len(texts))
        if text != None:
            doc = nlp(text)
            new_text = []
            for token in doc:
                if token.pos_ in allowed_postags:
                    new_text.append(token.lemma_)

            final = " ".join(new_text)
            texts_out.append(final)
        else:
            continue

    return texts_out

def gen_words(texts):
    """
    Gensim spezifisches Preprocessing.
    :param texts: Liste an Post-Texten aus dem Datenset.
    :return: Verarbeitete Texte.
    """
    final = []
    print("gen_words")
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return final

def make_bigram(texts, bigram):
    print("make_bigram")
    bigram = (bigram[doc] for doc in texts)
    return bigram

def make_trigram(texts, trigram, bigram):
    print("make_trigram")
    trigram = (trigram[bigram[text]] for text in enumerate(texts))
    return trigram

In [7]:



def run_topicmodelling(data, target_file_path):
    print("topic modelling")
    """
    Basis Methode um Topic Modelling zu performen. Die Methode ist in folgende Schritte unterteilt: Vorbereitung der Daten, Preprocessing, Bilden von Bigrams und Trigrams,
    Verarbeitung der Daten mit dem TFIDG Modell und Visualisierung mit pyLDAvis.
    :type data: JSON-Objekt des zu bearbeitenden Datensets; wird in main.py geladen
    :type target_file_path: Zielpfad für Speichern der Visualisierung aus pyLDAvis.
    """
    # prepare data
    data_posts = prep_data(data)
    print("preprocessing")
    # Preprocessing - Herausfiltern von Stopwords und Transformation der Wörter in Grundform (Lemmatization)
    stop = set(stopwords.words('english'))
    lemmatized_texts = lemmatization(data_posts)
    lemmatized_data = list(gen_words(lemmatized_texts))
    print("Lemmetized Data Example:", lemmatized_data[0])

    # bigram and trigams
    bigrams_phrases = gensim.models.Phrases(lemmatized_data, min_count=5, threshold=100)
    trigram_phrases = gensim.models.Phrases(bigrams_phrases[lemmatized_data], threshold=100)

    bigram = gensim.models.phrases.Phraser(bigrams_phrases)
    trigram = gensim.models.phrases.Phraser(trigram_phrases)

    trigram_total = (trigram[bigram[lemmatized_data[i]]] for i in range(len(lemmatized_data)))
    trigram_total = list(trigram_total)

    id2word = corpora.Dictionary(trigram_total)
    print("id2word - Unique Tokens Examples:", id2word)
    corpus = [id2word.doc2bow(text) for text in trigram_total]

    # Verarbeitung der Daten mit tfidf
    tfidf = TfidfModel(corpus, id2word=id2word)

    low_value = 0.03
    words = []
    words_missing_in_tfidf = []
    print("corpus analysis")
    for i in range(0, len(corpus)):
        bow = corpus[i]
        low_value_words = []  # reinitialize to be safe. You can skip this.
        print("corpus analysis:", i, "/", len(corpus))
        tfidf_ids = [id for id, value in tfidf[bow]]
        bow_ids = [id for id, value in bow]
        low_value_words = [id for id, value in tfidf[bow] if value < low_value]
        words_missing_in_tfidf = [id for id in bow_ids if
                                  id not in tfidf_ids]  # The words with tf-idf socre 0 will be missing

        new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]

        # reassign
        corpus[i] = new_bow
    # Bereitstellung des LDA Model
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                                id2word=id2word,
                                                num_topics=20,
                                                random_state=100,
                                                update_every=1,
                                                chunksize=100,
                                                passes=10,
                                                alpha="auto")

    # Visualisierung und Speichern
    print("visualisation")
    vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
    pyLDAvis.save_html(vis, target_file_path)

# https://www.machi nelearningplus.com/nlp/topic-modeling-gensim-python/#1introduction



In [10]:

print("type", type(i.incel_posts))

data = i.incel_posts.iloc[0:50000]
data['cleaned_text'] = data['post_html']\
    .apply(lambda x: BeautifulSoup(x, 'html.parser')
       .find('div', class_='bbWrapper')
       .find_all(string=True, recursive=False))\
    .apply(lambda x: ' '.join(x))


run_topicmodelling(data, "test_50000.html")

type <class 'pandas.core.frame.DataFrame'>


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['cleaned_text'] = data['post_html']\


topic modelling
data prep
index 0 / 250000
index 1 / 250000
index 2 / 250000
index 3 / 250000
index 4 / 250000
index 5 / 250000
index 6 / 250000
index 7 / 250000
index 8 / 250000
index 9 / 250000
index 10 / 250000
index 11 / 250000
index 12 / 250000
index 13 / 250000
index 14 / 250000
index 15 / 250000
index 16 / 250000
index 17 / 250000
index 18 / 250000
index 19 / 250000
index 20 / 250000
index 21 / 250000
index 22 / 250000
index 23 / 250000
index 24 / 250000
index 25 / 250000
index 26 / 250000
index 27 / 250000
index 28 / 250000
index 29 / 250000
index 30 / 250000
index 31 / 250000
index 32 / 250000
index 33 / 250000
index 34 / 250000
index 35 / 250000
index 36 / 250000
index 37 / 250000
index 38 / 250000
index 39 / 250000
index 40 / 250000
index 41 / 250000
index 42 / 250000
index 43 / 250000
index 44 / 250000
index 45 / 250000
index 46 / 250000
index 47 / 250000
index 48 / 250000
index 49 / 250000
index 50 / 250000
index 51 / 250000
index 52 / 250000
index 53 / 250000
index 54 / 2

