# Setup

The below [autoreload](https://ipython.org/ipython-doc/3/config/extensions/autoreload.html) extension ensures that if any locally imported python files change, the modules defined there are reloaded

In [1]:
%load_ext autoreload
%autoreload 2

The below imports [`here`](https://pypi.org/project/pyprojroot/), which allows one to refer to the root directory of the project in a consistent manner across execution environments. It then adds `here()` (the root directory) to the system path to ensure that we can load python modules defined in the project.

In [2]:
#!pip install hereutil flask_sqlalchemy
import pandas as pd
from hereutil import here, add_to_sys_path
add_to_sys_path(here())

Having ensured that the root path of the project is in the system path, we can load common basis functions from [src/common_basis.py](/src/common_basis.py). The template assumes that functions useful for most work be defined in `common_basis.py`, whereas code useful for individual analyses is defined where needed.

Naturally, if more refined organisation of common code is needed, one is also free to define whichever other modules one wants.

The central object defined in `common_basis` is `con`, which is the [MariaDB](https://mariadb.com/) (MySQL) database connection (an [SQLAlchemy Connection](https://docs.sqlalchemy.org/en/14/core/connections.html)) through which both ready data is accessed, as well as new data stored for others to reuse. Below, you will see both how to use con to store data in the database, as well as how query it.

The details of the database connection are stored in [`db_params.yaml`](/db_params.yaml). The password is given separately. **DO NOT INCLUDE THE PASSWORD IN ANY CODE YOU COMMIT TO GITHUB**. If running this notebook, it will ask for the password the first time you run it, and then store it separately in your keyring. This requires a working keyring implementation on your system. Consult the [`keyring`](https://pypi.org/project/keyring/) package documentation if you have problems.  If you cannot get it to work, a second option is to create a `db_secret.yaml` file in the project root directory with `db_pass: [PASSWORD]` as the content. This file is already set to be ignored by Git so it wouldn't accidentally get included in a commit, but still, if you do this, **DON'T MAKE THE MISTAKE OF COMMITTING THE FILE TO GITHUB**.


In [3]:
from src.common_basis import *

eng, con = get_connection()

## Example of reading tweets from lynching_tweets_a table

The following code reads a random sample of tweets from lynching_tweets_a table with following specifications:

- **keyword**: each tweet returned by the query has the keyword as a substring
- **n** : upper limit of how many tweets are fetched
- **start_date**: fetches tweets where date_created_at is the start_date or later
- **end_date**: looks for tweets until the day end_date - 1 

In [4]:
from sqlalchemy import text

keyword = "india"
n = 100
start_date = '2020-02-01'
end_date = '2023-02-05'

query = f"""
SELECT *
FROM lynching_tweets_a
WHERE MATCH(text) AGAINST('{keyword}' IN BOOLEAN MODE) 
AND created_at BETWEEN '{start_date}' AND '{end_date}'
ORDER BY RAND()
LIMIT {n};
"""

print("Query: ", query)

#df = pd.read_sql(text(query), con)


Query:  
SELECT *
FROM lynching_tweets_a
WHERE MATCH(text) AGAINST('india' IN BOOLEAN MODE) 
AND created_at BETWEEN '2020-02-01' AND '2023-02-05'
ORDER BY RAND()
LIMIT 100;



In [5]:
# Show text column of the result 
df['text']

0     RT @SpiritOfCongres: India under BJP:\n\n⚫Riot...
1     @azharulhaq @cjwerleman @ShiningSadaf @UN Indi...
2     RT @bainjal: New normal in new India cops watc...
3     RT @Shubham_fd: Lynching of Brahmins by Dalits...
4     Dumbledore’s Army has spoken. Trans Women are ...
                            ...                        
95    RT @imMAK02: Muslims are systematically target...
96    @FaroghYusuf @vandurocks Because I am a human....
97    RT @imMAK02: Hindutva Nazis are trending "I st...
98    RT @BachiBashith: Bjp leader nation wide calle...
99    RT @Aakashhassan: Kashmiri boy who was lynched...
Name: text, Length: 100, dtype: object

# Topic Modelling

In [None]:
import nltk

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel

# spacy
import spacy

nltk.download('stopwords')
from nltk.corpus import stopwords
from gensim.models import TfidfModel
# vis
import pyLDAvis
import pyLDAvis.gensim


In [None]:

def prep_data(data) -> list:
    """
    Wandelt die JSON Struktur des Datensets um in eine Liste.
    :param data: Zu bearbeitendes Datenset.
    :return: Liste mit Post-Inhalten aus dem Datenset.
    """
    text_list = []
    post_content = ''
    for idx, text in enumerate(data):
        post_content = data[idx]['content']
        text_list.append(post_content)
    return text_list

def lemmatization(texts, allowed_postags=["NOUN", "ADJ", "VERB", "ADV"]):
    """
    Filtert die Stopwörter aus den Post Texten heraus.
    :param texts: Liste an Post-Texten aus dem Datenset.
    :param allowed_postags: Wörter die nicht Herausgefiltert werden, z.B. Nomen.
    :return: Lemmatized texts, daher die gefilterten Texte.
    """
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]) # Quelle: https://github.com/explosion/spaCy/issues/7453
    texts_out = []
    for text in texts:
        if text != None:
            doc = nlp(text)
            new_text = []
            for token in doc:
                if token.pos_ in allowed_postags:
                    new_text.append(token.lemma_)

            final = " ".join(new_text)
            texts_out.append(final)
        else:
            continue

    return texts_out

def gen_words(texts):
    """
    Gensim spezifisches Preprocessing.
    :param texts: Liste an Post-Texten aus dem Datenset.
    :return: Verarbeitete Texte.
    """
    final = []
    for text in texts:
        new = gensim.utils.simple_preprocess(text, deacc=True)
        final.append(new)
    return final

def make_bigram(texts, bigram):
    bigram = (bigram[doc] for doc in texts)
    return bigram

def make_trigram(texts, trigram, bigram):
    trigram = (trigram[bigram[text]] for text in enumerate(texts))
    return trigram

In [5]:
def run_topicmodelling(data, target_file_path):
    """
    Basis Methode um Topic Modelling zu performen. Die Methode ist in folgende Schritte unterteilt: Vorbereitung der Daten, Preprocessing, Bilden von Bigrams und Trigrams,
    Verarbeitung der Daten mit dem TFIDG Modell und Visualisierung mit pyLDAvis.
    :type data: JSON-Objekt des zu bearbeitenden Datensets; wird in main.py geladen
    :type target_file_path: Zielpfad für Speichern der Visualisierung aus pyLDAvis.
    """
    # prepare data
    data_posts = prep_data(data)

    # Preprocessing - Herausfiltern von Stopwords und Transformation der Wörter in Grundform (Lemmatization)
    stop = set(stopwords.words('english'))
    lemmatized_texts = self.lemmatization(data_posts)
    lemmatized_data = list(self.gen_words(lemmatized_texts))
    print("Lemmetized Data Example:", lemmatized_data[0])

    # bigram and trigams
    bigrams_phrases = gensim.models.Phrases(lemmatized_data, min_count=5, threshold=100)
    trigram_phrases = gensim.models.Phrases(bigrams_phrases[lemmatized_data], threshold=100)

    bigram = gensim.models.phrases.Phraser(bigrams_phrases)
    trigram = gensim.models.phrases.Phraser(trigram_phrases)

    trigram_total = (trigram[bigram[lemmatized_data[i]]] for i in range(len(lemmatized_data)))
    trigram_total = list(trigram_total)

    id2word = corpora.Dictionary(trigram_total)
    print("id2word - Unique Tokens Examples:", id2word)
    corpus = [id2word.doc2bow(text) for text in trigram_total]

    # Verarbeitung der Daten mit tfidf
    tfidf = TfidfModel(corpus, id2word=id2word)

    low_value = 0.03
    words = []
    words_missing_in_tfidf = []

    for i in range(0, len(corpus)):
        bow = corpus[i]
        low_value_words = []  # reinitialize to be safe. You can skip this.
        tfidf_ids = [id for id, value in tfidf[bow]]
        bow_ids = [id for id, value in bow]
        low_value_words = [id for id, value in tfidf[bow] if value < low_value]
        words_missing_in_tfidf = [id for id in bow_ids if
                                  id not in tfidf_ids]  # The words with tf-idf socre 0 will be missing

        new_bow = [b for b in bow if b[0] not in low_value_words and b[0] not in words_missing_in_tfidf]

        # reassign
        corpus[i] = new_bow
    # Bereitstellung des LDA Model
    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                                id2word=id2word,
                                                num_topics=20,
                                                random_state=100,
                                                update_every=1,
                                                chunksize=100,
                                                passes=10,
                                                alpha="auto")

    # Visualisierung und Speichern
    vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, mds="mmds", R=30)
    pyLDAvis.save_html(vis, target_file_path)

# https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#1introduction

