## Gruppe 3 - Big Data WS2021

## Anwendungsfall: Topic Modeling mit LDA bei Wikipedia-Daten

### Problemstellungen:
- Warum ist Topic Modeling mit Wikipedia-Artikeln ein Big Data-Anwendungsfall?
- Welche Möglichkeiten gibt es zur Umsetzung?
- Wie wird beim Programm "Skalierung", "Parallelisierung" und "Fehlertoleranz" angegangen?
- Wo genau finden sich diese Eigenschaften?

---

**PySpark initialisieren**

_In Juypter Notebook we are required to use "findspark" to run pyspark._

In [1]:
import findspark
findspark.init()

import pyspark # only run after findspark.init()

**Imports**

pyspark.sql.Sparksession: "Main entry point for DataFrame and SQL functionality." **Source:** Spark Documentation

gensim: Python library for unsupervised topic modelling and NLP
corpora: Multiple corpus which contain large amounts of text

pyLDAvis.gensim: Visualizing our topic model

nltk: Collection of Python programs to work with human language data

re: Regex

In [None]:
from pyspark.sql import SparkSession

import gensim
from gensim import corpora, models
from gensim.test.utils import datapath, get_tmpfile
from gensim.corpora import WikiCorpus, MmCorpus

import pyLDAvis.gensim

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

import re

import wikipedia

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

**Functions**

These functions will be used later on to filter our data

get_title: returns the title of the article

get_content: returns the content of the article

check_if_person: True if the article is about a person

clean: Cleans the data by creating tokens, removing stopwords and stemming the words (normal form)

In [None]:
def get_title(content):
    # Remove any leading or lagging space if present 
    content = content.strip()
    title = ''
    try:
        if(content != ''):
            # Split the content on the basis of new line
            arr = content.split("\n",2)
            # Second line is the title
            title = arr[1]
            # Rest is the actual content
            actual_content = arr[2]
    except:
        title = 'error'
    return title

def get_content(content):
    # Remove any leading or lagging space if present 
    content = content.strip()
    actual_content = ''
    try:
        if(content != ''):
            # Split the content on the basis of new line
            arr = content.split("\n",2)
            # Second line is the title
            title = arr[1]
            # Rest is the actual content
            actual_content = arr[2]
    except:
        actual_content = 'error'
    return actual_content

def check_if_person(content):
    content = content[:150]
    list1 = re.findall(r"[\d]{1,2} [ADFJMNOS]\w* [\d]{4}", content)
    list2 = re.findall(r"[ADFJMNOS]\w* [\d]{1,2}[,] [\d]{4}", content)
    if(len(list1)>0 or len(list2)>0):
        return True
    return False

def clean(article):
    title = article[0]
    document = article[1]
    tokens = RegexpTokenizer(r'\w+').tokenize(document.lower())
    tokens_clean = [token for token in tokens if token not in stopwords.words('english')]
    tokens_stemmed = [PorterStemmer().stem(token) for token in tokens_clean]
    return (title, tokens_stemmed)

**Create SparkContext**

SparkContext: Entry point to Spark to create RDDs

In [None]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [2]:
spark

**1 - Get data**

Wir greifen auf 14 Artikel auf, wovon sieben die Kategorie "Auto" haben und die anderen alle fremd davon sind.
Damit können wir für unsere Tests klarer das Topic "Auto" identifizieren.

sc.wholeTextFiles erzeugt ein RDD mit unseren Artikeln.

In [None]:
data = sc.wholeTextFiles("C:/Users/Alina/Big Data/Wikipedia Exports/cars/*/*")

**2 - Clean data**

In [None]:
# Entfernt das <doc> Tag
pagesRaw = data.flatMap(lambda x :(x[1].split('</doc>')))

# Erstelle key-value-Paare von Titel und Inhalt
pagesTitleContent = pagesRaw.map(lambda x : (get_title(x),get_content(x)))

# Siehe "Functions"
cleanedPages = pagesTitleContent.map(lambda x : clean(x))

**3 - Create Dictionary**

Map words to their ids.
https://radimrehurek.com/gensim/corpora/dictionary.html

In [None]:
article_contents = cleanedPages.map(lambda x: x[1])
dictionary = corpora.Dictionary(article_contents.collect())

**4 - Create corpus through BOW**

Use doc2bow to convert a "list of words" into the "bag-of-words" format (token_id, token_count).
That means we count the words (also known as vector).
https://www.kite.com/python/docs/gensim.corpora.Dictionary.doc2bow

https://machinelearningmastery.com/gentle-introduction-bag-words-model/

In [None]:
corpus = article_contents.map(lambda x: dictionary.doc2bow(x)).collect()

**5 - Using LDA to our corpus to create topics**

The LdaModel function uses our prepared corpus and creates topics out of the bag-of-words.

## ToDo: Insert LDA description
#### Warum sind hier nun Zahlen zwischen 0.006 und 0.030?

Alternatives:
- models.ldamulticore
- SparkML von PySpark library

In [5]:
# Passes: How many times the algorithm passes over the whole corpus -> Longer duration but more precise results
lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=100)

# num_words: Ordered by significance
print(lda_model.print_topics(num_topics=5, num_words=5))

[(0, '0.029*"bmw" + 0.019*"merced" + 0.015*"benz" + 0.012*"car" + 0.011*"model"'), (1, '0.028*"human" + 0.022*"window" + 0.007*"use" + 0.006*"popul" + 0.006*"year"'), (2, '0.022*"audi" + 0.012*"car" + 0.012*"opel" + 0.010*"chair" + 0.008*"model"'), (3, '0.013*"page" + 0.012*"notebook" + 0.007*"paper" + 0.007*"pad" + 0.006*"bound"'), (4, '0.030*"plant" + 0.008*"natur" + 0.007*"water" + 0.007*"life" + 0.006*"earth"')]


**6 - Test our result with a new article**

We see that another car brand can be found in topic 2 (audi, car, opel) with a value of 76%.
The other topics have very low similarties.

In [8]:
article_title = "Lamborghini"

cleaned_article_content = clean([article_title, wikipedia.page(article_title).content])[1]
print( list( lda_model[ [dictionary.doc2bow(cleaned_article_content)] ]) )

[[(0, 0.19287233), (1, 0.010782254), (2, 0.76519686), (4, 0.026896305)]]


In [9]:
# ? Brauchen wir das noch?

cleanedPagesTitles = cleanedPages.map(lambda x: x[0])
cleanedPagesTitles.collect()
#cleanedPagesTitles.count()

['Audi',
 'BMW',
 'Opel',
 'Mercedes-Benz',
 'Human',
 'Nature',
 'Chair',
 'Table',
 'Plant',
 'Window',
 'Notebook',
 'BYD',
 'Volvo',
 'Škoda Auto',
 '']

**7 - Visualize our results and clusters**

pyLDAvis can easily create a graph.

In [7]:
lda_display = pyLDAvis.gensim.prepare(lda_model, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)