# Create LDA for Corpus<a id='top'></a>

0. Download an available corpus or create a new one. For the latter, create a JSON file for each blog or social media profile of your corpus; each entry or post is a line in a JSON file. One way to do this is to crawl websites using [scrapy](https://scrapy.org) with these flags: "-o result.json -t json" (see [sample crawlers](./scripts/scraper/spiders) and [example item file](./scripts/scraper/items.py)). An example JSON file is [here](./scripts/example.json).
1. [Prepare corpus for the LDA](#prepare). This notebook demonstrates how to load a (German) TEI xml, extract metadata and texts and filter unwanted POS (only nouns are left). The result is then saved as a json which can be used in the subsequent cells. You can also prepare your corpus externally, see my [example](./scripts/text.py) which is tailored to Russian texts. It removes all non-cyrillic characters, removes all words which are not nouns and sets all nouns into first person singular using POS tagging. The result is again saved in a json file
2. [Create LDA model for the corpus](#create)
3. [Compute topic distribution for corpus](#compute)
4. [Explore corpus](corpus.ipynb) (different notebook)

Due to copyright reasons I cannot publish the scraped raw data. The results of the smoothing process in step 2 are [here](./corpus/); they are used in the examples below.

In [1]:
import os
import sys
from gensim import corpora, models
import logging
import errno
import pandas as pd
from dateutil import parser
import pytz
import numpy as np
import json
import xml.etree.ElementTree as ET
import re
from tqdm import tqdm
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# set global paths for corpus etc.
corpus_path = "alpenwort"
result_path = "results_alpenwort_adj"
model_name = "model"
topics_name = "topics"

## Prepare corpus<a id='prepare'></a>

This cell demonstrates how to load a German TEI xml, extract metadata and texts and filter unwanted POS

[Back to top](#top)

In [2]:
keep_only = "ADJ"
# keep_only = "NOUN"

import spacy
!{sys.executable} -m spacy download de_core_news_sm
nlp = spacy.load('de_core_news_sm')

for xml_file in tqdm(sorted(os.listdir(corpus_path))):
    output_json = []
    if xml_file.endswith(".xml"):
        # get TEI xml data
        tree = ET.parse(os.path.join(corpus_path, xml_file))
        root = tree.getroot()
        text = []
        for text_node in root.findall(".//{*}text"):
            entry = {}
            entry["title"] = text_node.get("title")
            entry["url"] = xml_file
            entry["date"] = text_node.get("year")
            entry["author"] = text_node.get("author")
            entry["comment_count"] = 0
            entry["text"] = []
            for txt in text_node:
                # POS filtering
                if txt.text is not None and len(txt.text.split())> 3:
                    doc = nlp(txt.text)
                    for w in doc:
                        if w.pos_ == keep_only:
                            entry["text"].append(w.orth_)
            output_json.append(entry)

    with open(os.path.join(corpus_path, xml_file.split(".")[0] + ".json"), 'w') as outfile:
        json.dump(output_json, outfile)

'C:\Users\dr.' is not recognized as an internal or external command,
operable program or batch file.
100%|██████████| 92/92 [24:47<00:00, 16.16s/it]


## Create LDA model for corpus<a id='create'></a>

This cell creates the topic model for the specified corpus stored in JSON files

[Back to top](#top)

In [3]:
number_of_topics = 50

try:
    os.makedirs(result_path)
except OSError as exception:
    if exception.errno != errno.EEXIST:
        raise

# load corpus
corpus = []   
try:
    # load prepared corpus
    corpus = corpora.MmCorpus(os.path.join(result_path, model_name + ".corp"))
    dictionary = corpora.Dictionary.load(os.path.join(result_path, model_name + ".dict"))
except FileNotFoundError:
    # convert json corpus
    for json_file in sorted(os.listdir(corpus_path)):
        print("File: ", json_file)
        if json_file.endswith(".json"):
            # get data
            json_data = open(os.path.join(corpus_path, json_file))
            data = json.load(json_data)
            json_data.close()
            for entry in data:
                try:
                    corpus.append(entry["text"].split())
                except AttributeError:
                    corpus.append(entry["text"])

    print("File extraction complete.")

    dictionary = corpora.Dictionary(corpus)
    dictionary.save(os.path.join(result_path, model_name + ".dict"))

    corpus = [dictionary.doc2bow(text) for text in corpus]
    corpora.MmCorpus.serialize(os.path.join(result_path, model_name + ".corp"), corpus)    

lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=number_of_topics, alpha='auto', eval_every=5, passes=20)

start = 1
while os.path.isfile(os.path.join(result_path, model_name + "-" +str(start)+ ".lda")):
    start += 1

lda.save(os.path.join(result_path, model_name + "-" +str(start)+ ".lda"))

print("LDA saved as", os.path.join(result_path, model_name + "-" +str(start)+ ".lda"))

2020-11-16 23:47:16,642 : INFO : initializing cython corpus reader from results_alpenwort_adj\model.corp


File:  av_1900_031_TEI.json
File:  av_1900_031_TEI.xml
File:  av_1901_032_TEI.json
File:  av_1901_032_TEI.xml
File:  av_1902_033_TEI.json
File:  av_1902_033_TEI.xml
File:  av_1903_034_TEI.json
File:  av_1903_034_TEI.xml
File:  av_1904_035_TEI.json
File:  av_1904_035_TEI.xml
File:  av_1905_036_TEI.json
File:  av_1905_036_TEI.xml
File:  av_1906_037_TEI.json
File:  av_1906_037_TEI.xml
File:  av_1907_038_TEI.json
File:  av_1907_038_TEI.xml
File:  av_1908_039_TEI.json
File:  av_1908_039_TEI.xml
File:  av_1909_040_TEI.json
File:  av_1909_040_TEI.xml
File:  av_1910_041_TEI.json
File:  av_1910_041_TEI.xml
File:  av_1911_042_TEI.json
File:  av_1911_042_TEI.xml
File:  av_1912_043_TEI.json
File:  av_1912_043_TEI.xml
File:  av_1913_044_TEI.json
File:  av_1913_044_TEI.xml
File:  av_1914_045_TEI.json
File:  av_1914_045_TEI.xml
File:  av_1915_046_TEI.json
File:  av_1915_046_TEI.xml
File:  av_1916_047_TEI.json
File:  av_1916_047_TEI.xml
File:  av_1917_048_TEI.json
File:  av_1917_048_TEI.xml
File:  av_

2020-11-16 23:47:17,172 : INFO : adding document #0 to Dictionary(0 unique tokens: [])


 av_1929_060_TEI.xml
File:  av_1930_061_TEI.json
File:  av_1930_061_TEI.xml
File:  av_1931_062_TEI.json
File:  av_1931_062_TEI.xml
File:  av_1932_063_TEI.json
File:  av_1932_063_TEI.xml
File:  av_1933_064_TEI.json
File:  av_1933_064_TEI.xml
File:  av_1934_065_TEI.json
File:  av_1934_065_TEI.xml
File:  av_1935_066_TEI.json
File:  av_1935_066_TEI.xml
File:  av_1936_067_TEI.json
File:  av_1936_067_TEI.xml
File:  av_1937_068_TEI.json
File:  av_1937_068_TEI.xml
File:  av_1938_069_TEI.json
File:  av_1938_069_TEI.xml
File:  av_1939_070_TEI.json
File:  av_1939_070_TEI.xml
File:  av_1940_071_TEI.json
File:  av_1940_071_TEI.xml
File:  av_1941_072_TEI.json
File:  av_1941_072_TEI.xml
File:  av_1942_073_TEI.json
File:  av_1942_073_TEI.xml
File:  av_1943_000_TEI.json
File:  av_1943_000_TEI.xml
File:  av_1949_074_TEI.json
File:  av_1949_074_TEI.xml
File:  av_1950_075_TEI.json
File:  av_1950_075_TEI.xml
File extraction complete.


2020-11-16 23:47:18,152 : INFO : built Dictionary(64967 unique tokens: ['11.', 'Commerzienrathes', 'Dementsprechend', 'Deutschen', 'Hohe']...) from 682 documents (total 696571 corpus positions)
2020-11-16 23:47:18,152 : INFO : saving Dictionary object under results_alpenwort_adj\model.dict, separately None
2020-11-16 23:47:18,192 : INFO : saved results_alpenwort_adj\model.dict
2020-11-16 23:47:18,782 : INFO : storing corpus in Matrix Market format to results_alpenwort_adj\model.corp
2020-11-16 23:47:18,782 : INFO : saving sparse matrix to results_alpenwort_adj\model.corp
2020-11-16 23:47:18,782 : INFO : PROGRESS: saving document #0
2020-11-16 23:47:19,392 : INFO : saved 682x64967 matrix, density=1.026% (454421/44307494)
2020-11-16 23:47:19,392 : INFO : saving MmCorpus index to results_alpenwort_adj\model.corp.index
2020-11-16 23:47:19,402 : INFO : using autotuned alpha, starting with [0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 

2020-11-16 23:47:56,704 : INFO : topic #40 (0.022): 0.001*"großen" + 0.001*"ganzen" + 0.001*"erste" + 0.001*"ganze" + 0.001*"kleinen" + 0.001*"ersten" + 0.001*"gut" + 0.001*"alten" + 0.001*"nächsten" + 0.001*"später"
2020-11-16 23:47:56,704 : INFO : topic #36 (0.023): 0.002*"ersten" + 0.002*"großen" + 0.002*"gut" + 0.001*"weit" + 0.001*"hoch" + 0.001*"leicht" + 0.001*"ganzen" + 0.001*"große" + 0.001*"letzten" + 0.001*"rasch"
2020-11-16 23:47:56,704 : INFO : topic #38 (0.058): 0.011*"großen" + 0.007*"große" + 0.006*"hoch" + 0.006*"ersten" + 0.005*"weit" + 0.005*"anderen" + 0.005*"ganzen" + 0.005*"ganze" + 0.005*"kleinen" + 0.005*"hohen"
2020-11-16 23:47:56,714 : INFO : topic #16 (0.060): 0.007*"ersten" + 0.007*"leicht" + 0.005*"steilen" + 0.005*"kleinen" + 0.005*"ganzen" + 0.005*"weit" + 0.005*"gut" + 0.005*"erste" + 0.005*"steile" + 0.005*"hoch"
2020-11-16 23:47:56,714 : INFO : topic #35 (0.075): 0.008*"großen" + 0.008*"ersten" + 0.007*"weit" + 0.006*"letzten" + 0.005*"kleinen" + 0.005

2020-11-16 23:48:48,134 : INFO : topic #36 (0.018): 0.000*"ersten" + 0.000*"großen" + 0.000*"gut" + 0.000*"weit" + 0.000*"hoch" + 0.000*"leicht" + 0.000*"ganzen" + 0.000*"große" + 0.000*"letzten" + 0.000*"rasch"
2020-11-16 23:48:48,134 : INFO : topic #38 (0.073): 0.012*"großen" + 0.007*"große" + 0.007*"hoch" + 0.006*"weit" + 0.005*"anderen" + 0.005*"ersten" + 0.005*"hohen" + 0.005*"alten" + 0.005*"ganze" + 0.005*"ganzen"
2020-11-16 23:48:48,134 : INFO : topic #16 (0.084): 0.007*"ersten" + 0.007*"leicht" + 0.006*"kleinen" + 0.006*"steilen" + 0.005*"steile" + 0.005*"erste" + 0.005*"ganzen" + 0.005*"weit" + 0.005*"steil" + 0.005*"gut"
2020-11-16 23:48:48,134 : INFO : topic #35 (0.128): 0.008*"großen" + 0.008*"ersten" + 0.006*"weit" + 0.006*"letzten" + 0.006*"gut" + 0.006*"nächsten" + 0.006*"kleinen" + 0.005*"große" + 0.005*"hoch" + 0.005*"ganze"
2020-11-16 23:48:48,154 : INFO : topic diff=0.993816, rho=0.333333
2020-11-16 23:48:55,646 : INFO : -11.698 per-word bound, 3322.7 perplexity est

2020-11-16 23:49:38,708 : INFO : topic #2 (0.090): 0.009*"großen" + 0.007*"alpinen" + 0.007*"alten" + 0.007*"große" + 0.006*"anderen" + 0.006*"früher" + 0.006*"ersten" + 0.006*"andere" + 0.005*"neue" + 0.005*"ganze"
2020-11-16 23:49:38,718 : INFO : topic #16 (0.106): 0.008*"leicht" + 0.008*"ersten" + 0.006*"kleinen" + 0.006*"steilen" + 0.005*"steile" + 0.005*"steil" + 0.005*"erste" + 0.005*"ganzen" + 0.005*"weit" + 0.005*"oberen"
2020-11-16 23:49:38,718 : INFO : topic #35 (0.175): 0.008*"großen" + 0.008*"ersten" + 0.006*"weit" + 0.006*"letzten" + 0.006*"gut" + 0.006*"nächsten" + 0.006*"kleinen" + 0.005*"große" + 0.005*"hoch" + 0.005*"steilen"
2020-11-16 23:49:38,728 : INFO : topic diff=0.326763, rho=0.277350
2020-11-16 23:49:45,908 : INFO : -11.767 per-word bound, 3484.6 perplexity estimate based on a held-out corpus of 682 documents with 696571 words
2020-11-16 23:49:45,908 : INFO : PROGRESS: pass 12, at document #682/682
2020-11-16 23:49:51,048 : INFO : optimized alpha [0.016773973, 

2020-11-16 23:50:27,162 : INFO : topic #2 (0.112): 0.009*"großen" + 0.008*"alpinen" + 0.007*"alten" + 0.007*"große" + 0.006*"anderen" + 0.006*"früher" + 0.006*"ersten" + 0.006*"andere" + 0.005*"neue" + 0.005*"ganze"
2020-11-16 23:50:27,162 : INFO : topic #16 (0.124): 0.008*"leicht" + 0.008*"ersten" + 0.006*"kleinen" + 0.006*"steilen" + 0.006*"steile" + 0.006*"steil" + 0.005*"erste" + 0.005*"ganzen" + 0.005*"weit" + 0.005*"oberen"
2020-11-16 23:50:27,172 : INFO : topic #35 (0.215): 0.008*"großen" + 0.008*"ersten" + 0.006*"weit" + 0.006*"letzten" + 0.006*"gut" + 0.006*"nächsten" + 0.006*"kleinen" + 0.006*"große" + 0.005*"hoch" + 0.005*"steilen"
2020-11-16 23:50:27,182 : INFO : topic diff=0.125191, rho=0.242536
2020-11-16 23:50:33,094 : INFO : -11.769 per-word bound, 3489.7 perplexity estimate based on a held-out corpus of 682 documents with 696571 words
2020-11-16 23:50:33,094 : INFO : PROGRESS: pass 16, at document #682/682
2020-11-16 23:50:36,804 : INFO : optimized alpha [0.015170747, 

2020-11-16 23:51:06,224 : INFO : topic #2 (0.132): 0.009*"großen" + 0.008*"alpinen" + 0.007*"alten" + 0.007*"große" + 0.006*"anderen" + 0.006*"früher" + 0.006*"ersten" + 0.006*"andere" + 0.005*"neue" + 0.005*"ganze"
2020-11-16 23:51:06,234 : INFO : topic #16 (0.139): 0.008*"leicht" + 0.008*"ersten" + 0.006*"kleinen" + 0.006*"steilen" + 0.006*"steil" + 0.006*"steile" + 0.005*"erste" + 0.005*"ganzen" + 0.005*"weit" + 0.005*"oberen"
2020-11-16 23:51:06,234 : INFO : topic #35 (0.248): 0.008*"großen" + 0.008*"ersten" + 0.006*"weit" + 0.006*"letzten" + 0.006*"gut" + 0.006*"nächsten" + 0.006*"kleinen" + 0.006*"große" + 0.005*"hoch" + 0.005*"steilen"
2020-11-16 23:51:06,244 : INFO : topic diff=0.056062, rho=0.218218
2020-11-16 23:51:06,254 : INFO : saving LdaState object under results_alpenwort_adj\model-1.lda.state, separately None
2020-11-16 23:51:06,354 : INFO : saved results_alpenwort_adj\model-1.lda.state
2020-11-16 23:51:06,394 : INFO : saving LdaModel object under results_alpenwort_adj\

LDA saved as results_alpenwort_adj\model-1.lda


## Compute topic distribution for corpus<a id='compute'></a>

[Back to top](#top)

In [4]:
# entries published after max_date are ignored
utc = pytz.UTC
max_date = parser.parse("2014-12-31 23:59:59")
max_date_utc = utc.localize(parser.parse("2014-12-31 23:59:59"))

number = 0
for f in os.listdir(result_path):
    try:
        number = max(number, int(f.split(model_name+"-")[1].split(".lda")[0]))
    except IndexError:
        continue
if number > 0:
    file_name = model_name + "-" + str(number) + ".lda"
else: 
    file_name = model_name + ".lda"

# load LDA model and dictionary
dictionary = corpora.Dictionary.load(os.path.join(result_path, model_name + ".dict"))
model = models.LdaModel.load(os.path.join(result_path, file_name))

# new fields for compatibility, default values from
# https://radimrehurek.com/gensim/models/ldamodel.html
try:
    x = model.minimum_probability
except AttributeError:
    model.minimum_probability = 0.01
    model.minimum_phi_value = 0.01
    model.per_word_topics = False
    model.random_state = np.random.RandomState()

columns = ['group', 'url', 'date', 'comment_count', 'words']
columns.extend([str(topic) for topic in range(model.num_topics)])

result = []

# sort files
for json_file in sorted(os.listdir(corpus_path)):

    print("File: ", json_file)

    if json_file.endswith(".json"):
        # get data
        with open(os.path.join(corpus_path, json_file)) as json_data:
            data = json.load(json_data)

        removed = 0
        too_short = 0

        for entry in data:
            # check if entry is within data range
            try:
                date = parser.parse(entry["date"])
                try:
                    if date > max_date:
                        removed += 1
                        continue
                except TypeError:
                    if date > max_date_utc:
                        removed += 1
                        continue
            except ValueError:
                print("Wrong format", entry["date"])

            # get topic distribution for entry
            line = {}
            try:
                text = entry["text"].split(" ")
            except AttributeError:
                text = entry["text"]
                
            # filter too short entries
            if len(text) < 5:
                too_short += 1
                continue

            topics = [0] * model.num_topics
            for (topic, prop) in model[dictionary.doc2bow(text)]:
                topics[topic] = prop
            line["group"] = json_file.split(".json")[0]
            line["url"] = entry["url"]
            line["date"] = entry['date']
            line["words"] = len(text)
            line["comment_count"] = entry["comment_count"]
            for counter in range(len(topics)):
                line[str(counter)] = topics[counter]
            result.append(line)

        print("Total number of entries:", len(data))
        print("Removed because of date: ", removed)
        print("Removed because too short: ", too_short)
        print("Remaining:", (len(data) - removed - too_short))
            
frame = pd.DataFrame(result)
print(columns)
frame = frame[columns]
start = 1
while os.path.isfile(os.path.join(result_path, topics_name + "-" +str(start)+ ".json")):
    start += 1

frame.to_json(os.path.join(result_path, topics_name + "-" + str(start) + ".json"), orient='split')
print ("Created", os.path.join(result_path, topics_name + "-" + str(start) + ".json"))

2020-11-16 23:51:51,984 : INFO : loading Dictionary object from results_alpenwort_adj\model.dict
2020-11-16 23:51:52,074 : INFO : loaded results_alpenwort_adj\model.dict
2020-11-16 23:51:52,074 : INFO : loading LdaModel object from results_alpenwort_adj\model-1.lda
2020-11-16 23:51:52,084 : INFO : loading expElogbeta from results_alpenwort_adj\model-1.lda.expElogbeta.npy with mmap=None
2020-11-16 23:51:52,114 : INFO : setting ignored attribute state to None
2020-11-16 23:51:52,114 : INFO : setting ignored attribute dispatcher to None
2020-11-16 23:51:52,114 : INFO : setting ignored attribute id2word to None
2020-11-16 23:51:52,114 : INFO : loaded results_alpenwort_adj\model-1.lda
2020-11-16 23:51:52,114 : INFO : loading LdaState object from results_alpenwort_adj\model-1.lda.state
2020-11-16 23:51:52,224 : INFO : loaded results_alpenwort_adj\model-1.lda.state


File:  av_1900_031_TEI.json
Total number of entries: 22
Removed because of date:  0
Removed because too short:  0
Remaining: 22
File:  av_1900_031_TEI.xml
File:  av_1901_032_TEI.json
Total number of entries: 19
Removed because of date:  0
Removed because too short:  0
Remaining: 19
File:  av_1901_032_TEI.xml
File:  av_1902_033_TEI.json
Total number of entries: 18
Removed because of date:  0
Removed because too short:  0
Remaining: 18
File:  av_1902_033_TEI.xml
File:  av_1903_034_TEI.json
Total number of entries: 15
Removed because of date:  0
Removed because too short:  0
Remaining: 15
File:  av_1903_034_TEI.xml
File:  av_1904_035_TEI.json
Total number of entries: 20
Removed because of date:  0
Removed because too short:  0
Remaining: 20
File:  av_1904_035_TEI.xml
File:  av_1905_036_TEI.json
Total number of entries: 17
Removed because of date:  0
Removed because too short:  0
Remaining: 17
File:  av_1905_036_TEI.xml
File:  av_1906_037_TEI.json
Total number of entries: 16
Removed becaus