# Prepare a corpus for topic modeling<a id='top'></a>

0. Create a new project directory in the [projects subfolder](projects). It should have the following structure:
    * *projects/your_project_name/raw* for your raw corpus (e.g. as XML files or plain text)
    * *projects/your_project_name/corpus* for your prepared corpus (will be filled by this notebook)
    * *projects/your_project_name/results* for the results of the topic modeling process (will be filled by [create_lda.ipynb](create_lda.ipynb))
    
1. Put your raw corpus files into the corresponding filder  
2. [Create corpus from XML files](#prepare_xml) or [create corpus from plain text files](#prepare_txt). This notebook demonstrates how to load texts, extract metadata and filter unwanted (German) POS (e.g. only nouns are left). The result is then saved as a json which can be used in the subsequent notebooks. 
3. [Create LDA](create_lda.ipynb) (different notebook)
4. [Explore corpus](corpus.ipynb) (different notebook)

In [None]:
import os
from tqdm import tqdm
import sys
import json
import re

In [None]:
# set global paths for your project
your_project_name = "zeit"

In [None]:
# set global paths for corpus etc.
raw_path = os.path.join("projects", your_project_name, "raw")
corpus_path = os.path.join("projects", your_project_name, "corpus")
result_path =os.path.join("projects", your_project_name, "results")
model_name = "model"
topics_name = "topics"

## Prepare corpus from XML<a id='prepare_xml'></a>

This cell demonstrates how to load a German TEI xml, extract metadata and texts and filter unwanted POS

[Back to top](#top)

In [None]:
# keep_only = "ADJ"
keep_only = "NOUN"

import spacy
!{sys.executable} -m spacy download de_core_news_sm
nlp = spacy.load('de_core_news_sm')

for xml_file in tqdm(sorted(os.listdir(raw_path))):
    output_json = []
    print(xml_file)
    if xml_file.endswith(".xml"):
        # get TEI xml data
        tree = ET.parse(os.path.join(raw_path, xml_file))
        root = tree.getroot()
        text = []
        for text_node in root.findall(".//{*}text"):
            entry = {}
            entry["title"] = text_node.get("title")
            entry["url"] = xml_file
            entry["date"] = text_node.get("year")
            entry["author"] = text_node.get("author")
            entry["comment_count"] = 0
            entry["text"] = []
            for txt in text_node:
                # POS filtering
                if txt.text is not None and len(txt.text.split())> 3:
                    doc = nlp(txt.text)
                    for w in doc:
                        if w.pos_ == keep_only:
                            entry["text"].append(w.orth_)
            output_json.append(entry)

    with open(os.path.join(corpus_path, xml_file.split(".")[0] + ".json"), 'w') as outfile:
        json.dump(output_json, outfile)

## Prepare corpus from txt<a id='prepare_txt'></a>

This cell demonstrates how to load a German plain text file, extract metadata from the file name and filter unwanted POS

[Back to top](#top)

In [None]:
# keep_only = "ADJ"
keep_only = "NOUN"

import spacy
!{sys.executable} -m spacy download de_core_news_sm
nlp = spacy.load('de_core_news_sm')

# increase max length for texts
nlp.max_length = 2000000

for folder in tqdm(sorted(os.listdir(raw_path))):  
    output_json = []
    for txt_file in sorted(os.listdir(os.path.join(raw_path, folder))):
        entry = {}
        entry["title"] = re.sub(".txt", "", txt_file.split("-")[1])
        entry["url"] = os.path.join(folder, txt_file)
        entry["date"] = re.sub(".txt", "", txt_file.split("-")[0]) + " 00:00:00"
        entry["author"] = ""
        entry["comment_count"] = 0
        entry["text"] = []
        text = ""
        with open(os.path.join(raw_path, folder, txt_file), "r") as f:
            text = f.read()

        # POS filtering
        if text is not None and len(text.split())> 3:
            doc = nlp(text)
            for w in doc:
                if w.pos_ == keep_only:
                    entry["text"].append(w.orth_)
        output_json.append(entry)
    
    with open(os.path.join(corpus_path, folder + ".json"), 'w') as outfile:
         json.dump(output_json, outfile)