# Chinese Topic Modeling — Harry Potter Fanfiction

In these lessons, we're learning about a text analysis method called *topic modeling*. This method will help us identify the main topics or discourses within a collection of texts a single text that has been separated into smaller text chunks.

---

## Dataset

### Harry Potter Fanfiction (Chinese)

In this particular lesson, we're going to use [Little MALLET Wrapper](https://github.com/maria-antoniak/little-mallet-wrapper), a Python wrapper for [MALLET](http://mallet.cs.umass.edu/topics.php), to topic model a CSV file with fanfiction stories from the website [Archive of Our Own (AO3)](https://archiveofourown.org/).

___

<div class="attention" name="html-admonition" style="background: lightyellow;padding: 10px">  
<p class="title">Attention</p>  
    
<p>If you're working in this Jupyter notebook on your own computer, you'll need to have both the Java Development Kit and MALLET pre-installed. For set up instructions, please see <a href="http://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/Topic-Modeling-Set-Up.html">the previous lesson<a/>.  </p>
    
If you're working in this Jupyter notebook in the cloud via Binder/JupyterHub, then the Java Development Kit and Mallet will already be installed. You're good to go!  
     
</div>  

## Set MALLET Path

Since Little MALLET Wrapper is a Python package built around MALLET, we first need to tell it where the bigger, Java-based MALLET lives.

We're going to make a variable called `path_to_mallet` and assign it the file path of our MALLET program. We need to point it, specifically, to the "mallet" file inside the "bin" folder inside the "mallet-2.0.8" folder. 

In [None]:
path_to_mallet = '../../mallet-2.0.8/bin/mallet'

If MALLET is located in another directory, then set your `path_to_mallet` to that file path.

## Install Packages

In [None]:
#!pip install little_mallet_wrapper
#!pip install seaborn
#To install the most updated version of little_mallet_wrapper:
#!!pip install git+https://github.com/maria-antoniak/little-mallet-wrapper.git

In [None]:
!pip install scipy

## Import Packages

In [None]:
import little_mallet_wrapper
import seaborn
import pandas as pd
import random
from pathlib import Path
pd.options.display.max_colwidth = 100

## Get Training Data From CSV File

In [None]:
hp_df_chinese = pd.read_csv("harry_potter_chinese_stories_first_chaps.csv")

In [None]:
hp_df_chinese

Drop rows with no story text

In [None]:
hp_df_chinese = hp_df_chinese.dropna(subset=['Text'])

## Text Pre-Processing

To topic model Chinese texts, we're going to segment the texts and then remove Chinese stop words. 

In [None]:
!python -m spacy download zh_core_web_sm

In [None]:
import zh_core_web_sm
import spacy
nlp = spacy.load("zh_core_web_sm")

Run the spacy nlp model on each text

In [None]:
nlp_documents = list(nlp.pipe(hp_df_chinese[:200]['Text'].to_list()))

Pull out the tokens (segement the text)

In [None]:
all_tokens = []
for doc in nlp_documents:
    doc_tokens = []
    for token in doc:
        token_text = token.text
        doc_tokens.append(token_text)
    doc_tokens = " ".join(doc_tokens)    
    all_tokens.append(doc_tokens)

Load Chinese stopwords so we can remove them. This is the source that we're using: https://github.com/stopwords-iso/stopwords-zh It's important to closely inspect the stopwords that you choose!

In [None]:
chinese_stop_words = open("chinese_stop_words.txt").read().split()

`little_mallet_wrapper.process_string(text, numbers='remove')`

Next we're going to process our texts with the function `little_mallet_wrapper.process_string()`. This function will take every individual post, transform all the text to lowercase as well as remove stopwords, punctuation, and numbers, and then add the processed text to our master list `training_data`.

In [None]:
training_data = [little_mallet_wrapper.process_string(text, numbers='remove', remove_punctuation=False, remove_short_words=False, stop_words_extra = chinese_stop_words) for text in all_lemmas]

Strip punctuation from Chinese characters (strip anything that's not a word)

In [None]:
training_data = [re.sub('\W+',' ', text) for text in training_data]

Keep original fanfiction stories so we can examine them later

In [None]:
original_texts = [text for text in hp_df_chinese['Text'][:200]]

Keep fanfiction story titles so we can examine them later

In [None]:
hp_chinese_titles = [title for title in hp_df_chinese['Title'][:200]]

### Get Dataset Statistics

We can get training data summary statisitcs by using the funciton `little_mallet_wrapper.print_dataset_stats()`.

In [None]:
little_mallet_wrapper.print_dataset_stats(training_data)

## Training the Topic Model

We need to make a variable `num_topics` and assign it the number of topics we want returned. Then we're going to set a file path where we want all our MALLET topic modeling data to be dumped.

In [None]:
# your choice of topics
num_topics = 50

path_to_mallet = '../../mallet-2.0.8/bin/mallet'

#Set output directory
output_directory_path = 'topic-model-output/hp-fanfiction/chinese-firstchaps'

#Create output directory
Path(f"{output_directory_path}").mkdir(parents=True, exist_ok=True)

#Create output files
path_to_training_data           = f"{output_directory_path}/training.txt"
path_to_formatted_training_data = f"{output_directory_path}/mallet.training"
path_to_model                   = f"{output_directory_path}/mallet.model.{str(num_topics)}"
path_to_topic_keys              = f"{output_directory_path}/mallet.topic_keys.{str(num_topics)}"
path_to_topic_distributions     = f"{output_directory_path}/mallet.topic_distributions.{str(num_topics)}"

### Train Topic Model

Then we're going to train our topic model with `little_mallet_wrapper.quick_train_topic_model()`.

In [None]:
#little_mallet_wrapper.quick_train_topic_model(path_to_mallet,
#                                             output_directory_path,
#                                             num_topics,
#                                             training_data)

When the topic model finishes, it will output your results to your `output_directory_path`.

## Display Topics and Top Words

To examine the topics that the topic model extracted from the Reddit posts, run the cell below. This code uses the `little_mallet_wrapper.load_topic_keys()` function to read and process the MALLET topic model output.

In [None]:
topics = little_mallet_wrapper.load_topic_keys(path_to_topic_keys)

for topic_number, topic in enumerate(topics):
    
    print(f"✨Topic {topic_number}✨\n\n{topic}\n")

## Load Topic Distributions

MALLET also calculates the likely mixture of these topics for every single document in the corpus. This mixture is really a probability distribution, that is, the probability that each topic exists in the document. We can use these probability distributions to examine which of the above topics are strongly associated with which specific documents.

To get the topic distributions, we're going to use the `little_mallet_wrapper.load_topic_distributions()` function, which will read and process the MALLET topic model output.

In [None]:
topic_distributions = little_mallet_wrapper.load_topic_distributions(path_to_topic_distributions)

In [None]:
topic_distributions[0]

## Display Top Documents Per Topic

In [None]:
from IPython.display import Markdown, display
import re

def make_md(string):
    """A function that transforms string data into Markdown
    so it can be nicely formatted with bolding and emojis
    """
    display(Markdown(str(string)))

def get_top_docs(docs, topic_distributions, topic_index=1, n=5, doc_length = 2000):
    
    """A function that shows the top documents for a given set of topic distributions
    and a specific topic number
    """
    
    sorted_data = sorted([(_distribution[topic_index], _document) for _distribution, _document in zip(topic_distributions, docs)], reverse=True)
    topic_words = topics[topic_index]
    make_md(f"### ✨Topic {topic_index}✨\n\n{topic_words}\n\n---")
    
    for probability, doc in sorted_data[:n]:
        # Make topic words bolded
        for word in topic_words:
            if word in doc.lower():
                doc = re.sub(f"\\b{word}\\b", f"**{word}**", doc, re.IGNORECASE)
        make_md(f'✨  \n**Topic Probability**: {probability}  \n**Document**: {doc[:doc_length]}\n\n')

# Voldemort

In [None]:
get_top_docs(original_texts, topic_distributions, topic_index=4, n=5)

# Sex?

In [None]:
get_top_docs(original_texts, topic_distributions, topic_index=14, n=5)

# Old Wizards?

In [None]:
get_top_docs(original_texts, topic_distributions, topic_index=22, n=5)