# Run Postprocess

(author: uhk, based on run_prepare.py by cs, created on: October 2017)

This notebook is for calling functions from TMW's prepare module.

General advice on how to use the notebook:

* Run the code blocks one after the other. The blocks "Imports" and "General Settings" have to be run first.
* Each visualization block, e.g. plot words-in-topics-treemap, can be run separately.
* Some settings need to be changed, e.g. your working directory, number of topics, parameters, ... the comments say "set..." where this is the case. No other changes have to be made to the code.

## Imports

In [1]:
from os.path import join
from os.path import abspath
import importlib
import sys

In [23]:
"""
set the path to the directory where you stored TMW
"""
tmw_path = "/home/ulrike/Dokumente/GS/Veranstaltungen/2017_QuestioningModels_TM/scripts/tmw"

In [24]:
sys.path.append(abspath(tmw_path))
"""
this shows your path settings
"""
print(sys.path)

['', '/usr/lib/python35.zip', '/usr/lib/python3.5', '/usr/lib/python3.5/plat-x86_64-linux-gnu', '/usr/lib/python3.5/lib-dynload', '/home/ulrike/.local/lib/python3.5/site-packages', '/usr/local/lib/python3.5/dist-packages', '/usr/local/lib/python3.5/dist-packages/wikiextractor-2.69-py3.5.egg', '/usr/lib/python3/dist-packages', '/home/ulrike/.local/lib/python3.5/site-packages/IPython/extensions', '/home/ulrike/.ipython', '/home/ulrike/Git/tmw/scripts_verona', '/home/ulrike/Dokumente/GS/Veranstaltungen/2017_QuestioningModels_TM/scripts/tmw']


In [25]:
import postprocess

importlib.reload(postprocess)

<module 'postprocess' from '/home/ulrike/Dokumente/GS/Veranstaltungen/2017_QuestioningModels_TM/scripts/tmw/postprocess.py'>

## General settings

In [5]:
"""
set the working directory
- a folder which contains the metadata file, the corpus folder, etc.
- the topic modeling output will be stored in new subfolders inside the working directory
"""
wdir = "/home/ulrike/Dokumente/GS/Veranstaltungen/2017_QuestioningModels_TM/data_MALLET"

"""
set the corpus folder name
"""
corpus_folder = "segs"

"""
set the metadata file name (this should end with .csv)
"""
metadata_file = "english-novels-N/metadata-english-novels.csv"

"""
set the MALLET version which was used
"""
version  = "208+"

"""
set the model folder name (the folder where the MALLET model was stored in)
"""
model_folder = "model"

"""
set the file name of the MALLET --output-doc-topics option, which is usually "topics-in-texts"
(without the ending .txt)
"""
topics_in_texts = "topics-in-texts"

"""
set the file name of the MALLET --output-topic-keys option, which is usually "topics-with-words"
(without the ending .txt)
"""
topics_with_words = "topics-with-words"

"""
set the aggregate folder name
"""
aggregates_folder = "aggregates"

In [6]:
"""
set the parameters as they were used in the topic modeling
- these settings will be used for output file names
"""
NumTopics = 20
NumIterations = 5000
OptimizeIntervals = 100
TopTopics = 30

In [7]:
param_settings = str(NumTopics) + "tp-" + str(NumIterations) + "it-" + str(OptimizeIntervals) + "in-" + str(TopTopics) + "tt"

In [8]:
metadata_dir = join(wdir, metadata_file)
aggregates_dir = join(wdir, aggregates_folder, param_settings)

## create_mastermatrix

In [9]:
"""
This part joins all the information from the topic model in one matrix, the "mastermatrix".
"""

"""
set the file name for the mastermatrix (this should end with .csv)
"""
mastermatrix_file_name = "mastermatrix.csv"

In [10]:
corpus_path = join(wdir, corpus_folder, "*.txt")
useBins = False
topics_in_texts_path = join(wdir, model_folder, topics_in_texts + ".txt")
binDataFile = ""

In [12]:
postprocess.create_mastermatrix(corpus_path, aggregates_dir, mastermatrix_file_name, metadata_dir, topics_in_texts_path, NumTopics, useBins, binDataFile, version)


Launched create_mastermatrix.
- getting data...
- getting metadata...
- getting docmatrix...
- getting topicscores...
- merging data...
Done. Saved mastermatrix. Segments and columns: (1190, 27)


## calculate_averageTopicScores

In [13]:
"""
Based on the mastermatrix, average topic scores are calculated.
"""

"""
set the metadata categories for which average topic scores should be calculated
note: "idno" should always be there
"""
targets = ["idno", "author-name", "author-gender", "title", "publication-decade"]

In [14]:
mastermatrixfile = join(aggregates_dir, mastermatrix_file_name)
averages_outfolder = aggregates_dir

In [15]:
postprocess.calculate_averageTopicScores(mastermatrixfile, targets, averages_outfolder)


Launched calculate_averageTopicScores.
  Saved average topic scores for: idno
  Saved average topic scores for: author-name
  Saved average topic scores for: author-gender
  Saved average topic scores for: title
  Saved average topic scores for: publication-decade
Done.


## save_firstWords

In [16]:
"""
this function saves the top three words for each topic (so that they can be used in visualizations later)
"""
topicWordFile = join(wdir, model_folder, topics_with_words + ".txt")
firstWords_out = aggregates_dir
filename = "firstWords.csv"

In [17]:
postprocess.save_firstWords(topicWordFile, firstWords_out, filename)

Launched save_someFirstWords.
Done.


## save_topicRanks

In [18]:
"""
Save a list of the topics with their rank by topic score.
"""
topicWordFile = join(wdir, model_folder, topics_with_words + ".txt")
topicRanks_out = aggregates_dir
filename = "topicRanks.csv"

In [19]:
postprocess.save_topicRanks(topicWordFile, topicRanks_out, filename)

Launched save_topicRanks.
Done.


## calculate_complexAverageTopicScores

In [20]:
"""
Calculate the average topic scores for two criteria (e.g. "title" and "author-gender")
"""

"""
set the metadata categories to combine
"""
targets = ["title", "author-gender"]

In [21]:
mastermatrixfile = join(aggregates_dir, mastermatrix_file_name)
complexAverage_out = aggregates_dir

In [22]:
postprocess.calculate_complexAverageTopicScores(mastermatrixfile, targets, complexAverage_out)


Launched calculate_complexAverageTopicScores.
Done. Saved average topic scores for: title+author-gender
