# Run Visualize

(author: uhk, based on run_visualize.py by cs, created on: October 2017)

This notebook is for calling functions from TMW's visualize module.

General advice on how to use the notebook:

* Run the code blocks one after the other. The blocks "Imports" and "General Settings" have to be run first.
* Each visualization block, e.g. plot words-in-topics-treemap, can be run separately.
* Some settings need to be changed, e.g. your working directory, number of topics, parameters, ... the comments say "set..." where this is the case. No other changes have to be made to the code.

## Imports

In [1]:
from os.path import join
from os.path import abspath
import importlib
import sys

In [2]:
"""
set the path to the directory where you stored TMW
"""
tmw_path = "/home/ulrike/Git/tmw"

In [5]:
sys.path.append(abspath(join(tmw_path, "scripts_verona")))
"""
this shows your path settings
"""
print(sys.path)

['', '/usr/lib/python35.zip', '/usr/lib/python3.5', '/usr/lib/python3.5/plat-x86_64-linux-gnu', '/usr/lib/python3.5/lib-dynload', '/home/ulrike/.local/lib/python3.5/site-packages', '/usr/local/lib/python3.5/dist-packages', '/usr/local/lib/python3.5/dist-packages/wikiextractor-2.69-py3.5.egg', '/usr/lib/python3/dist-packages', '/home/ulrike/.local/lib/python3.5/site-packages/IPython/extensions', '/home/ulrike/.ipython', '/home/ulrike/Git/tmw/scripts', '/home/ulrike/Git/tmw/scripts_verona']


In [66]:
import visualize_verona

importlib.reload(visualize_verona)

<module 'visualize_verona' from '/home/ulrike/Git/tmw/scripts_verona/visualize_verona.py'>

## General settings

In [34]:
"""
set the working directory
- a folder which contains the metadata file, the corpus folder, etc.
- the topic modeling output will be stored in new subfolders inside the working directory
"""
wdir = "/home/ulrike/Dokumente/GS/Veranstaltungen/2017_Verona/exercises_English"

"""
set the model folder name
"""
model_folder = "mallet_model"

"""
set the file name of the MALLET --topic-word-weights option, which is usually "word-weights"
(without the ending .txt)
"""
word_weights = "word-weights"


"""
set the aggregate folder name
"""
aggregates_folder = "8_aggregates"

"""
set the output folder name for visualizations
"""
out_folder = "9_visuals"


In [16]:
out_dir = join(wdir, out_folder)

In [17]:
"""
set the parameters as they were used in the topic modeling
- these settings will be used for output file names
"""
NumTopics = 30
NumIterations = 500
OptimizeIntervals = 50
TopTopics = 20

In [26]:
param_settings = str(NumTopics) + "tp-" + str(NumIterations) + "it-" + str(OptimizeIntervals) + "in-" + str(TopTopics) + "tt"

## make_wordle_from_mallet

In [35]:
"""
create a wordle for each topic from MALLET
"""

"""
set the number of words to consider for the wordcloud
"""
num_words = 20

"""
set the resolution for the images (in dots per inch)
"""
dpi = 300

In [36]:
word_weights_file = join(wdir, model_folder, word_weights + ".txt")

wordles_out = join(wdir, out_folder, param_settings, "wordles")
#font_path = join(wdir, "font", "AlegreyaSans-Regular.otf")

num_topics = NumTopics
TopicRanksFile = join(wdir, aggregates_folder, param_settings, "topicRanks.csv")


In [39]:
visualize_verona.make_wordle_from_mallet(word_weights_file, NumTopics, num_words, TopicRanksFile, wordles_out, dpi)


Launched make_wordle_from_mallet.
Done.


## plot words-in-topics treemap

In [19]:
"""
set the number of words to plot
"""
words_to_plot = 10

In [20]:
word_weights_file = join(wdir, model_folder, word_weights + ".txt")
wordsintopics_treemap_out = join(out_dir, param_settings, "wordsintopics_treemap")

visualize_verona.plot_words_in_topics_treemap(NumTopics, words_to_plot, word_weights_file, wordsintopics_treemap_out)


Launched plot_words_in_topics_treemap.
Done.


## plot topics-in-docs treemap

In [21]:
"""
set the number of topics to plot
"""
topics_to_plot = 10

In [25]:
doc_topic_file = join(wdir, aggregates_folder, param_settings, "avgtopicscores_by-idno.csv")
first_words_file = join(wdir, aggregates_folder, param_settings, "firstWords.csv")
topicsindocs_treemap_out = join(out_dir, param_settings, "topicsindocs_treemap")

visualize_verona.plot_topics_in_docs_treemap(topics_to_plot, doc_topic_file, first_words_file, topicsindocs_treemap_out)


Launched plot_topics_in_docs_treemap.
Done.


## plot_topTopics

In [53]:
"""
creates a bar chart of the top topics for each metadata category
"""

"""
set the metadata categories which should be used (currently does not work for "publication-year")
"""
target = ["idno", "author-name", "author-gender", "title"]

"""
set the number of top topics to be shown
"""
topTopicsShown = 10

"""
set the font size for the charts
"""
fontscale = 1.0

"""
set the resolution for the images (in dots per inch)
"""
dpi = 300

"""
set the mode to "normalized" topic scores or "absolute" topic scores
"""
mode = "normalized"

In [54]:
averageDatasets = join(wdir, aggregates_folder, param_settings, "avg*.csv") 
firstWordsFile = join(wdir, aggregates_folder, param_settings, "firstWords.csv")
height = 0 # 0=automatic and variable
topTopics_out = join(wdir, out_folder, param_settings, "topTopics")

In [55]:
visualize_verona.plot_topTopics(averageDatasets, firstWordsFile, NumTopics, target, mode, topTopicsShown, fontscale, height, dpi, topTopics_out)

Launched plot_topTopics.
 Getting targetItems for: author-name
(30, 33)
30
  Creating plot for: Barclay_Florence_L
(30, 33)
30
  Creating plot for: Bennet_Arnold
(30, 33)
30
  Creating plot for: Blackmore_R_D
(30, 33)
30
  Creating plot for: Braddon_Mary_Elizabeth
(30, 33)
30
  Creating plot for: Bronte_Charlotte
(30, 33)
30
  Creating plot for: Bulwer_Lytton_Edward
(30, 33)
30
  Creating plot for: Burnett_Frances_Hodgson
(30, 33)
30
  Creating plot for: Chesterton_G_K
(30, 33)
30
  Creating plot for: Collins_Wilkie
(30, 33)
30
  Creating plot for: Conrad_Joseph
(30, 33)
30
  Creating plot for: Corelli_Marie
(30, 33)
30
  Creating plot for: Dickens_Charles
(30, 33)
30
  Creating plot for: Doyle_Arthur_Conan
(30, 33)
30
  Creating plot for: Eliot_George
(30, 33)
30
  Creating plot for: Ford_Madox_Ford
(30, 33)
30
  Creating plot for: Forster_E_M
(30, 33)
30
  Creating plot for: Galsworthy_John
(30, 33)
30
  Creating plot for: Gaskell_Elizabeth
(30, 33)
30
  Creating plot for: Gissing_Ge

## plot_topItems

In [57]:
"""
creates a barchart for each topic, with the top items (titles, authors, etc.) for that topic
"""

"""
set the metadata categories to plot
"""
target = ["idno", "author-name", "author-gender", "title", "publication-year"]

"""
set the number of top items to show
"""
topItemsShown = 20

"""
set the font size
"""
fontscale = 0.8

"""
set the resolution of the images (in dots per inch)
"""
dpi = 300

In [59]:
averageDatasets = join(wdir, aggregates_folder, param_settings, "avg*.csv")
topItems_out = join(wdir, out_folder, param_settings, "topItems")
firstWordsFile = join(wdir, aggregates_folder, param_settings, "firstWords.csv")
height = 0 # 0=automatic and flexible

In [61]:
visualize_verona.plot_topItems(averageDatasets, topItems_out, firstWordsFile, NumTopics, target, topItemsShown, fontscale, height, dpi)

Launched plot_topItems
 Plotting for: author-name
  Creating plot for topic: 0


  dataToPlot = dataToPlot.order(ascending=False)


  Creating plot for topic: 1
  Creating plot for topic: 2
  Creating plot for topic: 3
  Creating plot for topic: 4
  Creating plot for topic: 5
  Creating plot for topic: 6
  Creating plot for topic: 7
  Creating plot for topic: 8
  Creating plot for topic: 9
  Creating plot for topic: 10
  Creating plot for topic: 11
  Creating plot for topic: 12
  Creating plot for topic: 13
  Creating plot for topic: 14
  Creating plot for topic: 15
  Creating plot for topic: 16
  Creating plot for topic: 17
  Creating plot for topic: 18
  Creating plot for topic: 19
  Creating plot for topic: 20
  Creating plot for topic: 21
  Creating plot for topic: 22
  Creating plot for topic: 23
  Creating plot for topic: 24
  Creating plot for topic: 25
  Creating plot for topic: 26
  Creating plot for topic: 27
  Creating plot for topic: 28
  Creating plot for topic: 29
 Plotting for: author-gender
  Creating plot for topic: 0
  Creating plot for topic: 1
  Creating plot for topic: 2
  Creating plot for top

## plot_distinctiveness_heatmap

In [63]:
"""
for each category, make a heatmap of most distinctive topics
"""

"""
set the target metadata categories to use for the heatmap
"""
target = ["author-name", "author-gender", "title"]

"""
set the top topics shown
"""
topTopicsShown = 20

"""
set the mode for normalization of the heatmap values
possible values: meannorm|mediannorm|zscores|absolute
"""
mode = "zscores"

"""
set the font scale
"""
fontscale = 1.0

"""
set the resolution for the images
"""
dpi = 300

In [64]:
averageDatasets = join(wdir, aggregates_folder, param_settings, "avg*.csv") 
firstWordsFile = join(wdir, aggregates_folder, param_settings, "firstWords.csv")
out_distinctiveness = join(wdir, out_folder, param_settings, "distinctiveness")
sorting = "std"

In [67]:
visualize_verona.plot_distinctiveness_heatmap(averageDatasets, firstWordsFile, out_distinctiveness, target, NumTopics, topTopicsShown, mode, sorting, fontscale, dpi)

Launched plot_distinctiveness_heatmap.
- working on: author-name
- getting dataToPlot...
- working on: author-gender
- getting dataToPlot...
- working on: title
- getting dataToPlot...
Done.
