# Run Prepare

(author: uhk, based on run_prepare.py by cs, created on: October 2017)

This notebook is for calling functions from TMW's prepare module.

General advice on how to use the notebook:

* Run the code blocks one after the other. The blocks "Imports" and "General Settings" have to be run first.
* Each visualization block, e.g. plot words-in-topics-treemap, can be run separately.
* Some settings need to be changed, e.g. your working directory, number of topics, parameters, ... the comments say "set..." where this is the case. No other changes have to be made to the code.

## Imports

In [1]:
from os.path import join
from os.path import abspath
import importlib
import sys

In [2]:
"""
set the path to the directory where you stored TMW
"""
tmw_path = "/home/ulrike/Git/tmw"

In [3]:
sys.path.append(abspath(join(tmw_path, "scripts_verona")))
"""
this shows your path settings
"""
print(sys.path)

['', '/usr/lib/python35.zip', '/usr/lib/python3.5', '/usr/lib/python3.5/plat-x86_64-linux-gnu', '/usr/lib/python3.5/lib-dynload', '/home/ulrike/.local/lib/python3.5/site-packages', '/usr/local/lib/python3.5/dist-packages', '/usr/local/lib/python3.5/dist-packages/wikiextractor-2.69-py3.5.egg', '/usr/lib/python3/dist-packages', '/home/ulrike/.local/lib/python3.5/site-packages/IPython/extensions', '/home/ulrike/.ipython', '/home/ulrike/Git/tmw/scripts_verona']


In [4]:
import prepare_verona

importlib.reload(prepare_verona)

<module 'prepare_verona' from '/home/ulrike/Git/tmw/scripts_verona/prepare_verona.py'>

## General settings

In [10]:
"""
set the working directory
- a folder which contains the metadata file, the corpus folder, etc.
- the topic modeling output will be stored in new subfolders inside the working directory
"""
wdir = "/home/ulrike/Dokumente/GS/Veranstaltungen/2017_Verona/exercises_Italian"

"""
set the corpus folder name
"""
corpus_folder = "5_lemmata_N"

## Segmenter

Splits entire texts into smaller segments.

In [11]:
"""
set the segment size in tokens
"""
segment_size = 2000

"""
set the size tolerance factor

1 = exact target
> 1 = with some tolerance, e.g. 1.1 +/- 10%
"""
size_tolerance_factor = 1.1

"""
set preserve paragraphs

True: the segmenter will try to preserve paragraphs if possible (considering the size tolerance factor)
False: paragraphs will not be preserved when segmenting
"""
preserve_paragraphs = True

In [12]:
inpath = join(wdir, corpus_folder, "*.txt")
outfolder = join(wdir, "segs", "")

prepare_verona.segmenter(inpath, outfolder, segment_size, size_tolerance_factor, preserve_paragraphs)


Launched segmenter.
Done.


## Stopword list

In [15]:
"""
Create a stopword list that can be used for the topic modeling later on.
"""

"""
set the number of most frequent words that shall be stopwords
"""
mfw = 50

"""
set the file name for the stopwords output file
"""
stopwords_out = "it_stopwords.txt"

In [16]:
corpus_dir = join(wdir, corpus_folder)
stopwords_out_path = join(wdir, stopwords_out)

prepare_verona.create_stopword_list(mfw, corpus_dir, stopwords_out_path)


Launched create_stopword_list.
Done.
