1916 Letters Topic Modeling Project
##Letter analysis project## Is a students project that I create during an internship on internship at the Center for High Performance Computing in TCD and the Letters of 1916 project. The goal of my internship is to create a text analysis tool for the "Letters of 1916 project".
##The Letters of 1916: Creating History## is the first crowd-sourced humanities project in Ireland. The project was launched on Friday September 27th 2013 at Discover Research Night and invites people all over the world to share letters written during and related to the Easter Rising of 1916. Images of letters can be uploaded, read online and transcribed. This project focuses especially on private collections and the letters and voices of people that were less well known or even forgotten.
##Module documentation## ###Core modules:###
- importer.py - import of txt files or from excel
- cleaner.py - allows cleaning of the texts using a stopoword list, a cleaning pattern in form of a list of regular expressions, a spell checker, and stemmer
- analyse.py - is built on the gensim Python library and has several functions to extract topics via gensim
- outputter.py - several functions to format the data generated from analyse.py, and prepare it for visualisation in Gephi
All main modules can be used via the command line. For each module a help command is available: python module_name.py -h
When calling importer.py the following arguments are required: -m or --mode: either “txt” or “excel” -f --file_name_excel or -d --txt_dir_path - depending on the mode chosen -c --corpus_dir - a directory where the text corpus should be saved
-m excel -f c:\path_to_excel\all_transcriptions_until_16_06_2014.xlsx
When calling cleaner.py the following arguments are required: mode - mode values can be "pat+stop" (default), 'spell', 'stem', ‘pat+stop’ cleaning just using the cleaning pattern and stopword list, if ‘spell’ or ‘stem’ is set the spell checker and stemmer will be used as well to clean the texts. It is recommended to use first the default setting, this will also generate two txt files listing the changes the spell checker and stemmer would make if run. Warning: Spell checker and stemmer are quite radical and if used without caution they can change the meaning of the texts and will deliver wrong results. files_dir- takes as value the path to the directory which contains only the files that should be cleaned. This should be simple text files with txt extension. clean_files_dir - a directory path where the cleaned files should be saved afterwards
When calling analyse.py the following arguments are required: mode - ‘replace’ or ‘analyse’. The mode ‘replace’ would be used after a cleaning process. The cleaning process creates new files in a different folder and the file path stored in the corpus txt items has to be changed to point to this new file path. If mode ‘replace’ is set ‘path_to_txt’ and ‘new_text_dir’ are required. The first is the path to the pickle file containing all the TxtItem instances. the second is the directory path to the cleaned files. Keep in mind the file names should be kept the same. If the mode is set to ‘lsi’ (default) the arguments ‘path_to_corpus’ and ‘num_topics’ are required. Instead of ‘path_to_corpus’ the argument ‘path_to_txt’ can be used. In that case first a new corpus is created from the txt items specified, and afterwards the ‘lsi’ analysis is carried out. The result from the ‘lsi’ analysis is printed to two files: topic-keys.txt and topic-compostion.txt The format of the content is similar to the output of Mallet in order to make further transformation into Gephi edges files easier.
When calling outputter.py the following arguments are required:
- mode - possible mode values: 'search' (default), 'gephi'. Search allows to search a TxtCorpus, ‘gephi’ to create gephi edge files from topic-compostion.txt (Mallet like file listing all texts in the corpus and its closeness to the generated topics.
The following parameters have to be passed if mode ‘search’ is used:
- corpus_file_path - file path to an existing TxtCorpus file
- attrs - the attributes of the TxtItems that should be included into the result output besides the ‘unique_name’ of the item. A list separated by commas has to be given, e.g. Language, Category.
- python_expr - is a valid python expression using the attributes of the TxtItems, e.g. “Language != ‘English’”
- to_file - if set to True prints the results to a file, otherwise just to command line
The following parameters have to be passed if mode ‘gephi’ is used:
- imp_file_comp - path to an existing topic-compostion.txt file, like the ones generated by Mallet or analysis.py
- exp_file - a file to which the Gephi edges output will be saved
- limit - is the proximity limit below which the values will not be included as a Gephi edge.
- dist0to1 - if set to True the proximity values used for Gephi edges will be spread wider from 0 to 1
###Unittests### Following Test Driven Development all core modules have a unittest module to ensure it is running correctly and produces the right output.
###Further modules### In addition to the core modules the following modules were created:
- helper.py - module with helper functions
- txt_classes.py - module containing the text and corpus classes
- settings.py - the settings module is used to specify what columns of a Excel file the importer should use, where to find the stopword file, personal word list and cleaning pattern should be used by cleaner.py.
The core modules can be called using the command line. After adding the command line functionality the development of the following modules was stopped.
- gui.py - originally a GUI that allowed browsing of the letters and metadata, development of the module was droped and it is not compatible with the rest of the program anymore
- run_all.py - is a module originally developed to run all modules from one place. This was useful for testing, but the development was dropped and the module is not compatible with the rest of the program anymore