semantometrics-python

semantometrics-python is a Python toolset that tries to mimic the semantometrics research contribution measure, as being described by Petr Knoth and Drahomira Herrmannova in their 2014 article. This repository contains tools for calculating this and other measures, as well as preprocessing scripts to get from .html/.pdf to .txt-files.

Preprocessing

PDF to text

File: pdf2txt.py
Usage: Just move the script to a folder with .pdf-files and run it. It will output .txt-files.
Packages used: To get from PDF files (the de facto standard for research papers) to plain text files, we use the PDFMiner package. An alternative would be the Java package Apache Tika, which is actually used by Knoth and Herrmanova.

HTML to text

Files: html2txt_cultmach.py, html2txt_theoryandevent.py
Usage: Just move the script to a folder with .html-files and run it. It will output .txt-files.
Packages used: To get from HTML files to plain text files, we use the BeautifulSoup package. The HTML format is used by a few research papers, usually with different lay-out (YMMV), that's why there are a few of these converters in the repository.
Currently supported journals:
- Culture Machine
- Theory and Event

Tools

Semantic Similarity

File: semantometrics.py
Usage: The script is set up to work with two sets of files: references of the research article in question (A-set, .txt-files starting with the letter 'a') and citations of the research article (B-set, .txt-files starting with the letter 'b'). The script expects these files to be in the same directory. The script returns the result of the contribution function defined by Knoth and Herrmannova.
Packages used: For tf-idf calculation (the measure used in Knoth and Herrmannova's article) we use the scikit-learn implementation with a modified tokenization: NLTK's implementation of the Porter2 (Snowball) stemmer. We also opted to remove punctuation and tokens that contain digits (e.g. '1987' and 'a2c').

Gephi output

File: gephi_input.py
Usage: We can also use tf-idf to calculate semantic similarity between files instead of sets of files. We can then use this measure (called cosine similiarity) as weights on an undirected graph. This allows us to do all sorts of cluster analysis with tools like Gephi. The script creates two input files for Gephi:
- nodes.csv, which contains the nodes of the network
- edges.csv, which contains the weighted edges of the network.
Packages used: We use the same measure of similarity as in the above semantic similarity tool.

Topic extraction

File: topic_extraction.py
Usage: This script can be used to generate clusters from a corpus based on semantic similarity. The script uses k-means clustering. Per cluster, the top terms can be identified, which makes it a bit more insightful than the clustering with Gephi above. The script creates two output files per k number of clusters:
- clusters.txt, which contains the top terms per cluster
- clusters.csv, which contains the cluster association for each input file. Also, a graph is plotted with the inertia per k number of clusters, so that a value for k can be chosen with the elbow method.
Packages used: The script uses the scikit-learn implementation of K-means clustering, and again the tf-idf calculation from the same package, this time with the built-in tokenization. For the most part, the script is adapted from an example on the scikit-learn website to which the looping over k, the output files and the plotting of the inertia were added.
To plot the inertia, the script uses matplotlib.

Corpus counting

File: corpus_count.py
Usage: This script can be used to count the number of occurrences of a certain list of words in a corpus.
Packages used: The script uses the scikit-learn implementation of word counting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

semantometrics-python

Preprocessing

PDF to text

HTML to text

Tools

Semantic Similarity

Gephi output

Topic extraction

Corpus counting

About

Releases 1

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
corpus_count.py		corpus_count.py
gephi_input.py		gephi_input.py
html2txt_cultmach.py		html2txt_cultmach.py
html2txt_theoryandevent.py		html2txt_theoryandevent.py
pdf2txt.py		pdf2txt.py
semantometrics.py		semantometrics.py
topic_extraction.py		topic_extraction.py

License

UUDigitalHumanitieslab/semantometrics-python

Folders and files

Latest commit

History

Repository files navigation

semantometrics-python

Preprocessing

PDF to text

HTML to text

Tools

Semantic Similarity

Gephi output

Topic extraction

Corpus counting

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages