DHLW/TRU Case Study
This repository contains the code to scrape, process, and analyze data for an analysis of OST data for DHLW and TRU.
You will need to install the relevant libraries used. First ensure that you have pip installed.
Then, run the following to ensure that you have all the relevant libraries installed:
All the code below was run using Python3 with the exception of
The data for this case study was collected by scraping data from OSTI.
To scrape the TRU search results, simply run:
py -3 scrape_ost.py
To scrape the DHLW search results, run:
py -3 scrape_ost.py --dhlw
Keep in mind that the URLS for the two search results as well as the number of search result pages are statically defined in
scrape_ost.py as the constants
dhlw_pages. Before running the code, please verify the URLs for the search results and change the number of pages of results as needed.
Accessing the Data
After running the web scraping code above, the data should be accessible in the
data folder through either
dhlw.csv. Sometimes the URL calls will fail; these are also in the
data folder under
doc_urls_failed.txt. You can either manually enter these in failed URLs into your browser after or run API calls on them later.
Data Cleaning and Aggregation
Initial Subject Cleaning and Getting Subject Counts
To count the number of subjects and do an initial cleaning of the subjects data (stripping of whitespace; singularizing; removing any numbers, dashes, or spaces from the beginning of the string; and turning them all to lowercase), run:
py -2 count_subjects.py [--dhlw]
We singularize the last word of the subject; to do this, we use
pattern.text.en, which only works with Python2. This is why we run this command with Python2.
The resulting counts should be under
data/*_subject_counts.csv. The data files with the cleaned subjects should be under
Some subjects were ignored because they were too generic or ambiguous. A list of conditions for which subjects were to be ignored was read in, and subjects that satisfied these conditions were removed from the list. To do this, run:
py -3 ignore_subjects.py [--dhlw]
The resulting subjects that were not ignored and their counts should be under
To aggregate the subjects to their corresponding subcategories, run:
py -3 aggregate_subjects.py [--dhlw]
The resulting subject aggregations should be under
Labeling of Data
To label the documents with their corresponding categories and subcategories, run:
py -3 label_data.py [--dhlw]
The resulting data file should be under
Because we are using a modified version of the Global Migrations plot, we also generated JSON files containing the matrices with data link information, located under
Creating the Visualization
To create the visualization, open the
plot/chart_*.html files or the
plot/charts.html and copy in the JSON object into the
data variables. Open the HTML file in your browser.
This plot was taken from The Global Flow of People plot, which is done by Nikola Sander, Guy J. Abel, and Ramon Bauer. The source code is here, done by null2 GmbH. Their work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License and copyright (c) 2013 null2 GmbH Berlin.
Minor changes were made, including:
- Making links bidirectional
- Taking out the timeline
- Changing the region labels to be perpendicular
This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.