Code and docs for Internet Archives Transcript Project
The Industry Documents Library is a vast collection of resources encompassing documents, images, videos, and recordings. These materials can be studied individually, but increasingly, researchers are interested in examining trends across whole collections, or subsets of it. In this way, the Industry Documents Library is also a trove of data that can be used to uncover trends and patterns in the history of industries impacting public health.
There are many ways to generate data from digital collections. In this project we focused on a combination of collections metadata and computer-generated transcripts of video files. We know that, like all information, data is not objective but constructed. Metadata is usually entered manually and is subject to human error. Video transcripts generated by computer programs are never 100% accurate. If accuracy varies based on factors such as the age of the video or the type of event being recorded, how might this impact conclusions drawn by researchers who are treating all video transcriptions as equally accurate? What guidance can the library provide to prevent researchers from drawing inaccurate conclusions from computer-generated text?
This project is a case study that evaluates the accuracy of computer-generated transcripts for videos within the Industry Documents Library’s Tobacco Collection. Specifically, the project investigates how transcript accuracy differs between television commercials and court proceedings. Other factors impacting accuracy, such as year and runtime, are also considered. These findings provide a foundation for UCSF's Industry Documents Library to create guidelines for researchers using video transcripts for text analysis. This case study also acts as a roadmap for similar studies to be conducted on other collections.
Initial IDL-DSI project proposal: Proposal.MD
To extract metadata from the Internet Archives, we used the the “search_items” and “download” methods from the internetarchives module, which provides an API to provided targeted search and downloads.
https://internetarchive.readthedocs.io/en/stable/internetarchive.html
We retrieved metadata and links for all items in the ‘'UCSF Industry Archives Videos' collection, though we only ended up extracting transcripts for a small subset.
download_from_ia.py
The Internet Archives API downloads module provides methods to download metadata for for each item in a collection (along with other requested files). Files are downloaded locally to a directory named after the item identifier. To create a more analysis-friendly format, we extracted the following metadata elements into a pandas dataframe (which we later exported as a CSV)
For our study, we extracted:
- identifier
- collection
- title
- mediatype
- year
- description
- subject
- file size
- file length
- url
generate_links_metadata.ipynb
After identifying a subset of videos for analysis, we extracted transcripts from each URL using the Google AutoML API for transcription. This required a pipeline for transforming, downloading, extracting, and storing transcripts in a way that could be cross referenced with metadata.
Google AutoML provides a transcription service to extract transcripts from audio files. Unfortunately, this does mean that we need extract the audio file (in our case, we used a flac file) prior to using the transcription service.
- downloads the mp4 file from the URL (internet archives)
- uses ffmpeg to extract the .flac file
- uploads the flac file to a google cloud storage bucket (a gs bucket URI is a requirement for using the transcription service)
- extracts the transcript from the flac file as a JSON file
- writes the JSON transcript to local storage
Note - this process very time consuming and can take over 20 minutes for a 2-3 hour file. You can speed this up by running the code on a google cloud cluster with multiple processors and/or splitting the load across multiple clusters. Total cost to process ~50 videos ranging from 2 minutes to 2 hours was around $100.
extract_transcript_from_mp4_url.ipynb
We create a three dataframe with the transcript text, overall accuracy, min accuracy, max accuracy, and identifier name, which can be used to join with the metadata containing information on collection, title, mediatype, year, description, subject. Note that the AutoML transcription service assigns an accuracy score for each fraction of a transcript in short (often half-sentence) increments, so we calculated the average overall accuracy score for each transcript. As a result, a transcript with a high or low overall transcription accuracy rating based on the AutoML scores could have smaller sections with much lower or higher transcription accuracies.
format_json.ipynb
Because we are interested in assessing how sentiment scores vary a transcript based on different attributes, we generate a sentiment score. In this case, we used the generally trained sentiment score from Google AutoML. We store the results in a tabular data format containing identifier, sentiment, and magnitude.
Predict_Text_Sentiment.ipynb
Because we drew our transcripts from two sources - legal documents and advertising documents, we ran a Kmeans topic model for two categories to see how closely AutoML matches the pre-determined categories.
Topic-Modeling-Kmeans.ipynb
In order to evaluate transcript accuracy between different factors, we created a dataset containing all variables of interest generated above. One observation (or row) of the dataset = one video in the collection. The dataset contains the variables described below.
Variable | Description | Source/Calculation |
---|---|---|
id | Video identifier | Unique id from internetarchive |
runtime | Length of the video (in seconds) | Metadata from internetarchive converted to seconds |
category | Video type | Determined by us to be either "Advertising" or "Legal/Court" based on video content |
url | Video URL | Metadata from internetarchive |
year | Year of video | Metadata from internetarchive |
fellow_accuracy_rating | One of "Poor," "Fair," "Good," or "Excellent" | Determined by junior data science fellows based on experience editing the computer transcript |
automl_confidence_avg | Average Google AutoML confidence score for the transcript | Metadata from AutoML |
automl_confidence_min | Minimum Google AutoML confidence score for the transcript | Metadata from AutoML |
automl_confidence_max | Maximum Google AutoML confidence score for the transcript | Metadata from AutoML |
computer_transcript | Computer-generated transcript | Generated from AutoML |
human_transcript | Human-edited transcript | Transcript after editing by junior data science fellows |
sentiment | Sentiment score of the computer-generated transcript | Generated from AutoML |
magnitude | Sentiment magnitude of the computer-generated transcript | Generated from AutoML |
human_sentiment | Sentiment of the human-edited transcript | Generated from AutoML |
human_magnitude | Sentiment magnitude of the human-edited transcript | Generated from AutoML |
war | Word Accuracy Rate | Calculated with python jiwer package |
bleu_score | BLEU Score | Calculated with python nltk package |
transcript_accuracy_assessment.ipynb
Finally, using the research dataset we created, we performed an analysis to investigate our research questions. Using R, we generated summary statistics, visualizations, and statistical tests. The findings from this analysis can be found in this blog post summarizing the project.
final_analysis_viz.R