Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This repository contains a veriety of data analysis scripts. Projects are launched with which can build, test, upload jar file, start EMR Spark cluster, and run the job. Though the file must first be copied to with the values filled in for the AWS account. An example command is below for building, uploading, starting cluster, and submitting the new job for the word_count project.

python --project word_count --build --job_upload --job_submit_cluster_full


The first project is for natural language processing using Spark and CoreNLP on the 2015 Reddit comments corpus. More information on this is available in this blog post.

The Spark and CoreNLP code is in the nlp project. As it is a hefty processing step, CoreNLP is used first with training data added to the jar uploaded to EMR. This step performs tokenization, tagging, stemming, and named entity recognition. Input and output s3 folders are listed in The file RC_2015-05 can be used or swapped with a file with more data for local testing, which can also be done using

The next step in the machine learning pipeline is the word_count project. This does the simple text parsing and outputs the json containing information about organization, subreddit, count, direct adjectives, and connected adjectives to be added to react-tabulator. The file can be used to perform the conversion from Spark output to table input json.


The next project is for Word2Vec from Spark on the Reddit comment corpus. More information on this is available in this blog post.

This continues the pipeline from word_count and runs Spark's Word2Vec and outputs organization, subreddit, count, and similarities. Similarities lists both the words and cosine similarity metrics. can also be used to create output json for react-tabulator.