A final project for CS 108: Ethics of Intelligent Systems at Harvard.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.



  1. Get Twitter API keys.
    • Create a file called APIKeys.json, and store your API keys in there. You can use APIkeyexample.txt as a reference.
    • Note that this .json will not be pushed to git, unless you change the .gitignore.
  2. Generate tweets for a user or set of users
    • Navigate to the src directory
    • Run python main.py --names <NAME1> <NAME2> ... where each of the NAMEi can be replaced with a twitter handle.
    • The code will pull tweets and save them to the data directory
    • This will also print generated tweets to the console
  3. Determine sentence similarity
    • Navigate to the src directory
    • Run python model_test.py <tweet_file> <K>, where <tweet_file> is the relative path to a file in the data folder (for example, ../data/Harvard.csv), and K designates how big your K-mer will be. K must be at least 2.

Important files

  • main.py: Contains code to generate sentences given a list of Twitter handles at the command line.
  • model_generator.py: Contains functions to generate the Markov model for a user. This includes getting tweets from a file, extracting K-mers, forming the model, and determining next words given the current K-1 words.
  • model_test.py: Contains functions generate sentences from a model, and test their similarity to the original tweets. Note that when run as driver program, this file will default to determining sentence accuracy.
  • twitter_extractor.py: Contains functions to connect to Twitter API and extract tweets for user or users.
  • comparison.py: Contains functions to compare words/sentences for quantitative analysis.


numpy, scipy, Tweepy, NLTK