Yes, this is a bit messy and could get a whole lot cleaner if we used this all the time. It's basically a one-off, so it's not the cleanest bit of scripting you've ever seen. I'd probably build the whole thing in Python/Pandas if I had to do it again, but csvkit and bash get the job done. Searching is significantly slower than if you used Pandas, however.
- Jupyter Notebook (Highly recommend using virtual environments.
pip install jupyterif you're already using Python)
- Pandas (
pip install pandas)
- markegge's get-comments-with-api notebook
- csvkit (
pip install csvkit)
- jot (included in MacOS, must compile from source on other platforms. Alternately, use another random number generator in line 19 of generate-random.sh.)
- GNU core utilities (included in Linux, must install on MacOS using
brew install coreutils)
- Run get-comments-with-api from Jupyter Notebook to download the full comment set. (Alternately, export the notebook to a .py file and run that from the command line.) Note that you need an API key from data.gov to download all the comments.
- Copy comments.csv into your working directory.
sh match-random.shto clean comments.csv and pick 1000 random comments from it.
sh search-comments.sh utah-residents.txtto find possible comments from Utah residents (output is in utah-residents.csv)
sh random-from-search.sh 1000 utah-residents.csvto pick 1000 random comments.
utah-residents-random.csvinto a spreadsheet (we used Google Docs for simultaneous editing) and code each comment by hand.
If you just want to search the comment set for a bunch of terms, first generate a
csvclean -l comments.csv && mv comments_out.csv clean.csv
Then put your search terms into a .txt file, one term per line. (csvgrep uses regex, so terms like
liv(e|ed|ing) in utahwill find people who live, lived, or are living in Utah.
, utah (\d*)finds digits (like a zip code) after comma-space-utah.)
sh search-comments.sh [myfile.txt]to search clean.csv for all the terms in your text file. Output will be in
This all works for me on MacOS Sierra. It should work fine on Linux, but in line 19 of generate-random.sh, you'll need to change