Your data is too big for excel? Command line tools for journalists
Oh no why?
What is this terminal thing you ask? And why do we want to use it?
- What if the CSV file we want to open is too big for your spreadsheet program?
- Feel like having a quick look at what the data is without opening the file and needing to take a smoke break?
- An inexplicable reason your file does not open in your chosen spreadsheet program...
- There is power here. Grab bits of the internet, manipulate them quickly, and pop them into files or databases without needing to learn the complexities of coding.
The terminal is a text based interface that allows you to run text commands and interact with your file system.
There are alternate names for the terminal including; console, shell, command line, command prompt.
The following tutorial is tailored to the Mac and Linux terminals.
You will need to open up the terminal;
- On a Mac or Linux machine just use the standard terminal program.
- If you have a windows machine please install Cygwin (a Unix terminal emulator) in order use the same commands as this tutorial - Cygwin (Unix emulator) - https://cygwin.com/install.html
For those on Linux or Mac go right ahead and install csvkit for the last part of the tutorial:
$ easy_install pip $ pip install csvkit
What are those commands? Package management
Quick note easy_install, pip, and brew are all package managers that allow you to install what we need from the command line. Rather than downloading and trying to install and configure them yourself.
- easy_install is a package manager to install Python packages, however it isn't a fully fledged one. Pip however is, so first we install easy_install and then pip.
- pip is another Python package manager. A lot of these packages are on the Python Package Index (PyPI)
When you have trouble installing
There could be a number of issues;
With dependencies, you should 'oogle the specific error. Perhaps you already have somethings installed, perhaps you need to upgrade, perhaps your operating system has a specific issue. Stackoverflow is a good place to get answers.
Permissions can be a bit more confusing. All the commands we have shown so far are run with permissions that you (the default user) have. Sometimes this isn't enough and you need to run with bigger, grander permissions. Like a super hero, or in this case a super user (also known as root).
$ sudo <your command>
Be careful though, there are some ramifications to doing this.
TODO: explain this
##Getting all the data There have been a lot of tutorials on scraping, getting data out of documents and freedom of information requests.
For this exercise I grabbed all the tweets relating to the #dataharvest16 twitter handle and popped them into a CSV file using a tool called T. Which has it's own tutorial we don't have time for
$ t search all 'EIJC16' --csv -n 3200 > Dataharvest16tweets.csv
This is how I'll write commands in this tutorial. These are the commands we will pop into the terminal in order to do stuff, like processing our data.
Don't worry if you don't have a CSV, you can convert to one
in2csv input_file.xls output_file.csv
When you open your terminal you'll be somewhere - but where is that?
Will show you the path that you are in right now.
TODO: mkdir, cd, navigation
List the shizzle in your directory. Command first and options after. List. All. With readable file sizes.
$ ls -alk
Notice that there is your username or perhaps other usernames (like root) listed. These are the owners of the files, and the letters on the very left are permissions (i.e -rw-r--r--).
Don't know what a command does?
$ man ls
How will you get the data?
You will use wget to the file with out data (or whole sites)
$ wget https://github.com/arc64/dataharvest-2016-commandline/blob/master/dataharvest16tweets.csv
You may find you don't have wget installed. To install it, the easiest way is with another package manager called Homebrew. And then run
$ brew install wget
If you don't feel like installing the fabulous tool wget, just grab the file (click raw in github), and put it into the directory where you want to use it.
So what does the file look like?
Count the words
wc -w dataharvest16tweets.csv 3739 dataharvest16tweets.csv
wc -l README.md
In a CSV though we want to count the number of rows
csvstat --count dataharvest16tweets.csv Row count: 197
Looking at the beginning and end of the file
And searching for specific text
grep "@nrecherch" dataharvest16tweets.csv
Return only the top 10
grep "@nrecherch" dataharvest16tweets.csv | head
Get all the headers for the file
$ csvcut -n dataharvest16tweets.csv 1: ID 2: Posted at 3: Screen name 4: Text
Grab some columns
$ csvcut -c 3, 4 dataharvest16tweets.csv
Print it pretty
head dataharvest16tweets.csv | csvcut -c 3,4 dataharvest16tweets.csv | csvlook
Get some stats about your tweeters
csvcut -c 3 dataharvest16tweets.csv | csvstat
csvgrep -c 3 -m "Hackette7" | csvlook
$ csvcut -c 3 dataharvest16tweets.csv
Getting all the unique handles
$ csvcut -c 'Screen name' dataharvest16tweets.csv | sort | uniq -c | sort -rn
Getting just tweets by Hackette7
csvgrep -c 3 -m "Hackette7" dataharvest16tweets.csv | csvlook
What about just getting the RT
csvgrep -c 4 -r "^RT" dataharvest16tweets.csv
You can run sql queries on it, which if your file is very big will run slow (it is in memory!)
csvsql --query "SELECT * FROM dataharvest16tweets" dataharvest16tweets.csv > dataharvest16tweets_nrecherch.csv
If you really want to you can push this all to a sqllite database
$ cat dataharvest16tweets.csv | csvsql --table tweets --db sqlite:///tweets.sqlite --insert $ sqlite3 tweets.sqlite sqlite> SELECT * FROM tweets; sqlite> .exit
What can CSVkit do?
- do sql
- fuzzy matching
- pop into a database
- tab completion: use to finish command, directory or file names, or to see directories or files that match
- ctrl + a: move to the beginning of a line
- ctrl + e: move to the end of a line
- ctrl + c: terminate a command that is running
- up arrow: cycle back through previous commands
UNIX features you should know
Commands you have have heard in this tutorial
in the beginning was the command line
wget http://www.cryptonomicon.com/command.zip ; unzip command.zip ; nano command.txt
With specific thanks to these tutorials and authors for their inspiraton