tweetharvest is a Python utility to monitor Twitter conversations around a small set of hashtags, and to store statuses (tweets) from that stream to a MongoDB database. The intended use case: collecting tweets from discussions around a given event or campaign, and storing them locally for later analysis.
tweetharvest does not contain any analytic functions; it aims to do one thing well: data collection from the Twitter API.
The program has been developed on Python 2.7 on Mac OSX. It has run successfully on Windows 7 and on Ubuntu.
The setup process assumes Python 2.7 is installed on the system you are using. Further installation requires:
- Installation of MongoDB
- Cloning this repository
- Starting up the MongoDB server
- Installation of selected Python libraries
- Creation of a Twitter App and authorisation of the App on the harvesting machine
- Selection of hashtags to be monitored
- Running a harvest session
These steps will be described in detail.
Download the appropriate MongoDB installer for your system and follow the instructions to set it up on Linux (installation instructions vary by distro; see the relevant download page), Windows, or Mac OSX.
Note: Do not start up the server just yet, even if that is part of the installation instructions.
Clone this repository
You can download this repository as a zip file or clone it:
$ git clone https://github.com/ggData/tweetharvest
After unpacking the zip archive or cloning,
cd into the
$ cd tweetharvest
Start the MongoDB Server
In order to start storing statuses, we now start up the MongoDB server, serving data out of the
data directory in the
$ mongod --dbpath ./data
MongoDB starts up, reserves disk space, and creates blank journal files, all ready to start receiving tweets for storage.
Note: if at any time you want to stop the MongoDB server, go to the console window where it is running and press
Leave the MongoDB server running in this window and open a new terminal/console window.
cd to the tweetharvest directory:
$ cd path/to/tweetharvest
Installation of Python Packages
The harvest program requires three external Python packages, which now need to be installed.
$ pip install delorean
PyMongo "is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python". Installation instructions are provided here, but again in most cases, all we need is:
$ pip install pymongo
Finally we need the twitter package, a "minimalist Twitter API for Python". The installation instructions suggest you should use
setuptoolsto install the package but again, I found this to be sufficient:
$ pip install twitter
Creation of a Twitter App
In order to harvest statuses from the Twitter stream, you need to have a Twitter account and to create a "Twitter App". Both are free and easy to create.
- Sign in to Twitter: You probably already have a Twitter acccount. If not, head over to the home page and sign up.
- Create an app: Now go to the application management dashboard and hit the
Create New Appbutton.
When creating an app, provide the app name (e.g. 'Happy Harvester'), description (e.g. 'An app to collect tweets with the emotional hashtags like #happy'), and website (if you do not have a website, you may have a placeholder such as
http://www.example.com and update this later when you do set up a website). Do not fill in the Callback URL field. Accept the Developer Agreement, and click the button to
Create your Twitter application.
Authorisation of the App
If the application creation process succeeded, you will be taken to the its home screen. Switch to the tab Keys and Access Tokens. Make a note of the Consumer Key and the Consumer Secret that you see there (for example, copy and paste them to a text editor). These should never be shared with anyone as they represent your app's credentials with Twitter; otherwise anyone who has access to them could use them masquerade as your app.
You will now need to insert these credentials into the program. Navigate to the
$ cd path/to/tweetharvest/lib/secret
Copy the file called
$ cp twitconfig.py.bak twitconfig.py
twitconfig.py and insert your consumer key and consumer secret in between the quotation marks on lines 6 and 7. They look like this when you first open the file:
CONSUMER_KEY = 'InsertYourConsumerKeyHere' CONSUMER_SECRET = 'InsertYourConsumerSecretHere'
Make sure you preserve the quotes when you past your tokens. The end result should look something like this:
CONSUMER_KEY = 'Tte0jQJPFUph6hX66h8Rai6g5' CONSUMER_SECRET = 'XpQ2AvcEYhMkyXTwMkOT9tQAtddB7UusbHFon0BS5JeHkEliB0'
Check Authorization is Working
You now need to check that the authorization is working. Navigate back to the root folder and run the
$ cd path/to/tweetharvest $ python auth.py
If all is working, you should get the following output:
App is authorised, congratulations!
If you have made a mistake in the above process, you will get something like the following (with a printout of the detailed error received):
Unable to authorise app. Full report follows.
Hopefully you will have been successful and now have authorized your harvester to collect statuses from Twitter. If there has been an error, the most likely issue is that the credentials were entered incorrectly or that a network connection failed. Try to troubleshoot and use the normal discussion fora to check on solutions. Consider submitting an issue here if you think this is a general problem or a bug in the program.
Selection of hashtags to be monitored
By now, we have a MongoDB server running and ready to receive tweets. We have an app that has been authorized to collect statuses from the Twitter API. All we need is to select what hashtags we want to monitor.
This part of the process is best illustrated by example. Let us imagine we are interested in monitoring expressions of two emotions and we decide to monitor two hashtags:
#sad. We shall call our project
emotweets. This information is all we need to configure our app:
- Create a file called
tags_emotweets.txtin the tweetharvest root folder, beside
main.py. (Note: for any project called
projectname, our program expects to find a file called
tags_projectname.txtin the root folder).
- We insert each of the hashtags that we want to monitor on a separate line in this file. An example file is provided as a template. (In the example project, we insert the words
sadonto two lines and save the file).
This is all we need to run the
emotweets harvest! In the case of this example, the
tags_emotweets.txt configuration file is provided as a model for your own projects. There should be one such
tags_xxx.txt file per project (where
xxx stands for
projectname). Please also note that you can only monitor one project at a time (one client per IP address as stipulated by Twitter API terms of service).
Running a harvest session
It is assumed that you have MongoDB running in the background. If you have done this setup process in one session, it should still be running. If not, then go to the section 'Start the MongoDB Server' above and start it up...
Navigate to the root directory again and run the
main.py script, giving it the project name as an argument:
$ cd path/to/tweetharvest $ python main.py emotweets
If successful, you should now start getting outputs of this sort:
sad -1 happy -1 100 / 100 #sad 100 / 100 #happy 100 / 100 #sad 100 / 100 #happy 99 / 100 #sad 96 / 100 #happy
These lines appear with a delay of about 3 seconds between one and the other, thus ensuring that we stay within Twitter's rate-limiting policies. The lines tell us that:
- the program is working and actively collecting tweets
- the initial lines report the hashtags that we are monitoring and the id of the most recent tweet for that hashtag in our database. If there are no tweets yet, we get
-1, as in this instance.
- every few seconds, our programme retrieves up to 100 tweets from Twitter for a given hashtag. It reports how many of these tweets are new in our database. The last line in our example output says that we retrieved 100 tweets with the hash
#happybut only 96 of them were new.
Inside the MongoDB datastore, a database has been created called
tweets_db and our tweets are being stored there in a collection that bears the same name as the
projectname we used as argument to
main.py, in the case of this example, the collection is
Stopping and Starting
At this stage the console reports that
tweetharvest is merrily downloading emotional tweets for us. Eventually a number of things may happen:
- We may decide to stop the harvest. This can be done by pressing
Control-Cat any time and Python will exit the harvest. One reason we may want to stop is that we start to see that we are getting no fresh tweets for every set that we are downloading (for instance, we start to see
0 / 100 #happyrepeatedly in the output). It may be good practice to stop harvesting and return to it later. Generally Twitter will let us access the tweets from the past two weeks so we may want to run our harvest for a couple of hours every day and this will usually be sufficient for a complete collection.
- We may notice that the program has stopped with an error report. There is no effort to compensate for these in the program, as there are many errors possible (Twitter may temporarily be down, your network is temprarily down, etc). It is a design decision to allow these to crash the program, giving a natural break. The harvest can simply be restarted by hand at a convenient time.
Kick-Starting Your Analysis
After a few hours, we will have accumulated an extensive collection of tweets in the Mongo database. They will be located inside a database called
tweets_db, in a collection named
emotweets (or whatever is our
projectname). We can now analyze the data using any tools we prefer.
Note: The data for the example notebook is not provided so you can read it but not run it immediately. If you desire to run it locally, you must first run the
emotweets project using the provided tags file.