# `tweetharvest`: Example Analysis

This is an example notebook demonstrating how to establish a connection to a database of tweets collected using [`tweetharvest`](https://github.com/ggData/tweetharvest). It presupposes that all [the setup instructions](https://github.com/ggData/tweetharvest/blob/master/README.md) have been completed (see README file for that repository) and that MongoDB server is running as described there. We start by importing core packages the [PyMongo package](http://api.mongodb.org/python/current/index.html), the official package to access MongoDB databases.

In [1]:
import pymongo

Next we establish a link with the database. We know that the database created by `tweetharvester` is called `tweets_db` and within it is a collection of tweets that goes by the name of the project, in this example: `emotweets`.

In [2]:
db = pymongo.MongoClient().tweets_db
coll = db.emotweets
coll

Collection(Database(MongoClient('localhost', 27017), u'tweets_db'), u'emotweets')

We now have an object, `coll`, that offers full access to the MongoDB API where we can analyse the data in the collected tweets. For instance, in our small example collection, we can count the number of tweets:

In [3]:
coll.count()

1091

Or we can count the number of tweets that are geolocated with a field containing the latitude and longitude of the user when they sent the tweet. We construct a MongoDB query that looks for a non-empty field called `coordinates`.

In [4]:
query = {'coordinates': {'$ne': None}}
coll.find(query).count()

33

Or how many tweets had the hashtag `#happy` in them?

In [5]:
query = {'hashtags': {'$in': ['happy']}}
coll.find(query).count()

501

## Pre-requisites for Analysis

In order to perform these analyses there are a few things one needs to know:

1. At the risk of stating the obvious: how to code in [Python](http://www.python.org) (there is also [an excellent tutorial](https://docs.python.org/2/tutorial/)). Please note that the current version of `tweetharvest` uses Python 2.7, and not Python 3.
2. How to perform mongoDB queries, including aggregation, counting, grouping of subsets of data. There is a most effective short introduction ([The Little Book on MongoDB](http://openmymind.net/mongodb.pdf) by Karl Seguin), as well as [extremely rich documentation](http://docs.mongodb.org/manual/reference/) on the parent website.
3. [How to use PyMongo](http://api.mongodb.org/python/current/) to interface with the MongoDB API.

Apart from these skills, one needs to know how each status is stored in the database. Here is an easy way to look at the data structure of one tweet.

In [6]:
coll.find_one()

{u'_id': 608980543552794624L,
 u'contributors': None,
 u'coordinates': None,
 u'created_at': datetime.datetime(2015, 6, 11, 12, 54, 10),
 u'entities': {u'hashtags': [{u'indices': [109, 113], u'text': u'sad'}],
  u'symbols': [],
  u'urls': [],
  u'user_mentions': [{u'id': 330185970,
    u'id_str': u'330185970',
    u'indices': [0, 10],
    u'name': u'Pause Pub',
    u'screen_name': u'pause_pub'}]},
 u'favorite_count': 0,
 u'favorited': False,
 u'geo': None,
 u'hashtags': [u'sad'],
 u'id_str': u'608980543552794624',
 u'in_reply_to_screen_name': u'pause_pub',
 u'in_reply_to_status_id': 608977999069872128L,
 u'in_reply_to_status_id_str': u'608977999069872128',
 u'in_reply_to_user_id': 330185970,
 u'in_reply_to_user_id_str': u'330185970',
 u'is_quote_status': False,
 u'lang': u'fr',
 u'metadata': {u'iso_language_code': u'fr', u'result_type': u'recent'},
 u'place': None,
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twit

This JSON data structure is [documented on the Twitter API website](https://dev.twitter.com/overview/api/tweets) where each field is described in detail. It is recommended that this description is studied in order to understand how to construct valid queries.

`tweetharvest` is faithful to the core structure of the tweets as described in that documentation, but with minor differences created for convenience:

1. All date fields are stored as MongoDB `Date` objects and returned as Python `datetime` objects. This makes it easier to work on date ranges, sort by date, and do other date and time related manipulation.
2. A `hashtags` field is created for convenience. This contains a simple array of all the hashtags contained in a particular tweet and can be queried directly instead of looking for tags inside a dictionary, inside a list of other entities. It is included for ease of querying but may be ignored if one prefers.

## Next Steps

This notebook establishes how you can connect to the database of tweets that you have harvested and how you can use the power of Python and MongoDB to access and analyse your collections. Good luck!