## Introduction
Will explore aggregation framework for some analysis and then explore how we could use it for data cleaning

## Example of Aggregation Framework

Let's find out who tweeted the most
- group tweets by user
- count each user's tweets
- sort into descending order
- select user at top

![](pipeline.png)

In [1]:
import pprint

def get_client():
    from pymongo import MongoClient
    return MongoClient('mongodb://localhost:27017/')

def get_collection():
    return get_client().examples.twitter

In [2]:
collection = get_collection()

In [9]:
def aggregate_and_show(collection, query, limit = True):
    _query = query[:]
    if limit:
        _query.append({"$limit": 5})
    result = collection.aggregate(_query)
    pprint.pprint(list(r for r in result))

In [10]:
query = [
    {"$group": {"_id": "$user.screen_name",
                "count": {"$sum": 1}}},
    {"$sort": {"count": -1}}
]
aggregate_and_show(collection, query)

[{u'_id': u'behcolin', u'count': 8},
 {u'_id': u'JBTeenageDream', u'count': 7},
 {u'_id': u'mysterytrick', u'count': 7},
 {u'_id': u'juhbs', u'count': 6},
 {u'_id': u'officialjamesj', u'count': 6}]


## Aggregation Operators
- `$project` - shape documents  e.g. select
- `$match` - filtering
- `$skip` - skip at start
- `$limit` - limit after some
- `$unwind` - for every field of the array field on which it is used it will create an instance of document containing the values of the field. This can be used for grouping

## Match operator
Who has the highest followers to friend ratio?

In [5]:
query = [
    {"$match": {"user.friends_count": {"$gt": 0},
                "user.followers_count": {"$gt": 0}}},
    {"$project": {"ratio": {"$divide": ["$user.followers_count", 
                                        "$user.friends_count"]},
                  "screen_name": "$user.screen_name"}},
    {"$sort": {"ratio": -1}}
]

aggregate_and_show(collection, query)

[{u'_id': ObjectId('57bdd40a6d9f21e6de8a9c1a'),
  u'ratio': 19221.5,
  u'screen_name': u'Twitterrific'},
 {u'_id': ObjectId('57bdd4096d9f21e6de8a90da'),
  u'ratio': 17124.0,
  u'screen_name': u'steve_berra'},
 {u'_id': ObjectId('57bdd4096d9f21e6de8a938a'),
  u'ratio': 16847.0,
  u'screen_name': u'2dopeboyz'},
 {u'_id': ObjectId('57bdd40b6d9f21e6de8ab9a3'),
  u'ratio': 13894.222222222223,
  u'screen_name': u'backstreetboys'},
 {u'_id': ObjectId('57bdd4076d9f21e6de8a3a2d'),
  u'ratio': 13155.666666666666,
  u'screen_name': u'tedouumdado'}]


For `$match` we use the same syntax that we use for read operations

## Project operator
- include fields from the original document
- insert computed fields
- rename fields
- create fields that hold sub documents

## Unwind operator
- need to use array values somehow

Let's try and find who included the most user mentions

In [6]:
query = [
    {"$unwind": "$entities.user_mentions"},
    {"$group": {"_id": "$user.screen_name",
                "count": {"$sum": 1}}},
    {"$sort": {"count": -1}}
]

aggregate_and_show(collection, query)

[{u'_id': u'ThizBoySwagLoud', u'count': 21},
 {u'_id': u'MULA_BSB', u'count': 21},
 {u'_id': u'vanilla3450', u'count': 18},
 {u'_id': u'Democracy_Work', u'count': 17},
 {u'_id': u'itsajstuerd', u'count': 16}]


## group operators
- `$sum`
- `$first`
- `$last`
- `$max`
- `$min`
- `$avg`

array operators
- `$push`
- `$addToSet`

In [15]:
#get unique hashtags by user
query = [
    {"$unwind": "$entities.hashtags"},
    {"$group": {"_id": "$user.screen_name",
                "unique_hashtags": {
                    "$addToSet": "$entities.hashtags.text"
                }}},
    {"$sort": {"_id": -1}}
]

aggregate_and_show(collection, query)

[{u'_id': u'zzj090728', u'unique_hashtags': [u'yaplog']},
 {u'_id': u'zyosouzai', u'unique_hashtags': [u'lv25834899']},
 {u'_id': u'zudyezezwe65',
  u'unique_hashtags': [u'line', u'all', u'data', u'mun', u'kl', u'pop']},
 {u'_id': u'zorapataki', u'unique_hashtags': [u'zorapataki']},
 {u'_id': u'zootcadillac', u'unique_hashtags': [u'TwitterJokeTrial']}]


In [31]:
# find number of unique user mentions
query = [
    {"$unwind": "$entities.user_mentions"},
    {"$group": {
            "_id": "$user.screen_name",
            "mset": {
                "$addToSet": "$entities.user_mentions.screen_name"
            }
        }},
    {"$unwind": "$mset"},
    {"$group": {"_id": "$_id", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}}
]

aggregate_and_show(collection, query)

[{u'_id': u'Democracy_Work', u'count': 17},
 {u'_id': u'ThizBoySwagLoud', u'count': 16},
 {u'_id': u'itsajstuerd', u'count': 15},
 {u'_id': u'FollowersNeeded', u'count': 15},
 {u'_id': u'Egreeedy', u'count': 12}]


## Indexes

Sequence of index is important
![](index.png)

We can create indexes using 

`db.collections.ensureIndex({"tg" : 1})`

## Geospatial Indexes
- allows to query a location near a location

![](2d_index.png)

- The value needs to be stored as an array `[X, Y]`. 
- then we need to create an index
- then we use `$near` for using this