![title](http://ocausal.imbv.net/wp-content/uploads/2017/02/banner-autocop-3.jpg)
[AutoCop](http://ocausal.imbv.net/proyecto-autocop-es/), Proof of Concept of the  Observatorio de Contenidos Audiovisuales ([OCA](http://ocausal.imbv.net/proyecto-autocop-es/)), funded by the University of Salamanca Foundation [Plan TCUE 2015-2017 Fase 2]. 
Principal Investigator: Carlos Arcila Calderón. Researchers: Félix Ortega, Javier Amores, Sofía Trullenque, Miguel Vicente, Mateo Álvarez, Javier Ramírez

# AutoCop to run in a Spark in English

# Twitter sentiment streaming visualization

We're going to use Bokeh to visualize the output of our streaming sentiment analysis. This notebook will connect to a remote MongoDB server and poll for the latest results every given time. It will fetch the data, batch process it for visualization and plot it in a nice line chart.

## Imports

In [1]:
import time
import numpy as np

from bokeh.models.sources import ColumnDataSource
from bokeh.plotting import figure
from bokeh.io import output_notebook, show, push_notebook

import pymongo
from datetime import datetime, timedelta
from bson.objectid import ObjectId
import pprint

In [2]:
output_notebook()

## Global configuration

In [3]:
SERVER_URL = "mongodb://localhost:27017"

In [4]:
client = pymongo.MongoClient(SERVER_URL)
db = client.twitter
coll = db.labels

## Process Mongo data for visualization

The processing scripts process batches of tweets and write to the database in batches too.

From our side, we're fetching batches of processed tweets of the form:

```
    {'_id': ObjectId('5900f5bda822ec58140eae67'),
     'in_batch_id': 3,
     'label': -1,
     'timestamp': datetime.datetime(2017, 4, 26, 17, 32, 12, 60000)}
```

Where:

| Field        | Description                                                                         |
|--------------|-------------------------------------------------------------------------------------|
| `_id`        | Unique ID auto generated by MongoDB                                                 |
| ìn_batch_id` | Position of the result within the processed batch (not very relevant)               |
| `label`      | Either 1 or -1. 1 means the processed tweet was positive, -1 means it was negative. |
| `timestamp`  | Time at which the tweet was processed and written into the database                 |

We're just taking all `label`s in every batch and adding the `-1` and `+1` together to get the batch net score.

In [5]:
def compute_batch_score(batch):
    score = 0
    for result in batch:
        score = score + result["label"] # todo: rename variable for actual variable
    return score

## Visualization

Configurable parameters:

`period`: time in seconds to wait to poll the database for the next batch

`n_show`: number of points that will be shown simoulstaneously on screen


How it works: we create a Bokeh `figure` and we add a `line` to it. Inside an infinite loop, we poll for a new batch, advance one step, calculate the current y value (batch score), stream the data for Bokeh to update the chart and sleep for `period` seconds.

As you will notice, this cell will run until manually stopped, avoiding any other cells to execute in the meanwhile.

In [None]:
source = ColumnDataSource(dict(x=[], y=[]))

my_figure = figure(plot_width=800, plot_height=400)
my_figure.line(source=source, x="x", y="y", line_width=2, alpha=.85, color='blue')

handle = show(my_figure, notebook_handle=True)

new_data = dict(x=[0], y=[0])

x = []
y = []

step = 0
period = 2  # in seconds
n_show = 300  # number of points to keep and show

timenow = datetime.utcnow() - timedelta(hours=2, seconds=10)

while True:
    batch = coll.find({'timestamp':{'$gt': timenow}}).sort([("timestamp", -1)])
    
    latest_value = new_data['y'][0]
    
    new_data = dict(x=[step], y=[latest_value + compute_batch_score(batch)])

    source.stream(new_data, n_show)

    push_notebook(handle=handle)
    step += 1

    timenow = datetime.utcnow() - timedelta(hours=2, seconds=10)
    time.sleep(period)