## 1. Business understanding

### Objectives

Perform sentiment analysis of tweets in real time.

### Success criteria

The project will be considered successful if the resulting model will work with accuracy of 70% or higher. Results should be produced with a delay no bigger than 1 second.

### Project plan

1. Examine the data provided by the Twitter Streaming API.
1. Fetch a set of tweets that will be used for training purposes.
1. Explore the data gathered in the sample.
1. Label the training data
1. Build and evaluate models
1. Create a spark job that will use the best prediction model
1. Deploy the job to a preset spark cluster

## 2. Data understanding

### Collect data
Fetch tweets using the Twitter Streaming API and store them in Kafka.

In [1]:
from notebook_client.notebook_client import NotebookClient

In [256]:
nc = NotebookClient()
nc.initialize_producers_manager_connection()
nc.initialize_sampler_manager_connection()

In [322]:
producer_pid = nc.start_streaming('usa_stream', { 'locations': '-125.75,30.8,-70,45' })

In [47]:
reservoir_size = 5
limit = 10
nc.start_sampling('san_stream', 'london_sample', reservoir_size, limit)

In [317]:
nc.streaming_status(producer_pid)

'not found'

In [50]:
nc.sampling_status()

'finished'

In [338]:
nc.stop_streaming(producer_pid)

'stopped'

In [43]:
nc.stop_sampling()

'stopped'

### Explore fetched data

In [2]:
from kafka import KafkaConsumer

consumer = KafkaConsumer('usa_stream', bootstrap_servers='kafka')
consumer.topics()
partition = consumer.assignment().pop()

In [54]:
consumer.seek_to_end()
tweets_count = consumer.position(partition)
consumer.seek_to_beginning()

In [46]:
consumer.position(partition)

0

In [47]:
import json

data = map(lambda t: json.loads(t.value.decode('utf-8')), consumer)

#### Check what a sample tweet object looks like.

In [48]:
data.__next__()

{'contributors': None,
 'coordinates': None,
 'created_at': 'Thu Aug 31 10:41:10 +0000 2017',
 'display_text_range': [0, 55],
 'entities': {'hashtags': [{'indices': [11, 29], 'text': 'TrinidadandTobago'},
   {'indices': [31, 38], 'text': 'Repost'}],
  'media': [{'display_url': 'pic.twitter.com/tUZbpY8meO',
    'expanded_url': 'https://twitter.com/dyschick/status/903206020339699713/video/1',
    'id': 903205935727779840,
    'id_str': '903205935727779840',
    'indices': [56, 79],
    'media_url': 'http://pbs.twimg.com/ext_tw_video_thumb/903205935727779840/pu/img/79Rjyir-dcmfFMRs.jpg',
    'media_url_https': 'https://pbs.twimg.com/ext_tw_video_thumb/903205935727779840/pu/img/79Rjyir-dcmfFMRs.jpg',
    'sizes': {'large': {'h': 640, 'resize': 'fit', 'w': 640},
     'medium': {'h': 600, 'resize': 'fit', 'w': 600},
     'small': {'h': 340, 'resize': 'fit', 'w': 340},
     'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
    'type': 'photo',
    'url': 'https://t.co/tUZbpY8meO'}],
  'symbol

#### Extract useful properties.

In [50]:
def parse_tweet(data):
    return { 'id': data['id'], 'text': data['text'] }

In [52]:
import pandas as pd

tweets_count = 10
consumer.seek_to_beginning()
parsed_data = map(lambda x: parse_tweet(x), data)
df = pd.DataFrame([parsed_data.__next__() for i in range(tweets_count)])

In [53]:
df.head()

Unnamed: 0,id,text
0,903206020339699713,Happy 55th #TrinidadandTobago!\n#Repost @RemyR...
1,903206024332660736,Stupid Easy Trivia happens now with @toddandja...
2,903206024638816256,Yes but I miss Hillary more https://t.co/FS4hJ...
3,903206028006883328,"Just posted a photo @ Gladstone, Missouri http..."
4,903206026643611648,i have so much fun here. wish you were here. h...
