## 1. Business understanding

### Objectives

Perform sentiment analysis of tweets in real time.

### Success criteria

The project will be considered successful if the resulting model will work with accuracy of 70% or higher. Results should be produced with a delay no bigger than 1 second.

### Project plan

1. Examine the data provided by the Twitter Streaming API.
1. Fetch a set of tweets that will be used for training purposes.
1. Explore the data gathered in the sample.
1. Label the training data
1. Build and evaluate models
1. Create a spark job that will use the best prediction model
1. Deploy the job to a preset spark cluster

## 2. Data understanding

### Collect data
Fetch tweets using the Twitter Streaming API and store them in Kafka.

In [149]:
from notebook_client.notebook_client import NotebookClient

In [150]:
nc = NotebookClient()
nc.initialize_producers_manager_connection()
nc.initialize_sampler_manager_connection()

In [322]:
producer_pid = nc.start_streaming('usa_stream', { 'locations': '-125.75,30.8,-70,45' })

In [47]:
reservoir_size = 5
limit = 10
nc.start_sampling('san_stream', 'london_sample', reservoir_size, limit)

In [317]:
nc.streaming_status(producer_pid)

'not found'

In [50]:
nc.sampling_status()

'finished'

In [338]:
nc.stop_streaming(producer_pid)

'stopped'

In [43]:
nc.stop_sampling()

'stopped'

#### Load collected data from Kafka

In [151]:
from kafka import KafkaConsumer

consumer = KafkaConsumer('usa_stream', bootstrap_servers='kafka')
consumer.topics()
partition = consumer.assignment().pop()

In [152]:
consumer.seek_to_end()
tweets_count = consumer.position(partition)
consumer.seek_to_beginning()

In [56]:
tweets_count

808336

In [61]:
import json

In [153]:
data = map(lambda t: json.loads(t.value.decode('utf-8')), consumer)

#### Check what a sample tweet object looks like.

In [48]:
data.__next__()

{'contributors': None,
 'coordinates': None,
 'created_at': 'Thu Aug 31 10:41:10 +0000 2017',
 'display_text_range': [0, 55],
 'entities': {'hashtags': [{'indices': [11, 29], 'text': 'TrinidadandTobago'},
   {'indices': [31, 38], 'text': 'Repost'}],
  'media': [{'display_url': 'pic.twitter.com/tUZbpY8meO',
    'expanded_url': 'https://twitter.com/dyschick/status/903206020339699713/video/1',
    'id': 903205935727779840,
    'id_str': '903205935727779840',
    'indices': [56, 79],
    'media_url': 'http://pbs.twimg.com/ext_tw_video_thumb/903205935727779840/pu/img/79Rjyir-dcmfFMRs.jpg',
    'media_url_https': 'https://pbs.twimg.com/ext_tw_video_thumb/903205935727779840/pu/img/79Rjyir-dcmfFMRs.jpg',
    'sizes': {'large': {'h': 640, 'resize': 'fit', 'w': 640},
     'medium': {'h': 600, 'resize': 'fit', 'w': 600},
     'small': {'h': 340, 'resize': 'fit', 'w': 340},
     'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
    'type': 'photo',
    'url': 'https://t.co/tUZbpY8meO'}],
  'symbol

Tweet objects contain a lot of properties useless in our case. The two meaningful fields are _id_ (numeric) and _text_ (string).

#### Extract useful properties and load them into a data frame.

In [58]:
def parse_tweet(data):
    return { 'id': data.get('id'), 'text': data.get('text') }

In [84]:
import pandas as pd

pd.set_option("display.max_colwidth", 150)

In [154]:
consumer.seek_to_beginning()
parsed_data = map(lambda x: parse_tweet(x), data)
df = pd.DataFrame([parsed_data.__next__() for i in range(tweets_count)])

#### Explore the data

In [155]:
df.head()

Unnamed: 0,id,text
0,9.03206e+17,Happy 55th #TrinidadandTobago!\n#Repost @RemyRemBunction https://t.co/tUZbpY8meO
1,9.03206e+17,Stupid Easy Trivia happens now with @toddandjayde! Play with us to CLAIM YOUR THRONE!
2,9.03206e+17,Yes but I miss Hillary more https://t.co/FS4hJTVzTl
3,9.03206e+17,"Just posted a photo @ Gladstone, Missouri https://t.co/YFpvmBPTxs"
4,9.03206e+17,i have so much fun here. wish you were here. https://t.co/DvBlLLXEV3


In [86]:
df.tail()

Unnamed: 0,id,text
808330,9.039561e+17,Lmaooooooo https://t.co/CyyZewelQL
808331,9.039561e+17,"These last 2 days, boyyyyy. Have been the worst."
808332,9.039561e+17,This nigga is tweeting like he know me. All I said was shut up and he tweeted me 4 times lmaooooooooo https://t.co/4GI3Ogykrd
808333,9.039561e+17,"If you're looking for work in #MtPleasant, MI, check out this #job: https://t.co/Ik6kX1FChQ #CustomerService #Hiring #CareerArc"
808334,9.039561e+17,Haha to funny 😂 https://t.co/UoqVOjj06a


In [95]:
df.sample()

Unnamed: 0,id,text
232069,9.03354e+17,I'm literally my worst enemy.


In [156]:
df['text'].describe()

count     807271
unique    800093
top          Lol
freq          70
Name: text, dtype: object

Number of empty entries.

In [157]:
no_id = df[df['id'].isnull()].size
no_text = df[df['text'].isnull()].size
print("Entries without id: {0}".format(no_id))
print("Entries without text: {0}".format(no_text))

Entries without id: 2130
Entries without text: 2130


Word count statistics.

In [124]:
nn_df = df[df['id'].notnull()]
nn_df = nn_df[df['text'].notnull()]
nn_df['text'].str.split().apply(len).describe()

  


count    807270.000000
mean         11.346310
std           6.729883
min           1.000000
25%           6.000000
50%          11.000000
75%          16.000000
max          69.000000
Name: text, dtype: float64

In [135]:
corpus = set()
nn_df['text'].str.lower().str.split().apply(corpus.update)
print("Corpus size (before cleaning it up): {}".format(len(corpus)))
del corpus

Corpus size (before cleaning it up): 1017250


#### Data quality report
The data is rather complete. Only 2130 entries are incomplete (lacking text and id) and therefore should be excluded from further analysis. Text data contains lots of hashtags, emojis, urls, etc. This should be taken into consideration during the text cleaning part.

## 3. Data preparation

In order to reduce memory usage and facilitate inspecting the text data, tweet properties that will be utilized to build a classification model have been chosen beforehand, after initial overview of Twitter Streaming API response.

#### Cleaning the text data
Remove entries that contain null values

In [158]:
df = df[df['id'].notnull()]
df = df[df['text'].notnull()]

#### Fetch labels from http://www.sentiment140.com

In [208]:
import json
from urllib.parse import urlencode
from urllib.request import Request, urlopen


def payload(df):
    data = { 'data': [{ 'text': e[1]['text'], 'id': e[1]['id'] } for e in df.iterrows()]}
    return json.dumps(data).encode('utf-8')


def fetch_labels(df):
    api_url = 'http://www.sentiment140.com/api/bulkClassifyJson'
    request = Request(api_url, payload(df), {'Content-Type': 'application/json'})
    json = urlopen(request).read().decode()
    print(json)

In [209]:
fetch_labels(df[:5])

{"data":[{"id":9.0320602033969971E17,"text":"Happy 55th #TrinidadandTobago!\n#Repost @RemyRemBunction https://t.co/tUZbpY8meO","polarity":4,"meta":{"headline":true,"language":"en"}},{"id":9.0320602433266074E17,"text":"Stupid Easy Trivia happens now with @toddandjayde! Play with us to CLAIM YOUR THRONE!","polarity":2,"meta":{"language":"en"}},{"id":9.0320602463881626E17,"text":"Yes but I miss Hillary more https://t.co/FS4hJTVzTl","polarity":2,"meta":{"language":"en"}},{"id":9.0320602800688333E17,"text":"Just posted a photo @ Gladstone, Missouri https://t.co/YFpvmBPTxs","polarity":2,"meta":{"language":"en"}},{"id":9.0320602664361165E17,"text":"i have so much fun here. wish you were here. https://t.co/DvBlLLXEV3","polarity":2,"meta":{"language":"en"}}]}



In [173]:
chunk_size = 10
for i in range(0, df['text'].size - chunk_size + 1, chunk_size):
    labels = fetch_labels(df[i:i+chunk_size])
#     store_labels_in_cassandra(labels)

In [193]:
r = df.iterrows().__next__()

In [200]:
r[1]['text']

'Happy 55th #TrinidadandTobago!\n#Repost @RemyRemBunction https://t.co/tUZbpY8meO'