# Week 4 (Tuesday)- Streaming Data and Text Analysis

**Objectives**: Today we are going to explore streaming data and text analysis. Specifically, we will cover the following:
  
* Streaming Data
* Build code to connect to the Twitter streaming API
* Experiment with that code to generate different streams
* Use TextBlob to analyze results from API

##Streaming Data

Streaming data often embodies all three of the "big data" V's- variety, velocity, and volume. Much of this data is also unstructured. In this context, we are discussing streaming data and not streaming media like video or audio although streaming media is also a component of "big data".

For analytics and streaming data, a typical use case might look like the following example from Amazon Web Services.

<img src="https://raw.githubusercontent.com/azbones/big_data/master/images/Kinesis-Streams_Diagram.png">
(source: https://aws.amazon.com/kinesis/)

A streaming analytics pipeline can be used for many applications including:

* real-time machine learning (recommendations, predictions, etc.)
* real-time analytics
* distributed data generation and collection (JSON activity streams and xAPI)
* Internet of Things data collection

Open source platforms used to handle and analyze streaming data include:

* Apache Kafka for messaging- http://kafka.apache.org/
* Apache Storm for real-time computation- http://storm.apache.org/

Advanced firms like Palantir are expanding beyond the traditional single stream applications like recommendation to have a system built on analyzing and integrating multiple, desparate data streams as can be seen in these product descriptions:

* https://www.palantir.com/solutions/insider-threat/
* https://www.palantir.com/palantir-gotham/

Financial firms are also using Twitter data directly or from aggregators like RavenPack to either inform or is some cases trigger trades.

* http://www.bloomberg.com/bw/articles/2013-04-24/how-many-hft-firms-actually-use-twitter-to-trade
* http://www.ravenpack.com/

A key tool in the example of financial trading is sentiment analysis. Sentiment analysis uses natural language processing and text processing to infer attitudes about the subject of that text. In simplistic financial terms, if the public has positive attitudes about a given firm or its stock, that is usually coorelated with price stability or increases. Recent examples have even included April Fools Twitter jokes that have seemed to impact stock prices:

* http://blogs.wsj.com/moneybeat/2015/04/01/tesla-stock-moves-on-april-fools-joke/

Real-time analysis of the text data associated with these streams was likely involved in trading decisions like this. More broadly, the use of text analysis and streaming data can be a critical component of any firms analytics efforts.

Today we are going to get experience with Twitter's streaming API and basic text analysis of that streaming data.

##Twitter Streaming API

Details about the API can be found here:

* https://dev.twitter.com/streaming/overview

Luckily for us, there is also a Python library which makes access to the API fast and easy:

* http://tweepy.readthedocs.org/en/v3.5.0/index.html

**Setting up API Credentials**

To use the Twitter API, you must first register an application with Twitter in order to get the required access credentials. Go to the following website to create an account and register an application so you can get the credentials that are required to run the code below.

* https://dev.twitter.com/

Your apps can be managed at:

* https://apps.twitter.com/

When your application is properly configured, you should be able to access "Keys and Access Tokens" in a page that looks like this:

<img src="https://raw.githubusercontent.com/azbones/big_data/master/images/twitter_apps.png">

**Getting Started**

To get some experience with the API, we will begin by importing a function we built which makes using Tweepy easier.  Insert your Twitter application credentials into the following code block and run it.

In [None]:
# Import Tweepy functions and include access keys and tokens in global namespace.

from tweet_stream import TwitterAuth, PrintStream, FileStream, get_stream

consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''

Next, we need to build a Python connector to the streaming API with your credentials. 

In [None]:
# Create an OAuth object and make a connector

auth = TwitterAuth(consumer_key, consumer_secret, access_token, access_token_secret)
con = auth.make_connector()

Finally, we need to use the connector and listener to establish a stream. Streaming is fundementally different from other REST APIs given there is an ongoing HTTP connection that is listening for the stream. The following diagram represents this process for Twitter:

<img src="https://raw.githubusercontent.com/azbones/big_data/master/images/twitter_streaming-intro-2_1.png">
(source: https://dev.twitter.com/streaming/overview)

We have defined a function that is a listener to retrieve the data as it is streamed to your connection.  This listener was built using the <code>PrintStream()</code> class which defines that the results of the stream should be printed to the console.

Once the stream is set up, we are using the <code>filter</code> function to pass an array of search values in the <code>track</code> parameter which the Twitter API uses to pass to the connection.

Given this is a streaming connection, the output from the API will continue until you cancel it.  To cancel it, use the "interrupt kernel" button which is a black square in the Jupyter UI.

In [None]:
# Set up listener and start stream with defined search terms.

listener = PrintStream()
stream = get_stream(con, listener)
stream.filter(track=['Broncos','Cardinals'])

While you may have stopped the console printing, the stream is most likely still active which you can check by calling the stream's <code>running</code> attribute. Do that in the next code block.

In [None]:
# Check if the stream is still running

stream.running

To stop the connection, use the <code>disconnect()</code> function on the stream and then check if it is still running.

In [None]:
# Disconnect the stream

stream

In [None]:
# Check to see if the stream is still active.

stream.running

Next, instead of printing the output of the API to the console, let's write it to a file. To stop the listener, you again have to interrupt the kernel and disconnect the listener.

In [None]:
# Set up listener and start stream with defined search terms.

listener = FileStream(filepath='naruto.txt')
stream = get_stream(con, listener)
stream.filter(track=['Naruto'])

Finally, let's open that file to read the JSON contents into a file as a Python dictionary. We build a small function to make this easier called <code>tweets_list</code>.

Each JSON document contains a range of different values most of which can be referenced here:

* https://dev.twitter.com/overview/api/tweets


In [None]:
import json

def tweets_list(filename):
    """
    Read lines from file into a list of dictionaries.
    """
    tweets = []
    f = open(filename, 'r')
    for line in f:
        try:
            tweet = json.loads(line)
            tweets.append(tweet)
        except:
            continue
    return tweets     

In [None]:
lists = tweets_list('naruto.txt')

In [None]:
from pprint import pprint
import pandas as pd
df = pd.DataFrame(lists)

df



In [None]:
test_me=df.ix[26]['retweeted_status']

org_tweets=df[df['retweeted_status'].isnull()]

df['text'].value_counts()
org_tweets.ix[0:1000]['text']
org_tweets_blob=''.join(org_tweets)
org_list=org_tweets['text'][~org_tweets['text'].isnull()].tolist()
org_list=org_tweets['text'].tolist()
text_blob = ''.join(org_list)
df['text'][1]
text_blob[:3000]