# Week 4 (Tuesday)- Streaming Data and Text Analysis

**Objectives**: Today we are going to explore streaming data and text analysis. Specifically, we will cover the following:
  
* Streaming Data
* Build code to connect to the Twitter streaming API
* Experiment with that code to generate different streams
* Use TextBlob to analyze results from API

##Streaming Data

Streaming data often embodies all three of the "big data" V's- variety, velocity, and volume. Much of this data is also unstructured. In this context, we are discussing streaming data and not streaming media like video or audio although streaming media is also a component of "big data".

For analytics and streaming data, a typical use case might look like the following example from Amazon Web Services.

<img src="https://raw.githubusercontent.com/azbones/big_data/master/images/Kinesis-Streams_Diagram.png">
(source: https://aws.amazon.com/kinesis/)

A streaming analytics pipeline can be used for many applications including:

* real-time machine learning (recommendations, predictions, etc.)
* real-time analytics
* distributed data generation and collection (JSON activity streams and xAPI)
* Internet of Things data collection

Open source platforms used to handle and analyze streaming data include:

* Apache Kafka for messaging- http://kafka.apache.org/
* Apache Storm for real-time computation- http://storm.apache.org/

Advanced firms like Palantir are expanding beyond the traditional single stream applications like recommendation to have a system built on analyzing and integrating multiple, desparate data streams as can be seen in these product descriptions:

* https://www.palantir.com/solutions/insider-threat/
* https://www.palantir.com/palantir-gotham/

Financial firms are also using Twitter data directly or from aggregators like RavenPack to either inform or is some cases trigger trades.

* http://www.bloomberg.com/bw/articles/2013-04-24/how-many-hft-firms-actually-use-twitter-to-trade
* http://www.ravenpack.com/

A key tool in the example of financial trading is sentiment analysis. Sentiment analysis uses natural language processing and text processing to infer attitudes about the subject of that text. In simplistic financial terms, if the public has positive attitudes about a given firm or its stock, that is usually coorelated with price stability or increases. Recent examples have even included April Fools Twitter jokes that have seemed to impact stock prices:

* http://blogs.wsj.com/moneybeat/2015/04/01/tesla-stock-moves-on-april-fools-joke/

Real-time analysis of the text data associated with these streams was likely involved in trading decisions like this. More broadly, the use of text analysis and streaming data can be a critical component of any firms analytics efforts.

Today we are going to get experience with Twitter's streaming API and basic text analysis of that streaming data.

##Twitter Streaming API

Details about the API can be found here:

* https://dev.twitter.com/streaming/overview

Luckily for us, there is also a Python library which makes access to the API fast and easy:

* http://tweepy.readthedocs.org/en/v3.5.0/index.html

**Setting up API Credentials**

To use the Twitter API, you must first register an application with Twitter in order to get the required access credentials. Go to the following website to create an account and register an application so you can get the credentials that are required to run the code below.

* https://dev.twitter.com/

Your apps can be managed at:

* https://apps.twitter.com/

In [None]:
from tweet_stream import TwitterAuth, PrintStream, FileStream, get_stream

# consumer_key = 'insert_here'
# consumer_secret = 'insert_here'
# access_token = 'insert_here'
# access_token_secret = 'insert_here'

consumer_key = 'insert_here'
consumer_secret = 'insert_here'
access_token = 'insert_here'
access_token_secret = 'insert_here'

auth = TwitterAuth(consumer_key, consumer_secret, access_token, access_token_secret)
con = auth.make_connector()
listener = PrintStream()
stream = get_stream(con, listener)
stream.filter(track=['Broncos','Cardinals'])

In [None]:
import json

tweets = []
f = open('tweets.txt', 'r')
for line in f:
    try:
        tweet = json.loads(line)
        tweets.append(tweet)
    except:
        continue
        
from pprint import pprint
import pandas as pd
df = pd.DataFrame(tweets)

test_me=df.ix[26]['retweeted_status']

org_tweets=df[df['retweeted_status'].isnull()]

df['text'].value_counts()
org_tweets.ix[0:1000]['text']
org_tweets_blob=''.join(org_tweets)
org_list=org_tweets['text'][~org_tweets['text'].isnull()].tolist()
org_list=org_tweets['text'].tolist()
text_blob = ''.join(org_list)
df['text'][1]
text_blob[:3000]