# Week 3- Cloud Based Analysis Technologies and External Data Sources

**Objectives**: Today we are going to discuss the category of cloud-based analytics tools and extend our Python workflow to work with one example of such tools. We will also work with a streaming data source. Specifcally, we will cover the following:
  
* The Larger Ecosystem of Big Data Technologies
* PaaS Analytics Tools
* Google BigQuery
* External Data Sources
* Twitter streaming

## The Larger Ecosystem of Big Data Technologies

**Analytics PaaS Products**

While Python on its own is an important tool of data science, the nature of "big data" requires a multi-technology approach in most cases. 

Early discussions of the technology of "big data" typically revolved around Hadoop and MapReduce which were some of the first tools that could handle Internet-scale data sources. More recently, a whole variety of different technologies have emerged both in response to not only the larger scale, but also the increased focus on "analytics 3.0" applications. This week we will explore the general category of cloud-based analytics technologies that usually fall into the Platform as a Service (PaaS) category. These cloud offerings enable firms to outsource the management of various "big data" functions to technology firms both large and small. [Amazon (Amazon Web Services](https://aws.amazon.com/big-data/), [Google (Cloud Platform)](https://cloud.google.com/solutions/bigdata/), and [Microsoft (Azure)](https://azure.microsoft.com/en-us/blog/topics/big-data/) all have extensive PaaS "big data" offerings. More specialist providers like [HortonWorks](http://hortonworks.com/) and [Databricks](https://databricks.com/) are even hoping to make the entire process of data science accessble to their customers. Databricks, for example, describes their product as:

>Data science made easy, from ingest to production.
>We believe big data should be simple. 
>Apache Spark™ made a big step towards this goal. Databricks makes Spark easy through a cloud-based integrated workspace. (https://databricks.com/product/databricks - Nov 2015)

**Python in the PaaS Ecosystem**

While Python is a strng technology contender in desktop and server-based data science, it is also being used in these PaaS products as both an underlying technical foundation and as a common data science API. Two examples of this include Amazon's Redshift which now has user [defined functions (UDFs) that are written in Python](https://aws.amazon.com/blogs/aws/user-defined-functions-for-amazon-redshift/) and [Apache Spark which has a robust Python API](http://spark.apache.org/docs/latest/api/python/).

Jupyter Notebooks like this one are also part of the products of "big data" offerings of Databricks, Google, and Amazon:

* https://databricks.com/product/databricks#notebooks
* https://cloud.google.com/datalab/
* https://blogs.aws.amazon.com/bigdata/post/TxX4BY5T1PQ7BQ/Using-IPython-Notebook-to-Analyze-Data-with-Amazon-EMR

Today, we are going to add Google's BigQuery analytics platform to our workflow as an exemplar of the broader category of "big data" PaaS offerings. We will begin with a short description of BigQuery and then move on to working with some of the public datasets on BigQuery.


## Google BigQuery

Google BigQuery is the "productization" of the technology that was code named "Dremel" at Google. In their 2010 whitepaper, Google described Dremel as:

>Dremel is a scalable, interactive ad-hoc query system for analysis
of read-only nested data. By combining multi-level execution
trees and columnar data layout, it is capable of running aggregation
queries over trillion-row tables in seconds. The system scales
to thousands of CPUs and petabytes of data, and has thousands
of users at Google. (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf - Nov 2015)

The whitepaper gave the following examples of how Google has been using Dremel since 2006:

>* Analysis of crawled web documents.
>* Tracking install data for applications on Android Market.
>* Crash reporting for Google products.
>* OCR results from Google Books.
>* Spam analysis.
>* Debugging of map tiles on Google Maps.
>* Tablet migrations in managed Bigtable instances.
>* Results of tests run on Google’s distributed build system.
>* Disk I/O statistics for hundreds of thousands of disks.
>* Resource monitoring for jobs run in Google’s data centers.
>* Symbols and dependencies in Google’s codebase.

At the high level, by focusing on a read-only and columnar data structure instead of a traditional realational database, Google was able to achieve high scale and good interactive performance. Compared to MapReduce models which are batch based, the technology behind Dremel could enable an interactive data science workflow.   

Following the Amazon model of turning internal technolgies into PaaS offerings, Google launched BigQuery as a product in 2010. They describe the product as:

>BigQuery is Google's fully managed, NoOps, low cost data analytics service. With BigQuery you have no infrastructure to manage and don't need a database administrator, use familiar SQL and can take advantage of pay-as-you-go model. This collection of features allows you to focus on analyzing data to find meaningful insights. BigQuery is a powerful Big Data analytics platform used by all types of organizations, from startups to Fortune 500 companies.

## SIgning up for Google BigQuery

One of the advantages of BigQuery for data science is that it has a web UI, so you can explore data from a web interface instead of having to use a SQL client or API. In our case, this will allow you to test your SQL result prior to importing the results into Python for more detailed analysis.

For our class, BigQuery is also convenient in that Google has several different publicly datasets that you can access once you set up an account. These public dataset are maintained by Google's Felipe Hoffa details of which are here:

* https://www.reddit.com/r/bigquery/wiki/datasets


Google has a free trail that includes a $300 credit that can be used over 60 days. BigQuery also has a free usage tier up to 1 TB of data processed per month, so if you are using the public datasets (which have no storage costs) and only doing exploratory analysis, it is unlikely you will incur any charges using BigQuery in this course. 

My actual costs running the exercises in these notebooks were $XXXX.

The BigQuery console (https://console.developers.google.com/billing) will allow you to track your usage to ensure you don't incur any charges. While the trial credits and free usage tier *should* be more than enough to allow you to complete all the exercises in this course.

**IF YOU EXCEED THESE LIMITS THEY WILL CHARGE YOUR CREDIT CARD. YOU ARE RESPONSIBLE FOR MANAGING YOUR GOOGLE USAGE.** 

If you have any concerns about this, please let me know and I can try to accomodate you.   

You can either sign up with your ASU Google account or a personal Google Account. To sign up from ASU, login to your ASU account and visit this page which includes all of the relevant service details:

https://cloud.google.com/



## The BigQuery UI

Once you have signed up for the service and area logged into the BigQuery console, you can add the public datasets to your console by pasting their URLs into your browser.  For example, paste or click on the following links after you are logged into the Google Cloud and they should be added to your available projects:

* https://bigquery.cloud.google.com/table/bigquery-samples:reddit.full
* https://bigquery.cloud.google.com/dataset/imjasonh-storage:nfl

The console should look like this:

<img src="https://raw.githubusercontent.com/azbones/big_data/master/images/week3-gbq_console.png">



## pandas and BigQuery

In [None]:
from tweet_stream import TwitterAuth, PrintStream, FileStream, get_stream

# consumer_key = 'insert_here'
# consumer_secret = 'insert_here'
# access_token = 'insert_here'
# access_token_secret = 'insert_here'

consumer_key = 'insert_here'
consumer_secret = 'insert_here'
access_token = 'insert_here'
access_token_secret = 'insert_here'

auth = TwitterAuth(consumer_key, consumer_secret, access_token, access_token_secret)
con = auth.make_connector()
listener = PrintStream()
stream = get_stream(con, listener)
stream.filter(track=['Broncos','Cardinals'])

In [14]:
json.load?

In [19]:
import json

tweets = []
f = open('tweets.txt', 'r')
for line in f:
    try:
        tweet = json.loads(line)
        tweets.append(tweet)
    except:
        continue

In [36]:
from pprint import pprint
import pandas as pd
df = pd.DataFrame(tweets)


In [76]:
test_me=df.ix[26]['retweeted_status']

In [78]:
import numpy as np

In [93]:
org_tweets=df[df['retweeted_status'].isnull()]

In [96]:
df['text'].value_counts()

RT @StyleFashionHub: People Are Upset With Rob Lowe After He Tweeted About The Paris Attacks -  https://t.co/oo4izI2Lge https://t.co/Xo9uwS…           1532
RT @policia: La Policía francesa solicita colaboración ciudadana para localizar a este sujeto. Si lo ves, avisa #091 #062 #Paris https://t.…           1498
RT @ABC: JUST IN: French jets have begun bombing the defacto ISIS capital of Raqqa, French defense ministry says                                        655
RT @CNN: French Ministry of Defense has announced major bombardment of #ISIS targets in Raqqa, Syria https://t.co/PN8vs7J49z https://t.co/z…            629
RT @ABC: DETAILS: France has bombed a weapons warehouse, a command post and a terrorist recruiting center in de facto ISIS capital of Raqqa.            523
RT @Raqqa_SL: #Raqqa no Civilian got killed or Wounded by the Warplanes Airstrikes until now according to the #Raqqa Hospitals #Syria #ISIL…            456
RT @FoxNews: French Official: Massive airstrikes destroy two jih

In [92]:
len(df)

102336

In [94]:
org_tweets.ix[0:1000]['text']

10     @Independent Is this one of the forgeries made...
11     Shocking: As Paris Burned, UK Muslims Told To ...
13     melhor falar de paris e EI do que mariana porq...
26                                                   NaN
33     Les Libanais se demandent où est leur «Safety ...
40                                                   NaN
45     #фоловинг Киев присоединился к акции солидарно...
51                                                   NaN
54     N'oubliez pas, il faut prier pour toutes les v...
60     Si tu est un bon musulman tu doit poser des bo...
61     @VirusRai2 @vfeltri SU LA7  C'E' UN INTERVISTA...
63     @Louis_Tomlinson Thank you for supporting Pari...
64     france bombing isis? great, other nations shou...
71                           Lit https://t.co/3W3ZyHdpco
78     La Préfecture de Police de Paris interdit les ...
80     Yung pinsan ko gusto din palitan yung dp sa FB...
82     Muslims worldwide take up viral campaign conde...
83     Well war has definitely 

In [99]:
org_tweets_blob=''.join(org_tweets)

In [112]:
org_list=org_tweets['text'][~org_tweets['text'].isnull()].tolist()

In [105]:
org_list=org_tweets['text'].tolist()

In [113]:
text_blob = ''.join(org_list)

In [119]:
df['text'][1]

u'RT @RadioLondra_: Qui Radio Londra: La Francia bombarda Raqqa, capitale Isis https://t.co/7emzNmtCbw'

In [116]:
text_blob[:3000]

u'@Independent Is this one of the forgeries made in Turkey? Isis want Europe to turn against refugeesShocking: As Paris Burned, UK Muslims Told To \u2018Struggle\u2019 For Islamic State In Unprecedented Islamist Show Of Force https://t.co/MEbJZ1fW48melhor falar de paris e EI do que mariana porque esse governo de merda n\xe3o escuta ninguem, ninguem escuta ninguemLes Libanais se demandent o\xf9 est leur \xabSafety check\xbb sur @facebook #ISIS #BeirutAttacks #France #Facebook https://t.co/e6hewr4wZY#\u0444\u043e\u043b\u043e\u0432\u0438\u043d\u0433 \u041a\u0438\u0435\u0432 \u043f\u0440\u0438\u0441\u043e\u0435\u0434\u0438\u043d\u0438\u043b\u0441\u044f \u043a \u0430\u043a\u0446\u0438\u0438 \u0441\u043e\u043b\u0438\u0434\u0430\u0440\u043d\u043e\u0441\u0442\u0438 Pray for Paris https://t.co/UrTt5PtJnWN\'oubliez pas, il faut prier pour toutes les victimes d\'actes terribles, pas seulement celles de Paris #PrayForWorld \U0001f64f\U0001f3fd\U0001f30e\U0001f30f\U0001f30d\u2764\ufe0fSi tu est un 