# Week 3- Cloud Based Analysis Technologies and External Data Sources

**Objectives**: Today we are going to discuss the category of cloud-based analytics tools and extend our Python workflow to work with one example of such tools. We will also work with a streaming data source. Specifcally, we will cover the following:
  
* The Larger Ecosystem of Big Data Technologies
* PaaS Analytics Tools
* Google BigQuery
* pandas and BigQuery
* More dataframe operations

## The Larger Ecosystem of Big Data Technologies

**Analytics PaaS Products**

While Python on its own is an important tool of data science, the nature of "big data" requires a multi-technology approach in most cases. 

Early discussions of the technology of "big data" typically revolved around Hadoop and MapReduce which were some of the first tools that could handle Internet-scale data sources. More recently, a whole variety of different technologies have emerged both in response to not only the larger scale, but also the increased focus on "analytics 3.0" applications. This week we will explore the general category of cloud-based analytics technologies that usually fall into the Platform as a Service (PaaS) category. These cloud offerings enable firms to outsource the management of various "big data" functions to technology firms both large and small. [Amazon (Amazon Web Services](https://aws.amazon.com/big-data/), [Google (Cloud Platform)](https://cloud.google.com/solutions/bigdata/), and [Microsoft (Azure)](https://azure.microsoft.com/en-us/blog/topics/big-data/) all have extensive PaaS "big data" offerings. More specialist providers like [HortonWorks](http://hortonworks.com/) and [Databricks](https://databricks.com/) are even hoping to make the entire process of data science accessble to their customers. Databricks, for example, describes their product as:

>Data science made easy, from ingest to production.
>We believe big data should be simple. 
>Apache Spark™ made a big step towards this goal. Databricks makes Spark easy through a cloud-based integrated workspace. (https://databricks.com/product/databricks - Nov 2015)

**Python in the PaaS Ecosystem**

While Python is a strng technology contender in desktop and server-based data science, it is also being used in these PaaS products as both an underlying technical foundation and as a common data science API. Two examples of this include Amazon's Redshift which now has user [defined functions (UDFs) that are written in Python](https://aws.amazon.com/blogs/aws/user-defined-functions-for-amazon-redshift/) and [Apache Spark which has a robust Python API](http://spark.apache.org/docs/latest/api/python/).

Jupyter Notebooks like this one are also part of the products of "big data" offerings of Databricks, Google, and Amazon:

* https://databricks.com/product/databricks#notebooks
* https://cloud.google.com/datalab/
* https://blogs.aws.amazon.com/bigdata/post/TxX4BY5T1PQ7BQ/Using-IPython-Notebook-to-Analyze-Data-with-Amazon-EMR

Today, we are going to add Google's BigQuery analytics platform to our workflow as an exemplar of the broader category of "big data" PaaS offerings. We will begin with a short description of BigQuery and then move on to working with some of the public datasets on BigQuery.


## Google BigQuery

Google BigQuery is the "productization" of the technology that was code named "Dremel" at Google. In their 2010 whitepaper, Google described Dremel as:

>Dremel is a scalable, interactive ad-hoc query system for analysis
of read-only nested data. By combining multi-level execution
trees and columnar data layout, it is capable of running aggregation
queries over trillion-row tables in seconds. The system scales
to thousands of CPUs and petabytes of data, and has thousands
of users at Google. (http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf - Nov 2015)

The whitepaper gave the following examples of how Google has been using Dremel since 2006:

>* Analysis of crawled web documents.
>* Tracking install data for applications on Android Market.
>* Crash reporting for Google products.
>* OCR results from Google Books.
>* Spam analysis.
>* Debugging of map tiles on Google Maps.
>* Tablet migrations in managed Bigtable instances.
>* Results of tests run on Google’s distributed build system.
>* Disk I/O statistics for hundreds of thousands of disks.
>* Resource monitoring for jobs run in Google’s data centers.
>* Symbols and dependencies in Google’s codebase.

At the high level, by focusing on a read-only and columnar data structure instead of a traditional realational database, Google was able to achieve high scale and good interactive performance. Compared to MapReduce models which are batch based, the technology behind Dremel could enable an interactive data science workflow.   

Following the Amazon model of turning internal technolgies into PaaS offerings, Google launched BigQuery as a product in 2010. They describe the product as:

>BigQuery is Google's fully managed, NoOps, low cost data analytics service. With BigQuery you have no infrastructure to manage and don't need a database administrator, use familiar SQL and can take advantage of pay-as-you-go model. This collection of features allows you to focus on analyzing data to find meaningful insights. BigQuery is a powerful Big Data analytics platform used by all types of organizations, from startups to Fortune 500 companies.

## SIgning up for Google BigQuery

One of the advantages of BigQuery for data science is that it has a web UI, so you can explore data from a web interface instead of having to use a SQL client or API. In our case, this will allow you to test your SQL result prior to importing the results into Python for more detailed analysis.

For our class, BigQuery is also convenient in that Google has several different publicly datasets that you can access once you set up an account. These public dataset are maintained by Google's Felipe Hoffa details of which are here:

* https://www.reddit.com/r/bigquery/wiki/datasets


Google has a free trail that includes a $300 credit that can be used over 60 days. BigQuery also has a free usage tier up to 1 TB of data processed per month, so if you are using the public datasets (which have no storage costs) and only doing exploratory analysis, it is unlikely you will incur any charges using BigQuery in this course. 

My actual costs running the exercises in these notebooks were $XXXX.

The BigQuery console (https://console.developers.google.com/billing) will allow you to track your usage to ensure you don't incur any charges. While the trial credits and free usage tier *should* be more than enough to allow you to complete all the exercises in this course.

**IF YOU EXCEED THESE LIMITS THEY WILL CHARGE YOUR CREDIT CARD. YOU ARE RESPONSIBLE FOR MANAGING YOUR GOOGLE USAGE.** 

If you have any concerns about this, please let me know and I can try to accomodate you.   

You can either sign up with your ASU Google account or a personal Google Account. To sign up from ASU, login to your ASU account and visit this page which includes all of the relevant service details:

https://cloud.google.com/



## The BigQuery UI

Once you have signed up for the service and area logged into the BigQuery console, you can add the public datasets to your console by pasting their URLs into your browser.  For example, paste or click on the following links after you are logged into the Google Cloud and they should be added to your available projects:

* https://bigquery.cloud.google.com/table/bigquery-samples:reddit.full
* https://bigquery.cloud.google.com/dataset/imjasonh-storage:nfl

The console should look like this:

<img src="https://raw.githubusercontent.com/azbones/big_data/master/images/week3-gbq_console.png">

From the console, you can select a dataset from the left nav which will then make the query box active. In the example below, I have selected the [GSOD dataset which contains weather data from NOAA](https://data.noaa.gov/dataset/global-surface-summary-of-the-day-gsod). Under the query box, there are options to view the table schema and details of the dataset.

The query screen looks like this:

<img src="https://raw.githubusercontent.com/azbones/big_data/master/images/week3-gbq_query.png">



## BigQuery Syntax

Google's BigQuery products uses SQL-like syntax, but does not conform to ANSI SQL. Details of the syntax are included here:

* https://cloud.google.com/bigquery/query-reference

To get some experience in BigQuery, use the UI to derive the following metrics using the the defined BigQuery command. For this exercise, we are not running code in the notebook, so just check your own work.

Use the GSOD weather dataset:

* https://bigquery.cloud.google.com/table/publicdata:samples.gsod

**BigQuery Exercises**

All of these statements begin with <code>SELECT</code> and build out criteria to define the selection.

* Use <code>COUNT</code> and <code>GROUP BY</code> to calculate the number of observations in the year 1989

* Use <code>WHERE</code> to return all of the observations with mean wind speeds greater than 75

* Use <code>COUNT</code> and <code>WHERE</code> to return the number of observations with mean wind speeds greater than 75

* Use <code>AVG</code>, <code>STDDEV</code>, and <code>AS</code> to calculate the average of the mean temperatures across all observations and to give the results a descriptive name

* Use <code>AVG</code>, <code>STDDEV</code>, and <code>AS</code> to calculate the average of the mean temperatures across all observations and to give the results a descriptive name

**BigQuery Table Organization**

A common model for sending data to BigQuery splits tables into different time intervals. Querying across such intervals is easy as multiple tables with the same schema can be added to the <code>FROM</code> statement separated by a comma. For example, Google Analytics Premium customers can get raw, hit level data exported to BigQuery. The table naming convention for this appends the specific date of the extract to the table name like this:

<code>dataset.ga_sessions_20151115</code>

BigQuery then allows queries that define a time-based wildcard to retrieve specific date ranges without a scan of all tables like:

````
SELECT COUNT(*)
FROM (TABLE_DATE_RANGE(dataset.ga_sessions_, 
      TIMESTAMP('2015-11-01'), 
      TIMESTAMP('2015-11-15'))) 
WHERE device.browser = "Chrome"
````

With ongoing data extracts that are time based, such organization can further optimize performance with multi-terrabyte sized datasets given research questions may be limited by time period.

## pandas and BigQuery

The pandas library has a submodule which offers direct access to Google's BigQuery for read and write access called <code>gbq</code>. For desktop analysis, this submodule works by using [OAuth2.0(http://oauth.net/2/) to authenticate the desktop application as a valid user of the BigQuery account. Practically, you accomplish this by being logged into BigQuery in your default browser and then running the the <code>gbq</code> method.

The basic read syntax follows:

```
projectid = "xxxxxxxx"

df = pd.read_gbq('SELECT * FROM test_dataset.test_table', projectid)
```

When populated with a valid project id from the BigQuery console, the <code>read_gbq</code> method will open the default browser to an authentication page for BigQuery and, if successful, then save an access token to the local file system to provide access in the future. Once access is granted, <code>read_gbq</code> will send the SQL command in parameter <code>query</code> to BigQuery and return the results. The method's parameters include:


Parameters:	
* query : str : SQL-Like Query to return data values
* project_id : str : Google BigQuery Account project ID.
* index_col : str (optional) : Name of result column to use for index in results DataFrame
* col_order : list(str) (optional) : List of BigQuery column names in the desired order for results DataFrame
* reauth : boolean (default False) : Force Google BigQuery to reauthenticate the user. This is useful if multiple accounts are used.
* verbose : boolean (default True) : Verbose output

An example of code and the output from a first authentication run of the code is in the code and markdown blocks below. Note that the <code>%time</code> magic command entered before a statement in IPython returns various time information realted to its execution. This information can be useful especially when calling an external API like BigQuery to measure performance.


In [None]:
import pandas as pd
from pandas.io import gbq

query = """SELECT count(*)
           FROM publicdata:samples.gsod"""

project_id = XXXXXXXXX

%time gsod_year = gbq.read_gbq(query, project_id)

Example Output From First Run:

```
Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&response_type=code&client_id=XXXXXX.apps.googleusercontent.com&access_type=offline

If your browser is on a different machine then exit and re-run this
application with the command-line parameter 

  --noauth_local_webserver

Authentication successful.
Job not yet complete...
CPU times: user 183 ms, sys: 1.36 s, total: 1.54 s
Wall time: 14.4 s
```

## More Dataframe Operations

In [None]:
from tweet_stream import TwitterAuth, PrintStream, FileStream, get_stream

# consumer_key = 'insert_here'
# consumer_secret = 'insert_here'
# access_token = 'insert_here'
# access_token_secret = 'insert_here'

consumer_key = 'insert_here'
consumer_secret = 'insert_here'
access_token = 'insert_here'
access_token_secret = 'insert_here'

auth = TwitterAuth(consumer_key, consumer_secret, access_token, access_token_secret)
con = auth.make_connector()
listener = PrintStream()
stream = get_stream(con, listener)
stream.filter(track=['Broncos','Cardinals'])

In [None]:
json.load?

In [None]:
import json

tweets = []
f = open('tweets.txt', 'r')
for line in f:
    try:
        tweet = json.loads(line)
        tweets.append(tweet)
    except:
        continue

In [None]:
from pprint import pprint
import pandas as pd
df = pd.DataFrame(tweets)


In [None]:
test_me=df.ix[26]['retweeted_status']

In [None]:
import numpy as np

In [None]:
org_tweets=df[df['retweeted_status'].isnull()]

In [None]:
df['text'].value_counts()

In [None]:
len(df)

In [None]:
org_tweets.ix[0:1000]['text']

In [None]:
org_tweets_blob=''.join(org_tweets)

In [None]:
org_list=org_tweets['text'][~org_tweets['text'].isnull()].tolist()

In [None]:
org_list=org_tweets['text'].tolist()

In [None]:
text_blob = ''.join(org_list)

In [None]:
df['text'][1]

In [None]:
text_blob[:3000]