## Week 3 Exercise

**Assignment**: This exercise will use Google BigQuery to explore the GSOD weather dataset.  Specifically, you will:
  
* Use the <code>read_gbq</code> to retrieve data from the GSOD dataset
* Use pandas to describe the data
* Use SQL-like syntax to return summarized data to pandas from BigQuery
* Use <code>pivot_table</code> to display the data by month and year
* Compare and contrast PaaS versus build-it-yourself analytics solutions
* Discuss approaches to optimizing aggregation performance in a PaaS workflow

For this exercise, you will have to complete all the tasks within this notebook, save the entire notebook, and then upload into the Week 3 Assignment for your group on BlackBoard. Save this notebook with a new name with the following format:

**Week_3_Exercise_Group_group_number.ipynb**

These in-class exercises are designed to allow you to explore Python with your group and **DO NOT** include step-by-step directions or answers that have only one possibility. Use your team and other resources to determine how best to complete them. Make sure before you turn in your notebook that it runs without errors and the requested output is visible in the notebook. If you go through multiple steps in your code, make sure all those steps are included so that we can evaluate your work.

We are going to begin by exploring the GSOD dataset. Additional details about it can be found here:

* http://www1.ncdc.noaa.gov/pub/data/gsod/readme.txt

To begin, use the <code>read_gbq</code> function to retrieve the temperatures and dew points for the years 2013 and 2014 for days when there was hail from the <code>fh-bigquery:weather_gsod</code> dataset in BigQuery. Complete the other tasks in the code block.

In [None]:
# Create a dataframe from the criteria above and print it.


In [None]:
# Print the average temperature of that dataframe


In [None]:
# Print the average dewpoint of that dataframe


In [None]:
# Print the number of observations in this dataframe


In [None]:
# Print the number of unique stations (hint: try value_counts) these observations represent


In [None]:
# Print the mean, max, and min temperatures and dew points


Next, retrieve temperature and pressure observations in the dataset <code>fh-bigquery:weather_gsod</code> from 2013 that had hail or funnel clouds.

In [None]:
# Create a dataframe from BigQuery and the criteria above and print it.


In [None]:
# Print the number of observations in the dataframe


In [None]:
# Print some descriptive statistics for the dataframe


In [None]:
# Print the mean temperature on days it hailed


In [None]:
# Print the mean temperature on days when a tornado or funnel cloud was observed
# Remember http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing



In [None]:
# Print the number of stations that observed a tornado


Next, use BigQuery to return the average temperatures by month and year for the larger <code>publicdata:samples.gsod</code> dataset to a new dataframe. Make sure the returned data is ordered by year and month in ascending order. Hint: groupby

The format should be similar to:

```
YEAR|MONTH|AVERAGE_TEMP
```

In [None]:
# Create new dataframe with average temps by year and month


Next, use <code>pivot_table</code> to create a new dataframe that has years as the rows and months as the columns with the average temperatures as the values.

In [None]:
# Create new dataframe years as rows and months as columns


To save time, I have created a function called <code>box_plot</code> which will take a properly formatted dataframe and create an inline boxplot. Run the code in the follow block to read it into the namespace.

In [None]:
import matplotlib.pyplot as plt
import pylab
import numpy as np
import calendar
%matplotlib inline
pylab.rcParams['figure.figsize'] = 16, 10  #default image size for this interactive session

def box_plot(weather):
    """
    Take a dataframe of years in rows, months in columns, temps in values and return a boxplot.
    """
    weather.dropna(inplace=True)
    plt.figure('weather')
    plt.boxplot(weather.values)
    month_list=[calendar.month_abbr[i] for i in np.arange(1,13)]
    plt.xticks(range(1,13),month_list, rotation=15)
    plt.xlabel('Month')
    plt.ylabel(u'Average Temperature in F\u00b0')
    plt.title('GSOD Temperatures by Month for All Years and All Stations')
    plt.show()

Use <code>box_plot</code> to create the plot.

In [None]:
# Use box_plot to plot your dataframe


## Written Response 1
(Enter Your Response in This Cell)

What observations can you make about this dataset from the boxplot (you can use bullets in your response)?

## Written Response 2
(Enter Your Response in This Cell)

* Discuss some examples of data you may have had experience with that was in a record format.

## Written Response 3
(Enter Your Response in This Cell)

Some recent IT articles have suggested most firms deal with small or medium data rather than Internet-scale or "big data".

* What upsides and downsides to you see for a firm that has medium or small data using technologies that were built for large scale applications at companies like Google and Amazon?

## Written Response 4
(Enter Your Response in This Cell)

What are the benefits and drawbacks of a "big data" technology strategy that:

* uses a PaaS analytics provider like Google?
* uses open-source software like Hadoop or Spark on the firm's server infrastructure? 

## Written Response 5
(Enter Your Response in This Cell)

Given a data science workflow that uses Google BigQuery in the cloud and Python on the desktop, how would approach determining where to perform aggregate functions?  Specifically, when should you use SQL aggregate functions on BigQuery versus in memory pandas aggregate functions on your desktop?