#Weather Data in BigQuery
In this lab you will analyze historical weather observations using BigQuery and use weather data in conjunction with other datasets.This lab uses two public datasets in BigQuery: weather data from NOAA and citizen complaints data from New York City.

What you'll learn
In this lab, you will:

* Carry out interactive queries on the BigQuery console.
* Combine and run analytics on multiple datasets.

# Content
1. [Explore weather data](#1)
2. [Explore New York citizen complaints data](#2)
3. [Find correlation between weather and complaints](#3)
4. [Sumary]()

Provide your credentials to the runtime

In [None]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

##Optional: Enable data table display

Colab includes the google.colab.data_table package that can be used to display large pandas dataframes as an interactive data table. It can be enabled with:

In [None]:
%load_ext google.colab.data_table

If you would prefer to return to the classic Pandas dataframe display, you can disable this by running:

In [None]:
%unload_ext google.colab.data_table

## 1.Explore weather data <a name="1"></a>

### 1.Use BigQuery via magics

The `google.cloud.bigquery` library also includes a magic command which runs a query and either displays the result or saves it to a variable as a `DataFrame`.

In [None]:
%%bigquery --project bigquery-288322
SELECT
  -- Create a timestamp from the date components.
  stn,
  TIMESTAMP(CONCAT(year,"-",mo,"-",da)) AS timestamp,
  -- Replace numerical null values with actual null
  AVG(IF (temp=9999.9,
      null,
      temp)) AS temperature,
  AVG(IF (wdsp="999.9",
      null,
      CAST(wdsp AS Float64))) AS wind_speed,
  AVG(IF (prcp=99.99,
      0,
      prcp)) AS precipitation
FROM
  `bigquery-public-data.noaa_gsod.gsod20*`
WHERE
  CAST(YEAR AS INT64) > 2010
  AND CAST(MO AS INT64) = 6
  AND CAST(DA AS INT64) = 12
  AND (stn="725030" OR  -- La Guardia
    stn="744860")    -- JFK
GROUP BY
  stn,
  timestamp
ORDER BY
  timestamp DESC,
  stn ASC

Unnamed: 0,stn,timestamp,temperature,wind_speed,precipitation
0,725030,2020-06-12 00:00:00+00:00,77.9,6.8,0.05
1,744860,2020-06-12 00:00:00+00:00,71.5,7.5,0.77
2,725030,2019-06-12 00:00:00+00:00,68.2,9.2,0.37
3,744860,2019-06-12 00:00:00+00:00,67.3,9.8,0.2
4,725030,2018-06-12 00:00:00+00:00,66.3,7.8,0.0
5,744860,2018-06-12 00:00:00+00:00,61.3,6.8,0.02
6,725030,2017-06-12 00:00:00+00:00,86.7,8.5,0.0
7,744860,2017-06-12 00:00:00+00:00,79.6,9.4,0.0
8,725030,2016-06-12 00:00:00+00:00,80.2,14.0,0.1
9,744860,2016-06-12 00:00:00+00:00,80.7,15.0,0.0


### 2.Use BigQuery through google-cloud-bigquery
See BigQuery documentation and library reference documentation.

In [None]:
#Declare the Cloud project ID which will be used throughout this notebook
project_id = 'bigquery-288322'
from google.cloud import bigquery

client = bigquery.Client(project=project_id)
sql = """
SELECT
  -- Create a timestamp from the date components.
  stn,
  TIMESTAMP(CONCAT(year,"-",mo,"-",da)) AS timestamp,
  -- Replace numerical null values with actual null
  AVG(IF (temp=9999.9,
      null,
      temp)) AS temperature,
  AVG(IF (wdsp="999.9",
      null,
      CAST(wdsp AS Float64))) AS wind_speed,
  AVG(IF (prcp=99.99,
      0,
      prcp)) AS precipitation
FROM
  `bigquery-public-data.noaa_gsod.gsod20*`
WHERE
  CAST(YEAR AS INT64) > 2010
  AND CAST(MO AS INT64) = 6
  AND CAST(DA AS INT64) = 12
  AND (stn="725030" OR  -- La Guardia
    stn="744860")    -- JFK
GROUP BY
  stn,
  timestamp
ORDER BY
  timestamp DESC,
  stn ASC
"""
df = client.query(sql).to_dataframe()
df.head()

Unnamed: 0,stn,timestamp,temperature,wind_speed,precipitation
0,725030,2020-06-12 00:00:00+00:00,77.9,6.8,0.05
1,744860,2020-06-12 00:00:00+00:00,71.5,7.5,0.77
2,725030,2019-06-12 00:00:00+00:00,68.2,9.2,0.37
3,744860,2019-06-12 00:00:00+00:00,67.3,9.8,0.2
4,725030,2018-06-12 00:00:00+00:00,66.3,7.8,0.0


## Explore New York citizen complaints data <a name="2"></a>

what the most common complaints are?

In [None]:
%%bigquery --project bigquery-288322
SELECT
  EXTRACT(YEAR
  FROM
    created_date) AS year,
  complaint_type,
  COUNT(1) AS num_complaints
FROM
  `bigquery-public-data.new_york.311_service_requests`
GROUP BY
  year,
  complaint_type
ORDER BY
  num_complaints DESC

Unnamed: 0,year,complaint_type,num_complaints
0,2017,Noise - Residential,230152
1,2016,HEAT/HOT WATER,227959
2,2015,HEAT/HOT WATER,225706
3,2016,Noise - Residential,221906
4,2010,HEATING,214218
...,...,...,...
2024,2018,Sprinkler - Mechanical,1
2025,2018,Fire Alarm - Modification,1
2026,2017,Advocate-Business Tax,1
2027,2015,Advocate-Prop Class Incorrect,1


In [None]:
sql = """
SELECT
  descriptor,
  sum(complaint_count) as total_complaint_count,
  count(temperature) as data_count,
  ROUND(corr(temperature, avg_count),3) AS corr_count,
  ROUND(corr(temperature, avg_pct_count),3) AS corr_pct
From (
SELECT
  avg(pct_count) as avg_pct_count,
  avg(day_count) as avg_count,
  sum(day_count) as complaint_count,
  descriptor,
  temperature
FROM (
  SELECT
    DATE(timestamp) AS date,
    temperature
  FROM
    demos.nyc_weather) a
  JOIN (
  SELECT x.date, descriptor, day_count, day_count / all_calls_count as pct_count
  FROM
    (SELECT
      DATE(created_date) AS date,
      concat(complaint_type, ": ", descriptor) as descriptor,
      COUNT(*) AS day_count
    FROM
      `bigquery-public-data.new_york.311_service_requests`
    GROUP BY
      date,
      descriptor)x
    JOIN (
      SELECT
        DATE(timestamp) AS date,
        COUNT(*) AS all_calls_count
      FROM `demos.nyc_weather`
      GROUP BY date
    )y
  ON x.date=y.date
)b
ON
  a.date = b.date
GROUP BY
  descriptor,
  temperature
)
GROUP BY descriptor
HAVING
  total_complaint_count > 5000 AND
  ABS(corr_pct) > 0.5 AND
  data_count > 5
ORDER BY
  ABS(corr_pct) DESC
"""
df1 = client.query(sql).to_dataframe()
df1.head()