# Tutorial 2.2. Introduction to Google Cloud Platform and Big Query

by Nadzeya Laurentsyeva @ nadzeya.laurentsyeva@econ.lmu.de

# 1. Introduction 

## Motivation for learning about Google Cloud and Big Query
* Great computational capacities!
* Publicly available big datasets https://cloud.google.com/bigquery/public-data/
* Many additional apps for data analysis 

!But the service is not absolutely free. 

## Covered in this lecture 
**Tools** 
* Google Cloud Platform/Big Query Dashboard
* Python client for Google Cloud - optional, upon interest can discuss in office hours/buffer time

**Operations**  
* Quering databases using Google Big Query, exporting the results 

**Software requirements** 
* Python 3
* Packages in python: pandas, google-cloud-bigquery, virtualenv (recomended for installing google-cloud packages)
    * https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/
    * https://googleapis.dev/python/bigquery/latest/index.html 

**Data**
* ghtorrentmysql1906:MySQL1906 (stored on the Google Cloud); it is a full copy of the latest GHtorrent download https://ghtorrent.org/downloads.html 

## 2. Basic concepts

Go to https://cloud.google.com/, then go to Console, use your Google Account credentials to log-in. This will bring you to the dashboard.

Relevant elements on the Dashboard
* **Projects** - organise and govern your activities in the Cloud
    * Navigate and launch cloud tools (e.g. BigQuery, Storage, Dataprep, etc.)
    * Work collaboratively 
    * API manager 
    
* **Resources** - tools and apps that a project uses in the Cloud, two examples:
    * Storage - use to upload large (raw) files for futher analysis or to export query results; these data can be then used to create tables and analyse them in BigQuery
    * Datasets in BigQuery - tables ready to be queried 
    
* **Billing** - you are billed for the resources you use https://cloud.google.com/pricing/list, https://cloud.google.com/products/calculator
    * Storage - billed for bucket storage 
    * BigQuery
        * for Query processing 
        * for Table storage 
    * But there is also smth for free https://cloud.google.com/free/docs/gcp-free-tier 
       
    Eg: Pricing for BigQuery
    https://cloud.google.com/bigquery/pricing, but
        * 1 TB of querying per month free
        * 10 GB of storage each month free
    

## 3. Exercises

1. In Google Cloud Dashboard: create a new project, can give it any name

2. In the Navigation panel: find and click BigQuery (for convenience you can pin tools that you use often)

3. Resources: allows to add and keep (pin) data for your queries
    * Your own data: you upload a raw csv file to the Storage and then create a table based on the dataset(s)
    * Publicly available datasets https://cloud.google.com/bigquery/public-data/
        * Those explicitly shared by Google
        * Those shared by other organisations/users
        * Sample datasets 
    
4. How to pin a dataset to your project
    * If you know the dataset's exact name, you can pin it directly: try ADD Data > Pin project > ghtorrentmysql1906 
    * Otherwise, you can do it from the webpage of a relevant public dataset. Ex:   
    https://console.cloud.google.com/marketplace/details/johnshopkins/covid19_jhu_global_cases
    * Once you press 'View Dataset', it will open BigQuery and automatically pin the dataset in your project
    * Another very useful feature is to check sample queries of publicly available dataset. If you click: _Run this query_ it automatically opens the query in the BigQuery and pins the dataset for you. 
    
5. We can now repeat the steps I did to generate the GitHub datasets for the previous lecture. 

I extracted all commits in October 2018

https://console.cloud.google.com/bigquery?sq=349652669346:2e2174ccf7514243a07c2a2852cad36a


I then extracted repositories and users that had commits in October 2018

* https://console.cloud.google.com/bigquery?sq=349652669346:ee4c923ca8f34cb8b52a97b947ceb4ba

* https://console.cloud.google.com/bigquery?sq=349652669346:d5c64f3b40474b20a8bb57a98a1cc4e7

I saved the results to Google drive and you know what happened next :) 


6. You can now practice on your own and for ex. see whether our results for October 2018 are representative for all commits in 2018. You can appreciate how fast the merging is! Just watch out for the size of your queries! 

7. Dashboard is good for getting yourself familiar with Google Cloud tools. If you want to use it for research/work, it makes sense to install a client in your preferred programming language.
    

## Resources

Coursera Cloud Platform Courses
https://www.coursera.org/specializations/from-data-to-insights-google-cloud-platform 

Optimising queries
https://medium.com/google-cloud/bigquery-optimized-cluster-your-tables-65e2f684594b

GitHub queries
* https://github.blog/2017-01-19-github-data-ready-for-you-to-explore-with-bigquery/?fbclid=IwAR1E01NhM1kFZE4TM_XC6aDhkWSm2s8oCIsKXA4EcsiixnNdsBo22Kjlwho 

* https://github.com/fhoffa/analyzing_github/

* Note: there are currently several GitHub datasets available on BigQuery
    * GHtorrent data: ghtorrentmysql1906 - contains a publicly available GHtorrent dump from June 2019
    * GHtorrent data 2: ghtorrent-bq - GHtorrent dumps from 2017 and 2018 https://ghtorrent.org/ 
    * GitHub Activity data: bigquery-public-data:github_repos. Contains contents from 2.9M public, open source licensed repositories on GitHub. https://console.cloud.google.com/marketplace/details/github/github-repos?filter=solution-type:dataset&q=github&id=46ee22ab-2ca4-4750-81a7-3ee0f0150dcb
    * GitHub Archive data: githubarchive Contains data on GitHub events. https://www.gharchive.org/
    
