## Setting-up and accessing Google Cloud Platform (Big Query) via the client

**Installing packages** 

* virtualenv (recomended for installing google-cloud packages)
    * https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/
    * https://janakiev.com/blog/jupyter-virtual-envs/ This article gives good instructions how to create a virtual environment with Anaconda and how to add it in your Jupyter notebook

* google-cloud-bigquery 
    * https://cloud.google.com/bigquery/docs/reference/libraries 
    
**Creating a virtual environment with Anaconda and adding it to Jupyter** 

1. Open Anaconda prompt
2. _conda create -n myenv_  where myenv is any name you want to set for your virtual environment, it is stored in the envs folder in your Anaconda directory. 
3. To start working in it: _conda activate myenv_; 
    * to stop: _conda deactivate_
    * to list all available environments: _conda env list_
    * to remove an environment: conda env list _conda env remove -n myenv_
4. After activating the virtual environment, you need to add it to Jupyter. First, install ipykernel which provides the IPython kernel for Jupyter _pip install --user ipykernel_
5. This command will add the environment to Jupyter _python -m ipykernel install --user --name=myenv_
6. You should see the following output _Installed kernelspec myenv in /home/user/.local/share/jupyter/kernels/myenv_
7. Now you are able to choose this new environment as a kernel in Jupyter 
    * In an open notebook: Kernel --> Change Kernel --> myenv
    * Once you remove the virtual environment, you can remove the kernel from Jupyter: _jupyter kernelspec uninstall myenv_
8. Again from Anaconda prompt (make sure you are in the virtual environment): _pip install google-cloud-bigquery_

**Setting up authentication on Google Cloud** (you can do the same using the command line)
1. In the Cloud Console, select the relevant project and go to the Create service account key page.  https://console.cloud.google.com/apis/credentials/serviceaccountkey 
2. From the Service account list, select New service account. In the Service account name field, enter a name. From the Role list, select Project > Owner.
3. Save the generated json file

In [1]:
# Import the packages
import os, sys
from google.cloud import bigquery
import pandas as pd

In [2]:
# Add path to your .json file with credentials
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.path.abspath("C:/Users/nadzeya/Documents/teaching/Data_science_2020_ss/lectures/platform client/client.json")

In [3]:
# Construct a BigQuery client object.
client = bigquery.Client()

# Github datasets 
GitHub queries
* https://github.blog/2017-01-19-github-data-ready-for-you-to-explore-with-bigquery/?fbclid=IwAR1E01NhM1kFZE4TM_XC6aDhkWSm2s8oCIsKXA4EcsiixnNdsBo22Kjlwho 

* https://github.com/fhoffa/analyzing_github/

* Note: there are currently several GitHub datasets available on BigQuery
    * GHtorrent data: ghtorrentmysql1906 - contains GHtorrent dump from June 2019
    * GHtorrent data 2: ghtorrent-bq - GHtorrent dumps from 2017 and 2018 https://ghtorrent.org/ 
    * GitHub Activity data: bigquery-public-data:github_repos. Contains contents from 2.9M public, open source licensed repositories on GitHub. https://console.cloud.google.com/marketplace/details/github/github-repos?filter=solution-type:dataset&q=github&id=46ee22ab-2ca4-4750-81a7-3ee0f0150dcb
    * GitHub Archive data: githubarchive Contains data on GitHub events. https://www.gharchive.org/

In [4]:
# Summary statistics: activity of users by country codes 
# will check the share of events where we can see a users' location 

query = """
SELECT y, m, country_code, city, long, lat, type, SUM(events) as Events, COUNT(DISTINCT login) as Users
FROM 
  (SELECT EXTRACT(YEAR FROM created_at) AS y, EXTRACT(MONTH FROM created_at) AS m, actor.login as actor_login, type, COUNT(*) AS events
    FROM `githubarchive.month.2019*`
    WHERE _TABLE_SUFFIX BETWEEN '10' AND '12'
    GROUP BY y, m, actor.login, type
    ) a
JOIN 
  (SELECT login, country_code, city, long, lat,
  FROM `ghtorrentmysql1906.MySQL1906.users`) b
ON a.actor_login = b.login
GROUP BY y, m, country_code, city, long, lat, type
"""

In [5]:
# settings
dry = bigquery.QueryJobConfig(dry_run = True, use_query_cache=False) 
run = bigquery.QueryJobConfig(dry_run=False, use_query_cache=True)

In [6]:
# check processing costs
job = client.query(query, job_config=dry)
print("Total GB that will be processed: ", job.total_bytes_processed/1000000000)
print("Bytes billed: ", job.total_bytes_billed)

Total GB that will be processed:  6.272740366
Bytes billed:  0


In [7]:
# Actual job: RUN ONLY ONCE!!!!
# Now let's run the query and convert the results to a dataframe
job = client.query(query, job_config=run)  #comment it out if not needed to repeat
sum_users = job.to_dataframe()  
sum_users.head()

Unnamed: 0,y,m,country_code,city,long,lat,type,Events,Users
0,2019,10,,,0.0,0.0,ForkEvent,513465,251386
1,2019,11,,,0.0,0.0,ForkEvent,468176,237888
2,2019,10,,,,,ForkEvent,379607,163456
3,2019,11,,,,,ForkEvent,313965,146971
4,2019,12,,,0.0,0.0,ForkEvent,447537,223686


In [43]:
# cities by activity and unique users 
sum_users.groupby(["country_code"]).agg({"Events":"sum", "Users":"sum"}).sort_values("Events", ascending = False)

Unnamed: 0_level_0,Events,Users
country_code,Unnamed: 1_level_1,Unnamed: 2_level_1
us,10360073,1022618
hu,2855270,14121
cn,2485174,336471
de,2264084,244765
gb,2066987,218785
...,...,...
ne,4,4
kn,2,1
mh,2,2
tv,1,1


In [44]:
sum_users['country_missing'] = sum_users['country_code'].isnull()

In [46]:
sum_users.groupby(["country_missing"]).agg({"Events":"sum", "Users":"sum"}).sort_values("Events", ascending = False)

Unnamed: 0_level_0,Events,Users
country_missing,Unnamed: 1_level_1,Unnamed: 2_level_1
True,79356479,10418350
False,36661837,3920666


In [67]:
df = sum_users.groupby(["country_missing"]).agg({"Events":"sum", "Users":"sum"}).apply(lambda x: x/x.sum())
df

Unnamed: 0_level_0,Events,Users
country_missing,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.316,0.273426
True,0.684,0.726574


In [38]:
# select top active repositories as of 2019; do the same check 

In [39]:
# export to html if needed
import os

os.system('jupyter nbconvert --to html GH.ipynb')

0