# Google Cloud Platform Project Creation Workbook 
 
Use this workbook to create a google cloud project with everything needed to collect new data and host your own web app. 
 
Prerequisites:  
+ Create Google user account  <br><br>
+ Create your own personal Google Cloud Project and Enable Billing
    - Enable Free Tier account by seleting "Try it Free" here: [Try Google Cloud Platform for free](https://cloud.google.com/cloud-console)
    - Follow steps to activate billing found here: [Create New Billing Account](https://cloud.google.com/billing/docs/how-to/manage-billing-account#create_a_new_billing_account)
        - Billing account is required for APIs used in this project
        - You will not exceed the $300 free trial setting up this project but make sure to delete the project if you do not want to be charged
        - Take note of project name created because this billing account will be used with the new project <br><br>
+ Install and initialize Google Cloud SDK by following instructions found here: [Cloud SDK Quickstart](https://cloud.google.com/sdk/docs/quickstart) <br><br>
+ Set default region and zone following instructions here:

## Step 1 - Check Prequisites Successfully Completed
Check that you have successfully installed and enabled Cloud SDK by running the config list command. If you get an error please refer to Troubleshooting steps found here [Cloud SDK Quickstart](https://cloud.google.com/sdk/docs/quickstart).  
You should see an output that includes your account along with any other configuration setup when using gcloud init

In [162]:
!gcloud config list

[accessibility]
screen_reader = False
[compute]
region = us-central1
zone = us-central1-c
[core]
account = cwilbar04@gmail.com
disable_usage_reporting = True
project = nba-predictions-dev



Your active configuration is: [default]


Update all gcloud components to latest release.

In [163]:
!gcloud components update

Beginning update. This process may take several minutes.

All components are up to date.


## Step 2 - Create GCP Project

###### TO DO: Enter name for new project and biling project then change to Code block and run
###### Note: Proect name must be unique across GCP. If you get error when creating project please change the project name here and try again.
new_project_id = 'YOUR_NEW_UNIQUE_PROJECT_NAME'

In [30]:
new_project_id = 'nba-predictions-test'

In [3]:
!gcloud projects create {new_project_id}

ERROR: (gcloud.projects.create) Project creation failed. The project ID you specified is already in use by another project. Please try an alternative ID.


**TO DO: Navigate to [Cloud Console](https://console.cloud.google.com/), Change to new project, and enable billing following instructions found here: [Enable Billing](https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project)**

## Step 3 - Enable Necessary Cloud Services

This project uses:
+ BigQuery to Store Model Data 
+ Google Cloud Functions scheduled using Google Cloud Scheduler to Load new Data Daily
+ Google App Engine to Host Website
+ Google Firestore in Native Mode to store data used by the Web Page  
  
List below contains all services needed at time of creation of this workbook. Please add/remove from this list if the names/necessary services have changed.

In [171]:
enable_services_list = [
    'appengine.googleapis.com',
    'bigquery.googleapis.com',
    'bigquerystorage.googleapis.com',
    'cloudapis.googleapis.com',
    'cloudbuild.googleapis.com',
    'clouddebugger.googleapis.com',
    'cloudfunctions.googleapis.com',
    'cloudresourcemanager.googleapis.com',
    'cloudscheduler.googleapis.com',
    'cloudtrace.googleapis.com',
    'compute.googleapis.com',
    'datastudio.googleapis.com',
    'deploymentmanager.googleapis.com',
    'firebaserules.googleapis.com',
    'firestore.googleapis.com',
    'logging.googleapis.com',
    'monitoring.googleapis.com',
    'oslogin.googleapis.com',
    'servicemanagement.googleapis.com',
    'serviceusage.googleapis.com',
    'sql-component.googleapis.com',
    'storage-api.googleapis.com',
    'storage-component.googleapis.com',
    'storage.googleapis.com'    
]

In [172]:
## Services can only be enabled 20 at a time at the time of workbook creation. Use this loop to enable 20 at a time.
for x in range(0,len(enable_services),20):
    !gcloud services enable {' '.join(enable_services[x:(x+20)])} --project={new_project_id}   

Operation "operations/acf.p2-130738074716-2d57f6c4-b755-4e1f-b14d-392c602ef21f" finished successfully.
Operation "operations/acf.p2-130738074716-7812275a-cb95-4bcc-acd4-a232feeaea7e" finished successfully.


In [4]:
!gcloud services list --project={new_project_id}

NAME                                 TITLE
appengine.googleapis.com             App Engine Admin API
bigquery.googleapis.com              BigQuery API
bigquerydatatransfer.googleapis.com  BigQuery Data Transfer API
bigquerystorage.googleapis.com       BigQuery Storage API
cloudapis.googleapis.com             Google Cloud APIs
cloudbuild.googleapis.com            Cloud Build API
clouddebugger.googleapis.com         Cloud Debugger API
cloudfunctions.googleapis.com        Cloud Functions API
cloudresourcemanager.googleapis.com  Cloud Resource Manager API
cloudscheduler.googleapis.com        Cloud Scheduler API
cloudtrace.googleapis.com            Cloud Trace API
compute.googleapis.com               Compute Engine API
containerregistry.googleapis.com     Container Registry API
datastore.googleapis.com             Cloud Datastore API
datastudio.googleapis.com            Data Studio API
deploymentmanager.googleapis.com     Cloud Deployment Manager V2 API
firebaserules.googleapis.com         

## Step 4 - Create Necessary Service Accounts

There are four primary service accounts used in this project:  
- **App Engine default service account**
    - This gets created automatically when the App engine API is enabled
    - Generally your_project_id@appspot.gserviceaccount.com  <br><br>
      
- **Compute Engine default service account**
    - This gets created automatically when the Compute engine API is enabled
    - Generally your_project_number-compute@developer.gserviceaccount.com  <br><br>
      
- **Cloud Function service account**
    - We create this and add necessary roles below using the Cloud SDK
    - cloudfunction-service-account@your_project_name.iam.gserviceaccount.com
    - This account is used as the service account to run all Cloud Functions in this project  <br><br>
      
- **CircleCI Service Account**
    - We create this and add necessary roles below using the Cloud SDK
    - circleci-deployer@your_project_name.iam.gserviceaccount.com
    - This account is used in CircleCI for CI\CD to deploy and test App Engine and Cloud Functions 

Check what service ccounts are already created (should be the two default ones described above)

In [174]:
!gcloud iam service-accounts list --project={new_project_id}

DISPLAY NAME                            EMAIL                                               DISABLED
App Engine default service account      nba-predictions-prod@appspot.gserviceaccount.com    False
Compute Engine default service account  130738074716-compute@developer.gserviceaccount.com  False


In [175]:
!gcloud iam service-accounts create cloudfunction-service-account \
    --display-name="Cloud Function Service Account" \
    --description="Account used to run all Cloud Functions with necessary BigQuery and Firestore Permissions" \
    --project={new_project_id}

Created service account [cloudfunction-service-account].


In [177]:
!gcloud iam service-accounts create circleci-deployer \
    --display-name="Circle CI Service Account" \
    --description="Account used by Circle CI with necessary permissions to Deploy to Cloud Functions and App Engine" \
    --project={new_project_id}

Created service account [circleci-deployer].


Check service accounts were created successfully and display e-mail needed in the next step

In [178]:
!gcloud iam service-accounts list --project={new_project_id}

DISPLAY NAME                            EMAIL                                                                       DISABLED
App Engine default service account      nba-predictions-prod@appspot.gserviceaccount.com                            False
Compute Engine default service account  130738074716-compute@developer.gserviceaccount.com                          False
Circle CI Service Account               circleci-deployer@nba-predictions-prod.iam.gserviceaccount.com              False
Cloud Function Service Account          cloudfunction-service-account@nba-predictions-prod.iam.gserviceaccount.com  False


Programatically update the roles for the new service accounts using the guide found here: [Programatic Change Access](https://cloud.google.com/iam/docs/granting-changing-revoking-access#programmatic)

In [179]:
# Save policy file in directory above where the repo is saved so that it is not stored to github
file_directory = '..\..\policy.json'

In [180]:
# Write current policy to file directory
!gcloud projects get-iam-policy {new_project_id} --format json > {file_directory}

**If running jupyter notebook run below cell to load and modify policy file.**

In [181]:
import json

with open('..\..\policy.json') as f:
    policy = json.load(f)

def modify_policy_add_role(policy, role, member):
    """Adds a new role binding to a policy."""

    binding = {"members": [member],"role": role }
    policy["bindings"].append(binding)
    return policy

members = [f'serviceAccount:cloudfunction-service-account@{new_project_id}.iam.gserviceaccount.com', 
           f'serviceAccount:circleci-deployer@{new_project_id}.iam.gserviceaccount.com']
roles = {members[0]:['roles/bigquery.user','roles/datastore.user','roles/run.serviceAgent'],
        members[1]:['roles/appengine.deployer','roles/appengine.serviceAdmin','roles/cloudbuild.builds.builder',
                   'roles/cloudfunctions.admin','roles/compute.storageAdmin','roles/iam.serviceAccountUser']}

for member in members:
    for role in roles[member]:
        policy = modify_policy_add_role(policy, role, member)

with open('..\..\policy.json', 'w') as json_file:
    json.dump(policy, json_file)

**If running code direct in console, navigate to file path and add the members and roles below in to the file path**  
**Change "your_project_id" to the name of your project id**

{"members": ["serviceAccount:cloudfunction-service-account@your_project_id.iam.gserviceaccount.com"], "role": "roles/bigquery.user"},  
{"members": ["serviceAccount:cloudfunction-service-account@your_project_id.iam.gserviceaccount.com"], "role": "roles/datastore.user"},  
{"members": ["serviceAccount:cloudfunction-service-account@your_project_id.iam.gserviceaccount.com"], "role": "roles/run.serviceAgent"},  
{"members": ["serviceAccount:circleci-deployer@your_project_id.iam.gserviceaccount.com"], "role": "roles/appengine.deployer"},   
{"members": ["serviceAccount:circleci-deployer@your_project_id.iam.gserviceaccount.com"], "role": "roles/appengine.serviceAdmin"},   
{"members": ["serviceAccount:circleci-deployer@your_project_id.iam.gserviceaccount.com"], "role": "roles/cloudbuild.builds.builder"},   
{"members": ["serviceAccount:circleci-deployer@your_project_id.iam.gserviceaccount.com"], "role": "roles/cloudfunctions.admin"},  
{"members": ["serviceAccount:circleci-deployer@your_project_id.iam.gserviceaccount.com"], "role": "roles/compute.storageAdmin"},  
{"members": ["serviceAccount:circleci-deployer@your_project_id.iam.gserviceaccount.com"], "role": "roles/iam.serviceAccountUser"}

In [182]:
!gcloud projects set-iam-policy {new_project_id} {file_directory}

bindings:
- members:
  - serviceAccount:circleci-deployer@nba-predictions-prod.iam.gserviceaccount.com
  role: roles/appengine.deployer
- members:
  - serviceAccount:circleci-deployer@nba-predictions-prod.iam.gserviceaccount.com
  role: roles/appengine.serviceAdmin
- members:
  - serviceAccount:cloudfunction-service-account@nba-predictions-prod.iam.gserviceaccount.com
  role: roles/bigquery.user
- members:
  - serviceAccount:130738074716@cloudbuild.gserviceaccount.com
  - serviceAccount:circleci-deployer@nba-predictions-prod.iam.gserviceaccount.com
  role: roles/cloudbuild.builds.builder
- members:
  - serviceAccount:service-130738074716@gcp-sa-cloudbuild.iam.gserviceaccount.com
  role: roles/cloudbuild.serviceAgent
- members:
  - serviceAccount:circleci-deployer@nba-predictions-prod.iam.gserviceaccount.com
  role: roles/cloudfunctions.admin
- members:
  - serviceAccount:service-130738074716@gcf-admin-robot.iam.gserviceaccount.com
  role: roles/cloudfunctions.serviceAgent
- members:
  

Updated IAM policy for project [nba-predictions-prod].


In [183]:
# Remove policy file 
!del {file_directory}

## Step 5 - Create App Engine Application

In order to deploy a specific application you first need to create a placeholder application. 

**Change YOUR_REGION to your default region**  
See [Regions and Zone](https://cloud.google.com/compute/docs/regions-zones) for more info

In [184]:
!gcloud app create --region=YOUR_REGION --project={new_project_id}

You are creating an app for project [nba-predictions-prod].
cannot be changed. More information about regions is at
<https://cloud.google.com/appengine/docs/locations>.

Creating App Engine application in project [nba-predictions-prod] and region [us-central]....
.................................done.
Success! The app is now created. Please use `gcloud app deploy` to deploy your first app.


## Step 6 - Create BigQuery Dataset

Your new project will need a dataset to store the data if you plan on copying/creating your own repository of data.  

This has to be a unique name per project.  

In my workflows I have named the dataset 'nba' but feel free to change it. Note that if you do change it, then you will also need to change the dataset name in any of the other python scripts in this project appropriately. 

In [6]:
new_project_id = 'nba-predictions-test'

In [5]:
dataset_name = 'nba'

In [7]:
!bq --location=US mk --dataset \
--description "Stores all National Basketball Association Data. Created using Project Creation workbook found at https://github.com/cwilbar04/nba-predictions/tree/main/notebooks" \
{new_project_id}:{dataset_name}  

Dataset 'nba-predictions-test:nba' successfully created.


## Step 7 - Load BigQuery Tables

All data in this project is taken from [BASEKTBALL REFERENCE](https://www.basketball-reference.com/)

There are two options for loading the data to BigQuery:  
1. **Load the data yourself** 
    - Part 1: Raw Data
        - Navigate to [Initial Load Workbook](https://github.com/cwilbar04/nba-predictions/blob/main/notebooks/NBA%20Data%20Initial%20Load.ipynb) and change start date to desired starting date. For my model I loaded data starting from '10-1-1999'. Always choose a start date in between seasons if you don't want to get partial season data. Warning this may take a couple days and require re-starts. 
    - Part 2: Model Data
        - Navigate to [Initial Model Load Workbook](https://github.com/cwilbar04/nba-predictions/blob/main/notebooks/NBA%20Model%20Table%20Initial%20Load.ipynb) and change project and dataset names to what you used in the workbook then run all. <br><br>
2. **Copy Data**
    - For a quicker load process, simply copy the data directly from my public data set by running the code blocks below. You must completed Step 6 - Create BigQuery Dataset first. Be careful of costs if dataset you create is in a different region than US. At time of creation this is still in beta and there is no cost. See documentation here for latest info: [Copy Datasets](https://cloud.google.com/bigquery/docs/copying-datasets)

In [6]:
##### Copy Dataset Code Block. Only run if choosing option 2 above ####
## You first have to enable Data Transfer Service API ##
!gcloud services enable bigquerydatatransfer.googleapis.com --project={new_project_id}

In [92]:
##### Copy Dataset Code Block. Only run if choosing option 2 above ####
## Enabling the Data Transfer Service API can take a minute. Please wait and retry if you get an error"   ##
## Below code must be run in python. To run outside of python please replace {} with correct information. ##
## Params must be JSON formatted                                                                          ##
## Data will be transfered from my public data set to the dataset you created in Step 6 above ##

import json
source_parameters = '{"source_dataset_id":"nba", "source_project_id":"nba-predictions-dev", "overwrite_destination_table":"true"}'
source_parameters_json = json.dumps(source_parameters)
run = f'bq mk --transfer_config\
                --project_id={new_project_id}\
                --data_source=cross_region_copy\
                --target_dataset={dataset_name}\
                --display_name="Initial load of public NBA dataset"\
                --no_auto_scheduling\
                --params={source_parameters_json}'
!{run}

## Step 8 - Deploy Cloud Functions

This project uses three cloud functions that we will set up schedules for using Cloud Scheduler in order to update the data daily:
1. **nba_basketball_reference_scraper**
    - This funciton allows you to specify a start date and end date in a JSON header ({"StartDate":"1-1-1000","EndDate":"1-1-100"}) for game box scores and game player box scores from [BASEKTBALL REFERENCE](https://www.basketball-reference.com/) to nba.raw_basketballreference_game and nba.raw_basketballreference_playerbox.
    - If you don't provide a start date then it automatically uses the max game date from the raw_basketballreference_game table.
    - If you don't specify an end date then it automatically loads data up to yesterday (aka the last day games were guaranteed to be completed).
    - When we schedule this job we will not provide a start date or end date so it will always load the most recent data that is not already in the raw_basketballreference_game and raw_basketballreference_playerbox tables. <br><br>
       
2. **nba_model_game_refresh**
    - This function uses the view we will create in the next step to identify games that have been loaded to the raw_basketballreference_game table but have not been loaded in to the model_game_data table yet. It then performs all of the necessary transformations to combine specific player data stats and create moving average columns and load the data in to the model_game_data table.
    - This job also loads the most recent information for each team to Firestore that our web app uses when making predictions.
    - This job does not care what is in the JSON header.
    - We will schedule this to run daily one hour after the scraper function. <br><br>
    
3. **nba_get_upcoming_games**
    - This function gets the schedule from [BASEKTBALL REFERENCE](https://www.basketball-reference.com/) for one week, including "today" and overwrites the schedule file stored in the App Engine default cloud storage bucket. This schedule will be used to display upcoming games on our web page.
    - This function will be scheduled to run one hour before the scraper function.
  
**NOTE:** All three functions are set to be allow all users to invoke them in the current build. This is to avoid setting up credentialing for cloud scheduler. Future build will seek to remove this vulnerability by properly setting up Cloud Scheduler credentials.

**IMPORTANT** The deploy functions will only run if you have launched this notebook from a git cloned folder. Otherwise, you will need to change the "source" to the file path where the folders containing the relevant functions and requirements exist.

In [93]:
## Set variables used in each deploy. You should not need to change these if you have followed 
# all of the steps about in creating the service account and creating the app engine.
CLOUD_FUNCTION_SERVICE_ACCOUNT = f'cloudfunction-service-account@{new_project_id}.iam.gserviceaccount.com'
CLOUD_STORAGE_BUCKET = f'{new_project_id}.appspot.com'

In [98]:
FUNCTION_NAME='nba_basketball_reference_scraper'

In [None]:
# Deploy function
FUNCTION_NAME='nba_basketball_reference_scraper'

!gcloud functions deploy {FUNCTION_NAME} \
  --source=../scraper \
  --project={new_project_id} \
  --allow-unauthenticated \
  --entry-point=nba_basketballreference_scraper \
  --memory=1024MB \
  --runtime=python38 \
  --service-account={CLOUD_FUNCTION_SERVICE_ACCOUNT} \
  --trigger-http \
  --timeout=300

# Set policy on function to allow allUsers to invoke
!gcloud functions add-iam-policy-binding {FUNCTION_NAME} \
  --member=allUsers \
  --role=roles/cloudfunctions.invoker \
  --project={new_project_id}

In [None]:
#Deploy function
FUNCTION_NAME='nba_model_game_refresh'

!gcloud functions deploy {FUNCTION_NAME} \
  --source=../data_model \
  --project={new_project_id} \
  --allow-unauthenticated \
  --entry-point=create_model_data \
  --memory=1024MB \
  --runtime=python38 \
  --service-account={CLOUD_FUNCTION_SERVICE_ACCOUNT} \
  --trigger-http \
  --timeout=300

# Set policy on function to allow allUsers to invoke
!gcloud functions add-iam-policy-binding {FUNCTION_NAME} \
  --member=allUsers \
  --role=roles/cloudfunctions.invoker \
  --project={new_project_id}

In [None]:
# Deploy function
FUNCTION_NAME='nba_get_upcoming_games'

!gcloud functions deploy {FUNCTION_NAME} \
  --source=../get_schedule \
  --project={new_project_id} \
  --allow-unauthenticated \
  --entry-point=write_to_bucket \
  --memory=512MB \
  --runtime=python38 \
  --service-account={CLOUD_FUNCTION_SERVICE_ACCOUNT} \
  --trigger-http \
  --timeout=60 \
  --set-env-vars=CLOUD_STORAGE_BUCKET={CLOUD_STORAGE_BUCKET}

# Set policy on function to allow allUsers to invoke
!gcloud functions add-iam-policy-binding {FUNCTION_NAME} \
  --member=allUsers \
  --role=roles/cloudfunctions.invoker \
  --project={new_project_id}

## Step 9 - Create BigQuery View

In order to use the nba_model_game_refresh function we need to create a Big Query view that identifies what games have been loaded in to the raw_basektballrefernce_game table but have not been loaded in to the model_game_data table yet. Copying datasets does not copy views so we will always need to run this step even if you copied the entire dataset directly.

**IMPORTANT** If you ever change the number of games to use for the weighted moving average (W) then you will need to update this view as well. The game_number < filter needs to change to however many games you are averaging over. Future release will seek to remove this change dependency as it is too easy to miss.

In [None]:
OPTIONS( \
  description="Games to Load to Model View. \
      IMPORTANT: If you ever change the number of games to use for the weighted moving average (W) then you \
      will need to update this view as well. The game_number < filter needs to change to however many games you are \
      averaging over."

In [178]:
## Change dataset name (nba) if you chose a different dataset name earlier
view_name = 'nba.games_to_load_to_model'
view_query = f'CREATE OR REPLACE VIEW `{view_name}` AS \
WITH model_load_games as (SELECT \
distinct left(game_key,length(game_key)-1) as game_key \
FROM `nba.model_game` \
) \
    SELECT distinct order_of_games_per_team.game_key, \
    CASE WHEN model_load_games.game_key is NULL THEN 1 ELSE 0 END as NEEDS_TO_LOAD_TO_MODEL \
    FROM ( \
            SELECT team, game_key, row_number() OVER (PARTITION BY team ORDER BY game_date desc) as game_number \
            FROM ( \
                    SELECT \
                        home_team_name as team, game_date, game_key \
                    FROM  `nba.raw_basketballreference_game` \
                    UNION DISTINCT \
                    SELECT \
                        visitor_team_name as team, game_date, game_key \
                    FROM  `nba.raw_basketballreference_game` \
                 ) games_per_team \
            )order_of_games_per_team \
    LEFT JOIN model_load_games ON model_load_games.game_key = order_of_games_per_team.game_key \
    WHERE \
        game_number <= 11 \
        and team in ( \
                    SELECT \
                        distinct home_team_name as team_to_load \
                    FROM `nba.raw_basketballreference_game` \
                    WHERE \
                    game_date >= (SELECT date_sub(max(game_date), INTERVAL 1 YEAR) FROM `nba.raw_basketballreference_game` ) \
                    and game_key not in (SELECT game_key FROM model_load_games) \
                    UNION DISTINCT \
                    SELECT \
                        distinct visitor_team_name as team_to_load \
                    FROM `nba.raw_basketballreference_game` \
                    WHERE \
                    game_date >= (SELECT date_sub(max(game_date), INTERVAL 1 YEAR) FROM `nba.raw_basketballreference_game`) \
                    and game_key not in (SELECT game_key FROM model_load_games))'

run_view = f'''bq query --use_legacy_sql=false --project_id={new_project_id} "{view_query}"'''
!{run_view}

## Step 10 Create Cloud Scheduler Jobs

This is only required if you wish to keep your data up to date. If you do not need to keep the data up to date, simply make sure you execute the nba_model_game_refresh and nba_get_upcoming_games functions once in order for the Web App to be able to function with most recent game and upcoming schedule information.

In [None]:

uri = f'https://us-central1-nba-predictions-dev.cloudfunctions.net/nba_basketball_reference_scraper'
!gcloud scheduler jobs create http nba_basketball_reference_scraper_daily \
--schedule "0 6 * * *" --uri "http://myproject/my-url.com" --http-method GET