No description, website, or topics provided.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
utils
.gitignore
Add zipcode to raw rides data.ipynb
Baseline + Precipitation Adjustment Model.ipynb
Baseline Prediction Model.ipynb
Cab and weather data 201405 pre-processing with uszipcode.ipynb
Demand Analysis.ipynb
Demand Homogeneity.ipynb
Demand Visualization.ipynb
Preprocessing - PySpark.ipynb
README
Random Forest - PySpark.ipynb
Random Forest Model - Development.ipynb
Trained Random Forest.ipynb
eval_2014_models_on_2015.py
eval_models.py
python3-initscript.sh
python36-initscript.sh
requirements.txt
sparkutils.zip

README

2014-01 both 'full' and 'mid' fail after 2 hours.
Error message says spark.yarn.executor.memoryOverhead 3g is too small.
Consider using standalone. -> success!
On standalone:
- nFold=2, nParamSets=2, 2014-01: train on 500,000 (mid) in 3 mins; on 13,782,494 (full) in 8 mins.
- nFold=5, nParamSets=3*2*3=18, 2014-01: on mid in ??? mins.

maxDepth supported up to 30.

## Refs
- http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

## DataProc Cluster
What does the master do? It doesn't seem to be utilizing all cores. Workers use all cores.
Try a larger number of cores on workers: gcloud dataproc --region us-west1 clusters create pyspark --subnet default --zone us-west1-b --master-machine-type n1-standard-2 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-standard-4 --worker-boot-disk-size 500 --image-version 1.3 --scopes 'https://www.googleapis.com/auth/cloud-platform' --project spark-rf --initialization-actions gs://nyc-taxi-8472/python3-initscript.sh
Standalone:
gcloud dataproc --region us-west1 clusters create pyspark --subnet default --zone us-west1-b --single-node --master-machine-type n1-standard-8 --master-boot-disk-size 500 --image-version 1.3 --project spark-rf --initialization-actions gs://nyc-taxi-8472/python3-initscript.sh


Command:
gcloud dataproc --region us-west1 clusters create pyspark --subnet default --zone us-west1-b --master-machine-type n1-standard-2 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-standard-4 --worker-boot-disk-size 500 --image-version 1.3 --scopes 'https://www.googleapis.com/auth/cloud-platform' --project spark-rf --initialization-actions gs://nyc-taxi-8472/python3-initscript.sh
- Consider using high-memory instances instead n1-standard:
gcloud dataproc --region us-west1 clusters create pyspark --subnet default --zone us-west1-b --master-machine-type n1-highmem-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-highmem-2 --worker-boot-disk-size 500 --image-version 1.3 --scopes 'https://www.googleapis.com/auth/cloud-platform' --project spark-rf
- Ensure python3-initscript.sh is in an appropriate bucket.
    DataProc only supports python2 by default.
- Delete: gcloud dataproc clusters delete pyspark --region us-west1

- file must be in a bucket. Read files with gs://bucket-name/filename.
- Job submit command: spark-submit --py-files sparkutils.zip --verbose eval_models.py

## Jupyter notebook instructions
Deploy:
0. Cluster Initialization (set timeout high enough. The default 600s=10m may not be enough)
    gcloud dataproc --region us-west1 clusters create pyspark --subnet default --zone us-west1-b --master-machine-type n1-highmem-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-highmem-2 --worker-boot-disk-size 500 --image-version 1.2 --project spark-rf --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh --initialization-action-timeout 3000
1. Set up ssh tunnel. Run locally: gcloud compute ssh pyspark-m --project=spark-rf --  -D 8082 -N
2. Run browser: /bin/google-chrome-stable \
  --proxy-server="socks5://localhost:8082" \
  --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" \
  --user-data-dir=/tmp/pyspark-m
3. Go to http://pyspark-m:8123
4. Notebooks are saved in gs://$DATAPROC_BUCKET/notebooks/ --> how to get code??

## Submit job
submit dependencies as zip?

## Self-managed Cluster
- provision
1. git, default-jre (gets 8)
2. python3.6 -> add "deb http://ftp.de.debian.org/debian testing main" in /etc/apt/sources.list
    and 'apt-get install -t testing python3.6'.
    virtualenv: also get it from testing: 'apt-get install -t testing python3-venv'
    python3.6 -m venv venv
    activate virtualenv
3. pip install pyspark pandas (pandas shouldn't be necessary, but is a dependency of datautils.py)