<a href="https://colab.research.google.com/github/andiub97/CovidPubRank/blob/master/CovidPageRank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Page Ranking Algorithms on Google Cloud Dataproc

- Use the [Cloud Resource Manager](https://cloud.google.com/resource-manager) to create a project if you do not already have one.
- [Enable billing](https://support.google.com/cloud/answer/6293499#enable-billing) for the project.
- See [Google Cloud Storage (GCS) Documentation](https://cloud.google.com/storage/) for more info.


In [None]:
from google.colab import auth
auth.authenticate_user()

# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = '[scalaproject-343716]'
!gcloud config set project {'scalaproject-343716'}

In [None]:
f = open('./sample_data/citations_500.txt', 'r+')

lines = f.read()
content = lines.split("\n")
edges = []

for i in content:
  if i.startswith("e "):
    edges.append(i)

f.write("\n")
c = 0
while c < 17500:
  f.writelines("{}\n".format(x) for x in edges)
  c += 1

f.close()

## Send file to Google Cloud Bucket

In [None]:
import uuid

# Make a unique bucket to which we'll upload the file.
# (GCS buckets are part of a single global namespace.)
bucket_name = 'colab-sample-bucket-' + str(uuid.uuid1())

# Full reference: https://cloud.google.com/storage/docs/gsutil/commands/mb
!gsutil mb -l us-central1 -b on gs://{bucket_name}

# Copy the file to our new bucket.
# Full reference: https://cloud.google.com/storage/docs/gsutil/commands/cp
!gsutil cp sample_data/citations_500.txt gs://{bucket_name}/
  
# Finally, dump the contents of our newly copied file to make sure everything worked.
#!gsutil cat gs://{bucket_name}/citations_100.txt

## Create Dataproc cluster with specified parameters

In [None]:
!gcloud dataproc clusters create 3workers-cluster \
  --region us-central1 \
  --zone us-central1-a \
  --master-machine-type n1-standard-4 \
  --master-boot-disk-size 600 \
  --worker-machine-type n1-standard-4 \
  --num-workers 3 \
  --worker-boot-disk-size 600

# Weak Scalabity 
### Send Dataproc jobs to single-node cluster for all datasets

In [None]:
!gcloud dataproc clusters create single-node-cluster \
  --region us-central1 \
  --zone us-central1-a \
  --single-node 

In [None]:
!gcloud dataproc jobs submit spark \
    --cluster=single-node-cluster \
    --region=us-central1 \
    --jar=gs://covid_program/covidpubrank_2.12-0.1.0-SNAPSHOT.jar \
    -- "allAlgorithms" "gs://citations_bucket/citations_500.txt" "gs://ranking_output_bucket/single-node/distributed"

In [None]:
!gcloud dataproc jobs submit spark \
    --cluster=single-node-cluster \
    --region=us-central1 \
    --jar=gs://covid_program/covidpubrank_2.12-0.1.0-SNAPSHOT.jar \
    -- "allAlgorithms" "gs://citations_bucket/citations_100.txt" "gs://ranking_output_bucket/single-node/notDistributed"

In [None]:
!gcloud dataproc jobs submit spark \
    --cluster=single-node-cluster \
    --region=us-central1 \
    --jar=gs://covid_program/covidpubrank_2.12-0.1.0-SNAPSHOT.jar \
    -- "allAlgorithms" "gs://citations_bucket/citations_50.txt" "gs://ranking_output_bucket/single-node/notDistributed"

In [None]:
!gcloud dataproc jobs submit spark \
    --cluster=single-node-cluster \
    --region=us-central1 \
    --jar=gs://covid_program/covidpubrank_2.12-0.1.0-SNAPSHOT.jar \
    -- "allAlgorithms" "gs://citations_bucket/citations_10.txt" "gs://ranking_output_bucket/output1"

### Delete cluster

In [None]:
!gcloud dataproc clusters delete single-node-cluster \
    --region=us-central1

# Strong scalability
1. Send Dataproc jobs to 2-node cluster for "citations_1.txt" dataset

In [None]:
!gcloud dataproc clusters create 2workers-cluster \
  --region us-central1 \
  --zone us-central1-a \
  --master-machine-type n1-standard-4 \
  --master-boot-disk-size 600 \
  --worker-machine-type n1-standard-4 \
  --num-workers 2 \
  --worker-boot-disk-size 600

In [None]:
!gcloud dataproc jobs submit spark \
    --cluster=2workers-cluster \
    --region=us-central1 \
    --jar=gs://covid_program/covidpubrank_2.12-0.1.0-SNAPSHOT.jar \
    -- "DistributedAlgorithms" "gs://citations_bucket/citations_1.txt" "gs://ranking_output_bucket/2workers/distributed"

Delete cluster

In [None]:
!gcloud dataproc clusters delete 2workers-cluster \
    --region=us-central1

2. Send Dataproc jobs to 4-node cluster for "citations_1.txt" dataset

In [None]:
!gcloud dataproc clusters create 4workers-cluster \
  --region us-central1 \
  --zone us-central1-a \
  --master-machine-type n1-standard-4 \
  --master-boot-disk-size 600 \
  --worker-machine-type n1-standard-4 \
  --num-workers 4 \
  --worker-boot-disk-size 600

In [None]:
!gcloud dataproc jobs submit spark \
    --cluster=4workers-cluster \
    --region=us-central1 \
    --jar=gs://covid_program/covidpubrank_2.12-0.1.0-SNAPSHOT.jar \
    -- "DistributedAlgorithms" "gs://citations_bucket/citations_1.txt" "gs://ranking_output_bucket/4workers/distributed"

Delete cluster

In [None]:
!gcloud dataproc clusters delete 4workers-cluster \
    --region=us-central1

Delete jobs

## Get job execution output, in other words ranking algorithms' execution time

In [None]:
# Create output directory
mkdir ./sample_data/output

# Download the file from a given Google Cloud Storage bucket.
!gsutil cp gs://ranking_output_bucket/ ./sample_data/output

In [None]:
import matplotlib.pyplot as plt

names = []
times = []
chars = "(),\n,"

f = open('/content/sample_data/part-00000','r+')
f1 = open('/content/sample_data/part-00000_1','r+')

testo=f.read()
f1.write('\n')
f1.write(testo)

for row in f1:
    row = row.replace(chars,"")
    row = row.split(',')
    names.append(row[0])
    row[1] = row[1].replace(")", "")
    times.append(float(row[1]))

f.close() 
f1.close()
plt.bar(names, times, color = 'b', label = 'Results')

plt.xlabel('Algorithm Name', fontsize = 12)
plt.ylabel('Time in sec.', fontsize = 12)

plt.title('Algorithms performance', fontsize = 20)

fig1=plt.gcf()
fig1.set_figwidth(10)
fig1.set_figheight(10)
plt.show()
fig1.savefig('result.png')