No description, website, or topics provided.
Python JavaScript Scala Java CSS HTML Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
Amrit
Ansible/playbooks
Notes
Spark_Jobs
WikiLDA
autospark_ui
connector
docker
driver
scripts
.gitignore
AutoSparkFinalReport.pdf
Note for Spark installation on HPC.md
README.md

README.md

AutoSpark

Auto spinning spark clusters for text analysis and machine learning.

Usage Demo for AutoSpark

https://youtu.be/gPppTDGynoU -- Needs to be viewed in 1080p to see the text

Setting up AutoSpark

Setup AWS Access Keys

  1. Go to AWS console
  2. Add a new user and copy the AWS ACCESS KEY and SECRET KEYS
  3. Add the user to the Administrator Access group (Required to launch instances using boto API)
  4. Copy the AWS SECRET AND ACCESS Keys to a safe place

Setup Digital Ocean token

  1. Login to digital ocean
  2. Go to the API tab
  3. Generate a new token
  4. Copy the token to a safe place

Instructions to run on Docker

Steps to build and run Docker Image

docker build https://github.com/alt-code/AutoSpark.git#master:docker -t saurabhsvj/autospark
docker run -it saurabhsvj/autospark /bin/bash

This step should start a bash prompt inside docker to run the below commands

Execute commands on docker container bash

These create the necessary folders for ssh_keys

docker-bash $: ssh-keygen -t rsa
docker-bash $: mkdir /ssh_keys
docker-bash $: cp ~/.ssh/id_rsa /ssh_keys/id_rsa
docker-bash $: cp ~/.ssh/id_rsa.pub /ssh_keys/id_rsa.pub
docker-bash $: echo "StrictHostKeyChecking no" >> /etc/ssh/ssh_config

Running the logscanner job using docker

Initial setup to get datasets

sudo apt-get install wget
wget ftp://ita.ee.lbl.gov/traces/NASA_access_log_Jul95.gz
gzip -d NASA_access_log_Jul95.gz
mv NASA_access_log_Jul95 nasalogs

Loading the data onto the cluster

node autospark-load.js

Note: Follow the instructions on command line

Submitting the job to spark cluster

node autospark-submit.js

Note: Follow the instructions on command line

Tear down the cluster

node autospark-teardown.js

Note: Follow the instructions on command line

Instructions for launching clusters using Ubuntu OS

Updates the driver machine

sudo apt-get update -y

Install git on the system

sudo apt-get install git

Clone the AutoSpark Repo

git clone https://github.com/alt-code/AutoSpark.git

Run the setup script

cd AutoSpark/scripts
sudo ./setup_machine.sh

If ssh key pair is not present create it. Copy it to a known folder e.g. ssh_keys

ssh-keygen -t rsa
mkdir /home/ubuntu/ssh_keys
cp ~/.ssh/id_rsa /home/ubuntu/ssh_keys/id_rsa
cp ~/.ssh/id_rsa.pub /home/ubuntu/ssh_keys/id_rsa.pub

Edit ssh_config file - set Stricthost checking to no

sudo vi /etc/ssh/ssh_config

Set: 

StrictHostKeyChecking=no

AutoSpark Usage:

Install AutoSpark Dependencies

cd AutoSpark/driver

npm install

Launch cluster

node autospark-cluster-launcher.js

Note: Follow Instructions. Keep the AWS keys and digital ocean tokens handy.

Load Data onto the cluster

node autospark-load.js

Submit Spark Job on the cluster

node autospark-submit.js

Tear down the cluster

node autospark-teardown.js

Note: Follow Instructions. Keep the AWS keys and digital ocean tokens handy.

Notes:

Detailed Steps for setting up a Spark Cluster in Standalone mode:

https://docs.google.com/document/d/1RrwooqTfAZzn0L8kq4EvvGLNYQBRi_FNeyB4C221qMo/edit#