Skip to content

Commit

Permalink
Allowing AWS, updating links (#6)
Browse files Browse the repository at this point in the history
* first pass at aws scripts

* tweaking aws instance to be loaded

* updating readme

* tiny tweak on example Spark Shell

* final commit of evening; am tired so making too many little onesGoogle

* shoutout to docnow team for inspiration

* changed vm box provider to aws-supported one

* tweaking box provision
  • Loading branch information
ianmilligan1 authored and ruebot committed Nov 17, 2016
1 parent 5bcd681 commit 74495e3
Show file tree
Hide file tree
Showing 4 changed files with 118 additions and 54 deletions.
62 changes: 48 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ To install this virtual machine, you have two options.

[You can download it from this link and "import the appliance" using VirtualBox](http://alpha.library.yorku.ca/releases/warcbase_workshop/Warcbase_workshop_VM.ova). Note that this is a 6.4GB download. If you do this, [skip to "Spark Notebook" below](https://github.com/web-archive-group/warcbase_workshop_vagrant#spark-notebook).

Or you can use vagrant to build it yourself.
Or you can use vagrant to build it yourself, or provision it using `aws`.

## Use

Expand All @@ -38,6 +38,38 @@ From a working directory, please run the following commands.

Once you run these three commands, you will have a running virtual machine with the latest version of warcbase installed.

## Cloud Deployment

You can also deploy this as an AWS machine. To do so, install [vagrant-aws](https://github.com/mitchellh/vagrant-aws).

`vagrant plugin install vagrant-aws`

And then modify the `VagrantFile` to point to your AWS information. The following block will need to be changed:

```
config.vm.provider :aws do |aws, override|
aws.access_key_id = "KEYHERE"
aws.secret_access_key = "SECRETKEYHERE"
aws.region = "us-west-2"
aws.region_config "us-west-2" do |region|
region.ami = "ami-01f05461"
# by default, spins up lightweight m3.medium. If want powerful, uncomment below.
# region.instance_type = "c3.4xlarge"
region.keypair_name = "KEYPAIRNAME"
end
override.ssh.username = "ubuntu"
override.ssh.private_key_path = "PATHTOPRIVATEKEY"
```

You can then load it by typing:

`vagrant up --provider aws`

Note, you will need to change your AWS Security Group to allow for incoming connections on port 22 (SSH) and 9000 (for Spark Notebook). By default, it launches a lightweight m3.medium. To do real work, you will need a larger (and sadly more expensive instance).

## Connect

Now you need to connect to the machine. This will be done through your command line, but also through your browser through Spark Notebook.
Expand All @@ -47,14 +79,14 @@ We use three commands to connect to this virtual machine. `ssh` to connect to it
To get started, type `vagrant ssh` in the directory where you installed the VM.

When prompted:
- username: `vagrant`
- password: `vagrant`
- username: `ubuntu`
- password: `ubuntu`

Here are some other example commands:
* `ssh -p 2222 vagrant@localhost` - will connect to the machine using `ssh`;
* `scp -P 2222 somefile.txt vagrant@localhost:/destination/path` - will copy `somefile.txt` to your vagrant machine.
- You'll need to specify the destination. For example, `scp -P 2222 WARC.warc.gz vagrant@localhost:/home/vagrant` will copy WARC.warc.gz to the home directory of the vagrant machine.
* `rsync --rsh='ssh -p2222' -av somedir vagrant@localhost:/home/vagrant` - will sync `somedir` to your home directory of the vagrant machine.
* `ssh -p 2222 ubuntu@localhost` - will connect to the machine using `ssh`;
* `scp -P 2222 somefile.txt ubuntu@localhost:/destination/path` - will copy `somefile.txt` to your vagrant machine.
- You'll need to specify the destination. For example, `scp -P 2222 WARC.warc.gz ubuntu@localhost:/home/ubuntu` will copy WARC.warc.gz to the home directory of the vagrant machine.
* `rsync --rsh='ssh -p2222' -av somedir ubuntu@localhost:/home/ubuntu` - will sync `somedir` to your home directory of the vagrant machine.

## Environment

Expand All @@ -72,9 +104,11 @@ Here are some other example commands:
To run spark notebook, type the following:

* `vagrant ssh` (if on vagrant; if you downloaded the ova file and are running with VirtualBox you do not need to do this)
* `cd project/spark-notebook-0.6.2-SNAPSHOT-scala-2.10.4-spark-1.5.1-hadoop-2.6.0-cdh5.4.2/bin`
* `cd /home/ubuntu/project/spark-notebook-0.6.3-scala-2.11.7-spark-1.6.2-hadoop-2.7.2/bin`
* `./spark-notebook -Dhttp.port=9000 -J-Xms1024m`
* Visit http://127.0.0.1:9000/ in your web browser.
* Visit http://127.0.0.1:9000/ in your web browser.

If you are connecting via AWS, visit the IP address of your instance (found on EC2 dashboard), port 9000 (i.e. `35.162.32.51:9000`).

![Spark Notebook](https://cloud.githubusercontent.com/assets/218561/14062458/f8c6a842-f375-11e5-991b-c5d6a80c6f1a.png)

Expand All @@ -84,11 +118,11 @@ To run spark shell:

* `vagrant ssh` (if you did not run that in the previous step)
* `cd project/spark-1.5.1-bin-hadoop2.6/bin`
* `./spark-shell --jars /home/vagrant/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar`
* `./spark-shell --jars /home/ubuntu/project/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar`

Example:
```bash
vagrant@warcbase:~/project/spark-1.5.1-bin-hadoop2.6/bin$ ./spark-shell --jars /home/vagrant/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar
ubuntu@warcbase:~/project/spark-1.5.1-bin-hadoop2.6/bin$ ./spark-shell --jars /home/ubuntu/project/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar
WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
Expand Down Expand Up @@ -116,7 +150,7 @@ scala> :paste
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
val r = RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
Expand All @@ -134,7 +168,7 @@ To quit Spark Shell, you can exit using Ctrl+C.
## Resources
This build also includes the [warcbase resources](https://github.com/lintool/warcbase-resources) repository, which contains NER libraries as well as sample data from the University of Toronto (located in `/home/vagrant/project/warcbase-resources/Sample-Data/`).
This build also includes the [warcbase resources](https://github.com/lintool/warcbase-resources) repository, which contains NER libraries as well as sample data from the University of Toronto (located in `/home/ubuntu/project/warcbase-resources/Sample-Data/`).
The ARC and WARC file are drawn from the [Canadian Political Parties & Political Interest Groups Archive-It Collection](https://archive-it.org/collections/227), collected by the University of Toronto. We are grateful that they've provided this material to us.
Expand All @@ -151,4 +185,4 @@ You can find more information about this collection at [WebArchives.ca](http://w
## Acknowlegements
This research has been supported by the Social Sciences and Humanities Research Council with Insight Grant 435-2015-0011. Additional funding for student labour on this project comes from an Ontario Ministry of Research and Innovation Early Researcher Award.
This research has been supported by the Social Sciences and Humanities Research Council with Insight Grant 435-2015-0011. Additional funding for student labour on this project comes from an Ontario Ministry of Research and Innovation Early Researcher Award. The idea for the AWS deployment came from the DocNow team and their [repository here](https://github.com/web-archive-group/warcbase_workshop_vagrant/tree/aws).
34 changes: 32 additions & 2 deletions Vagrantfile
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,43 @@ Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.provider "virtualbox" do |v|
v.name = "Warcbase workshop VM"
end
config.vm.hostname = "warcbase"

config.vm.hostname = "warcbase"

# Every Vagrant virtual environment requires a box to build off of.
config.vm.box = "ubuntu/trusty64"
config.vm.box = "ubuntu/xenial64"

config.vm.network :forwarded_port, guest: 9000, host: 9000 # Spark Notebook

config.vm.provider :aws do |aws, override|
aws.access_key_id = "KEY"
aws.secret_access_key = "SECRETKEY"
override.vm.box = "lattice/ubuntu-trusty-64"
override.ssh.username = "ubuntu"
override.ssh.private_key_path = "/PATH/TO/KEY"
aws.region = "us-west-2"
aws.region_config "us-west-2" do |region|
region.ami = "ami-01f05461"
# by default, spins up lightweight m3.medium. If want powerful, uncomment below.
# region.instance_type = "c3.4xlarge"
region.keypair_name = "KEYPAIRNAME"
end
end

# This should work fine out of the box if environment variables are declared
config.vm.provider :digital_ocean do |provider, override|
provider.ssh_key_name = ENV['DIGITALOCEAN_KEYNAME']
override.ssh.private_key_path = ENV['DIGITALOCEAN_KEYPATH']
override.ssh.username = "vagrant"
override.vm.box = 'digital_ocean'
override.vm.box_url = "https://github.com/smdahlen/vagrant-digitalocean/raw/master/box/digital_ocean.box"
provider.token = ENV['DIGITALOCEAN_TOKEN']
provider.image = 'ubuntu-14-04-x64'
provider.region = 'tor1'
provider.size = '4gb'
override.vm.network :forwarded_port, guest: 80, host: 80
end

config.vm.provider "virtualbox" do |vb|
vb.customize ["modifyvm", :id, "--memory", '2056']
vb.customize ["modifyvm", :id, "--cpus", "2"]
Expand Down
Loading

0 comments on commit 74495e3

Please sign in to comment.