Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing AWS, updating links #6

Merged
merged 8 commits into from
Nov 17, 2016
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 47 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ To install this virtual machine, you have two options.

[You can download it from this link and "import the appliance" using VirtualBox](http://alpha.library.yorku.ca/releases/warcbase_workshop/Warcbase_workshop_VM.ova). Note that this is a 6.4GB download. If you do this, [skip to "Spark Notebook" below](https://github.com/web-archive-group/warcbase_workshop_vagrant#spark-notebook).

Or you can use vagrant to build it yourself.
Or you can use vagrant to build it yourself, or provision it using `aws`.

## Use

Expand All @@ -38,6 +38,38 @@ From a working directory, please run the following commands.

Once you run these three commands, you will have a running virtual machine with the latest version of warcbase installed.

## Cloud Deployment

You can also deploy this as an AWS machine. To do so, install [vagrant-aws](https://github.com/mitchellh/vagrant-aws).

`vagrant plugin install vagrant-aws`

And then modify the `VagrantFile` to point to your AWS information. The following block will need to be changed:

```
config.vm.provider :aws do |aws, override|
aws.access_key_id = "KEYHERE"
aws.secret_access_key = "SECRETKEYHERE"
aws.region = "us-west-2"

aws.region_config "us-west-2" do |region|
region.ami = "ami-01f05461"
# by default, spins up lightweight m3.medium. If want powerful, uncomment below.
# region.instance_type = "c3.4xlarge"

region.keypair_name = "KEYPAIRNAME"
end

override.ssh.username = "ubuntu"
override.ssh.private_key_path = "PATHTOPRIVATEKEY"
```

You can then load it by typing:

`vagrant up --provider aws`

Note, you will need to change your AWS Security Group to allow for incoming connections on port 22 (SSH) and 9000 (for Spark Notebook). By default, it launches a lightweight m3.medium. To do real work, you will need a larger (and sadly more expensive instance).

## Connect

Now you need to connect to the machine. This will be done through your command line, but also through your browser through Spark Notebook.
Expand All @@ -47,14 +79,14 @@ We use three commands to connect to this virtual machine. `ssh` to connect to it
To get started, type `vagrant ssh` in the directory where you installed the VM.

When prompted:
- username: `vagrant`
- password: `vagrant`
- username: `ubuntu`
- password: `ubuntu`

Here are some other example commands:
* `ssh -p 2222 vagrant@localhost` - will connect to the machine using `ssh`;
* `scp -P 2222 somefile.txt vagrant@localhost:/destination/path` - will copy `somefile.txt` to your vagrant machine.
- You'll need to specify the destination. For example, `scp -P 2222 WARC.warc.gz vagrant@localhost:/home/vagrant` will copy WARC.warc.gz to the home directory of the vagrant machine.
* `rsync --rsh='ssh -p2222' -av somedir vagrant@localhost:/home/vagrant` - will sync `somedir` to your home directory of the vagrant machine.
* `ssh -p 2222 ubuntu@localhost` - will connect to the machine using `ssh`;
* `scp -P 2222 somefile.txt ubuntu@localhost:/destination/path` - will copy `somefile.txt` to your vagrant machine.
- You'll need to specify the destination. For example, `scp -P 2222 WARC.warc.gz ubuntu@localhost:/home/ubuntu` will copy WARC.warc.gz to the home directory of the vagrant machine.
* `rsync --rsh='ssh -p2222' -av somedir ubuntu@localhost:/home/ubuntu` - will sync `somedir` to your home directory of the vagrant machine.

## Environment

Expand All @@ -72,9 +104,11 @@ Here are some other example commands:
To run spark notebook, type the following:

* `vagrant ssh` (if on vagrant; if you downloaded the ova file and are running with VirtualBox you do not need to do this)
* `cd project/spark-notebook-0.6.2-SNAPSHOT-scala-2.10.4-spark-1.5.1-hadoop-2.6.0-cdh5.4.2/bin`
* `cd /home/ubuntu/project/spark-notebook-0.6.3-scala-2.11.7-spark-1.6.2-hadoop-2.7.2/bin`
* `./spark-notebook -Dhttp.port=9000 -J-Xms1024m`
* Visit http://127.0.0.1:9000/ in your web browser.
* Visit http://127.0.0.1:9000/ in your web browser.

If you are connecting via AWS, visit the IP address of your instance (found on EC2 dashboard), port 9000 (i.e. `35.162.32.51:9000`).

![Spark Notebook](https://cloud.githubusercontent.com/assets/218561/14062458/f8c6a842-f375-11e5-991b-c5d6a80c6f1a.png)

Expand All @@ -84,11 +118,11 @@ To run spark shell:

* `vagrant ssh` (if you did not run that in the previous step)
* `cd project/spark-1.5.1-bin-hadoop2.6/bin`
* `./spark-shell --jars /home/vagrant/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar`
* `./spark-shell --jars /home/ubuntu/project/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar`

Example:
```bash
vagrant@warcbase:~/project/spark-1.5.1-bin-hadoop2.6/bin$ ./spark-shell --jars /home/vagrant/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar
ubuntu@warcbase:~/project/spark-1.5.1-bin-hadoop2.6/bin$ ./spark-shell --jars /home/ubuntu/project/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar
WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
Expand Down Expand Up @@ -116,7 +150,7 @@ scala> :paste

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
val r = RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
Expand All @@ -134,7 +168,7 @@ To quit Spark Shell, you can exit using Ctrl+C.

## Resources

This build also includes the [warcbase resources](https://github.com/lintool/warcbase-resources) repository, which contains NER libraries as well as sample data from the University of Toronto (located in `/home/vagrant/project/warcbase-resources/Sample-Data/`).
This build also includes the [warcbase resources](https://github.com/lintool/warcbase-resources) repository, which contains NER libraries as well as sample data from the University of Toronto (located in `/home/ubuntu/project/warcbase-resources/Sample-Data/`).

The ARC and WARC file are drawn from the [Canadian Political Parties & Political Interest Groups Archive-It Collection](https://archive-it.org/collections/227), collected by the University of Toronto. We are grateful that they've provided this material to us.

Expand Down
21 changes: 20 additions & 1 deletion Vagrantfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,29 @@ Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
config.vm.hostname = "warcbase"

# Every Vagrant virtual environment requires a box to build off of.
config.vm.box = "ubuntu/trusty64"
config.vm.box = "dummy"

config.vm.network :forwarded_port, guest: 9000, host: 9000 # Spark Notebook

config.vm.provider :aws do |aws, override|
aws.access_key_id = "KEY"
aws.secret_access_key = "SECRETKEY"
#aws.security_groups = "sg-eaf78b93"

#aws.session_token = ""
aws.region = "us-west-2"

aws.region_config "us-west-2" do |region|
region.ami = "ami-01f05461"
# by default, spins up lightweight m3.medium. If want powerful, uncomment below.
# region.instance_type = "c3.4xlarge"
region.keypair_name = "KEYPAIRNAME"
end

override.ssh.username = "ubuntu"
override.ssh.private_key_path = "/PATH/TO/PRIVATE/key.pem"
end

config.vm.provider "virtualbox" do |vb|
vb.customize ["modifyvm", :id, "--memory", '2056']
vb.customize ["modifyvm", :id, "--cpus", "2"]
Expand Down
52 changes: 26 additions & 26 deletions coursework/lessonplan.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ Go to your File menu, select "Import Appliance," and select the OVA file. Press
Then press "Start."

If you're lucky, the terminal window will appear. If you're asked for a username or password, it is:
- username: `vagrant`
- password: `vagrant`
- username: `ubuntu`
- password: `ubuntu`

### Option Two: Vagrant

Expand Down Expand Up @@ -78,33 +78,33 @@ When prompted:
- password: `vagrant`

Here are some other example commands:
* `ssh -p 2222 vagrant@localhost` - will connect to the machine using `ssh`;
* `scp -P 2222 somefile.txt vagrant@localhost:/destination/path` - will copy `somefile.txt` to your vagrant machine.
- You'll need to specify the destination. For example, `scp -P 2222 WARC.warc.gz vagrant@localhost:/home/vagrant` will copy WARC.warc.gz to the home directory of the vagrant machine.
* `rsync --rsh='ssh -p2222' -av somedir vagrant@localhost:/home/vagrant` - will sync `somedir` to your home directory of the vagrant machine.
* `ssh -p 2222 ubuntu@localhost` - will connect to the machine using `ssh`;
* `scp -P 2222 somefile.txt ubuntu@localhost:/destination/path` - will copy `somefile.txt` to your vagrant machine.
- You'll need to specify the destination. For example, `scp -P 2222 WARC.warc.gz ubuntu@localhost:/home/ubuntu` will copy WARC.warc.gz to the home directory of the vagrant machine.
* `rsync --rsh='ssh -p2222' -av somedir ubuntu@localhost:/home/ubuntu` - will sync `somedir` to your home directory of the vagrant machine.

## Testing

Let's make sure we can get spark notebook running. On vagrant, connect using `vagrant ssh`.

If you used VirtualBox, you have two options. On OS X or Linux, you can minimize your window, open your terminal, and connect to it using: `ssh -p 2222 vagrant@localhost`.
If you used VirtualBox, you have two options. On OS X or Linux, you can minimize your window, open your terminal, and connect to it using: `ssh -p 2222 ubuntu@localhost`.

On Windows, you'll have to use your VirtualBox terminal.

Either way, you should be at a prompt that looks like:

```
vagrant@warcbase:~$
ubuntu@warcbase:~$
```

### Testing Spark Shell

* `cd project/spark-1.5.1-bin-hadoop2.6/bin`
* `./spark-shell --jars /home/vagrant/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar`
* `./spark-shell --jars /home/ubuntu/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar`

Example:
```bash
vagrant@warcbase:~/project/spark-1.5.1-bin-hadoop2.6/bin$ ./spark-shell --jars /home/vagrant/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar
ubuntu@warcbase:~/project/spark-1.5.1-bin-hadoop2.6/bin$ ./spark-shell --jars /home/ubuntu/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar
WARN NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
Expand Down Expand Up @@ -158,7 +158,7 @@ Let's start a new notebook. Click the "new" button in the upper right, and then
First, you need to load the warcbase jar. Paste this into the first command and press the play button.

```bash
:cp /home/vagrant/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar
:cp /home/ubuntu/project/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar
```

Second, you need to import the classes.
Expand All @@ -172,7 +172,7 @@ Third, let's run a test script. The following will load one of the ARC files fro

```scala
val r =
RecordLoader.loadArchives("/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz",
RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz",
sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
Expand All @@ -192,7 +192,7 @@ Let's give it a try by adapting some of the scripts that we might run in the She

```scala
val r =
RecordLoader.loadArchives("/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz",
RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz",
sc)
.keepValidPages()
.map(r => {
Expand All @@ -209,7 +209,7 @@ Again, change a variable. Right now, we see 100 characters of each webpage. Let'
Sometimes it can get boring typing out the same thing over and over again. We can set variables to make our life easier, such as:

```scala
val warc="/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz"
val warc="/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz"
```

Now instead of typing the path, we can just use `warc`. Try running that cell and replacing it in the script above. For the lazy, it looks like:
Expand Down Expand Up @@ -260,13 +260,13 @@ For example, to grab the plain text from the collection and **save it to a file*
import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader}

RecordLoader.loadArchives("/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
.keepValidPages()
.map(r => (r.getCrawldate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/home/vagrant/WARC-plain-text")
.saveAsTextFile("/home/ubuntu/WARC-plain-text")
```

You should now have a directory in `/home/vagrant/` with the plain text. I will show you it.
You should now have a directory in `/home/ubuntu/` with the plain text. I will show you it.

##### Text by Domain

Expand All @@ -276,11 +276,11 @@ Above, we saw that there were 34 pages belonging to `davidsuzuki.org`. Imagine w
import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader}
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArchives("/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
.keepValidPages()
.keepDomains(Set("www.davidsuzuki.org"))
.map(r => (r.getCrawldate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString)))
.saveAsTextFile("/home/vagrant/WARC-plain-text-David-Suzuki")
.saveAsTextFile("/home/ubuntu/WARC-plain-text-David-Suzuki")
```

It should work as well. Note that your command `keepDomains(Set("www.davidsuzuki.org"))` needs to match the string you found above.
Expand All @@ -300,15 +300,15 @@ import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
import StringUtils._

val links = RecordLoader.loadArchives("/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
val links = RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
.keepValidPages()
.flatMap(r => ExtractLinks(r.getUrl, r.getContentString))
.map(r => (ExtractDomain(r._1).removePrefixWWW(), ExtractDomain(r._2).removePrefixWWW()))
.filter(r => r._1 != "" && r._2 != "")
.countItems()
.filter(r => r._2 > 5)

links.saveAsTextFile("/home/vagrant/WARC-links-all/")
links.saveAsTextFile("/home/ubuntu/WARC-links-all/")
```

By now this should be seeming pretty straightforward. In your other window, visit the resulting file (the `part-00000` file in your `WARC-links-all` direcrory) and type:
Expand Down Expand Up @@ -340,7 +340,7 @@ You may want to do work with images. The following script finds all the image UR
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val links = RecordLoader.loadArchives("/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
val links = RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
.keepValidPages()
.flatMap(r => ExtractImageLinks(r.getUrl, r.getContentString))
.countItems()
Expand All @@ -364,13 +364,13 @@ We won't have much time for Spark Shell today, but we wanted to briefly show it.
To run, navigate to the spark-shell directory by

```bash
cd /home/vagrant/project/spark-1.5.1-bin-hadoop2.6/bin
cd /home/ubuntu/project/spark-1.5.1-bin-hadoop2.6/bin
```

Then run with:

```bash
./spark-shell --jars /home/vagrant/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar
./spark-shell --jars /home/ubuntu/project/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar
```

>On your own system, you might want to pass different variables to allocate more memory and the such (i.e. on our server, we often use `/home/i2millig/spark-1.5.1/bin/spark-shell --driver-memory 60G --jars ~/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar` to give it 60GB of memory; or on the cluster, we use `spark-shell --jars ~/warcbase/target/warcbase-0.1.0-SNAPSHOT-fatjar.jar --num-executors 75 --executor-cores 5 --executor-memory 20G --driver-memory 26G`).
Expand All @@ -386,7 +386,7 @@ Then you can paste the following script. When it's looking right, press `Ctrl` a
```scala
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("/home/vagrant/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
val r = RecordLoader.loadArchives("/home/ubuntu/project/warcbase-resources/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
Expand All @@ -405,7 +405,7 @@ Let's try setting it up on your own servers, or in a real production environment

# Acknowledgements and Final Notes

This build also includes the [warcbase resources](https://github.com/lintool/warcbase-resources) repository, which contains NER libraries as well as sample data from the University of Toronto (located in `/home/vagrant/project/warcbase-resources/Sample-Data/`).
This build also includes the [warcbase resources](https://github.com/lintool/warcbase-resources) repository, which contains NER libraries as well as sample data from the University of Toronto (located in `/home/ubuntu/project/warcbase-resources/Sample-Data/`).

The ARC and WARC file are drawn from the [Canadian Political Parties & Political Interest Groups Archive-It Collection](https://archive-it.org/collections/227), collected by the University of Toronto. We are grateful that they've provided this material to us.

Expand Down
Loading