# Project Setup

*Andrea Soto*  
*MIDS W205 Final Project*  
*Project Name: Graph Model of the Million Song Dataset*

---

# Baseline Requirements:

- W205 AMI with Hadoop and Spark. A **c3.8xlarge** instance with 32 CPUs was initialized to processes the entire MSD. This instance took approximately 14 hours to extract all the data and create the CSV files for Neo4j.
- AWS CLI installed and configured:

> **`$ pip install awscli`**  
> **`$ aws configure`**  
> AWS Access Key ID [None]: *enter_access_key*  
> AWS Secret Access Key [None]: *enter_secret_access_key*  
> Default region name [None]: *us-east-1*  
> Default output format [None]: *json*  

## Additional Requirements

Additional requierements will be installed with the configurations scripts in this notebook. These scritps are runned from within the EC2 instance and assume that the project's github repository has been cloned:

> `git clone git@github.com:andrea-soto/W205_FinalProject.git`

The configuration scripts described in this notebook are located under the folder **'config'**

## Environment Names

The following enviroment names should exist after the setup:

- **NEO4J_HOME**="/graph/neo4j/bin"
- **INSTANCE_PDNS**="c2-54-155-21-219.compute-1.amazonaws.com"

--- 

## Step 1: Install jq and ec2-metadata (as root user)

**Script Path and Name:** config/install-jq-ec2meta.sh  
**Script Description:** Install [jq](https://stedolan.github.io/jq/) to parse json in the command line and [ec3-metadata](https://aws.amazon.com/code/1825) to query information about current instance

In [1]:
%%writefile config/install-jq-ec2meta.sh
#!/usr/bin/env bash

# Install jq to parse JSON in shell
sudo yum install jq

# Install EC2 Instance Metadata Query Tool
wget http://s3.amazonaws.com/ec2metadata/ec2-metadata
chmod a+x ec2-metadata
mv ec2-metadata /usr/bin

Writing config/install-jq-ec2meta.sh


In [2]:
!chmod a+x config/install-jq-ec2meta.sh

## Step 2: Create and Attach EBS volumes to current instance (as root)

Attache 2 volumes to this instance:

- **Graph Volume:** 200GB volume created to store the graph and interim files. This volume was mounted under the folder '/graph'
- **MSD Volume:** 500GB volume created from the AWS snapshot *snap-5178cf30* with the entire Million Song Dataset (MSD). This volume was mounted under the folder 'msong_dataset'. For details about the snapshot see [AWS Datasets](https://aws.amazon.com/datasets/million-song-dataset/)

**Script Path and Name:** config/create_volumes.sh  
**Script Description:** Create the volumes described above, attache them to instance, and mount them for use

The environment names used in the script are set using ec3-metadata and should look something like this:
- INSTANCE_ID=i-ds107669
- INSTANCE_PDNS=ec2-54-155-21-219.compute-1.amazonaws.com
- INSTANCE_ZONE=us-east-1c
- GRAPH_VOL_ID=vol-7e18339d
- MSD_VOL_ID=vol-7e18339d

In [3]:
%%writefile config/create_volumes.sh
#!/usr/bin/env bash

# =============================================
# RUN SCRIPT AS ROOT USER
# Attaches 2 volumes to this instance: a Graph Volume of 200 GB mounted to /graph
# and a MSD Volume 280GB from snap-5178cf30 with the entire dataset mounted to /msong_dataset

cd ~

# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Save instance info in environment variables
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

# Get instance id
INSTANCE_ID=$(ec2-metadata -i | cut -d:  -f2| cut -d' ' -f2)
export INSTANCE_ID
# Get instance public hostname
INSTANCE_PDNS=$(ec2-metadata -p | cut -d:  -f2| cut -d' ' -f2)
export INSTANCE_PDNS
# Get instance availability zone
INSTANCE_ZONE=$(ec2-metadata -z | cut -d:  -f2| cut -d' ' -f2)
export INSTANCE_ZONE

# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Create Volumes
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

mkdir -p aws-info
 
### Create volume to store graph
echo LOG: Creating graph volume...
aws ec2 create-volume --size 200 --availability-zone $INSTANCE_ZONE --volume-type gp2 > aws-info/graph-volume.json
wait
GRAPH_VOL_ID=$(jq '.VolumeId' aws-info/graph-volume.json)
GRAPH_VOL_ID="${GRAPH_VOL_ID%\"}"
GRAPH_VOL_ID="${GRAPH_VOL_ID#\"}"
export GRAPH_VOL_ID

### Create volume from AWS snapshot of Million Song Dataset (full dataset)
echo LOG: Copying Million Song Dataset volume...
aws ec2 create-volume --availability-zone $INSTANCE_ZONE \
--snapshot-id snap-5178cf30 --volume-type gp2 > aws-info/msd-volume.json
wait
MSD_VOL_ID=$(jq '.VolumeId' aws-info/msd-volume.json)
MSD_VOL_ID="${MSD_VOL_ID%\"}"
MSD_VOL_ID="${MSD_VOL_ID#\"}"
export MSD_VOL_ID

echo LOG: Wait for volumes to become available...

sleep 30

# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Attache volumes to this instance
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    
echo LOG: Attaching graph volumne...
aws ec2 attach-volume --volume-id $GRAPH_VOL_ID --instance-id $INSTANCE_ID --device /dev/xvdj

echo LOG: Attaching Million Song Dataset volume...
aws ec2 attach-volume --volume-id $MSD_VOL_ID --instance-id $INSTANCE_ID --device /dev/xvdk

echo LOG: Wait for volumes to be attached...
    
sleep 30 

# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Mount volumes to instance
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

mkdir -p /graph
sudo mkfs -t ext4 /dev/xvdh
sudo mount -t ext4 /dev/xvdj /graph
chmod g+rwx -R /graph/

mkdir -p /msong_dataset
sudo mount /dev/xvdk /msong_dataset

echo LOG: Check volumes were created and mounted...
lsblk

Overwriting config/create_volumes.sh


In [4]:
!chmod a+x config/create_volumes.sh

## Step 3: Install Anaconda and h5py (as user)

**Script Path and Name:** config/install_anaconda.sh  
**Script Description:** Install anaconda and then h5py under the directory **'/graph'** 

In [None]:
!su - asoto

In [5]:
%%writefile config/install_anaconda.sh
#!/usr/bin/env bash

cd /graph
wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda2-2.4.1-Linux-x86_64.sh
bash Anaconda2-2.4.1-Linux-x86_64.sh
conda install h5py    

Writing config/install_anaconda.sh


## Step 4: Install Neo4j in /graph directory (as user)

**Script Path and Name:** config/install_neo4j.sh  
**Script Description:** Install neo4j under the directory **'/graph'** 

In [10]:
%%writefile config/install_neo4j.sh
#!/usr/bin/env bash

# === Install Neo4j in /graph directory ===

echo LOG: Installing Neo4j in /graph ...
cd ~
cd /graph
wget http://neo4j.com/artifact.php?name=neo4j-community-2.3.1-unix.tar.gz
tar -xf artifact.php\?name\=neo4j-community-2.3.1-unix.tar.gz
rm artifact.php\?name\=neo4j-community-2.3.1-unix.tar.gz
mv neo4j-community-2.3.1/ neo4j/
NEO4J_HOME="/graph/neo4j/bin"
export NEO4J_HOME

echo 'org.neo4j.server.webserver.address = 0.0.0.0' >> /graph/neo4j/conf/neo4j-server.properties

echo LOG: Installing py2neo and updating password of user 'neo4j' ...
pip install py2neo
neoauth neo4j neo4j redpill

echo LOG: Password set to 'redpill'...
echo to change pasword run: neoauth neo4j redpill <new-password>

Overwriting config/install_neo4j.sh


## Step 5: Download last.fm dataset and mismatch data (as user)

**Script Path and Name:** config/download_lastfm.sh  
**Script Description:** Download Last.fm dataset and store it under the directory **'/graph/lastfm'**. Download mismatch data and store it under the directory **'/graph/import'**

> The MSD team found some matching errors between tracks and songs in the data. They created a list of (song id, tack id) pairs that are not trusted and they suggest removing this pairs from the data. These missmatches were removed from the data as part of the transformation process and they were not included in the original Last.fm dataset.

> For more details see:
- http://labrosa.ee.columbia.edu/millionsong/blog/12-1-2-matching-errors-taste-profile-and-msd
- http://labrosa.ee.columbia.edu/millionsong/blog/12-2-12-fixing-matching-errors

In [9]:
%%writefile config/download_lastfm.sh
#!/usr/bin/env bash

mkdir /graph/import
mkdir /graph/lastfm

cd /graph/lastfm
mkdir data

wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip

unzip -q lastfm_train.zip
mv lastfm_train/* data/

unzip -q lastfm_test.zip
rsync -av lastfm_test/ data/

rm lastfm_train.zip
rm lastfm_test.zip
rm -r lastfm_train/
rm -r lastfm_test/

cd /graph/import
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/tasteprofile/sid_mismatches.txt
    
cd /graph/

Overwriting config/download_lastfm.sh


In [None]:
#echo 'export INSTANCE_ID='$INSTANCE_ID >> ~/.bashrc
#echo 'export INSTANCE_PDNS='$INSTANCE_PDNS >> ~/.bashrc
#echo 'export INSTANCE_ZONE='$INSTANCE_ZONE >> ~/.bashrc
#source ~/.bashrc

In [None]:
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Unmount volumes from instance
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
umount /graph
umount /msong_dataset

# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Detaching volumes from this instance
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    
echo LOG: Detaching graph volumne...
aws ec2 detach-volume --volume-id $GRAPH_VOL_ID --instance-id $INSTANCE_ID

echo LOG: Detaching Million Song Dataset volume...
aws ec2 detach-volume --volume-id $MSD_VOL_ID --instance-id $INSTANCE_ID