# Part 1. SageMaker & Dask

In this part, you'll work with a dataset of [YouTube Spam Comments](https://christophm.github.io/interpretable-ml-book/spam-data.html).

> We work with 1956 comments from 5 different YouTube videos. The comments were collected via the YouTube API from five of the ten most viewed videos on YouTube in the first half of 2015. All 5 are music videos. One of them is “Gangnam Style” by Korean artist Psy. The other artists were Katy Perry, LMFAO, Eminem, and Shakira.

> The comments were manually labeled as spam or legitimate. Spam was coded with a “1” and legitimate comments with a “0”.

Start an Amazon SageMaker Notebook instance. (Any instance type is ok. This can take a few minutes.) Open Jupyter. 

### Terminal
In the Jupyter dashboard, choose **New**, and then choose **Terminal.**

Run these commands in the terminal:

1. Upgrade Dask in the conda environment named python3. (This command upgrades Bokeh too, even though you don't need to use it, because the packages seem to have dependencies. This can take a few minutes.)
```
conda install -n python3 bokeh dask
```

2. Change directory to SageMaker
```
cd SageMaker
```

3. Download data
```
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
```

4. Unzip data
```
unzip YouTube-Spam-Collection-v1.zip
```

5. See there are five csv files
```
ls *.csv
```

Then you can close the terminal window. 

### Notebook
Create a new notebook, with the **conda_python3** kernel.

For this Sprint Challenge, you *don't* need to create a Dask Distributed Client. You can just use a Dask Dataframe.

Load the five csv files into one Dask Dataframe. It should have a length of 1956 rows, and 5 columns.

Use the Dask Dataframe to compute the counts of spam (1005 comments) versus the counts of legitimate comments (951).

Spammers often tell people to check out their stuff! When the comments are converted to lowercase, then 461 spam comments contain the word "check", versus only 19 legitimate comments which contain the word "check." Use the Dask Dataframe to compute these counts.

### Optional bonus
To score a 3, do extra work, such as creating the Dask Distributed Client, or creating a visualization with this dataset.

In [51]:
# import pandas
import pandas as pd

# import dask
import dask.dataframe as dd
from dask import compute, delayed

from dask.distributed import Client

client = Client(n_workers=16)
client

Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.


0,1
Client  Scheduler: tcp://127.0.0.1:33335  Dashboard: http://127.0.0.1:34047/status,Cluster  Workers: 16  Cores: 16  Memory: 33.10 GB


In [52]:
# Ensure all 5 csv files are ready
%ls -lh *.csv

-rw-r--r-- 1 ec2-user ec2-user 57K Mar 26  2017 Youtube01-Psy.csv
-rw-r--r-- 1 ec2-user ec2-user 63K Mar 26  2017 Youtube02-KatyPerry.csv
-rw-r--r-- 1 ec2-user ec2-user 63K Mar 26  2017 Youtube03-LMFAO.csv
-rw-r--r-- 1 ec2-user ec2-user 81K Mar 26  2017 Youtube04-Eminem.csv
-rw-r--r-- 1 ec2-user ec2-user 72K Mar 26  2017 Youtube05-Shakira.csv


In [53]:
# load all csvs into a dask dataframe
%time
yt_spam = dd.read_csv('Youtube*.csv')

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 12.4 µs


In [54]:
yt_spam

Unnamed: 0_level_0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
npartitions=5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,object,object,object,object,int64
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


In [55]:
# Check values of spam or legitimate
yt_spam['CLASS'].value_counts().compute()

1    1005
0     951
Name: CLASS, dtype: int64

In [40]:
# Set column for checking lower case 'check'
yt_spam['check'] = yt_spam['CONTENT'].str.lower().str.contains('check')

In [47]:
# Differentiate btwn spam check and legitimate check
spam = yt_spam[yt_spam['CLASS'] == 1]
not_spam = yt_spam[yt_spam['CLASS'] == 0]

In [48]:
# Check against what is classified as spam
check_spam = spam[yt_spam['check']]
len(check_spam)

  return func(*args2)


461

In [49]:
# Check against what is classified as legitimate
check_not_spam = not_spam[yt_spam['check']]
len(check_not_spam)

  return func(*args2)


19

# Part 2. Big data options
You've been introduced to a variety of platforms (AWS SageMaker, AWS EMR, Databricks), libraries (Numba, Dask, MapReduce, Spark), and languages (Python, SQL, Scala, Java) that can "scale up" or "scale out" for faster processing of big data.

Write a paragraph comparing some of these technology options. For example, you could describe which technology you may personally prefer to use, in what circumstances, for what reasons.

(You can add your paragraph as a Markdown cell at the bottom of your SageMaker Notebook.)

### Part 2 Answer

Numba and dask are great tools to use if you already have experience using Pandas and Python. The syntax and logic are very similar to pandas and python. Numba and dask are useful when you want to parallelize the workflow. This will speed up your workflow when transforming and cleaning data. The order of speed is: Numba > Dask > Python. The speed difference can be seen when running similar processes with Numba or Dask and using the python magic function %time.

Even though the AWS ecosystem costs can accumulate, I prefer AWS over other tools like Databricks. AWS tools can also have a steeper learning curve, but it will pay off in the long run because of the increased support and integration with the larger ecosystem of AWS tools. Databricks is good to use as a standalone product, has easy installation of notebooks and clusters and it is tough to beat the cost, free. But, it can be frustrating when you run into an error statement and the solution is not as obvious. As a beginning developer, and probably a developer at any stage, having a good linter can save you a lot of time.