# Part 1. SageMaker & Dask
In this part, you'll work with a dataset of [YouTube Spam Comments](https://christophm.github.io/interpretable-ml-book/spam-data.html).

> We work with 1956 comments from 5 different YouTube videos. The comments were collected via the YouTube API from five of the ten most viewed videos on YouTube in the first half of 2015. All 5 are music videos. One of them is “Gangnam Style” by Korean artist Psy. The other artists were Katy Perry, LMFAO, Eminem, and Shakira.

> The comments were manually labeled as spam or legitimate. Spam was coded with a “1” and legitimate comments with a “0”.

Start an Amazon SageMaker Notebook instance. (Any instance type is ok. This can take a few minutes.) Open Jupyter. 

### Terminal
In the Jupyter dashboard, choose **New**, and then choose **Terminal.**

Run these commands in the terminal:

1. Upgrade Dask in the conda environment named python3. (This command upgrades Bokeh too, even though you don't need to use it, because the packages seem to have dependencies. This can take a few minutes.)
```
conda install -n python3 bokeh dask
```

2. Change directory to SageMaker
```
cd SageMaker
```

3. Download data
```
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
```

4. Unzip data
```
unzip YouTube-Spam-Collection-v1.zip
```

5. See there are five csv files
```
ls *.csv
```

### Notebook
Create a new notebook, with the **conda_python3** kernel.

- For this Sprint Challenge, you *don't* need to create a Dask Distributed Client. You can just use a Dask Dataframe.

In [86]:
# Creating merged Dask DataFrame

import dask.dataframe as dd

df = dd.read_csv('/home/ec2-user/SageMaker/*.csv')

df.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


- Load the five csv files into one Dask Dataframe. It should have a length of 1956 rows, and 5 columns.

In [87]:
# Dask DataFrame Shape

len(df), len(df.columns)

(1956, 5)

- Use the Dask Dataframe to compute the counts of spam (1005 comments) versus the counts of legitimate comments (951).

In [183]:
# Spam comments versus Legitimate comments Value Counts

print('Spam = 1, Legitimate = 0')

df['CLASS'].compute().value_counts()

Spam = 1, Legitimate = 0


1    1005
0     951
Name: CLASS, dtype: int64

- Spammers often tell people to check out their stuff! When the comments are converted to lowercase, then 461 spam comments contain the word "check", versus only 19 legitimate comments which contain the word "check." Use the Dask Dataframe to compute tbhese counts.

In [387]:
# Feature engineering lowercase content column

df['content'] = df['CONTENT'].str.lower()

In [225]:
# Feature enginering boolean if the work 'check' is in the CONTENT column

df['check'] = df['content'].str.contains('check')

In [386]:
# Counting legitimate comments versus spam comment with the word 'check'

group = df.groupby(['CLASS', 'check']).size().compute()

print("Legitimate comments with 'check':",group[0][1])
print("Spam comments with 'check':", group[1][1])

Legitimate comments with 'check': 19
Spam comments with 'check': 461


# Part 2. Big data options
You've been introduced to a variety of platforms (AWS SageMaker, AWS EMR, Databricks), libraries (Numba, Dask, MapReduce, Spark), and languages (Python, SQL, Scala, Java) that can "scale up" or "scale out" for faster processing of big data.

Write a paragraph comparing some of these technology options. For example, you could describe which technology you may personally prefer to use, in what circumstances, for what reasons.

(You can add your paragraph as a Markdown cell at the bottom of your SageMaker Notebook.)

### Optional bonus
Well-written, detailed paragraphs can score a 3. Or create a diagram comparing some of these technology options, or a flowchart to illustrate your decision-making process. 

You can use text-based diagram tools, such as:
- https://www.tablesgenerator.com/markdown_tables
- https://mermaidjs.github.io/mermaid-live-editor/

Or you can use presentation or drawing software, and commit your diagram to your GitHub repo as an image file. Or sketch on the back of a napkin, and take a photo with your phone. (If you choose to create a diagram, then you should also consider publishing it with a blog post later, after the Sprint Challenge.

## Part 2 Response

####    There are a number of platform and library technologies that are are available to process big data. The available platforms are AWS SageMaker, AWS EMR and Databricks. The available libraries that one might use are Numba, Dask, MapReduce, and Spark. Each of these can be used with a combo of the coding languages of Python, SQL, Scala, and Java. Deciding on what to use depends on the business need and/or model framework you are operating. If you are in an environment where scaling up is the model then you might it would make sense to use an AWS platorm with either MapReduce or Dask along within the languages of Python, SQL or Java. If the business dymanic is to scale out, then the best option would be to use either AWS Sagemaker or Databricks and harness the power of Spark because you can use the core language of Scala and incorporate the likes of Python and SQL where and when needed.     