# Part 1. SageMaker & Dask
In this part, you'll work with a dataset of [YouTube Spam Comments](https://christophm.github.io/interpretable-ml-book/spam-data.html).

> We work with 1956 comments from 5 different YouTube videos. The comments were collected via the YouTube API from five of the ten most viewed videos on YouTube in the first half of 2015. All 5 are music videos. One of them is “Gangnam Style” by Korean artist Psy. The other artists were Katy Perry, LMFAO, Eminem, and Shakira.

> The comments were manually labeled as spam or legitimate. Spam was coded with a “1” and legitimate comments with a “0”.

### Notebook
Create a new notebook, with the **conda_python3** kernel.

For this Sprint Challenge, you *don't* need to create a Dask Distributed Client. You can just use a Dask Dataframe.

Load the five csv files into one Dask Dataframe. It should have a length of 1956 rows, and 5 columns.

Use the Dask Dataframe to compute the counts of spam (1005 comments) versus the counts of legitimate comments (951).

Spammers often tell people to check out their stuff! When the comments are converted to lowercase, then 461 spam comments contain the word "check", versus only 19 legitimate comments which contain the word "check." Use the Dask Dataframe to compute these counts.

### Optional bonus
To score a 3, do extra work, such as creating the Dask Distributed Client, or creating a visualization with this dataset.

In [1]:
import dask.dataframe as dd
from dask.distributed import Client

client = Client(n_workers=16)
client

0,1
Client  Scheduler: tcp://127.0.0.1:40617  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 16  Cores: 16  Memory: 67.53 GB


In [12]:
# Read in the csv's with dask dataframe

yt_comments = dd.read_csv('Youtube*.csv')
print(yt_comments.columns)
print(len(yt_comments))

Index(['COMMENT_ID', 'AUTHOR', 'DATE', 'CONTENT', 'CLASS'], dtype='object')
1956


In [19]:
spam = yt_comments[yt_comments['CLASS'] == 1]

print(len(spam))

1005


In [20]:
legit = yt_comments[yt_comments['CLASS'] == 0]

print(len(legit))

951


In [39]:
check_spam = spam[spam['CONTENT'].str.lower().str.contains('check')]

len(check_spam)

461

In [40]:
check_legit = legit[legit['CONTENT'].str.lower().str.contains('check')]

len(check_legit)

19

# Part 2. Big data options
You've been introduced to a variety of platforms (AWS SageMaker, AWS EMR, Databricks), libraries (Numba, Dask, MapReduce, Spark), and languages (Python, SQL, Scala, Java) that can "scale up" or "scale out" for faster processing of big data.

Write a paragraph comparing some of these technology options. For example, you could describe which technology you may personally prefer to use, in what circumstances, for what reasons.

(You can add your paragraph as a Markdown cell at the bottom of your SageMaker Notebook.)

### Optional bonus
To score a 3, create a diagram comparing some of these technology options, or a flowchart to illustrate your decision-making process. 

You can use text-based diagram tools, such as:
- https://www.tablesgenerator.com/markdown_tables
- https://mermaidjs.github.io/mermaid-live-editor/

Or you can use presentation or drawing software, and commit your diagram to your GitHub repo as an image file. Or sketch on the back of a napkin, and take a photo with your phone. (If you choose to create a diagram, then you should also consider publishing it with a blog post later, after the Sprint Challenge.)

### GitHub
Commit your SageMaker notebook for parts 1 & 2 to GitHub. You can use git directly from the SageMaker terminal. Or you can download the .ipynb file from SageMaker to your local machine, and then commit the file to GitHub.

### Stop your instance
Stop your SageMaker Notebook instance, so you don't use excessive AWS credits. 

### Answer

From all the work we've been doing and from previous experience, I would prefer not to use databricks. While the notebooks and clusters are a little easier to use, the errors it gives are rarely very helpful and it fails to assist in writing the code (ie, not always giving the ending quote, curly-brace, parenthesis). While AWS SageMaker and EMR may be more expensive, there is definitely more support and development happening which causes them to be better tools in the big data space.

When working with big data,I would definitely choose Dask for getting more processing power. You can distribute the work across the number of workers you choose to define. It's very familiar since it uses features in the same format as the python libraries we've learned, so it's more intuitive than Spark.

As far as languages, I prefer to use SQL just because it is what is most familiar to me, but I have enjoyed using it with Python and Scala. I don't feel Scala's DF API is very intuitive and have resorted to using spark.sql. I definitely prefer working with Python over Scala. So the combination of Python + SQL would be my choice.