#### Download Data

In [5]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00380/YouTube-Spam-Collection-v1.zip
!unzip YouTube-Spam-Collection-v1.zip
!mkdir -p data/youtube
!mv *.csv data/youtube
!rm -rf __MACOSX/

In [8]:
!ls data/youtube

Youtube01-Psy.csv	 Youtube03-LMFAO.csv   Youtube05-Shakira.csv
Youtube02-KatyPerry.csv  Youtube04-Eminem.csv


#### Load Dependencies

In [22]:
import dask.dataframe as dd
import dask as ds

#### Load Data

In [13]:
youtube_comments = dd.read_csv("data/youtube/*.csv")

#### Verify Shape

In [23]:
ds.compute(youtube_comments.shape)

((1956, 5),)

#### Verify Classification Counts

In [95]:
youtube_comments['CLASS']\
    .mask(youtube_comments['CLASS'] == 1, 'spam')\
    .mask(youtube_comments['CLASS'] == 0, 'ham')\
    .compute()\
    .value_counts()

spam    1005
ham      951
Name: CLASS, dtype: int64

#### How Many Non-Spam Comments Contain The Word 'check'?

In [112]:
youtube_comments[['CLASS', 'CONTENT']]\
    .assign(
        contains_check=youtube_comments['CONTENT'].str.lower().str.contains('check'),
        CLASS=youtube_comments['CLASS']\
                .mask(youtube_comments['CLASS'] == 1, 'spam')\
                .mask(youtube_comments['CLASS'] == 0, 'ham')
    )\
    .groupby(['CLASS', 'contains_check'])\
    .count()\
    .compute()\
    .rename(columns={
        'CONTENT': 'comments'
    })

Unnamed: 0_level_0,Unnamed: 1_level_0,comments
CLASS,contains_check,Unnamed: 2_level_1
ham,False,932
ham,True,19
spam,False,544
spam,True,461


#### Platforms, Languages, and Libraries. What have we learned?

Much of the benfit extended by the tools we've been using this past week have been determined by the development goals and ecological context within which each project was built. While, for example, dask and spark both broadly solve the same issues of distributed computation on large data. Spark, by way of being a Java project and receiving heavy adoption in industry, is large and vertically integrated, whereas dask aims to solve only the issue of coordinating a distributed version of the pandas framework.

With respect to these technologies AWS EMR and DataBricks have popped up as simplifying platforms upon which one can outsource the management of compute clusters, and simply worry about the computations themselves.

Of these technologies, I prefer those with a small footprint; Python, SQL, or Dask. The fact that these have remained well-segmented is what allows them to be composed so well, and therein allows for more powerful abstraction. Specifically, these technologies are useful when the time exists for building out a problem specific solution, otherwise large and wide-ranging technologies like Spark allow for immediate integration and adoption, making them useful for those who just want to solve the problem at hand and nothing more.