In [117]:
from dask import compute, delayed
import dask.dataframe as dd
import pandas as pd

In [118]:
!ls -l

total 544
drwxrwxr-x 4 ec2-user ec2-user   4096 Feb 25 23:08 DS-Unit-3-Sprint-3-Big-Data
drwx------ 2 ec2-user ec2-user  16384 Feb 22 22:35 lost+found
drwxrwxr-x 2 ec2-user ec2-user   4096 Mar 26  2017 __MACOSX
-rw-rw-r-- 1 ec2-user ec2-user  15512 Mar  1 17:34 Sprint-Challenge-DS-Unit-3-Sprint-3-Big-Data.ipynb
-rw-r--r-- 1 ec2-user ec2-user  57438 Mar 26  2017 Youtube01-Psy.csv
-rw-r--r-- 1 ec2-user ec2-user  64279 Mar 26  2017 Youtube02-KatyPerry.csv
-rw-r--r-- 1 ec2-user ec2-user  64419 Mar 26  2017 Youtube03-LMFAO.csv
-rw-r--r-- 1 ec2-user ec2-user  82896 Mar 26  2017 Youtube04-Eminem.csv
-rw-r--r-- 1 ec2-user ec2-user  72706 Mar 26  2017 Youtube05-Shakira.csv
-rw-rw-r-- 1 ec2-user ec2-user 163567 Mar 26  2017 YouTube-Spam-Collection-v1.zip


In [119]:
dd_all = dd.read_csv("Youtube*.csv").persist() # Persist to make future computations faster

In [120]:
dd_all.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


In [121]:
dd_all.shape

(Delayed('int-a22df5de-32c2-4b48-9a6b-ce9cc6a23d4b'), 5)

In [122]:
print("Num rows:", len(dd_all))
print("Num cols:", dd_all.shape[1])

Num rows: 1956
Num cols: 5


In [123]:
dd_all.count().compute()

COMMENT_ID    1956
AUTHOR        1956
DATE          1711
CONTENT       1956
CLASS         1956
dtype: int64

### From the above, there are fewer dates than rows. So we have missing data. Just remember this for later.

In [124]:
df_spam_or_no = dd_all.groupby("CLASS").CLASS.count().compute()
df_spam_or_no

CLASS
0     951
1    1005
Name: CLASS, dtype: int64

In [125]:
print("Number of spam comments:", df_spam_or_no[1])
print("Number of non-spam comments:", df_spam_or_no[0])

Number of spam comments: 1005
Number of non-spam comments: 951


In [127]:
def lower_and_check_exists(s):
    lower_s = s.lower()
    if "check" in s:
        return 1
    else:
        return 0
    
dd_all["CHECK_EXISTS"] = \
    dd_all["CONTENT"] \
        .apply(lower_and_check_exists,
            meta=pd.Series(dtype='int',
                           name='CHECK_EXISTS'))
dd_all.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS,CHECK_EXISTS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1,1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1,0
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1,0
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1,0


In [130]:
dd_all.groupby(["CHECK_EXISTS", "CLASS"]).CLASS.count().compute()

CHECK_EXISTS  CLASS
0             0        932
              1        819
1             0         19
              1        186
Name: CLASS, dtype: int64

### From the above, it looks like there are 19 legitimate emails where "check" exists, and 186 (vs. 461 for Ryan) spam comments that contain the word "check".

# Part 2

## Big Data options
Dask will allow you to work with a cluster of machines.
You specify what you want to do with the dataframe (
in this case a distributed dask dataframe).
When you're done with the specification, you say compute()
on the last part of the calculation which runs the whole
computation specified and returns the output.

Spark is similar to Dask in that you can distribute
your data across partitions, declare what you want
to do to that data, and when you want results to be
returned, you specify collect().
Spark allows SQL queries to be run with the spark.sql()
command. This makes it easier for those that know SQL
to come to the Spark world.
Scala brings a steep learning curve to Spark.
I have read that you can do almost anything you want
with pyspark + Spark that you can do with Scala + Spark.
This makes it easier for users who know Python to use Spark.

If I want to do some distributed calculations for
speeding up my analysis of the data, I will use Python + Dask.
This is just because I will be doing analysis in Python anyway,
and would not want to go to Scala + Spark (another environment).

If I'm working in a production environment, I will use
Scala + Spark. These are more mature technologies than
Dask, and have been battle-tested.
I have read that Dask is faster than Spark, but have
not looked at metrics yet. Anyway, for production, I would
like to have something stable, more than something faster.

In the area I'm living in, there are many Scala + Spark jobs,
and not many Dask jobs (if at all). This is why I will
go with the Scala + Spark alternative.