# Part 1. SageMaker and Dask

In [1]:
import dask.dataframe as dd

## Create dataframe

In [2]:
# Read in all 5 CSV files at once
df = dd.read_csv('*.csv')

In [4]:
df.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


In [13]:
# Shape  gives us the columns from the schema, but is apparently not 
# enough to get past the lazy evaluation and tell me the number of rows
df.shape

(Delayed('int-95fc256d-e7a2-4198-adf6-7dcc46ccf4e2'), 5)

In [9]:
# There we go, 1956 rows.
len(df)

1956

## How many comments are spam?

In [34]:
# Group by spam or not spam
df.groupby('CLASS').count().compute()

Unnamed: 0_level_0,COMMENT_ID,AUTHOR,DATE,CONTENT,lowercase,check
CLASS,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,951,951,951,951,951,951
1,1005,1005,760,1005,1005,1005


In [23]:
# Create lowercase column
df['lowercase'] = df['CONTENT'].apply(str.lower)

  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result


In [25]:
# Create column that checks for the word 'check'
df['check'] = df['lowercase'].apply(lambda x: 'check' in x)

  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result


In [33]:
# Groupby spam status and the word 'check'
df.groupby(['check','CLASS']).count().compute()

Unnamed: 0_level_0,Unnamed: 1_level_0,COMMENT_ID,AUTHOR,DATE,CONTENT,lowercase
check,CLASS,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
False,0,932,932,932,932,932
False,1,544,544,466,544,544
True,0,19,19,19,19,19
True,1,461,461,294,461,461


And there we go. Among comments containing 'check' (check=True), 19 are ham (CLASS=0) and 461 are spam (CLASS=1).

## Part 1 bonus!

In [37]:
# Creating a distributed client
from dask.distributed import Client
client = Client()

In [38]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:42625  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 16  Memory: 67.53 GB


# Part 2.  Big Data Options

My favorite platform is definitely SageMaker, if only because it allows me to stick to the Jupyter notebooks (running Python) that I know and love.  The main downside is that I have to install all dependencies in each session, which takes a while and seems like it shouldn't be necessary.  I'm glad to expand that workflow with Dask, since it actively tries to appeal to people that already run python libraries locally (pandas, sklearn) and it copies their syntax as much as possible while adding distributed computation for higher speeds.

That being said, I've also become fond of SQL.  That's partly because I know it's the business standard, and therefore I need at least basic SQL literacy.  It's also just really simple, though; python is a large language where you have to do all sorts of things, but the main task in SQL is just the humble query.  Since all queries have basically the same purpose, it takes a short time to figure out how the keywords fit together and how to use them.  In contexts such as writing Scala to interface with Spark on Databricks, I'm all too glad that I can create a temporary SQL object and query it.  

Support for multiple languages is actually some of the most useful functionality in Spark, and probably the reason why it's become such a standard. It's a much larger, more all-inclusive library than Dask, better suited to business intelligence.  Dask is better for scrappy projects written in Python and dependent on the other Python libraries (NumPy, Pandas).

Oh, and Numba is also extremely useful for the appropriate use cases.  I've run into a few situations (usually in coding challenges) where I write my own functions and they include a lot of loops that take forever to compute.  It feels like magic that Numba can just compile that code into a more efficient version and run in a fraction of the time. 