## Part 1: SageMaker and Dask

In [8]:
# Import dask, read all CSVs, compute df.shape
import dask.dataframe as dd

df = dd.read_csv('*.csv')
dd.compute(df.shape)

((1956, 5),)

In [12]:
df.head()

Unnamed: 0,COMMENT_ID,AUTHOR,DATE,CONTENT,CLASS
0,LZQPQhLyRh80UYxNuaDWhIGQYNQ96IuCg-AYWqNPjpU,Julius NM,2013-11-07T06:20:48,"Huh, anyway check out this you[tube] channel: ...",1
1,LZQPQhLyRh_C2cTtd9MvFRJedxydaVW-2sNg5Diuo4A,adam riyati,2013-11-07T12:37:15,Hey guys check out my new channel and our firs...,1
2,LZQPQhLyRh9MSZYnf8djyk0gEF9BHDPYrrK-qCczIY8,Evgeny Murashkin,2013-11-08T17:34:21,just for test I have to say murdev.com,1
3,z13jhp0bxqncu512g22wvzkasxmvvzjaz04,ElNino Melendez,2013-11-09T08:28:43,me shaking my sexy ass on my channel enjoy ^_^ ﻿,1
4,z13fwbwp1oujthgqj04chlngpvzmtt3r3dw,GsMega,2013-11-10T16:05:38,watch?v=vtaRGgvGtWQ Check this out .﻿,1


In [15]:
# Calculate the count of spam and non-spam comments
dd.compute(df['CLASS'].value_counts())

(1    1005
 0     951
 Name: CLASS, dtype: int64,)

## Part 2: Big Data Options

First, I will discuss the platforms used in this week's sprint: AWS SageMaker and DataBricks. These PaaS offerings have a major dividing property: their scaling methodology. AWS SageMaker only offers the use of a single container, but the configurations can vary in allocated compute resources. This is a scaling strategy called _scaling up_, and its a good option when traditional tooling can be used to perform the data operations, but require more resources than a traditional workstation or laptop. Databricks, conversely, is a good option for when the size of the data is so large that a single AWS instance would be too expensive, or when the compute required is so extensive that it'd be better split up amongst a _cluster_ of nodes.

Next, there's the choice of programming languages and libraries for Big Data Operations; This week we used Python, Scala, and SQL, and for APIs we used Numba, Dask, MapReduce, and Spark. To effectively choose a programming language one needs to think about what platform they're using, the source of the data (can be a _kind_ of db, data lake, etc), and the scale of the operation (size and compute). When the choice is mine I prefer to use SQL for querying instead of some possibly half-baked query extension for an API, and whichever language the necessary API is natively written in. I think the language for data science operations should be dictated by which API best suits the problem, not the other way around. You can miss large "for free" performance optimizations by simply choosing the language before the "tool"/API. Example: using Scala when Spark is necessary will most likely result in better performance than if one were to use the Python Spark API. 

The right big data tool: Numba, Dask, Spark, is somewhat dictated by the platform that's being used for the big data operation. Numba's strength lies in its ability to jit compile Python snippets and has a mature integration with NumPy. Numba is a good option, in my opinion, for all big data platforms and workflows using Python because Numba because it can parallelize and optimize computations across a single CPU as well as leverage a distributed framework like Dask or Spark. Finally, there's the choice of Spark or Dask for a distributed execution model. This choice _does_ depend on the language being used to some degree. Dask's API is a pure Python implementation and is smaller than Dask's, which allows it to be more composable with other modules like pandas. Spark, being written is Scala, is a more heavyweight option that's based on MapReduce. Its intended to be a more robust "batteries included" library compared to Dask, but I'm willing to say that Spark hasn't achieved this goal yet. Its years newer than Dask and thus is still struggling to achieve feature parity with some other tools in the big data ecosystem. Spark does have a distinct advantage in that there's a SQL dialect built-in, with the killer feature being its query optimizer.