In [33]:
import dask

import dask.dataframe as dd


In [34]:
df = dd.read_csv("*csv")

In [35]:
# Here's my shape in rows , columns

len(df), len(df.columns)

(1956, 5)

In [36]:
print("Spam Count: " , len(df[df['CLASS'] == 1]),'\n',
      "Legit Comments: ", len(df[df['CLASS'] != 1]))

Spam Count:  1005 
 Legit Comments:  951


In [37]:
print("Spam Promotion: " , (df[df["CLASS"]==1]['CONTENT']
                            .str.lower()
                            .str.contains("check")
                            .compute()
                            .value_counts()
                            .values[1]), '\n',
      "Legit Promotion: " , (df[df["CLASS"]!=1]['CONTENT']
                            .str.lower()
                            .str.contains("check")
                            .compute()
                            .value_counts()
                            .values[1]), '\n')

Spam Promotion:  461 
 Legit Promotion:  19 



There are numerous options for scalabiling project on the market currently. Depending on the size and requirements of the project, each has it's advantages and workflows. On the smaller scale, (0 - 16 GB), Pandas and a local machine can be utilized. Pandas creates dataframes with rather large memory footprints. A function like [this](https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65) is incredibly useful for scaling down the size of a dataframe in order for it to still fit into memory. Objects can categorically encoded to further these memory reductions as well. Once the data comfortably fits in memory, the next challenge lies in time to compute. Often times, a perfectly efficient, 5GB dataframe consists of **billions** of rows which can take a long time to compute. To get optimize computation time, the Numba library is incredibly effective. It essentially translates a function to machine code, which runs far more efficiently. Furthermore, it also allows for parallel processing, instructing the computer to commit more resources to a particular computation. 

This approach has a couple downsides however. First, Numba does not work on Pandas datatypes, so the programmer must turn those DataFrames and Series into Numpy Arrays to get the benefit they were looking for. That will change the workflow and will not be as natural to a lot of programmers. Secondly this doesn't work in situations where any sort of text bodies are present. Those Pandas object will result in a much higher memory footprint and not benefit from optimizing datatype. The next step in scaling the data science project is utilizing Dask. 

Dask is a library that operates similarly to Pandas on the surface, but is designed for Data that does not comfortably fit on local memory. Instead of reading a .csv file to a dataframe, it relies on lazy computation. Dask makes a note of whatever transformations the programmer wants to use, and then waits until a `compute` command to execute those transformations. When it does compute an action, only then does it read the file(s) being called for with the `read_csv` function. That way, the dataframe stays of virtual memory. This kind of computation also allows for parrallelization, but that also requires a specific workflow. These two methods allow for larger dataframes to be run locally. However, when you need a remote solution or even more power, that's where the cloud comes in. 

At Lambda School we use AWS, but other companies are offering their own services (Microsoft Azure, Google, etc.). AWS allows the user to pick any number of instances with specific hardware depending in their needs. This includes trendy things like GPU acceleration. AWS sagemaker is fantastic service for running Jupyter projects with scalable hardware. For example, I could start an instance that provides close to a TB of RAM. This will help me get around almost any virtual memory scaling issues. The downside is, it costs money per useage. For enterrise with mid-size data challenges, this very well could be the best choice. 

Finally we come to Spark. Spark is designed for **BIG** data. Enterprises should not be utilizing spark unless they scale out of appropriate AWS instances. Spark is a very versatile tool (We use the Databricks service) that can interpret several languages (Scala, Python, SQL, etc.). However, Scala is the preferred language as it very quickly can operate over distributed computing method associate with Spark and Databricks. In other words, Scala and Spark will store the data over a cluster of computers, operate via lazy computation and can handle big problems very quickly. The downside is that it does not handle small problems efficiently, so it should only be used for **BIG** data and computations. 