# Big Data and Cloud

# Objectives

- Understand what is "big data"
- Know why big data is different and how it can be processed
- Understand how data can be handled in distributed and parallel systems
- Understand how MapReduce is run
- Explain the general concept of "the cloud"
- Understand the cases where ***hardware acceleration*** is useful
- Understand the cases where ***cloud storage*** and the **Boto3** library in particular are useful

# What is Big Data?

> A different amount makes a different kind

There is no clear/agreed upon definition but typically we say we're working on **big data** if we have to use something like a distributed computing system (not just one local machine)

Interactive numbers: https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html

**Thrashing**: when your CPU is bored, waiting for tasks to do since it has to wait for their slowpoke friends

## The 3 Vs of Data

+ Volume --> Large Amounts
+ Velocity --> Quickly Generated
+ Variety --> Unstructured 

![](images/3vs.png)

Data is *big* when it is better/faster to split the work over the network amongst more (parallel) because of one or more of these Vs

# Applying it via Tools

## Hadoop Framework

![](images/hadoop_logo.png)

> Considered "old-school"
>
> Slower since it has to write to disk each time

- Storage (usually HDFS) 
- Data Processing (MapReduce)
- Resource Management

## Apache Spark

![](images/apache_spark_logo.png)

> Holds data in memory whenever possible (faster)
>
> Can still be built on top of Hadoop but also S3 on AWS

Spark has become king of data since it does a good job with ETL (Extract-Transform-Load) & ML in distributed systems

##### _Aside: More Detail on Spark_

**Some Resources**

>[Here](https://towardsdatascience.com/a-neanderthals-guide-to-apache-spark-in-python-9ef1f156d427) is a great walkthrough of Spark basics!
>
> And [here](https://towardsdatascience.com/apache-spark-a-conceptual-orientation-e326f8c57a64)'s another from our very own alum, Alex Shropshire!
>
> Spark has APIs for Scala (this is ur-Spark), Java, Python, and R.

N.B. Unless otherwise marked, page references are to [Salloum, Dautov, et al., "Big Data Analytics on Apache Spark", 2016](https://link.springer.com/content/pdf/10.1007%2Fs41060-016-0027-9.pdf).

Spark is a tool for the management of big data. Sometimes data science professionals will refer to the [five "V"s](https://www.bbva.com/en/five-vs-big-data/) of big data. Clearly, the availabilty and size of datasets are growing rapidly. What counts as "big data"? Roughly speaking, we're talking about datasets that are too large to be processed on a single machine.

Many large companies are relying on big data these days, and Spark is a major player in the big data game. Examples can be found [here](https://www.icas.com/news/10-companies-using-big-data) and [here](https://enlyft.com/tech/products/apache-spark) and [here](https://www.quora.com/Which-are-the-companies-that-use-apache-spark).

So ... how in the world *do* you process a dataset that's too large for a single machine? You use multiple machines linked together! Let's call each machine a *node*, and the group of all machines working in parallel a *cluster*.

The origin story of Spark starts with [MapReduce](https://en.wikipedia.org/wiki/MapReduce), whose programs comprise (unsurprisingly) a "map" routine (for filtering and sorting) and a "reduce" routine (for performing some aggregate operation).

Let's look at an [example](https://en.wikipedia.org/wiki/MapReduce#Logical_view):

An early major player in big data that used MapReduce was [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop). Hadoop was (and still is) a framework for distributed data processing. Its processing component used MapReduce, but it also had a storage component called the "Hadoop Distributed File System".

From Wikipedia: "Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking".

But Spark appeared as open source in 2010, and had some advantages over Hadoop MapReduce.

Spark's advances over Hadoop MapReduce:

- data processing in memory rather than on disks
- a single framework for machine learning, graph analysis, and processing of streaming data (pp. 159-160)

For more on the advantages of Spark over MapReduce, see [this piece](https://research.ijcaonline.org/volume113/number1/pxc3900531.pdf).

Distributed computing can help enormously with speed. Check out [this website](http://sortbenchmark.org) for the latest in speed records.

"As a framework, it combines a core engine for distributed computing with an advanced programming model for in-memory processing. Although it has the same linear scalability and fault tolerance capabilities as those of MapReduce, it comes with a multistage in-memory programming model comparing to the rigid map-then-reduce disk-based model" (146).

Illustration, p. 148, of Spark guts.

"Running a Spark application involves five key entities ... a driver program, a cluster manager, workers, executors and tasks. A driver program is an application that uses Spark as a library and defines a high-level control flow of the target computation. While a worker provides CPU, memory and storage resources to a Spark application, an executer \[sic\] is a JVM (Java Virtual Machine) process that Spark creates on each worker for that application. A job is a set of computations (e.g., a data processing algorithm) that Spark performs on a cluster to get results to the driver program. A Spark application can launch multiple jobs. Spark splits a job into a directed acyclic graph (DAG) of stages where each stage is a collection of tasks. A task is the smallest unit of work that Spark sends to an executor. The main entry point for Spark functionalities is a SparkContext through which the driver program access \[sic\] Spark. A SparkContext represents a connection to a computing cluster" (149).

RDDs, Transformations, and Actions:

Fault tolerance achieved by keeping a record of the RDD's lineage. There are *redundancies* in the data records, so that, in the event of node failure, the other nodes can provide for data recovery. This is what makes these RDDs *resilient*.

- Transformations take one from an RDD to another RDD;
- Actions take one from an RDD to an output value.

Broadcast variables and accumulators act as global variables; the latter are for counters or sums.

Surveys of Big Data tools [here](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-015-0032-1) and [here](https://ieeexplore.ieee.org/document/7300948).

Debugging can be a challenge in Spark. [This project](https://sites.google.com/site/sparkbigdebug/) was started to help with that.

Also check out Paco Nathan's [massive slide show presentation](http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf) on Spark. Let's just look at slides 66-7 and 82.

###### Aside: The story of Spark (a timeline)

|<p align="left justify">Date</p>|<p align="left justify">Product</p>|<p align="left justify">Update</p>|
|:----|:-----|:-----|
| 2002 | Hadoop | <p align="left justify">Doug Cutting starts `Apache Nutch` researching sort/merge processing</p> |
| 2006 | Hadoop |  <p align="left justify">Leaves `Nutch` and joins `Yahoo`, renaming the project `Hadoop` </p>|
| 2008 | Hadoop |  <p align="left justify">`Hadoop` was made `Apache’s` top level project </p> |
| Jan 2008 | Hadoop |  <p align="left justify">v 0.10.1 released </p>|
| 2009 | Spark | <p align="left justify">started as a research project at the UC Berkeley AMPLab  </p>|
| 2010 | Spark |  <p align="left justify">open sourced </p>|
| Sept 2012 | Spark |  <p align="left justify">0.6.0 released </p>|
| 2013 | Spark |  <p align="left justify">moved to the `Apache` Software Foundation </p>|
| Feb 2013| Spark |  <p align="left justify">Spark 0.7 adds a Python API called `PySpark` </p>|
| Sept 2013 | Spark | <p align="left justify">0.8.0 introduces `MLlib` </p>|
| 2013 | Databricks |  <p align="left justify">Original Spark research team at UC Berkeley found Databricks</p> |
| May 2014 |Spark |  <p align="left justify">v 1.0 introduces Spark SQL, for loading and manipulating structured data in Spark</p>|
| Sept 2014 | Spark|  <p align="left justify">v 1.1.0 provided support for registering Python lambda funtions as UDFs</p>|
|Mar 2015 | Spark | <p align="left justify"> v 1.3.0 brings a new DataFrame API</p> |
| Jun 2015 | Spark | <p align="left justify"> v 1.4.0 brings an R API to Spark</p> |
| 2015 | Databricks | <p align="left justify"> The Databricks Apache Spark cloud platform goes public</p> |
| Jan 2016|  Spark | <p align="left justify"> v 1.6.0 brings a new Dataset API <br> - A new Spark API, similar to RDDs, that allows users to work with custom objects and lambda functions while still gaining the benefits of the Spark SQL execution engine.</p> |
| Jul 2016 | Spark | <p align="left justify"> v 2.0.0 **big update**! <Br> - Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface. <br> - SparkSession: new entry point that replaces the old SQLContext<br>- Native CSV data source, based on Databricks’ spark-csv module<br>- MLlib - The DataFrame-based API is now the primary API. The RDD-based API is entering maintenance mode </p> |
| 2016 | Databricks | <p align="left justify"> Databricks Launches Free Community Edition As Companion To Free Online Spark Courses </p>|
| Jul 2017| Spark | <p align="left justify"> v 2.2.0 drops support for Python 2.6 |
| Nov 2018 | Spark | <p align="left justify"> v 2.4.0<br> - This release adds Barrier Execution Mode for better integration with deep learning frameworks<br> - more integration between pandas UDF and spark DataFrames </p>|
| June 2020 | Spark | 3.0 <p align="left justify"> - This release adds adaptive query execution <br> - ANSI SQL compliance <br> - pandas API improvements|
| March 2021 | Spark| 3.1.2 <p align="left justify"> - Stability Fixes</p>|
| October 2021 | Spark | 3.2.0 <p align="left justify"> - Added support for pandas API</p> |
| June 2022 | Spark | 3.3.0 <p align="left justify"> - Increased join query performance<br>- Extends panda API coverage</p> |

### Spark Data Objects

***In Pyspark there are only RDD and DataFrames***

In other languages where "compiling" is done, there is the distinction between DataFrames and DataSet. 

![dataframe image](https://databricks.com/wp-content/uploads/2018/05/DataFrames.png)

#### Use an RDD when:

[quoted from databricks](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

> - you want low-level transformation and actions and control on your dataset;
> - your data is unstructured, such as media streams or streams of text;
> - you want to manipulate your data with functional programming constructs than domain specific expressions;
> - you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column

#### Use a dataframe when:

[also quoted from databricks](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html)

> - you want rich semantics, high-level abstractions, and domain specific APIs, use DataFrame
> - your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data, use DataFrame
> - you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
> - you want unification and simplification of APIs across Spark Libraries, use DataFrame or Dataset.
> - If you are a R user, use DataFrames.
> - If you are a Python user, use DataFrames and resort back to RDDs if you need more control.

**Note**: Machine learning algorithms are run on _DataFrames_

## But Spark Isn't Always the Best Tool!

![](images/tech_stack.png)

# What Do We Mean by "Parallel" & "Distributed"?

## Distributed

![](images/types_of_network.png)

> tasks split up and executed by different workers

+ Multiple CPUs each have their own memory
+ Multiple CPUs share via a network (using "messages")

## Sequential

![](images/sequential.png)

> Take a step at a time

## Parallel

![](images/parallel.png)

> executing tasks in a non-sequential order

+ Multiple CPUs share same memory to "communicate"

# Aside: MapReduce

Describes two jobs: **Map** & **Reduce**

Software best for **clusters**

![](images/MapReduceZooExample.drawio.png)

## Steps in MapReduce

![](images/mapreduce_visual.jpg)

# Cloud Services

<img src="https://nerds.net/wp-content/uploads/2018/02/cloud-computer-reality-750x646.jpg" width="400px">

## What Is "The Cloud"?

For ***computationally intensive*** or ***long-running*** tasks, it doesn't make much sense to use a personal computer. Personal computers are not particularly powerful, and you also might want to turn them off or use their computing power to do other things.

<a title="GNOME Project, CC BY-SA 3.0 &lt;https://creativecommons.org/licenses/by-sa/3.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Gnome-computer.svg"><img width="240" alt="Gnome-computer" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Gnome-computer.svg/240px-Gnome-computer.svg.png"></a>

Before the cloud, organizations would typically have ***on-premises dedicated hardware*** for these tasks. This meant that they also needed IT systems administrators to manage the physical hardware.

<a title="Akramusns, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Unique_Server_Racks.png"><img width="512" alt="Unique Server Racks" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/db/Unique_Server_Racks.png/512px-Unique_Server_Racks.png"></a>

<small><i>Server racks used for web hosting (2014).</i></small>

With cloud computing, ***hardware details are abstracted away*** and you can get on-demand computing power. The code is still running on a server maintained by the cloud provider, but you don't need an IT systems administrator to coordinate how the servers will be used.

<a title="百楽兎, CC BY-SA 3.0 &lt;https://creativecommons.org/licenses/by-sa/3.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Cloud_computing_icon.svg"><img width="320" alt="Cloud computing icon" src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/12/Cloud_computing_icon.svg/320px-Cloud_computing_icon.svg.png"></a>

## Cloud Providers

Below is a graph showing some of the top cloud providers used by enterprise customers today:

![public cloud adoption for enterprises](https://www.flexera.com/blog/wp-content/uploads/2021/03/Picture9.png)

### Choosing a Cloud Provider

***On the job*** there will likely already be a preferred cloud provider that your employer uses, so you won't need to make a decision here. But ***as a student*** here are some things to consider:

#### Big-Name Providers

Consider choosing to use one of the most popular providers, because this may help you in the job search.

<img src="https://a0.awsstatic.com/libra-css/images/logos/aws_logo_smile_1200x630.png" width=350 alt="aws logo"/>

**AWS** (Amazon Web Services) is currently the most popular cloud provider. In a previous Flatiron School analysis of the job market, we found that about 6% of entry-level Data Scientist roles specifically mentioned AWS as a required skill. AWS was the first true "cloud services" provider -- launching Simple Storage Service (S3) and Elastic Compute Cloud (EC2) in 2006, and is still very popular in part because they were the first of their kind. Check out this [Introduction to AWS for Data Scientists](https://www.dataquest.io/blog/introduction-to-aws-for-data-scientists/) for more information on navigating all of the available services.

<a title="Microsoft Corporation, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Microsoft_Azure_Logo.svg"><img width="320" alt="Microsoft Azure Logo" src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a8/Microsoft_Azure_Logo.svg/320px-Microsoft_Azure_Logo.svg.png"></a>

**Microsoft Azure** (also just referred to as Azure) is the second most popular cloud provider. Our analysis found that about 2% of entry-level Data Scientist roles specifically mentioned Azure as a required skill. Azure launched later than AWS, and has very good compatibility with Windows tools and software.

<img src="https://www.vectorlogo.zone/logos/google_cloud/google_cloud-ar21.png" width=350 alt="google cloud logo" />

**Google Cloud** is the third most popular cloud provider. It did not appear in our entry-level Data Scientist role analysis as a requirement. It also launched later than AWS, and is compatible with other Google products.

#### Cost

In general, more expensive services will perform better than cheaper services:

<img src="https://miro.medium.com/max/630/1*Ao2QhhVEBEr2mJXL9jsuWQ.png" alt="cost per hour vs. time to train"/>
<small><i>From <a href="https://towardsdatascience.com/maximize-your-gpu-dollars-a9133f4e546a" target="_blank">Best Deals in Deep Learning Cloud Providers</a></i></small>

As a student, you have no obligation to pay anything for cloud services! We are just letting you know that they exist, and what they can do for you.

Some cloud services offer a 100% **free version** where you do not need to enter a credit card. These include:

* [Google Colab](https://research.google.com/colaboratory/)
* [Kaggle Kernels](https://www.kaggle.com/code)
* [Databricks Community Edition](https://databricks.com/product/faq/community-edition)
* [MongoDB Atlas](https://www.mongodb.com/docs/atlas/tutorial/deploy-free-tier-cluster/)
* [Heroku](https://www.heroku.com/)

Other services will offer **free credits**. This includes AWS, Azure, Google Cloud, and others. They usually offer a default amount of free credit but will occasionally have special promotions for additional credit.

To use free credits, you will typically need to enter credit card information, so make sure you pay attention to your free credit balance so you don't spend money that you don't intend to spend!
  

## Why Would a Data Scientist Use Cloud Services?

The two main reasons a data scientist would use cloud services are to ***get more computing power*** and to ***deploy machine learning models***.

## More Computing Power

Particularly with large datasets and tools like grid search that fit many different model iterations, training a model can take a **long time** on a personal computer. Maybe you have already had the experience of running a model and having to step away from the computer for minutes or even hours as the fan spins and the computer works hard to perform all of the necessary computations.

### Hardware Acceleration

<a title="Nick Stathas, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Under_the_GPU.jpg"><img width="256" alt="Under the GPU" src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/27/Under_the_GPU.jpg/256px-Under_the_GPU.jpg"></a>

<small><i>A zoomed-in photo of the capacitors inside of a GPU.</i></small>

As much as software libraries like NumPy or Spark can improve the efficiency of code, there is a limit to how much of a difference they can make, depending on the actual hardware of your computer.

As a general concept, [hardware acceleration](https://www.omnisci.com/learn/resources/technical-glossary/hardware-acceleration) means using purpose-built hardware rather than general-purpose hardware.

In the case of machine learning, this typically means running your code on a **GPU**, rather than a CPU.  A CPU _can_ do everything that a GPU can do, but it is not optimized for it, so it will likely take more time.  [This blog](https://towardsdatascience.com/maximize-your-gpu-dollars-a9133f4e546a) argues that a CPU is to a GPU as a horse and buggy is to a car.

One approach you might take would be to purchase a more powerful computer, with a GPU, more RAM, etc. and just use it for training models. But that can easily get very expensive!

With a cloud service, you can train a machine learning model using GPU hardware, so the training should complete much more quickly than on a typical personal computer. And unlike having a dedicated computer, you're only paying for the computing power when you need it!

### Cloud Instances/Containers with GPUs

<a title="Unknown authorUnknown author, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:SSH_diagram.png"><img width="512" alt="SSH diagram" src="https://upload.wikimedia.org/wikipedia/commons/c/c8/SSH_diagram.png"></a>

<small><i>SSH diagram</i></small>

A cloud instance means you can run a customized, fully-fledged computer in the cloud. This often gives you the most fine-grained control but can also be much more expensive because they are not designed specifically for machine learning. Typically you will need to connect to a cloud instance via SSH, and you'll need to be comfortable navigating in a terminal interface.

AWS Elastic Compute Cloud (EC2) is probably the most well-known cloud instance, and our analysis found that it was mentioned in about 2% of entry-level Data Scientist job postings.

Here are some cloud container options with GPUs:

 - [AWS EC2](https://aws.amazon.com/blogs/machine-learning/train-deep-learning-models-on-gpus-using-amazon-ec2-spot-instances/)
 - [Google Cloud Platform](https://cloud.google.com/ml-engine/docs/using-gpus)
 - [IBM Watson Studio](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml_dlaas_gpus.html)
 - [Azure VM](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-gpu)
 - [Oracle Cloud](https://www.oracle.com/cloud/compute/gpu/)

### Cloud Notebooks

<img src="https://d2908q01vomqb2.cloudfront.net/da4b9237bacccdf19c0760cab7aec4a8359010b0/2019/12/01/Rhinestone-SageMaker-Studio-Page-2-v2.png" width=500 alt="amazon sagemaker studio"/>

<small><i>From <b>Amazon SageMaker Studio: The First Fully Integrated Development Environment For Machine Learning</b> on the <a href="https://aws.amazon.com/blogs/aws/amazon-sagemaker-studio-the-first-fully-integrated-development-environment-for-machine-learning/" target="blank_">AWS News Blog</a></i></small>

Compared to virtual machines, cloud notebooks tend to be easier to work with because they allow you to use the familiar notebook interface. Some of them even have free GPUs or TPUs! Even without hardware acceleration, cloud notebooks can allow you to train models in the cloud and free up resources on your personal computer to do other tasks.

Here are some cloud notebooks to consider:

 - [AWS Sagemaker](https://aws.amazon.com/machine-learning/accelerate-machine-learning-P3/)
 - [Databricks Community Edition](https://community.cloud.databricks.com/)
 - [Google Colab](https://towardsdatascience.com/getting-started-with-google-colab-f2fff97f594c)
 - [Kaggle kernels](https://www.kaggle.com/dansbecker/running-kaggle-kernels-with-a-gpu)
 - [data.world](https://jupyter.data.world/)
 
Because there is a GPU available in the free tier, Google Colab is the most popular of these tools for our students.

### Cloud Storage

<a title="Kottakkalnet, CC BY-SA 4.0 &lt;https://creativecommons.org/licenses/by-sa/4.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Plastic_bucket_IMG_20160701_161628956.jpg"><img width="512" alt="Plastic bucket IMG 20160701 161628956" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/Plastic_bucket_IMG_20160701_161628956.jpg/512px-Plastic_bucket_IMG_20160701_161628956.jpg"></a>

It's annoying to have huge data files taking up space on your laptop, and if you want to train your model in the cloud, your data also needs to be in the cloud.  But for reasons related to hardware acceleration, it can get pretty expensive to store large datasets in general-purpose cloud services like an EC2 instance or a cloud VM.  That's when cloud storage services become useful.

#### Cloud Storage Buckets

The major providers of storage "buckets" are:

 - [AWS S3](https://aws.amazon.com/s3/getting-started/)
 - [Google Cloud Storage](https://cloud.google.com/storage/)
 - [Azure Storage](https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction)

These tools are designed for uploads of raw files, e.g. folders full of images, CSVs, or JSONs.

They each cost about 2-5 cents per GB per month.  AWS S3 is the oldest and tends to have the most integration support with other platforms, although you may need to use Google storage if you're using other Google products or Azure storage if you're using other Azure products.

**Boto3** is the Python library used to connect to S3, and there is a demonstration of how to use it in the "Level Up" portion of this notebook.

#### Cloud Databases

If you want to deploy a website where new information gets saved (what kinds of queries users perform, user ratings of the quality of predictions, etc.) then you need a cloud database.  These work roughly the same as a database running on your computer.

Using a cloud database is mainly an opportunity to practice using tooling that you are likely to use on the job, because they assist with collaboration.

Some popular providers are:

 - [Heroku Postgres](https://www.heroku.com/postgres)
 - [MongoDB Atlas](https://www.mongodb.com/cloud/atlas)
 - [AWS Aurora](https://aws.amazon.com/rds/aurora/)
 - [AWS RDS](https://aws.amazon.com/rds/)

Most of these tools have a free tier, which permits a limited number of records to be stored.

## Deploying ML Models

Typically in this program we have used Jupyter Notebooks to build, train, and evaluate models. Jupyter Notebook is a very useful interface, but a predictive model that only exists in the context of a notebook is not particularly useful in the real world!

Some key tools and techniques to be aware of for deploying ML models include model persistence (pickling), deploying as an API, and deploying as a full-stack web app.

### Model Persistence

<a title="Renee Comet (photographer), Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Pickle.jpg"><img width="256" alt="Pickle" src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/Pickle.jpg/256px-Pickle.jpg"></a>

Recall that there is a difference between a file ***on disk*** and a variable ***in memory***. Variables in memory only persist until the notebook kernel is shut down, whereas files on disk persist as long as there is functional storage hardware.

When you first train a model, it only exists in memory. The process for storing it on disk is called ***pickling***. This is a type of serialization where the trained model gets stored in a file, conventionally using a `.pkl` extension. There is an example in the "Level Up" section of this notebook demonstrating the fitting, pickling, and un-pickling of a model.

### Deploying a Model as an API

<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/request_response_cycle.png" width=600 alt="client-server model and request-response cycle"/>

<small><i>Icons made by <a href="https://www.flaticon.com/authors/freepik" target="blank_">Freepik</a> from www.flaticon.com</i></small>

Once you have a pickled model, in theory anyone who can execute Python code can then un-pickle the model and use it to make predictions.

However, what if you want someone to be able to use your model even if they aren't running Python code? For example, what if the model is being used in the context of a website or a mobile app?

In that case, developing an HTTP ***API*** can be useful because this protocol is compatible with many different programming languages.

An example interaction with a model deployed as an API would be:

1. Client request: `POST /predict '{"sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2}'`
2. Server response: `'{"predicted_class": 0}'`

Just like there are multiple tools that can make a `POST` request (Python `requests` library, `cURL` in the terminal, JavaScript `fetch` in a web appication), there are multiple ways you can build your server.

#### Deploying a Model as an API with Cloud Functions

<a title="Wvbailey, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Function_machine2.svg"><img width="243" alt="Function machine2" src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Function_machine2.svg/243px-Function_machine2.svg.png"></a>

The most minimal way to deploy a machine learning model is to create a ***cloud function***. This means that you created a pickled model, write a few lines of Python code specifying how to un-pickle the model and use it to make predictions, and deploy it with a cloud service designed for this purpose.

[This curriculum lesson](https://github.com/learn-co-curriculum/dsc-pickling-pipelines) walks through the process of developing and deploying a Google Cloud function, including the process of pickling a full pipeline. The complete Google Cloud function from that lesson looks like this:

```python
import json
import joblib

def iris_prediction(sepal_length, sepal_width, petal_length, petal_width):
    with open("model.pkl", "rb") as f:
        model = joblib.load(f)
    X = [[sepal_length, sepal_width, petal_length, petal_width]]
    predictions = model.predict(X)
    prediction = int(predictions[0])
    return {"predicted_class": prediction}

def predict(request):
    request_json = request.get_json()
    result = iris_prediction(**request_json)
    return json.dumps(result)
```

While they involve writing the least amount of code of any model deployment option, cloud functions can be tricky to configure within the cloud service. Looking at the code above you might notice that the `predict` function is never actually invoked in the code -- when you configure the cloud function, you have to specify that this function should be called. You will also need to configure the cloud function so that it can accept public web requests, and typically you won't be able to test anything on your local computer, so this can be a slow back-and-forth of tweaking the configuration until it works.

We found that the Google Cloud functions were the easiest to work with, but you also might want to check out [AWS Lambda Functions](https://docs.aws.amazon.com/lambda/latest/dg/python-programming-model.html) and [Azure Functions](https://docs.microsoft.com/en-us/azure/azure-functions/functions-create-first-function-python).

#### Deploying a Model as an API with Flask

<a title="Armin Ronacher, Copyrighted free use, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:Flask_logo.svg"><img width="256" alt="Flask logo" src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3c/Flask_logo.svg/256px-Flask_logo.svg.png"></a>

***Flask*** is a [microframework](https://flask.palletsprojects.com/en/2.0.x/foreword/#what-does-micro-mean) for web development with Python. It can be used for full-stack web applications, but it's also very popular for deploying machine learning models as APIs. In fact, the Google Cloud function Python tooling that is used in the curriculum lesson linked above uses Flask "under the hood"!

Here is that same cloud function, rewritten as a Flask app:

```python
from flask import Flask, request
import joblib
import json

app = Flask(__name__)

def iris_prediction(sepal_length, sepal_width, petal_length, petal_width):
    with open("model.pkl", "rb") as f:
        model = joblib.load(f)
    X = [[sepal_length, sepal_width, petal_length, petal_width]]
    predictions = model.predict(X)
    prediction = int(predictions[0])
    return {"predicted_class": prediction}

@app.route('/', methods=['GET'])
def index():
    return 'Hello, world!'

@app.route('/predict', methods=['POST'])
def predict():
    request_json = request.get_json()
    result = iris_prediction(**request_json)
    return json.dumps(result)
```

It's definitely a bit longer than the cloud function version, but it has two benefits:

1. You can actually run the server on your local computer, which makes **debugging** much faster and easier.
2. Rather than using Google Cloud functions (a paid service, although there are free credits when you sign up), you can use **Heroku** to deploy your API, which is **100% free** for up to 5 apps. (You can also deploy using an EC2 instance or other cloud container if your model is too large/slow for Heroku.)

Check out these curriculum lessons for more info: [Introduction to Flask](https://github.com/learn-co-curriculum/dsc-flask-intro), [Deploying a Model with Flask](https://github.com/learn-co-curriculum/dsc-flask-deployment).

Besides Heroku and AWS EC2, you can also host flask apps with [Azure App Service](https://docs.microsoft.com/en-us/learn/modules/host-a-web-app-with-azure-app-service/index), [DigitalOcean](https://www.digitalocean.com/community/tutorials/how-to-deploy-a-flask-application-on-an-ubuntu-vps), [Google Cloud App Engine](https://cloud.google.com/python/getting-started/hello-world), and [AWS Elatic Beanstalk](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/Welcome.html).

### Deploying a Model as a Full-Stack Web App

An API interface is very useful, but you also might want something more interactive and impressive for your data science portfolio. Developing a ***full-stack web app*** means that there is a web page interface, so once your app is deployed, someone can load it directly in the browser and generate model predictions by clicking on the page.

#### Deploying a Model as a Full-Stack Web App with Flask

<img src="images/model_view_controller.png" width=600 alt="model view controller diagram" />
<small><i>Icons made by <a href="https://www.flaticon.com/authors/freepik" target="blank_">Freepik</a> from www.flaticon.com</i></small>

The same microframework described for deploying a model as an API can also be used to make a full-stack web app. This requires that you write HTML and CSS (and optionally JavaScript) as well as Python in order to create the "view" component of the model-view-controller (MVC) framework.

**Pros:**

* If you already have experience with HTML and CSS, this approach can allow you to flex your creativity and make something very polished that shows off all of your skills. You can add as many pages as you want (e.g. to create a portfolio website rather than a single-page application)
* You can generate data visualizations in Python with Matplotlib (don't need to learn any new plotting libraries)

**Cons:**

* If you don't have any experience with HTML and CSS, the learning curve can be steep. These languages don't produce straightforward error messages like Python does, so it can be quite difficult to know where your mistakes are
* Unless you know JavaScript and are prepared to work with a library like [D3.js](https://d3js.org/), your visualizations aren't going to be interactive

If you are interested in using this approach, check out this [template repository](https://github.com/learn-co-students/capstone-flask-app-template-082420), which includes instructions in the README for modifying the HTML and Python code so that it will work for your project.

#### Deploying a Model as a Full-Stack Web App with Dash

<img src="images/dash_app.png" />

***Plotly Dash*** (also just referred to as Dash) is "the most downloaded, trusted framework for building machine learning web apps in Python" ([source](https://plotly.com/building-machine-learning-web-apps-in-python/)). It allows you to create interactive websites that make predictions from machine learning models, all without writing HTML, CSS, or JavaScript directly.

Dash is built on top of Flask, so it can be deployed using Heroku as well. Check out [this link](https://calm-bastion-07515.herokuapp.com/) for an example app hosted on Heroku (might take a while to load if it hasn't been used in a while), and these curriculum lessons for more info: [Introduction to Dash](https://github.com/learn-co-curriculum/dsc-dash-intro), [Deploying a Model with Dash](https://github.com/learn-co-curriculum/dsc-dash-deployment).

## Level Up: Pickling a Model for Deployment Demo

This shows the basic outline for training a model, evaluating it, then using it in a "production" context to make predictions about new data.

### Model Training and Evaluation

We'll use the wine dataset from scikit-learn for a simple example.

In [None]:
# scikit-learn imports
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

# get premade wine dataset from sklearn
data = load_wine()
print(data.DESCR)

In [None]:
# let's build a model to predict the class of wine
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
classifier = RandomForestClassifier(max_depth=2, random_state=0, n_estimators=100)
classifier.fit(X_train, y_train)

# get the model accuracy
classifier.score(X_test, y_test)

In [None]:
# display the confusion matrix
metrics.confusion_matrix(y_test, classifier.predict(X_test))

### Pickling the Model

The [`pickle` format](https://docs.python.org/3/library/pickle.html) is built into the Python language. It's called pickling because it is a form of preserving an object in memory for later. This is achieved by converting everything about the Python variable into bits of data that can be stored in a file.

For scikit-learn models, [the `joblib` library is recommended instead](https://scikit-learn.org/stable/modules/model_persistence.html). This works similarly to `pickle` but has built-in functionality that works better with NumPy arrays.

In the cell below, we take our `classifier` variable and turn it into a file on disk called `wine_classifier.pkl`.

In [None]:
import joblib

# use the built-in open() function to open a file
output_file = open("wine_classifier.pkl", "wb") # "wb" means "write as bytes"
# dump the variable's contents into the file
joblib.dump(classifier, output_file)
# close the file, ensuring nothing stays in the buffer
output_file.close()

### Loading the Model

This part would actually almost never be in the same file as the previous step. The goal is to take information that was stored in memory at one time, then save it so it can be used later. Here specifically this is useful because training a model is usually a lot slower than using the model to make a prediction, so this saves us from having to re-run that costly operation each time.

In [None]:
import joblib

In [None]:
# use the built-in open() function again, this time to read
model_file = open("wine_classifier.pkl", "rb") # "rb" means "read as bytes"
# load the variable's contents from the file into a variable
loaded_model = joblib.load(model_file)
# close the file
model_file.close()

### Making a Prediction with the Loaded Model

In this section I'm constructing a request JSON that resembles what would come from a user who wants a predicted class of wine based on these feature values. This code would not actually exist in your deployed application, it would be created automatically by whatever protocol generated the request.

In [None]:
# make a fake request JSON from the user with all the headings

request_json = {}

expected_features = ("Alcohol", "Malic acid", "Ash", "Alcalinity of ash", \
        "Magnesium", "Total phenols", "Flavanoids", "Nonflavanoid phenols", \
        "Proanthocyanins", "Color intensity", "Hue", \
        "OD280/OD315 of diluted wines", "Proline")
example_values = [1.282e+01, 3.370e+00, 2.300e+00, 1.950e+01, 8.800e+01, 1.480e+00, \
       6.600e-01, 4.000e-01, 9.700e-01, 1.026e+01, 7.200e-01, 1.750e+00, \
       6.850e+02]

for i, feature in enumerate(expected_features):
    request_json[feature] = example_values[i]
request_json

This is the section that more closely resembles what you might have in your application. I'm checking to make sure that the expected values are in the request_json, transforming them into the right format to make a prediction, then printing out that prediction. In your actual deployed code, you would most likely be **returning** the response, not printing it.

In [None]:
if request_json and all(feature in request_json for feature in expected_features):
    
    # unpack all of the relevant values from the request into a list
    
    test_value = [request_json[feature] for feature in expected_features]
    
    # make a prediction from the "user input"
    
    predicted_class = int(loaded_model.predict([test_value])[0])
    
    # construct a response
    
    response_json = {"prediction": predicted_class}
    print(response_json)
else:
    print("something was missing from the request")

For a more extended explanation of pickling, check out these curriculum lessons: [Pickle](https://github.com/learn-co-curriculum/dsc-pickle) and [Pickling and Deploying Pipelines](https://github.com/learn-co-curriculum/dsc-pickling-pipelines).

## Level Up: AWS S3 Buckets with Boto3 Demo

For the purpose of this example, the `wine_classifier.pkl` file has already been uploaded to the Flatiron School Curriculum AWS account. To make your own pickle file available from code, you would need to make an account and upload the file.

### Optional Prerequisite: CLI Interface

Installation instructions [here](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html), CLI docs [here](https://docs.aws.amazon.com/cli/latest/reference/s3/).  You will need to use this to upload large files (somewhere around 160 MB) but it's clunkier than integrating directly into Python and won't work with all deployment techniques.

If you want to upload a document using the CLI interface, that will look something like this:

```bash
aws s3 cp s3://<your bucket name>/wine_classifier.pkl wine_classifier.pkl
```

### Using Boto3 to Retrieve a Pickled Model

![AWS CLI plus Python equals Boto3](https://curriculum-content.s3.amazonaws.com/data-science/images/boto3.png)

Boto 3 is a library that allows Python developers to access many different Amazon web services, not just S3. You can find the full list [here](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/index.html). But we'll focus on using S3 with boto3 because that is one of the most common use cases for data scientists.

If you are accessing public resources in an S3 bucket, the interface is pretty simple. For example, here is how you would load the pickled wine classifier from an S3 bucket:

In [None]:
import boto3

# instantiate a connection to the S3 resource
s3 = boto3.resource("s3")

# load an object from that resource
pkl_obj = s3.Object(bucket_name="curriculum-content", key="data-science/wine_classifier.pkl")

# get the response (under the hood this is a similar to `requests`)
pkl_resp = pkl_obj.get()["Body"].read()

# look at what's in there (first 100 characters)
pkl_resp[:100]

### Loading a Pickled Model from Boto3

As you can see from the print-out above, we have a `RandomForestClassifier`. But right now it's still encoded as a string of bytes rather than loaded into a scikit-learn model.

To load it into an actual model, we'll need to use `BytesIO` (a class in the built-in [Python `io` module](https://docs.python.org/3/library/io.html)) and then we'll be able to load it with `joblib`.

In [None]:
# adding this so we'll ignore warnings about our scikit-learn version being different
import warnings
warnings.filterwarnings('ignore') 

from io import BytesIO

# read the string of bytes into BytesIO object
pkl_bytes = BytesIO(pkl_resp)

# load the model using joblib
boto3_loaded_model = joblib.load(pkl_bytes)
boto3_loaded_model

In [None]:
# display model params
boto3_loaded_model.get_params()

In [None]:
boto3_loaded_model.feature_importances_

### Using Boto3 to Retrieve a CSV

The process for loading a CSV is essentially the same as loading a pickled model, except that you can pass the `BytesIO` object directly to `pandas`, rather than needing `joblib` to perform an additional loading step first.

In [None]:
# load another object (same bucket name, different key this time)
csv_obj = s3.Object(bucket_name="curriculum-content", key="data-science/data/wine.csv")

# get the response
csv_resp = csv_obj.get()["Body"].read()

# look at what's in there (first 100 characters)
csv_resp[:100]

In [None]:
import pandas as pd

# read the string of bytes into BytesIO object
csv_bytes = BytesIO(csv_resp)

# read the csv file into a dataframe using pandas
boto3_loaded_df = pd.read_csv(csv_bytes)
boto3_loaded_df