# Milestone 4

In this milestone, we are deploying our learned model to Amazon EC2. We are using Flask to host the API.

## Develop the API

Here, we have created a simple Flask application that would return a prediction value at a `POST` endpoint.

We downloaded the `model.joblib` from Milestone 3, installed the dependencies (`flask`, `scikit-learn`, `pandas`), and placed that at a project directory. We saved the code below as `main.py` and placed that at the project directory as well. Then we run:

```
flask --app main --host 0.0.0.0 --debug
```

In [1]:
#! pip install flask scikit-learn joblib pandas

In [2]:
from flask import Flask, request, jsonify
import joblib
import pandas as pd

app = Flask(__name__)

model = joblib.load("model.joblib")

def compute_prediction(data):
    df = pd.DataFrame(data).T
    return model.predict(df).tolist()

@app.route("/")
def index():
    return """
    <h1>Rain Prediction Service</h1>
    <p>
        <strong>Usage</strong>:<br />
        Make a JSON post request to the /predict url with 25 climate model outputs.
    </p>
    <p>
        <strong>Example</strong>:<br />
        <code>curl http://127.0.0.1:5000/predict
            -d '{"data":[1,2,3,4,53,11,22,37,41,53,11,24,31,44,53,11,22,35,42,53,12,23,31,42,53]}'
            -H "Content-Type: application/json"
        </code>
    </p>
    """

@app.route('/predict', methods=['POST'])
def rainfall_prediction():
    content = request.json
    prediction = compute_prediction(content["data"])
    results = {
        "prediction": prediction,
        "input": content["data"]
    }
    return jsonify(results)

## Deploy the API

We have used AWS EC2 for this lab. After making sure that the security group allows the port that we need (`5000` in our case here), we run the Flask app. Here is a screenshot showing the API in action.

![Screenshot showing the browser output of the `/` endpoint](images/m4-browser.png)

![Screenshot showing the curl output of the `/predict` endpoint](images/m4-curl.png)

## Milestone Summaries

This is a summary of what we have done across the milestones:

**Milestone 1**:
In milestone 1, we downloaded the data, combined the csv files, and preformed a EDA in both python and R. During this process we investigated csv join, load, and data type transformation performance between computers (CPU and RAM) while measuring RAM used. We then exported the data from python to R using the parquet file format to preform further EDA in R.

**Milestone 2**: In Milestone 2, we made the transition from working on our local machines to the cloud using Amazon Web Services (AWS).  We created a collaborative environment by setting up an EC2 instance with JupyterHub, installed all the necessary packages needed for our server, and added all team members as users.  We then set up a S3 bucket to read and store the data from the parquet file created in Milestone 1.  Once the data was read and loaded into the bucket, it was wrangled in preparation for use in machine learning models in Milestone 3.

**Milestone 3**: Initially, we performed some fundamental exploratory data analysis on the CSV file retrieved from S3. Subsequently, we partitioned the data into training and testing sets and proceeded to train an ensemble machine learning model using the RandomForestRegressor algorithm on the ml_data_SYD.csv file generated in Milestone 2. We evaluated the performance of the model using RMSE (root-mean-square error). To identify the optimal model, we utilized PySpark's MLlib to fine-tune some hyperparameters of a Random Forest and deployed the resulting model using Amazon EMR.

**Milestone 4**:
We used AWS EC2, a scalable virtual server, to host a flask app that returns a prediction using the model developed and trained in previous milestones via the POST endpoint `predict`.