## Productionizing Notebook Tutorial
This notebook provides a guide on how to make your notebook code ready for production using Versatile Data Kit.

### 1.1 Good to Know Before You Start
This tutorial aims to be user-friendly and easy to follow. 
However, you will have a smoother experience if you already have some familiarity with the following topics:
- **Python and SQL**: Understanding of basic commands and queries.
- **Data Concepts**: Knowledge of simple data modeling and API usage.
- **Tools**: Comfort using Jupyter Notebook.



### 1.2 Useful notebook shortcuts
* Click the **Play icon** in the left gutter of the cell;
* Type **Cmd/Ctrl+Enter** to run the cell in place;
* Type **Shift+Enter** to run the cell and move focus to the next cell (adding one if none exists); or
* Type **Alt+Enter** to run the cell and insert a new code cell immediately below it.

There are additional options for running some or all cells in the **Runtime** menu on top.

### 2. Objectives we will be following
1. **Retrieve Data:** - Extract data from the specified URL using pandas.

2. **Data Cleansing:**  - Eliminate records associated with 'testuser'.

3. **Score Classification:** - Assign scores into predefined categories for clarity.

4. **Data Ingestion:** - Use VDK job_input to ingest the organized data.

### 3. Initialize new VDK job (input)

In [None]:
"""
vdk.plugin.ipython extension introduces a magic command for Jupyter.
The command enables the user to load VDK for the current notebook.
VDK provides the job_input API, which has methods for:
    * executing queries to an OLAP database;
    * ingesting data into a database;
    * processing data into a database.
Type help(job_input) to see its documentation.

"""

%reload_ext vdk.plugin.ipython
%reload_VDK
job_input = VDK.get_initialized_job_input()

### 3.1 Explore what you can do (Task 1)
For full guide on how to use VDK Jupyter refer to Getting Started Notebook.

In [None]:
# See all methods with help:
help(job_input)

### 4. Start working on the objectives

#### 4.1 Meet the first challenge in Productionizing Notebooks
Reproducibility poses a major challenge in notebooks because of non-linear code execution,leading to hidden dependencies and altered states when cells are run out of sequence.
Continue with the code below to verify this for yourself!

#### 4.2 Retrieve Data

In [None]:
import pandas as pd

In [None]:
# Reading the CSV file from the provided URL and storing the data in a DataFrame 'df'
url = "https://raw.githubusercontent.com/duyguHsnHsn/nps-data/main/nps_data.csv"
df = pd.read_csv(url)

In [None]:
# Check the DataFrame
df

#### 4.2 Data Cleansing

In [None]:
# Cleaning the DataFrame: Removing all records associated with 'testuser'
df = df[df['User'] != 'testuser']
# Rerun the data check that was performed in the previous step to validate data cleanliness

#### 4.3 Check our code for reproducibility problem
Running the data check twice has completely overwritten the information from the first execution!


Running the data check cell twice didn’t cause issues since it was a mere print statement, but had it been a data transformation, it could have affected the data cleaning, demonstrating how notebooks can complicate achieving code reproducibility.

#### 4.4 Solve the reproducibility problem
Introduction of “VDK cells” by VDK (tagged with "vdk").

  - **Implementation**:
    - Assign a "vdk" tag and a specific number to a cell.
    - The number dictates the order in which the cell will be executed in production.
    
  - **Benefits**:
    - Ensures only the tagged cells are executed, and in the determined sequence.
    - Resolves the reproducibility issue by clearly defining the execution order.
    
  - **How your notebooks should look like and how to tag cells**
    <img src="../images/reproducibility_solution.png" width="1000" length="546" alt="Run a job">



#### 4.5 Meet the second challenge in Productionizing Notebooks
Excessive irrelevant code, including unused print statements and unrelated snippets, can be found in notebooks. In the development phase, it's common to include a mixture of code: from critical algorithms to random print statements meant for quick debugging or data checks. This is incredibly useful during the experimental stages, allowing for easy insights and iterative changes. However, when it's time to transition to production, this bag full of relevant and random code becomes problematic. 

#### 4.6 Score Classification:

In [None]:
# Import all functions from the 'helper' module,
# which contains the necessary logic for classification and data visualization
from helper import *

In [None]:
# Apply the classification function to the 'Score' column to determine the 'Type'
# Note: this cell might fail on its first run. 
# If it does, simply run it again, and it should work as expected.
df.loc[:, 'Type'] = df['Score'].apply(classify_score)

In [None]:
# Check the DataFrame
df

In [None]:
# Visualise the types of users
visualize_data(df)

#### 4.7 Data Ingestion

In [None]:
# Sending data for ingestion 
job_input.send_tabular_data_for_ingestion(
    df.itertuples(index=False),
    destination_table="nps_data",
    column_names=df.columns.tolist()
)

#### 4.8 Check for irrelevant for production code
Our primary goals were to classify the users and ingest our data. Printing the DataFrame and visualizing the types were not necessary steps. 
Including these cells in production could potentially slow down our code, as they are not essential to achieving our objectives.

#### 4.9 Solve the irrelevant code problem
- **Simple Solution**: Delete the irrelevant cells.
- **VDK’s Alternative**:
  - **Introduction of “Non VDK Cells”**:
    - Cells without the "vdk" tag.
  - **Implementation**:
    - Leave the cell untagged and in its default state.
  - **Benefits**:
    - **Exclusion in Production**: These cells won’t be executed in production, maintaining efficiency.
    - **Available for Development**: The cells remain accessible for quick checks during development since they are not removed.
  - **How your notebooks should look like**
    <img src="../images/irrelevant_code_solution.png" width="1000" length="546" alt="Run a job">

In [None]:
print("Data processing complete.")

### 5. We are ready with out objectives
Congratulations! You have successfully ingested the required data.

To verify the ingestion process, you can utilize the `vdksql` cell magic command.

In [None]:
%%vdksql
SELECT * FROM nps_data

### 6. Other problems linked to productionising notebooks




#### 6.1 Testing 
When we're working with Jupyter Notebooks, testing can be a bit tricky. Unlike some other coding tools, notebooks don't have straightforward ways to test the code we write. There aren't well-organized methods or frameworks specifically designed for testing notebooks. This means it's not as easy to check if our code is doing what it's supposed to do. We might have to come up with our own solutions, which can sometimes be a bit messy and not as reliable.  

  ##### VDK’s Solution for Easier End-to-End Testing

- **VDK Run Command**: This tool facilitates end-to-end (e2e) testing capabilities.
  - **Execution Simplicity**: Run your code as if it's in the production environment with a simple command.
  - **Error Handling**: Get detailed error messages and stack traces for troubleshooting if the code fails.
  - **Success Confirmation**: Receive success messages and logs for successful executions.
 
 For guidance on running jobs, consult the [Getting Started Guide](./getting-started.ipynb).


#### 6.2 Version Control
Version control with notebooks can be complex due to their JSON-based format which might include too much noise in it. 
 ##### VDK’s Solution
 - **Noise Reduction**: VDK cleanses the notebook's JSON by removing non-essential elements, such as execution counts and outputs, which typically cause "noise."
- **Seamless Integration with Git**: With VDK's integration, your code is committed to Git on deployment in a cleaner state, devoid of unnecessary metadata, simplifying version control and reducing clutter.

# Congratulations! 🎉

You've successfully completed the Productionizing Jupyter Notebooks with VDK tutorial! We hope you found this tutorial useful.

## Your Feedback Matters!

We continuously strive to improve and your feedback is invaluable to us. Please take a moment to complete our survey. It will only take a few minutes.

### [**👉 Complete the Survey Here 👈**](to be added)

Thank you for participating in this tutorial!
