## Productionizing Notebook Tutorial
This notebook provides a guide on how to make your notebook code production-ready using Versatile Data Kit.
You can find the Versatile Data Kit repository [here](https://github.com/vmware/versatile-data-kit).

### 1.1 Good to Know Before You Start
<details>
    <summary><strong style="color: Tomato;">Click to check!</strong></summary>
    <p>This tutorial aims to be user-friendly and easy to follow. 
    However, you will have a smoother experience if you already have some familiarity with the following topics:</p>
    <ul>
        <li><strong>Python and SQL</strong>: Understanding of basic commands and queries.</li>
        <li><strong>Data Concepts</strong>: Knowledge of simple data modeling and API usage.</li>
        <li><strong>Tools</strong>: Familiarity with Jupyter Notebook.</li>
    </ul>
</details>

### 1.2 Useful notebook shortcuts

<details>
    <summary><strong style="color: Tomato;">Click to see them!</strong></summary>
    <ul>
        <li>Click the <strong>Play icon</strong> in the left gutter of the cell;</li>
        <li>Type <strong>Cmd/Ctrl+Enter</strong> to run the cell in place;</li>
        <li>Type <strong>Shift+Enter</strong> to run the cell and move focus to the next cell (adding one if none exists); or</li>
        <li>Type <strong>Alt+Enter</strong> to run the cell and insert a new code cell immediately below it.</li>
    </ul>
    <p>There are additional options for running some or all cells in the <strong>Runtime</strong> menu on top.</p>
</details>


### 2. Objectives & Your Action Items

🔹 **Task 1: The Reproducibility Problem**
   > **Objective:** Extract data from the specified URL using pandas. 
   
   > **Objective:** Eliminate records associated with 'testuser'.


🔹 **Task 2: The Irrelevant Code Problem**
   > **Objective:** Assign scores into predefined categories for clarity.
   
   > **Objective:** Use VDK job_input to ingest the organized data.

🔹 **Task 3: Testing Challenges**

🔹 **Task 4: Version Control with Notebooks**

### 3. Initialize the VDK Job Input object

##### Prerequisite: Run the environment configuration

In [None]:
%env db_default_type=SQLITE
%env ingest_method_default=SQLITE
%env ingest_target_default=vdk-sqlite.db
%env vdk_sqlite_file=vdk-sqlite.db
%env INGESTER_WAIT_TO_FINISH_AFTER_EVERY_SEND=true

In [None]:
"""
vdk.plugin.ipython extension introduces a magic command for Jupyter.
The command enables the user to load VDK for the current notebook.
VDK provides the job_input API, which has methods for:
    * executing queries to an OLAP database;
    * ingesting data into a database;
    * processing data into a database.
Type help(job_input) to see its documentation.

"""

%reload_ext vdk.plugin.ipython
%reload_VDK
job_input = VDK.get_initialized_job_input()

### 3.1 Explore what you can do (Task 0)
For a full guide on how to use VDK in Jupyter refer to the Getting Started Notebook.

In [None]:
# See all methods:
help(job_input)

### 4 The Reproducibility Problem in Productionizing Notebooks
Reproducibility poses a major challenge when using notebooks because of the non-linearity of code execution, leading to hidden dependencies and altered states when cells are run out of sequence.
  <details>
    <summary><strong style="color: Blue;">Check example!</strong></summary>
    <img src="../images/reproducibility_problem_example.png" width="600" >
</details>

Continue with the code below to understand this problem further!

#### 4.1 Retrieve Data

In [None]:
import pandas as pd

In [None]:
# Reading the CSV file from the provided URL and storing the data in a DataFrame 'df'
url = "https://raw.githubusercontent.com/duyguHsnHsn/nps-data/main/nps_data.csv"
df = pd.read_csv(url)

In [None]:
# Check the DataFrame
df

#### 4.2 Data Cleansing

In [None]:
# Check the count of test users
testuser_count = df[df['User'] == 'testuser'].shape[0]
testuser_count

In [None]:
# Cleaning the DataFrame: Removing all records associated with 'testuser'
df = df[df['User'] != 'testuser']
# Rerun the data check that was performed in the previous step to validate data cleanliness

#### 4.3 Check our code for problems with reproducibility
Be cautious of reproducibility issues in your code! For instance, if you run a cell which counts data multiple times, it can lead to misleading results as it overwrites the data from the initial execution.

This issue was evident when the data check cell was executed more than once. Although it was only a print statement and didn't alter the data, a similar repeated execution involving data transformation could compromise your data cleaning process. This example highlights a common challenge with notebooks, where achieving code reproducibility can be complex and requires careful handling.


#### 4.4 Solve the reproducibility problem
Introduction of “VDK cells” by VDK (tagged with "vdk").

> **Implementation**:
> - Assign a "vdk" tag and a specific number to a cell.
> - The number dictates the order in which the cell will be executed in production.

> **Benefits**:
> - Ensures only the tagged cells are executed, and in the determined sequence.
> - Resolves the reproducibility issue by clearly defining the execution order.

    
  <font color='red'>**ACTION! Tag your cells!**</font> 
  <details>
    <summary><strong style="color: Green;">Check solution!</strong></summary>
    <p>The cells in your notebook should be tagged as in the picture below.</p>
    <img src="../images/reproducibility_solution.png" width="1000" height="546" alt="Run a job">
</details>

    


### 5 The Irrelevant Code Problem in Notebooks
In the development phase, it's common to include a mixture of code - from algorithms to print statements - meant for quick debugging or data checks. This is incredibly useful during the experimental stage, allowing for easy insights and iterative changes. However, when it's time to transition to production, this notebook full of relevant and random code becomes problematic. 
  <details>
    <summary><strong style="color: Blue;">Check example!</strong></summary>
    <img src="../images/irrelevant-code-example.png" width="600" >
</details>

#### 5.1 Score Classification

In [None]:
# Import all functions from the 'helper' module,
# which contains the necessary logic for classification and data visualization
from helper import visualize_data, classify_score

In [None]:
# Apply the classification function to the 'Score' column to determine the 'Type'
# Note: this cell might fail on its first run. 
# If it does, simply run it again, and it should work as expected.
df.loc[:, 'Type'] = df['Score'].apply(classify_score)

In [None]:
# Check the DataFrame
df

In [None]:
# Visualise the types of users
visualize_data(df)

#### 5.2 Data Ingestion

In [None]:
# Sending data for ingestion 
job_input.send_tabular_data_for_ingestion(
    df.itertuples(index=False),
    destination_table="nps_data",
    column_names=df.columns.tolist()
)

#### 5.3 Check for irrelevant code
Our primary goals were to classify the users and ingest our data. Printing the DataFrame and visualizing the types were not necessary steps. 
Including these cells in production could potentially slow down our code, as they are not essential to achieving our objectives.

#### 5.4 Solve the irrelevant code problem
- **Simple Solution**: Delete the irrelevant cells.
- **VDK’s Alternative**:

 > **Introduction of “Non VDK Cells”**:
 >   - Cells without the "vdk" tag.
 
 > **Implementation**:
 >   - Leave the cell untagged and in its default state.
 
 > **Benefits**:
 >   - **Exclusion in Production**: These cells won’t be executed in production, maintaining efficiency.
 >   - **Available for Development**: The cells remain accessible for quick checks during development since they are not removed.
    
 <font color='red'>**ACTION! Tag your cells!**</font> 
  <details>
    <summary><strong style="color: Green;">Check solution!</strong></summary>
    <p>The cells in your notebook should be tagged as in the picture below.</p>
    <img src="../images/irrelevant_code_solution.png" width="1000" height="546" alt="Run a job">
</details>

In [None]:
print("Data processing complete.")

### 6. We are ready with our objectives
Congratulations! You have successfully ingested the required data.

To verify the ingestion process, you can utilize the `vdksql` cell magic command.

In [None]:
%%vdksql
SELECT * FROM nps_data

### 7. Other problems with productionising notebooks

#### 7.1 Testing 
When we're working with Jupyter Notebooks, testing can be a bit tricky. Unlike some other coding tools, notebooks don't have straightforward ways to test the code we write. There aren't well-organized methods or frameworks specifically designed for testing notebooks. This means it's not as easy to check if our code is doing what it's supposed to do. We might have to come up with our own solutions, which can sometimes be a bit messy and not as reliable.  

  ##### VDK’s Solution for Easier End-to-End Testing

> **VDK Run Command**: This tool facilitates end-to-end (e2e) testing capabilities.
>   - **Execution Simplicity**: Run your code as if it's in the production environment with a simple command.
>   - **Error Handling**: Get detailed error messages and stack traces for troubleshooting if the code fails.
>   - **Success Confirmation**: Receive success messages and logs for successful executions.
 
 For guidance on running jobs, consult the [Getting Started Guide](./getting-started.ipynb).
 
 The VDK Run command can be configured to execute automatically before deployment. This ensures that only code which passes this pre-deployment check is promoted to production, reducing the risk of errors.


#### 7.2 Version Control
Version control with notebooks can be complex due to their JSON-based format which might include too much noise in it. 
 ##### VDK’s Solution
> - **Noise Reduction**: VDK cleanses the notebook's JSON by removing non-essential elements, such as execution counts and outputs, which typically cause "noise."
> - **Seamless Integration with Git**: With VDK's integration, your code is committed to a dedicated Git repository on deployment in a cleaner state, devoid of unnecessary metadata, simplifying version control and reducing clutter.

# Congratulations! 🎉

You've successfully completed the Productionizing Jupyter Notebooks with VDK tutorial! We hope you found this tutorial useful.

## Your Feedback Matters!

We continuously strive to improve and your feedback is invaluable to us. Please take a moment to complete our survey. It will only take a few minutes.

### [**👉 Complete the Survey Here 👈**](https://forms.office.com/Pages/ResponsePage.aspx?id=yjiRs-48Skuk1s2D2d1i8AGV0VaygrpPnt7Tz5bBbeBUNFA5NkU3QzlNWEQyUFJCTTQwRUszWk9GUS4u)


### Alternatively, Complete the Survey Directly in Jupyter!
We've made it even easier for you to provide feedback. Just run the code cell below once, and you're done! Your insights are invaluable, and we appreciate your participation in making this tutorial better. 



Thank you for joining us and sharing your thoughts!


In [8]:
from IPython.display import display, HTML

iframe_html = """
<iframe width="100%" height="580px" src="https://forms.office.com/Pages/ResponsePage.aspx?id=yjiRs-48Skuk1s2D2d1i8AGV0VaygrpPnt7Tz5bBbeBUNFA5NkU3QzlNWEQyUFJCTTQwRUszWk9GUS4u&embed=true" frameborder="0" marginwidth="0" marginheight="0" style="border: none; max-width:100%; max-height:100vh" allowfullscreen webkitallowfullscreen mozallowfullscreen msallowfullscreen> </iframe>
"""

display(HTML(iframe_html))