<a href="https://colab.research.google.com/github/bhuvana-ak/uplimit-mlops/blob/main/Bhuvana_MLOPS_Week_1_FINAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1 Project: Sentiment Analysis Project for Lamada E-commerce Platform
Welcome to the Week 1 Project! In this project you will be taking on the role as an MLOps Engineer working on a new greenfield project over at Lamada a leading E-commerce Giant.

![Lamada](https://drive.google.com/uc?id=17r13P5Wy9DtjmLEaiaXf3DUVViUXmFso)

## Background
Lamada, an e-commerce platform, currently relies on manual review analysis by customer support specialists and product analysts. This process is time-consuming, error-prone, and struggles to scale with increasing review volumes.

## Project Goal
Implement automated sentiment analysis to improve the efficiency and accuracy of review processing at Lamada.

## Benefits of Automated Sentiment Analysis
- Real-time processing for quicker customer feedback responses
- Systematic identification of common themes and issues
- Data-driven insights for targeted marketing strategies
- Efficient prioritization of customer support tasks
- Trend analysis for product satisfaction and inventory management

In this project, you will develop a machine learning model to automate the sentiment analysis process, addressing Lamada's current challenges and unlocking these benefits.

## MLOps Focus
This greenfield project presents an opportunity to address issues that have historically challenged machine learning teams at Lamada. By implementing MLOps best practices, we aim to enhance the entire ML lifecycle, resulting in:

- Improved model quality and reliability
- Faster development and deployment cycles
- Better collaboration between data scientists and operations teams
- Increased reproducibility of results
- Enhanced monitoring and maintenance of models in production

In the following section we will be diving into the actual data that we have at hand at this point of time and performing some quick EDA on it in order to understand the data and all its quirks.

In [1]:
# Installing all the necessary packages

!pip install \
ucimlrepo \
ydata-profiling \
pandas \
snorkel \
ipytest \
pytest \
great_expectations==0.18.19 \
scikit-learn \
wandb \
skl2onnx \
onnxruntime \
checklist



# Exploratory Data Analysis (EDA)

## Importance of EDA in Sentiment Analysis

Exploratory Data Analysis is a crucial step in any data science project, particularly in sentiment analysis. For our Lamada e-commerce platform project using the Drugs.com dataset, EDA will help us:

1. Understand the data distribution and quality
2. Identify patterns and relationships between variables
3. Detect anomalies or outliers that might affect our model
4. Inform feature engineering and selection
5. Guide our choice of machine learning algorithms

## Key Areas to Explore

Given the structure of our Drugs.com dataset, we should focus on:

1. Text Analysis:
   - Review length distribution
   - Common words and phrases
   - Correlation between review text and ratings

2. Rating Distribution:
   - Overall rating distribution
   - Rating patterns across different drugs and conditions

3. Temporal Trends:
   - Changes in sentiment over time
   - Seasonal patterns in reviews or ratings

4. Drug and Condition Analysis:
   - Most reviewed drugs and conditions
   - Relationship between conditions and ratings

5. Useful Count Analysis:
   - Distribution of useful votes
   - Correlation between usefulness and sentiment

## Potential Challenges

During EDA, we need to be aware of:

1. Class Imbalance: The rating distribution might be skewed, affecting our model's performance.

2. Data Quality: Look for missing values, inconsistencies in drug names or conditions, and potential data entry errors.

3. Text Preprocessing Needs: Identify requirements for text cleaning, such as handling abbreviations, misspellings, or medical jargon.

4. Bias Detection: Check for potential biases in the data, such as overrepresentation of certain drugs or conditions.

5. Feature Relevance: Assess which features contribute most to sentiment and which might introduce noise.


In [4]:
# Fetching the dataset that we will be using throughout this course.
# Read more about it here: https://archive.ics.uci.edu/dataset/462/drug+review+dataset+drugs+com

from ucimlrepo import fetch_ucirepo
import pandas as pd


drug_reviews_drugs_com = fetch_ucirepo(id=462)
df = pd.concat([drug_reviews_drugs_com.data.features, drug_reviews_drugs_com.data.targets])

In [5]:
print(drug_reviews_drugs_com)


{'data': {'ids':             id
0       206461
1        95260
2        92703
3       138000
4        35696
...        ...
215058  159999
215059  140714
215060  130945
215061   47656
215062  113712

[215063 rows x 1 columns], 'features':                         drugName                     condition  \
0                      Valsartan  Left Ventricular Dysfunction   
1                     Guanfacine                          ADHD   
2                         Lybrel                 Birth Control   
3                     Ortho Evra                 Birth Control   
4       Buprenorphine / naloxone             Opiate Dependence   
...                          ...                           ...   
215058                 Tamoxifen     Breast Cancer, Prevention   
215059              Escitalopram                       Anxiety   
215060            Levonorgestrel                 Birth Control   
215061                Tapentadol                          Pain   
215062                 Arthrotec     

In [10]:
df.head()

Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,20-May-12,27
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,27-Apr-10,192
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,14-Dec-09,17
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,3-Nov-15,10
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,27-Nov-16,37


## Quick & Easy EDA using Ydata-Profilling

While manual EDA is crucial for deep understanding, we can accelerate our initial data exploration using automated tools. For this project, we'll be utilizing [ydata-profiling](https://github.com/ydataai/ydata-profiling), a powerful library that generates comprehensive exploratory data analysis reports.

### Benefits of ydata-profiling

1. **Speed**: Quickly generates an in-depth EDA report, saving time in the initial exploration phase.
2. **Comprehensiveness**: Provides a wide range of statistics and visualizations for each variable in the dataset.
3. **Interactivity**: Creates an interactive HTML report that allows for easy navigation and exploration.
4. **Correlation Analysis**: Automatically detects and visualizes relationships between variables.
5. Missing Data Overview **bold text**: Clearly highlights missing data patterns across the dataset.

While ydata-profiling will provide us with a solid starting point, remember that *it's a complement to, not a replacement for, thoughtful manual EDA*. We'll use its insights as a springboard for more targeted analysis and feature engineering.

In [11]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df, title="Drug Reviews Dataset Profiling Report")

In [14]:
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [15]:
profile.to_notebook_iframe()

### Analyzing the ydata-profiling Report

Now that you've generated the ydata-profiling report for our Drugs.com dataset, it's time to dive in and extract meaningful insights.

## [TODO]Add in your observations
1. List at least five interesting findings from the report.
2. Identify any potential data quality issues that need addressing.
3. Based on this initial analysis, propose three hypotheses about sentiment in drug reviews that we could test in our deeper analysis.

Remember, this automated report is a starting point. Use these insights to guide your manual EDA and feature engineering in the next steps of our project.

```
YOUR ANSWER GOES HERE
```
#### Interesting findings from the report
1. Useful count which is the number of users who found review useful is 0 for nearly 3.9% of rows. The distribution of Useful count is more skewed towards left towards 0
2. There is a good coreelation between useful count and rating. This makes sense because people always check rating and reviews and upvote if they find it useful. Any user who leaves a review will also leave a rating. Review is the rationale behind rating.
3. There are null values provided for the condition column.
4. The duplicate rows contains same review text and other values for different drugs. This does not look like an authentic review. There are chances that these duplicate rows paid reviews.
5. THe entire dataset contains more data for recent dates between 2015 and 2017 and contains data for 3671 distinct drugs.


#### Interesting findings from the report
1. The data quality issue with respect to duplicate rows posts a question about authenticity of the review text. Its good that there are not too many duplicates in the datset.

#### Three hypotheses about sentiment in drug reviews to test in deeper analysis
1. Temporal changes in sentiment: If sentiment about particular drugs changes over time, perhaps as more long-term effects become apparent.
2. Demographic differences in sentiment:
If demographic data is available, an analysis could test for differences in sentiment expression across age groups, genders, or other relevant factors.
3. Sentiment varies based on specific drug effects:
An analysis could test if certain types of drug effects (e.g. physical side effects vs. mood changes) correlate more strongly with positive or negative sentiment.
4. High-rated reviews have more positive sentiment:
Multiple studies found that high-rated drug reviews had higher average sentiment scores compared to low-rated reviews24. This hypothesis could be further validated across different datasets and drug types.
5. Adverse drug reactions correlate with negative sentiment:
There is an intuition that patients express negative sentiments when posting about adverse drug reactions1. A more rigorous analysis could test the strength of this correlation.


# Designing the ML System

Before diving into model development, it's crucial to plan and scope out our machine learning project. This ensures that we're building a system that meets stakeholders' needs and aligns with business objectives. We'll use the Machine Learning Canvas to structure our planning process.

## Project Expectations from Stakeholders

Imagine the following stakeholder requests for our sentiment analysis system at Lamada:

- Chief Customer Officer (CCO): "We need a system that can automatically categorize the sentiment of drug reviews in real-time. This will help us respond quickly to negative feedback and highlight positive experiences."
- Head of Product: "The system should be able to process reviews as they come in, with a latency of no more than 200ms per review. We want to use this data to inform our product recommendations and marketing strategies."
- CTO: "We need to ensure the system can handle our current load of about 1000 reviews per hour, with the ability to scale up to 5000 reviews per hour during peak times."

## Machine Learning Canvas Overview

Let's break down each component of the Machine Learning Canvas and what you should consider:

1. **Background**:
   - Consider Lamada's current manual review process and its limitations.
   - Think about the volume of reviews and the impact of slow response times on customer satisfaction.

2. **Value Proposition**:
   - How will automated sentiment analysis improve Lamada's operations and customer experience?
   - What specific pain points will this solution address?

3. **Objectives**:
   - Break down the high-level goal of "sentiment analysis" into specific, measurable objectives.
   - Consider accuracy targets, processing speed, and scalability requirements.

4. **Solution**:
   - Outline the key features of your sentiment analysis system.
   - Consider how it will integrate with Lamada's existing e-commerce platform.
   - Define what's in scope (e.g., English language reviews) and out of scope (e.g., image-based reviews).

5. **Feasibility**:
   - Assess if you have the necessary data, computational resources, and expertise to build this system.
   - Consider any potential technical or ethical challenges.

6. **Data**:
   - Describe the Drugs.com dataset and how it will be used for training.
   - Consider how you'll handle real-time incoming review data in production.
   - Think about data privacy and security considerations.

7. **Metrics**:
   - Define key performance indicators (KPIs) for your model, such as accuracy, F1 score, and latency.
   - Consider business metrics like customer satisfaction scores or response time improvements.

8. **Evaluation**:
   - Design your offline evaluation strategy using the Drugs.com dataset.
   - Plan how you'll conduct online evaluation once the system is deployed.

9. **Modeling**:
   - Outline your approach to building the sentiment analysis model.
   - Consider starting with a baseline model and iterating to more complex approaches.

10. **Inference**:
    - Based on the stakeholder requirements, you'll need to design for real-time inference.
    - Consider how to optimize your model for low-latency predictions.

11. **Feedback**:
    - Plan how you'll collect feedback on the model's performance in production.
    - Consider implementing a human-in-the-loop system for reviewing uncertain predictions.

12. **Project**:
    - Outline the team members needed (e.g., data scientists, ML engineers, DevOps).
    - Create a timeline for development, testing, and deployment phases.


![ML Canvas](https://drive.google.com/uc?id=1j1dXJ3PLdpbvbAMIeyUNAL-vJjiV8P-w)

## [OPTIONAL] Fill up the ML Canvas
Go through the notebook and finish the minimum requirements before filling out the Machine Learning Canvas for our Lamada sentiment analysis project. Be sure to consider the stakeholder requirements and the insights gained from our EDA phase. This canvas will serve as a roadmap for the rest of our project, ensuring that we're building a system that meets both technical and business needs. Make sure to include how we can create ground truth labels for this use case(we could for example, use the `rating` to map scores that are `1` to be `Negative` and `10` to be `Positive`).

**Disclaimer**:
For the purposes of this course project, you are not required to fill out all sections of the Machine Learning Canvas in full detail. Specifically, the "Project" section, including team member requirements and timelines, is optional. However, we encourage you to think about these aspects as they are crucial in real-world ML projects. Considering the full spectrum of project planning will give you a more comprehensive understanding of what goes into deploying a machine learning system in a production environment.

Focus primarily on the technical aspects such as the data, modeling, metrics, and evaluation strategies. These elements will directly inform the implementation phase of our project. The business and operational considerations (like team composition and timelines) are included to give you a holistic view of ML project planning, which will be valuable in your future career.

```
YOUR ANSWER GOES HERE
```

# Data Preparation

While our Drugs.com dataset is conveniently packaged for this project, it's important to understand that in real-world scenarios like Lamada, data is rarely so neatly organized. Let's explore how data is typically handled in production environments.

## Data in the Wild: The Reality of Enterprise Data

In most enterprises, including e-commerce platforms like Lamada, data is:

1. Distributed: Stored across multiple databases, data lakes, and other storage systems.
2. Heterogeneous: Comes in various formats (structured, semi-structured, and unstructured).
3. Dynamic: Constantly updated and growing.
4. Raw: Often requires significant processing before it's usable for analysis or modeling.

## The Role of Data Engineering

Data Engineers play a crucial role in making raw data usable for Data Scientists. Their responsibilities include:

1. Data Integration: Combining data from various sources.
2. Data Transformation: Converting data into a suitable format for analysis.
3. Data Quality Assurance: Ensuring data accuracy, completeness, and consistency.
4. Data Pipeline Management: Creating and maintaining data flows.

## ETL vs ELT Processes


![ELT Pipeline](https://drive.google.com/uc?id=11AGHxqmvfNgYGCL7kZmInyQhxml0X-6r)

Two common approaches to data preparation are:

1. Extract, Transform, Load (ETL):
   - Data is extracted from source systems.
   - Transformed to fit operational needs.
   - Loaded into the target system (e.g., Data Warehouse).

2. Extract, Load, Transform (ELT):
   - Data is extracted from source systems.
   - Loaded into the target system in its original format.
   - Transformed within the target system as needed.

ELT is becoming increasingly popular due to the decreasing cost of storage and the increasing power of modern data warehouses to handle transformations.

## Data Warehouses and Their Role


![Data Systems](https://drive.google.com/uc?id=1Rd1WLEm30gWlJzaJiQXEzlFQ6mJNWfMl)


Data Warehouses like Snowflake, AWS Redshift, or Google BigQuery serve as centralized repositories for integrated data from various sources. They offer:

1. Scalability: Can handle large volumes of data.
2. Performance: Optimized for complex queries and analytics.
3. Integration: Can combine data from multiple sources.
4. Historical Data: Maintain historical records for trend analysis.

## From Raw Data to ML-Ready Datasets

For our Lamada sentiment analysis project, the process might look like this:

1. Data Collection:
   - Customer reviews are collected from web forms, mobile apps, and customer service interactions.
   - Product information is stored in product databases.
   - User data is kept in customer relationship management (CRM) systems.

2. Data Integration:
   - Data Engineers create pipelines to extract this data from various sources.
   - The data is loaded into a staging area in the Data Warehouse.

3. Data Transformation:
   - Engineers apply transformations to clean the data (e.g., handling missing values, standardizing formats).
   - They join different tables to create a unified view of reviews with associated metadata.

4. Feature Engineering:
   - Data Scientists work with the integrated data to create relevant features.
   - This might include text preprocessing, sentiment score calculation, or deriving new features from existing data.

5. Data Labelling:
   - If manual labelling is required (e.g., for a subset of reviews to train or validate the model), a labelling workflow is set up.
   - This could involve a team of human annotators or a crowdsourcing platform.

6. Dataset Creation:
   - The final ML-ready dataset is created, combining the engineered features and labels.
   - This dataset is versioned and stored, often in a format optimized for ML workflows (e.g., Parquet files in a data lake).

## Data Quality Checks

### Importance of Data Quality Checks

Data quality is crucial for any machine learning project. Poor data quality can lead to unreliable models, incorrect insights, and wasted time and resources. Implementing data quality checks on raw data is essential because:

1. **Garbage In, Garbage Out**: The quality of your model's output is directly dependent on the quality of input data.

2. **Early Error Detection**: Catching data issues early in the pipeline saves time and prevents downstream problems.

3. **Consistency**: Ensures that data meets predefined standards and is consistent across different batches or sources.

4. **Trust**: Builds confidence in the data and subsequent analysis among stakeholders.

5. **Compliance**: Helps meet regulatory requirements in industries where data quality is mandated.

6. **Efficiency**: Automates the process of data validation, reducing manual checks and human error.

7. **Documentation**: Creates a clear record of data expectations and quality over time.


### [TODO] Implementing Data Quality Checks with Great Expectations

Great Expectations is a powerful tool for data validation and documentation. It allows you to express what you "expect" from your data and then validates those expectations.

[**TODO**]: Implement the following data quality checks using Great Expectations for our Drugs.com review dataset



In [2]:
!great_expectations init

[36m
  ___              _     ___                  _        _   _
 / __|_ _ ___ __ _| |_  | __|_ ___ __  ___ __| |_ __ _| |_(_)___ _ _  ___
| (_ | '_/ -_) _` |  _| | _|\ \ / '_ \/ -_) _|  _/ _` |  _| / _ \ ' \(_-<
 \___|_| \___\__,_|\__| |___/_\_\ .__/\___\__|\__\__,_|\__|_\___/_||_/__/
                                |_|
             ~ Always know what to expect from your data ~
[0m
Let's create a new Data Context to hold your project configuration.

Great Expectations will create a new directory with the following structure:

    great_expectations
    |-- great_expectations.yml
    |-- expectations
    |-- checkpoints
    |-- plugins
    |-- .gitignore
    |-- uncommitted
        |-- config_variables.yml
        |-- data_docs
        |-- validations

OK to proceed? [Y/n]: Y


[36mCongratulations! You are now ready to customize your Great Expectations configuration.[0m

[36mYou can customize your configuration in many ways. Here are some examples:[0m

  [36mUse the CLI to:[0m

In [9]:
import great_expectations as gx
from great_expectations.dataset import PandasDataset

context = gx.get_context()

datasource = context.sources.add_pandas(name="my_pandas_datasource_1")
data_asset = datasource.add_dataframe_asset(name="drug_reviews", dataframe=df)

# Create an Expectation Suite
expectation_suite_name = "drug_reviews_suite"
context.add_or_update_expectation_suite(expectation_suite_name=expectation_suite_name)

# Create a validator
validator = context.get_validator(
    datasource_name="my_pandas_datasource_1",
    data_asset_name="drug_reviews",
    expectation_suite_name=expectation_suite_name
)

In [10]:
df.columns

Index(['drugName', 'condition', 'review', 'rating', 'date', 'usefulCount'], dtype='object')

In [13]:
# Refer to the GX Expectations Gallery
# Solve this section https://greatexpectations.io/expectations

# TODO 1: Check for the presence and order of all expected columns
# Hint: This expectation ensures that the table has exactly these columns in this order
expected_column = ['drugName', 'condition', 'review', 'rating', 'date', 'usefulCount']
validator.expect_table_columns_to_match_ordered_list(column_list=expected_column)

# TODO 2: Ensure 'rating' is between 1 and 10
# Hint: This expectation checks if all values in the 'rating' column are within the specified range
validator.expect_column_values_to_be_between(column='rating', min_value=1, max_value=10)


# TODO 3: Check that 'review' column doesn't contain null values
# Hint: This expectation verifies that there are no null values in the 'review' column
validator.expect_column_values_to_not_be_null(column='review')

# TODO 4: Verify that 'usefulCount' is non-negative
# Hint: This expectation ensures all values in 'usefulCount' are greater than or equal to zero
validator.expect_column_values_to_be_between(column='usefulCount', min_value=0)

validator.save_expectation_suite(discard_failed_expectations=False)

checkpoint_name = "my_checkpoint"
checkpoint = context.add_or_update_checkpoint(
    name=checkpoint_name,
    validator=validator,
)

checkpoint_result = checkpoint.run()

print(checkpoint_result)


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

{
  "run_id": {
    "run_name": null,
    "run_time": "2024-10-13T00:44:56.631746+00:00"
  },
  "run_results": {
    "ValidationResultIdentifier::drug_reviews_suite/__none__/20241013T004456.631746Z/my_pandas_datasource_1-drug_reviews": {
      "validation_result": {
        "success": true,
        "results": [
          {
            "success": true,
            "expectation_config": {
              "expectation_type": "expect_table_columns_to_match_ordered_list",
              "kwargs": {
                "column_list": [
                  "drugName",
                  "condition",
                  "review",
                  "rating",
                  "date",
                  "usefulCount"
                ],
                "batch_id": "my_pandas_datasource_1-drug_reviews"
              },
              "meta": {}
            },
            "result": {
              "observed_value": [
                "drugName",
                "condition",
                "review",
             

### [OPTIONAL]Create more checks for the data

In this section you can try performing more checks based on the ones we have listed down below or feel free to add in checks that you feel would be suitable for our use case!

In [14]:
# TODO 1: Check for the presence and order of all expected columns
# Hint: Use expect_table_columns_to_match_ordered_list() method
# Expected columns: "drugName", "condition", "review", "rating", "date", "usefulCount"
expected_column = ['drugName', 'condition', 'review', 'rating', 'date', 'usefulCount']
validator.expect_table_columns_to_match_ordered_list(column_list=expected_column)

# TODO 2: Ensure 'rating' is between 1 and 10
# Hint: Use expect_column_values_to_be_between() method
validator.expect_column_values_to_be_between(column='rating', min_value=1, max_value=10)

# TODO 3: Check that 'review' column doesn't contain null values
# Hint: Use expect_column_values_to_not_be_null() method
validator.expect_column_values_to_not_be_null(column='review')

# TODO 4: Verify that 'usefulCount' is non-negative
# Hint: Use expect_column_values_to_be_between() method with only a min_value
validator.expect_column_values_to_be_between(column='usefulCount', min_value=0)

# TODO 5: Ensure 'date' follows the expected format (DD-Mon-YY)
# Hint: Use expect_column_values_to_match_strftime_format() method
# The strftime format for DD-Mon-YY is "%d-%b-%y"
validator.expect_column_values_to_match_strftime_format(column='date', strftime_format="%d-%b-%y")

checkpoint_name = "my_checkpoint"
checkpoint = context.add_or_update_checkpoint(
    name=checkpoint_name,
    validator=validator,
)

checkpoint_result = checkpoint.run()

print(checkpoint_result)


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]


  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

  self.comm = Comm(**args)



Calculating Metrics:   0%|          | 0/22 [00:00<?, ?it/s]

{
  "run_id": {
    "run_name": null,
    "run_time": "2024-10-13T00:46:59.774275+00:00"
  },
  "run_results": {
    "ValidationResultIdentifier::drug_reviews_suite/__none__/20241013T004659.774275Z/my_pandas_datasource_1-drug_reviews": {
      "validation_result": {
        "success": true,
        "results": [
          {
            "success": true,
            "expectation_config": {
              "expectation_type": "expect_table_columns_to_match_ordered_list",
              "kwargs": {
                "column_list": [
                  "drugName",
                  "condition",
                  "review",
                  "rating",
                  "date",
                  "usefulCount"
                ],
                "batch_id": "my_pandas_datasource_1-drug_reviews"
              },
              "meta": {}
            },
            "result": {
              "observed_value": [
                "drugName",
                "condition",
                "review",
             

## Data Labeling for Sentiment Analysis

Looking at our sample dataset, we can see that we have reviews with associated ratings. For our sentiment analysis task, we need to convert these ratings into sentiment labels. Let's explore different labeling techniques:

### 1. Natural Labels

In our case, we're fortunate to have natural labels in the form of ratings.
> Tasks with natural labels are tasks where the model's predictions can be automatically evaluated or partially evaluated by the system.
These ratings can be used to infer sentiment without additional manual labeling.

### 2. Threshold-Based Labeling(Programmatic Labeling)

![Programmatic Labeling](https://drive.google.com/uc?id=1Esp2XapZggGpyUyST5X_clzlrIQnTRfo)

A simple approach to start with:
- `Ratings 1-4`: Negative sentiment
- `Ratings 5-6`: Neutral sentiment
- `Ratings 7-10`: Positive sentiment

This method is quick and easy but may oversimplify the nuances in the reviews. You can use a tool like [Snorkel](https://github.com/snorkel-team/snorkel) for this task.

### 3. Hand Labeling

![Hand Labeling](https://drive.google.com/uc?id=14tb6DFFpe3ApCYlI8n0vnBv5MM7Z6Xtf)

For more nuanced labeling, we could manually review a subset of the data:
- Read each review
- Assign a sentiment label (Negative, Neutral, Positive)
- Consider the rating as a guide, but allow for discrepancies

While this method can be more accurate, it's **time-consuming**, and where bias easily creeps in, and may not be feasible for large datasets.

### 4. LLM-Assisted Labeling
![Data Labelling using LLMs](https://drive.google.com/uc?id=1jRIMZFUUDr3_03I6btVsOlv6y9thlcZA)

Large Language Models (LLMs) have revolutionized the labeling process:
- Use an LLM to analyze the review text and suggest a sentiment label
- Optionally, have a human review the LLM's suggestions for quality control

This method can significantly speed up the labeling process while maintaining high quality. More information about it can be found [here](https://www.refuel.ai/blog-posts/llm-labeling-technical-report).

### 5. Active Learning

![Active Learning](https://drive.google.com/uc?id=1xEUrPHaKHSA4G9PYGA35013QbdHRwvfl)



If resources are limited:
1. Label a small initial dataset
2. Train a model on this dataset
3. Use the model to predict labels for unlabeled data
4. Select the most uncertain predictions for human review
5. Add these newly labeled examples to the training set
6. Repeat steps 2-5

This iterative process can efficiently improve your model with minimal labeling effort.

If you would like to dive a bit deeper into Active Learning then check out this [video](https://youtu.be/7kX6rhUGtzA?si=sLhi6gRZFaPeMA4X) about the topic.

For our project, let's start with the threshold-based approach for quick results. As we progress, we can explore LLM-assisted labeling to refine our dataset and potentially improve model performance.

Remember, the quality of your labels directly impacts your model's performance. Always validate a sample of your labels, regardless of the method used.

In [15]:
from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis
import pytest
import ipytest

ipytest.autoconfig()

# TODO: Create the labeling function based on the threshold we created and
# leveraging the LABEL_MAPPING below:
# Label 0 => Ratings 1-4: Negative sentiment
# Label 1 => Ratings 5-6: Neutral sentiment
# Label 2 => Ratings 7-10: Positive sentiment

LABEL_MAPPING = {
    "NEGATIVE": 0,
    "NEUTRAL": 1,
    "POSITIVE": 2,
}

@labeling_function()
def label_sentiment(x):
  # TODO: Add your code here
  if x['rating'] >= 1 and x['rating'] <= 4:
    return LABEL_MAPPING['NEGATIVE']
  elif x['rating'] >= 5 and x['rating'] <= 6:
    return LABEL_MAPPING['NEUTRAL']
  elif x['rating'] >= 7 and x['rating'] <= 10:
    return LABEL_MAPPING['POSITIVE']

In [16]:
def label_data(df):
    lf_applier = PandasLFApplier([label_sentiment])
    labels = lf_applier.apply(df)
    df['sentiment_label'] = labels
    return df

In [17]:
%%ipytest -vv -s

@pytest.mark.parametrize("rating, expected_label", [
    (2, 0),
    (5, 1),
    (7, 2),
    (10, 2),
    (1, 0),
    (3, 0),
    (4, 0),
    (6, 1),
    (8, 2),
])
def test_sentiment_labeling(rating, expected_label):
    # Create a sample dataframe with a single row
    data = {
        'drugName': ['Drug A'],
        'condition': ['Condition A'],
        'review': ['Review A'],
        'rating': [rating],
        'date': ['2022-01-01'],
        'usefulCount': [10]
    }
    df = pd.DataFrame(data)
    labeled_df = label_data(df)
    assert labeled_df['sentiment_label'].iloc[0] == expected_label, f"Labeling doesn't match expected output for rating {rating}"
    assert all(labeled_df[col].iloc[0] == df[col].iloc[0] for col in df.columns), "Original data was modified"
    assert 'sentiment_label' in labeled_df.columns, "sentiment_label column not added"

platform linux -- Python 3.10.12, pytest-7.4.4, pluggy-1.5.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /content
plugins: anyio-3.7.1, typeguard-4.3.0
[1mcollecting ... [0mcollected 9 items

t_252b8d801b494b6395e5d0a4e82b7255.py::test_sentiment_labeling[2-0] 

100%|██████████| 1/1 [00:00<00:00, 306.04it/s]

[32mPASSED[0m
t_252b8d801b494b6395e5d0a4e82b7255.py::test_sentiment_labeling[5-1] 


100%|██████████| 1/1 [00:00<00:00, 435.50it/s]

[32mPASSED[0m
t_252b8d801b494b6395e5d0a4e82b7255.py::test_sentiment_labeling[7-2] 


100%|██████████| 1/1 [00:00<00:00, 690.65it/s]

[32mPASSED[0m
t_252b8d801b494b6395e5d0a4e82b7255.py::test_sentiment_labeling[10-2] 


100%|██████████| 1/1 [00:00<00:00, 661.98it/s]

[32mPASSED[0m
t_252b8d801b494b6395e5d0a4e82b7255.py::test_sentiment_labeling[1-0] 


100%|██████████| 1/1 [00:00<00:00, 666.08it/s]

[32mPASSED[0m
t_252b8d801b494b6395e5d0a4e82b7255.py::test_sentiment_labeling[3-0] 


100%|██████████| 1/1 [00:00<00:00, 616.63it/s]

[32mPASSED[0m
t_252b8d801b494b6395e5d0a4e82b7255.py::test_sentiment_labeling[4-0] 


100%|██████████| 1/1 [00:00<00:00, 532.00it/s]

[32mPASSED[0m
t_252b8d801b494b6395e5d0a4e82b7255.py::test_sentiment_labeling[6-1] 


100%|██████████| 1/1 [00:00<00:00, 578.44it/s]

[32mPASSED[0m
t_252b8d801b494b6395e5d0a4e82b7255.py::test_sentiment_labeling[8-2] 


100%|██████████| 1/1 [00:00<00:00, 630.72it/s]

[32mPASSED[0m






## [OPTIONAL]LLM-Assisted Labeling

In this section, you'll explore how to use a Large Language Model (LLM) to assist in labeling our dataset.
This can be especially useful for more nuanced sentiment analysis or when dealing with large datasets.

Steps to complete:
1. Choose an LLM API (e.g., OpenAI's GPT-3, Hugging Face's API, or any other accessible LLM)
2. Set up the necessary API credentials
3. Create a function to send reviews to the LLM and interpret its responses
4. Apply the LLM labeling to a subset of our data
5. Compare LLM-generated labels with our rule-based labels

**NOTE**: Be mindful of API usage costs and rate limits when using LLM services.

You could also try out [autolabel](https://github.com/refuel-ai/autolabel) which does all of this right out of the box for you!

**Discussion Questions**:
1. How does the agreement rate between rule-based and LLM-generated labels compare?
2. In cases of disagreement, which labeling method seems more accurate? Why?
3. What are the pros and cons of using an LLM for labeling compared to our rule-based approach?

In [None]:
# OPTIONAL TASKS GOES HERE

## Data Quality Checks on Features


In [18]:
from sklearn.model_selection import train_test_split


labels = label_data(df)
sentiment_df = labels[['review', 'sentiment_label']]
sentiment_df = sentiment_df.rename(columns={'review': 'text', 'sentiment_label': 'label'})
train_df, test_df = train_test_split(sentiment_df, test_size=0.2, random_state=42)

print("Sample of sentiment_df:")
print(sentiment_df.head())

print(f"\nTrain set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")

100%|██████████| 215063/215063 [00:05<00:00, 38676.10it/s]


Sample of sentiment_df:
                                                text  label
0  "It has no side effect, I take it in combinati...      2
1  "My son is halfway through his fourth week of ...      2
2  "I used to take another oral contraceptive, wh...      1
3  "This is my first time using any form of birth...      2
4  "Suboxone has completely turned my life around...      2

Train set shape: (172050, 2)
Test set shape: (43013, 2)


In [31]:
import great_expectations as gx
from great_expectations.dataset import PandasDataset
import pandas as pd

def perform_data_quality_checks(train_df, test_df):
    train_ge = gx.dataset.PandasDataset(train_df)
    test_ge = gx.dataset.PandasDataset(test_df)

    print("Performing data quality checks on processed data...")

    # TODO: Check that the labels are of values 0, 1, 2
    # Hint: Use the expect_column_values_to_be_in_set() method
    train_validation_1 = train_ge.expect_column_values_to_be_in_set(column='label', value_set=[0, 1, 2])
    test_validation_1 = test_ge.expect_column_values_to_be_in_set(column='label', value_set=[0, 1, 2])
    overall_success = train_validation_1["success"] and test_validation_1["success"]

    print("\n1. Checking label values:")
    # Print results here
    if not train_validation_1["success"]:
        print(f"Train set validation result: {train_validation_1}")
    if not test_validation_1["success"]:
        print(f"Test set validation result: {test_validation_1}")
    print(f"Overall validation result: {overall_success}")


    # TODO: Verify that we only have columns text and label
    # Hint: Use the expect_table_columns_to_match_ordered_list() method
    train_validation_2 = train_ge.expect_table_columns_to_match_ordered_list(column_list=['text', 'label'])
    test_validation_2 = test_ge.expect_table_columns_to_match_ordered_list(column_list=['text', 'label'])
    overall_success = train_validation_2["success"] and test_validation_2["success"]
    print("\n2. Checking columns:")
    # Print results here
    if not train_validation_2["success"]:
        print(f"Train set validation result: {train_validation_2}")
    if not test_validation_2["success"]:
        print(f"Test set validation result: {test_validation_2}")
    print(f"Overall validation result: {overall_success}")


    # TODO: Check for data leakage between train and test data on text
    # Hint: Compare the 'text' columns of train and test dataframes
    # Look for any duplicate texts between the two datasets
    common_texts = set(train_df['text']).intersection(set(test_df['text']))
    num_common_texts = len(common_texts)

    if num_common_texts > 0:
      overall_success = True
    else:
      overall_success =False
    print("\n3. Checking for data leakage:")
    # Print results here
    print(f"Number of common texts between train and test data: {num_common_texts}")
    print(f"Overall validation result: {overall_success}")

    # TODO: Check for duplicates in each dataset
    # Hint: Use the expect_column_values_to_be_unique() method on the 'text' column
    train_validation_4 = train_ge.expect_column_values_to_be_unique(column='text')
    test_validation_4 = test_ge.expect_column_values_to_be_unique(column='text')
    overall_success = train_validation_4["success"] and test_validation_4["success"]
    print("\n4. Checking for duplicate reviews within each dataset:")
    # Print results here
    if not train_validation_4["success"]:
        print(f"Train set validation result: {train_validation_4}")
    if not test_validation_4["success"]:
        print(f"Test set validation result: {test_validation_4}")
    print(f"Overall validation result: {overall_success}")

    # TODO: Update overall success check
    # Hint: Combine the results of all previous checks

    print("\nOverall data quality check result:", "Passed" if overall_success else "Failed")
    return overall_success

quality_check_passed = perform_data_quality_checks(train_df, test_df)

if quality_check_passed:
    print("All data quality checks passed. Proceeding with model training...")
else:
    print("Data quality checks failed. Please address the issues before proceeding.")

Performing data quality checks on processed data...

1. Checking label values:
Overall validation result: True

2. Checking columns:
Overall validation result: True

3. Checking for data leakage:
Number of common texts between train and test data: 27545
Overall validation result: True

4. Checking for duplicate reviews within each dataset:
Train set validation result: {
  "success": false,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {
      "column": "text",
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "result": {
    "element_count": 172050,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 110601,
    "unexpected_percent": 64.28421970357454,
    "unexpected_percent_total": 64.28421970357454,
    "unexpected_percent_nonmissing": 64.28421970357454,
    "partial_unexpected_list": [
      "\"I have chronic panic disorder with agoraphobia and due to the fact I have not responded to any antide

## Addressing Data Issues

Our data leakage check has revealed a significant number of duplicate reviews between the train and test sets. This is a critical issue that needs to be resolved before we can proceed with model training. There are a couple of things that we can do of varying complexity:

### TODO: Drop Duplicates(Simple Fix)
A simple fix that we can do is to just drop the duplicate rows that exists in the original dataset and then

### [OPTIONAL] Investigate and Handle Duplicates
Perform a deeper EDA into the dataset looking into why this happens in the first place and from there decide whether we should be



In [32]:
import pandas as pd
from sklearn.model_selection import train_test_split

print(f"Original dataset shape: {df.shape}")
labels = label_data(df)
sentiment_df = labels[['review', 'sentiment_label']]
sentiment_df = sentiment_df.rename(columns={'review': 'text', 'sentiment_label': 'label'})
# TODO: Drop duplicates here
df_deduplicated = sentiment_df.drop_duplicates(subset='text')
print(f"Shape after removing duplicates: {df_deduplicated.shape}")

train_df, test_df = train_test_split(df_deduplicated, test_size=0.2, random_state=42)

Original dataset shape: (215063, 7)



  0%|          | 0/215063 [00:00<?, ?it/s][A
  0%|          | 6/215063 [00:00<1:03:31, 56.43it/s][A
  1%|          | 1370/215063 [00:00<00:27, 7842.82it/s][A
  1%|          | 2648/215063 [00:00<00:21, 10065.33it/s][A
  2%|▏         | 4042/215063 [00:00<00:18, 11581.69it/s][A
  3%|▎         | 5442/215063 [00:00<00:16, 12447.65it/s][A
  4%|▍         | 8425/215063 [00:00<00:11, 18334.64it/s][A
  5%|▌         | 11258/215063 [00:00<00:09, 21593.95it/s][A
  6%|▋         | 13807/215063 [00:00<00:08, 22831.52it/s][A
  7%|▋         | 16094/215063 [00:00<00:10, 18641.37it/s][A
  8%|▊         | 18082/215063 [00:01<00:12, 15390.08it/s][A
  9%|▉         | 19784/215063 [00:01<00:14, 13321.40it/s][A
 10%|▉         | 21258/215063 [00:01<00:15, 12190.23it/s][A
 11%|█         | 22951/215063 [00:01<00:14, 13125.81it/s][A
 11%|█▏        | 24364/215063 [00:01<00:14, 13163.76it/s][A
 12%|█▏        | 25751/215063 [00:01<00:15, 12427.82it/s][A
 13%|█▎        | 27044/215063 [00:01<00:15, 12222

Shape after removing duplicates: (128478, 2)


In [33]:
quality_check_passed = perform_data_quality_checks(train_df, test_df)

if quality_check_passed:
    print("All data quality checks passed. Proceeding with model training...")
else:
    print("Data quality checks failed. Please review the processing steps.")

Performing data quality checks on processed data...

1. Checking label values:
Overall validation result: True

2. Checking columns:
Overall validation result: True

3. Checking for data leakage:
Number of common texts between train and test data: 0
Overall validation result: False

4. Checking for duplicate reviews within each dataset:
Overall validation result: True

Overall data quality check result: Passed
All data quality checks passed. Proceeding with model training...


## Model Development

Now that we have prepared and validated our data, we can focus our efforts on modeling. Before diving into complex algorithms, it's crucial to establish a baseline model. This step, often overlooked, is fundamental in the machine learning development process.

### The Importance of Baseline Models

1. **Benchmark for Comparison**:
   A baseline model provides a point of reference against which we can compare more sophisticated models. It helps answer the question: "Is our complex model actually performing better than a simple approach?"

2. **Justification for Complexity**:
   If a simple model performs nearly as well as a complex one, it may not be worth the additional computational cost and potential overfitting risk of the complex model.

3. **Problem Understanding**:
   Implementing a baseline forces us to understand the fundamental aspects of our problem and data.

4. **Quick Insights**:
   Baseline models can provide quick insights into the problem, potentially revealing simple patterns or biases in the data.

5. **Sanity Check**:
   If a complex model performs worse than the baseline, it's a clear sign that something is wrong – either with the model, the data, or our approach.

### Baseline Models for Sentiment Analysis

For our drug review sentiment analysis task, we can consider the following baseline approaches:

**Majority Class Predictor**:
Always predict the most common sentiment in our training data. This is the simplest possible baseline.

```python
from sklearn.dummy import DummyClassifier

majority_baseline = DummyClassifier(strategy='most_frequent')
majority_baseline.fit(X_train, y_train)
majority_accuracy = majority_baseline.score(X_test, y_test)
print(f"Majority Class Baseline Accuracy: {majority_accuracy:.4f}")
```

In [58]:
LABELS = ["NEGATIVE", "NEUTRAL", "POSITIVE"]

In [59]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mkbhuvi[0m ([33mkbhuvi-uplimit[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [60]:
import os
import wandb
import pandas as pd

def save_and_log_datasets(train_df, test_df, y_probas, run):
    """
    Save the split datasets and probabilities to CSV files and log them as artifacts in W&B.

    Args:
    train_df (pd.DataFrame): Training dataframe
    test_df (pd.DataFrame): Test dataframe
    y_probas (np.array): Predicted probabilities for test set, (n_samples, n_classes)
    run (wandb.Run): The current W&B run
    """

    os.makedirs('datasets', exist_ok=True)

    train_df.to_csv('datasets/train.csv', index=False)
    test_df.to_csv('datasets/test.csv', index=False)
    LABELS = ["NEGATIVE", "NEUTRAL", "POSITIVE"]

    probas_df = pd.DataFrame(y_probas, columns=LABELS)
    probas_df.to_csv('datasets/test_probas.csv', index=False)

    artifact = wandb.Artifact(name="drug-review-dataset", type="dataset")

    artifact.add_file(local_path="datasets/train.csv", name="train.csv")
    artifact.add_file(local_path="datasets/test.csv", name="test.csv")
    artifact.add_file(local_path="datasets/test_probas.csv", name="test_probas.csv")

    run.log_artifact(artifact)

    print("Datasets and probabilities saved and logged as artifacts in W&B.")

In [61]:
X_train, y_train = train_df[['text']].values.flatten(), train_df[['label']].values.flatten()
X_test, y_test = test_df[['text']].values.flatten(), test_df[['label']].values.flatten()

In [62]:
import numpy as np
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.model_selection import train_test_split

def train_and_evaluate(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_probas = model.predict_proba(X_test)

    metrics = {
        'accuracy': accuracy_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred, average='weighted'),
        'precision': precision_score(y_test, y_pred, average='weighted'),
        'recall': recall_score(y_test, y_pred, average='weighted')
    }
    # Ensure that we save the dataset used so that we can use it for debugging
    # or in the future for comparisons with other models
    run = wandb.init(project="Drug Review MLOps Uplimit", name="DummyClassifier_Stratified",
           notes="Dummy Classifier Baseline", tags=["baseline", "dummy", "stratified"])
    print("run",run)
    save_and_log_datasets(train_df, test_df, y_probas, run)

    for metric, value in metrics.items():
        wandb.log({metric: value})

    wandb.sklearn.plot_confusion_matrix(y_test, y_pred, labels=['Negative', 'Neutral', 'Positive'])
    wandb.sklearn.plot_roc(y_test, y_probas, labels=['Negative', 'Neutral', 'Positive'])
    wandb.sklearn.plot_precision_recall(y_test, y_probas, labels=['Negative', 'Neutral', 'Positive'])
    wandb.sklearn.plot_class_proportions(y_train, y_test, ['Negative', 'Neutral', 'Positive'])

    return metrics

# Initialize W&B run
wandb.init(project="Drug Review MLOps Uplimit", name="DummyClassifier_Stratified",
           notes="Dummy Classifier Baseline", tags=["baseline", "dummy", "stratified"])

config = {
    "model": "DummyClassifier",
    "strategy": "stratified"
}
wandb.config.update(config)

dummy_clf = DummyClassifier(strategy='stratified', random_state=42)
dummy_metrics = train_and_evaluate(dummy_clf, X_train, y_train, X_test, y_test)

wandb.finish()

  self.comm = Comm(**args)



VBox(children=(Label(value='0.013 MB of 0.013 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

  self.comm = Comm(**args)



VBox(children=(Label(value='0.013 MB of 0.013 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

run <wandb.sdk.wandb_run.Run object at 0x7c2c74e2a740>
Datasets and probabilities saved and logged as artifacts in W&B.


VBox(children=(Label(value='0.026 MB of 0.026 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
accuracy,▁
f1,▁
precision,▁
recall,▁

0,1
accuracy,0.51086
f1,0.5115
precision,0.51215
recall,0.51086


In [63]:
import warnings
from sklearn.exceptions import ConvergenceWarning
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def check_batch_training(X, y, n_gram_size, max_iter, batch_size=32):
    """
    Check if the model can train on a batch of specified size using a scikit-learn Pipeline.
    """
    X_batch = X[:batch_size]
    y_batch = y[:batch_size]

    pipeline = Pipeline([
        ('featurizer', CountVectorizer(ngram_range=(1, n_gram_size))),
        ('classifier', LogisticRegression(max_iter=max_iter))
    ])

    try:
      pipeline.fit(X_batch, y_batch)
      return True
    except Exception as e:
        print(f"Error training on batch of size {batch_size}: {str(e)}")
        return False

# Perform batch check before main training loop
print("Performing batch training check...")
N_GRAM_SIZE = 3
LR_MAX_ITER = 100
batch_success = check_batch_training(X_train, y_train, N_GRAM_SIZE, LR_MAX_ITER, batch_size=32)

if not batch_success:
    print("Batch training failed. Please review your model and data.")
else:
    print("Batch training successful. Proceeding with full training...")

Performing batch training check...
Batch training successful. Proceeding with full training...


In [64]:
import wandb
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import StringTensorType
import os
from pathlib import Path

# Constants
N_GRAM_SIZE = 3
LR_MAX_ITER = 100
TRAIN_SIZE_EVALS = [1000, 5000, 10000, len(X_train)]

def train_and_evaluate(pipeline, X_train, y_train, X_test, y_test, run_name, run):
    pipeline.fit(X_train, y_train)
    registered_model_name = "review-sentiment-analysis-dev"

    y_pred = pipeline.predict(X_test)
    y_probas = pipeline.predict_proba(X_test)

    X_train_vec = pipeline.named_steps['featurizer'].transform(X_train)
    X_test_vec = pipeline.named_steps['featurizer'].transform(X_test)

    wandb.sklearn.plot_confusion_matrix(y_test, y_pred, labels=LABELS)
    wandb.sklearn.plot_roc(y_test, y_probas, labels=LABELS)
    wandb.sklearn.plot_precision_recall(y_test, y_probas, labels=LABELS)
    wandb.sklearn.plot_class_proportions(y_train, y_test, LABELS)

    save_and_log_datasets(train_df, test_df, y_probas, run)

    # Export the model to ONNX format
    initial_type = [('text_input', StringTensorType([None, 1]))]
    onx = convert_sklearn(pipeline, initial_types=initial_type)

    # Save the ONNX model locally
    onnx_filename = f"logreg_model_{run_name}.onnx"
    onnx_filepath = Path(onnx_filename)
    with open(onnx_filepath, "wb") as f:
        f.write(onx.SerializeToString())

    run.link_model(
        onnx_filepath,
        registered_model_name
    )

    # Clean up the local file
    os.remove(onnx_filepath)

# OPTIONAL: Initially lets just train on a small sample if you have the time
# go ahead and try training on the entire dataset!
n = TRAIN_SIZE_EVALS[0]
run_name = f"LR_train_size_{n}"
run = wandb.init(project="Drug Review MLOps Uplimit", name=run_name,
            notes="Logistic Regression with various train sizes",
            tags=["logistic-regression", "experiment"])

config = {
    "model": "LogisticRegression",
    "n_gram_size": N_GRAM_SIZE,
    "max_iter": LR_MAX_ITER,
    "train_size": n
}
run.config.update(config)

X_train_i = X_train[:n]
y_train_i = y_train[:n]

pipeline = Pipeline([
    ('featurizer', CountVectorizer(ngram_range=(1, N_GRAM_SIZE))),
    ('classifier', LogisticRegression(max_iter=LR_MAX_ITER))
])

train_and_evaluate(pipeline, X_train_i, y_train_i, X_test, y_test, run_name, run)

wandb.finish()

print("Experiment tracking completed.")

  self.comm = Comm(**args)





Datasets and probabilities saved and logged as artifacts in W&B.


VBox(children=(Label(value='4.899 MB of 4.899 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

Experiment tracking completed.


## Compare Models in W&B

Once we have trained our models we can utilise W&B to compare them by heading to our Project and looking at the charts that are generated, W&B will compare different models across the charts that we specified for them during the model training process. It's important to identify what plots we want ahead of time to ensure that we can always compare them easily!

![Compare ML Models in W&B](https://drive.google.com/uc?id=1WSKNK3rlwTr_e-mhVhjCNy-LcSBg3amw)

## Fetch and Run Inference using the Model from Model Registry

Now that we have logged our models to the model registry, let's try loading it here and performing inference with it using the ONNX Runtime to verify that everything works as expected!

In [65]:
import wandb
run = wandb.init()
model_name = "kbhuvi-uplimit/Drug Review MLOps Uplimit/run-y3m59we9-logreg_model_LR_train_size_1000.onnx:v0"
downloaded_model_path = run.use_model(model_name)
print(downloaded_model_path)

  self.comm = Comm(**args)



[34m[1mwandb[0m:   1 of 1 files downloaded.  


/content/artifacts/run-y3m59we9-logreg_model_LR_train_size_1000.onnx:v0/logreg_model_LR_train_size_1000.onnx


In [67]:
import numpy as np
import onnxruntime as rt

# First we must start a session.
sess = rt.InferenceSession(downloaded_model_path)
# The name of the input is saved as part of the .onnx file.
# We are retreiving it because we will need it later.
input_name = sess.get_inputs()[0].name
print(f"{input_name=}")
# This code will run the model on our behalf.
query = "effective" #"I loved the product!"
_, probas = sess.run(None, {input_name: np.array([[query]])})
print(probas[0])

input_name='text_input'
{0: 0.16362914443016052, 1: 0.0402657687664032, 2: 0.7961050868034363}


## [OPTIONAL] Advanced Models for Sentiment Analysis

For those interested in exploring more sophisticated approaches, this section introduces two advanced techniques: using BERT embeddings as a featurizer and leveraging Large Language Models (LLMs) for sentiment analysis.

### 1. BERT Embeddings with Logistic Regression

This approach uses BERT to create embeddings, which are then fed into a logistic regression classifier.

```python
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer
from sklearn.metrics import accuracy_score, f1_score

class TransformerFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        # TODO: Initialize the SentenceTransformer model

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        # TODO: Use the SentenceTransformer model to create embeddings for the input text
        pass

# Training and evaluation
models_advanced = {}
for n in TRAIN_SIZE_EVALS:
    print(f"Evaluating BERT+LR for training data size = {n}")
    X_train_i = X_train[:n]
    Y_train_i = Y_train[:n]

    pipeline = Pipeline([
        ('featurizer', TransformerFeaturizer()),
        ('classifier', LogisticRegression(max_iter=1000))
    ])

    # TODO: Fit the pipeline, make predictions, and calculate metrics
    # Store results in models_advanced[n]

    print(f"Accuracy on test set: {models_advanced[n]['accuracy']}")
```

### 2. LLM-based Sentiment Analysis

This approach uses a Large Language Model for zero-shot and few-shot sentiment analysis.

```python
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# Initialize your LLM (replace with your preferred model)
llm = OpenAI(temperature=0)

# Zero-shot prompting
zero_shot_template = """
YOUR PROMPT
"""
zero_shot_prompt = PromptTemplate(input_variables=["review"], template=zero_shot_template)
zero_shot_chain = LLMChain(llm=llm, prompt=zero_shot_prompt)

# TODO: Implement zero-shot sentiment analysis on a sample of drug reviews

# Few-shot prompting
few_shot_template = """
YOUR PROMPT
"""
few_shot_prompt = PromptTemplate(input_variables=["review"], template=few_shot_template)
few_shot_chain = LLMChain(llm=llm, prompt=few_shot_prompt)

# TODO: Implement few-shot sentiment analysis on a sample of drug reviews

# TODO: Compare the performance of zero-shot and few-shot approaches
```

## [OPTIONAL] Model Evaluation: Estimating Confidence Intervals with Bootstrap Sampling

When evaluating machine learning models, it's crucial to understand not just the point estimates of performance metrics, but also the uncertainty around these estimates. Bootstrap sampling is a powerful technique that allows us to estimate confidence intervals for our model's performance metrics.

## Why Bootstrap Sampling?

1. **Quantify Uncertainty**: Bootstrap sampling helps us quantify the uncertainty in our model's performance metrics.
2. **Robustness**: It provides a more robust estimate of model performance than a single point estimate.
3. **No Distributional Assumptions**: Bootstrap sampling doesn't require assumptions about the underlying distribution of the data.

## Implementing Bootstrap Sampling

Here's a step-by-step guide to implement bootstrap sampling for estimating confidence intervals:

1. **Generate Bootstrap Samples**:
   Create N (e.g., 1000) bootstrap samples, each the same size as your original test set. Each sample is created by randomly selecting instances from the test set with replacement.

2. **Calculate Metrics for Each Sample**:
   For each bootstrap sample, calculate the performance metrics you're interested in (e.g., accuracy, F1-score).

3. **Sort the Results**:
   Sort the N values for each metric in ascending order.

4. **Compute Confidence Intervals**:
   The 95% confidence interval is given by the 2.5th and 97.5th percentiles of the sorted values.

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm

def bootstrap_sample(X, y, n_samples):
    # TODO: Implement bootstrap sampling
    # Hint: Use np.random.randint to generate random indices
    pass

def bootstrap_confidence_interval(model, X_test, y_test, n_iterations=1000):
    accuracies = []
    f1_scores = []
    n_samples = len(X_test)

    for _ in tqdm(range(n_iterations)):
        # TODO: Generate bootstrap sample
        # TODO: Make predictions on bootstrap sample
        # TODO: Calculate accuracy and F1-score
        # TODO: Append results to accuracies and f1_scores lists
        pass

    # TODO: Calculate mean and confidence intervals for accuracy and F1-score
    # Hint: Use np.percentile for confidence intervals

    # TODO: Return results in a dictionary format

# Usage
# TODO: Call the bootstrap_confidence_interval function with your model and test data
# TODO: Print the results

# [OPTIOANL] Post-Training Tests on Model Behavior

After training your sentiment analysis model, it's crucial to thoroughly test its behavior beyond just accuracy metrics. This section introduces three types of behavioral tests that will help you understand your model's strengths, weaknesses, and potential biases.

## Invariance Tests

Invariance tests check whether your model's output remains unchanged when irrelevant input features are modified.

### Example: Name Invariance Test

In sentiment analysis, a person's name mentioned in a review should not affect the sentiment prediction.

**Test Setup:**
1. Select a set of reviews from your test dataset.
2. Create copies of these reviews, replacing any mentioned names with different names.
3. Run both the original and modified reviews through your model.
4. Compare the sentiment predictions.

**Expected Outcome:** The sentiment predictions should remain the same for both original and name-modified reviews.

## Directional Expectation Tests

Directional Expectation Tests check if changes in input lead to expected changes in output.

### Example: Intensifier Test

Adding intensity-related words should increase the strength of the sentiment prediction.

**Test Setup:**
1. Select a set of positive and negative reviews from your test dataset.
2. Create copies of these reviews, adding intensifiers like "very", "extremely", or "incredibly".
3. Run both the original and modified reviews through your model.
4. Compare the sentiment prediction probabilities.

**Expected Outcome:** The sentiment prediction probability should increase in the direction of the original sentiment.

## Minimum Functionality Tests

Minimum Functionality Tests check if your model performs correctly on very simple or critical cases.

### Example: Explicit Sentiment Words Test

Reviews containing explicit sentiment words should be correctly classified.

**Test Setup:**
1. Create a list of reviews using explicit positive and negative sentiment words.
2. Run these through your model.
3. Check if the predictions match the expected sentiments.

**Expected Outcome:** The model should correctly classify these simple, explicit sentiment expressions with high accuracy.

In [None]:
import checklist
from checklist.editor import Editor
from checklist.test_types import MFT
from checklist.pred_wrapper import PredictorWrapper

editor = Editor()

# TODO: Define variables for test data generation
# Example:
# positive_words = ['excellent', 'amazing', 'fantastic']
# negative_words = ['terrible', 'awful', 'horrible']
# neutral_words = ['okay', 'average', 'mediocre']
# products = ['movie', 'book', 'restaurant', 'hotel']

# TODO: Create templates for Invariance Tests
# Example: Name Invariance Test
# ret = editor.template('According to {name}, the {product} was {quality}.',
#                       name=['John', 'Emma', 'Mohammed', 'Yuki', 'Maria'],
#                       product=products,
#                       quality=positive_words + negative_words + neutral_words,
#                       labels='Sentiment')

# TODO: Create templates for Directional Expectation Tests
# Example: Intensifier Test
# ret += editor.template('The {product} was {intensifier} {quality}.',
#                        product=products,
#                        intensifier=['', 'very', 'extremely'],
#                        quality=positive_words + negative_words,
#                        labels='Sentiment')

# TODO: Create templates for Minimum Functionality Tests
# Example: Explicit Sentiment Words Test
# ret += editor.template('This {product} is {quality}.',
#                        product=products,
#                        quality=positive_words + negative_words,
#                        labels='Sentiment')

# Part 2: Test Configuration

# TODO: Configure the MFT test
# test = MFT(**ret, name='Sentiment Analysis Behavioral Tests')

# Part 3: Test Run & Results summary

# TODO: Implement a function to use your trained model for predictions
def predict_sentiment(texts):
    # Your code here to make predictions using your trained model
    pass

# TODO: Wrap your prediction function
# wrapped_predictor = PredictorWrapper.wrap_predict(predict_sentiment)

# TODO: Run the test and display results
# test.run(wrapped_predictor)
# test.summary()

# Selecting the Model to Productionize

Now that you've trained, evaluated, and performed behavioral testing on your sentiment analysis models, it's time to select the best model for production deployment. This decision should be based on a comprehensive analysis of each model's performance, behavior, and suitability for the real-world application. For this project the full justification step is left as an optional task. We recommend you just pick a simple model that can be easily be deployed like the Logistic Regression model.

## [OPTIONAL] Model Selection and Justification

Example table of factors to consider:

| Aspect | Logistic Regression | BERT | [Other Models] |
|--------|---------------------|------|----------------|
| Accuracy | | | |
| F1 Score | | | |
| Invariance Test | | | |
| Directional Test | | | |
| MF Test | | | |
| Pros | | | |
| Cons | | | |
| Training Time | | | |
| Inference Time | | | |
| Explainability | | | |

1. Review your models' performance, considering factors such as:
   - Accuracy, F1 Score, Precision, Recall
   - Results from Invariance, Directional Expectation, and Minimum Functionality Tests
   - Training and inference time
   - Model size and resource requirements
   - Explainability and interpretability
   - Robustness and potential biases

2. Write a brief justification for your chosen model, addressing:
   - Why this model is the best fit for the drug review sentiment analysis task
   - How it balances performance, efficiency, and robustness
   - Any potential challenges or limitations, and how you plan to address them
   - How this model aligns with the business requirements and constraints

3. Promote the selected model to production in Weights & Biases (W&B):
   - Log into your W&B account
   - Navigate to your project
   - Find the run corresponding to your selected model
   - Use the W&B UI to promote this model to `production` by changing the alias to it
   - Provide any necessary metadata or tags