## (3) Case Study: Transform data for modeling using a data integration tool
##### (GenAI Life Cycle Phase 3: Data Preparation self-assesment)

---

In [27]:
import ipywidgets as widgets
from IPython.display import display

# EDA Pre-Reading Data Content
eda_pre_reading_data = [
    ["<b>(2a) Perform Exploratory Data Analysis (EDA)</b>", 
     ("In your EDA, you are expected to get an overview of the categorical, ordinal, and interval variables present in the datasets "
      "and identify how they may later be used for the Retrieval-Augmented Generation (RAG) of a virtual agent instance in this toolkit. "
      "This dataset contains information on restaurants, including reviews, user ratings, and business details.<br>"
      "<ul style='margin-left: 20px;'>"
      "<li>Understand the data structure by analyzing key variables and their distributions.</li>"
      "<li>Identify data types, such as categorical, ordinal, and interval variables.</li>"
      "<li>Recognize potential use cases for RAG-based integration.</li>"
      "</ul>"
      "<b>NOTE:</b> The file you will be directed to below may take a few minutes to load given the size of the datasets used in this Case Study.<br>"
      "<a href='case-files/ailtk-running-code-case-2.ipynb' target='_blank' style='color: #1e7e34; text-decoration: underline;'>"
      "(Click here to open Solution: Case Study 2 in Visual Studio Code)</a><br>"
      "<a href='case-files/solution-case-study-2.pdf' target='_blank' style='color: #1e7e34; text-decoration: underline;'>"
      "(Alternative: PDF version)</a><br>")
    ],
    
    ["<b>Schema Analysis:</b>", 
     ("<ul style='margin-left: 20px;'>"
      "<li><b>yelp_academic_dataset_business.csv</b>: Contains business information.<br>"
      "Key columns: <code>business_id</code>, <code>name</code>, <code>categories</code>, <code>stars</code>, <code>review_count</code>, etc.</li>"
      "<li><b>yelp_academic_dataset_review.csv</b>: Contains reviews of businesses.<br>"
      "Key columns: <code>review_id</code>, <code>user_id</code>, <code>business_id</code>, <code>stars</code>, <code>text</code>, <code>date</code>, etc.</li>"
      "<li><b>yelp_academic_dataset_user.csv</b>: Contains user data.<br>"
      "Key columns: <code>user_id</code>, <code>name</code>, <code>review_count</code>, <code>friends</code>, <code>yelping_since</code>, etc.</li>"
      "</ul>")],
    
    ["<b>EDA Summary:</b>", 
     ("The data primarily consists of interval and categorical variables, with <b>'stars'</b> as an example of ordinal data. "
      "These variables provide a basis for our upcoming data transformation."
      "<ul style='margin-left: 20px;'>"
      "<li><b>Business Dataset</b>: Provides detailed information on businesses, including location, categories, and operational details. "
      "Key for identifying review sources and geographical trends.</li>"
      "<li><b>Review Dataset</b>: Links users to businesses via reviews. Essential for analyzing customer feedback and ratings.</li>"
      "<li><b>User Dataset</b>: Captures user activity and social connections, valuable for understanding reviewer demographics and engagement.</li>"
      "</ul>")],

    ["<b>Data Relationships:</b>", 
     ("<ul style='margin-left: 20px;'>"
      "<li><code>business_id</code>: Links <b>businesses</b> to <b>reviews</b>.</li>"
      "<li><code>user_id</code>: Links <b>reviews</b> to <b>users</b>.</li>"
      "</ul>")]
]

# Create content for the widget
eda_pre_reading_content = widgets.VBox([widgets.HTML(value=f"{item[0]}<br>{item[1]}") for item in eda_pre_reading_data])

# Styled Box for EDA Pre-Reading
styled_eda_box = widgets.Box(
    [widgets.HTML(value="<h3 style='color: #1e7e34; display: inline;'>PRE-READING: Solution of \"(2) Case Study: Source and investigate usable data sources\" </h3>"),
     widgets.HTML(value="<hr style='border: 1px solid #1e7e34;'>"),  # Horizontal line for separation
     eda_pre_reading_content],
    layout=widgets.Layout(
        border="2px solid #1e7e34",
        padding="20px",
        width="90%",
        margin="20px 0px"
    )
)

# Display the styled box
display(styled_eda_box)


Box(children=(HTML(value='<h3 style=\'color: #1e7e34; display: inline;\'>PRE-READING: Solution of "(2) Case St…

---

#### **Case Scenario:**

> With your exploratory data analysis (EDA) from Activity 2 complete, Welp now requires the transformation of the analyzed data into a format suitable for the AI-powered virtual assistant. Management envisions a virtual agent capable of personalized **restaurant and bar** recommendations, and this step is crucial to ensure the data is usable for the virtual assistant’s recommendation engine.
>
> Your role as an AI developer is to process and structure the dataset in a way that it can be effectively utilized by the virtual assistant. This includes transforming the raw data from the dataset into a format that captures essential features such as restaurant names, cuisine types, user ratings, review highlights, ambiance descriptions, and other relevant data, while ensuring compatibility with the AI system's needs.
>
> The transformation process will require you to clean, filter, organize, merge, and label the datasets in a format that is later usable for Retrieval Augmented Generation (RAG) .
>
> Your tasks are as follows:
>
> (a) **Identify the file type and format (or shape) that the data will be transformed into.** The dataset format must compatible with machine learning models and can be easily accessed by the recommendation engine. Export the dataset into a file format suitable for training and real-time recommendation (e.g., CSV, Excel).
>
> (b) **Use Apache Hop to transform data into a shape and file type suitable for RAG**, ensuring that it is in a structured and consistent format. Handle any missing or incomplete information, normalize textual data, and prepare the dataset for AI consumption. 
>
> By completing these tasks, you will gain hands-on experience in data preprocessing, cleaning, and structuring data for AI-based systems, preparing the dataset for deployment in a personalized recommendation system.

---

##### Pre-requisites:
- [Open Apache Hop](../ailtk_learning-management-module/learning-files/ailtk-apachehop-howto.ipynb)

#### (a) **Identify the file type and format (or shape) that the data will be transformed into.** The dataset format must compatible with machine learning models and can be easily accessed by the recommendation engine. Export the dataset into a file format suitable for training and real-time recommendation (e.g., CSV, Excel).

#### (b) **Use Apache Hop to transform data into a shape and file type suitable for RAG**, ensuring that it is in a structured and consistent format. Handle any missing or incomplete information, normalize textual data, and prepare the dataset for AI consumption.

> SOLUTION:
>
> You can view a completed Pipeline and accomplished spreadsheet for this task here:
>
> <a href='case-files/ailtk-solutions-case-3.ipynb' target='_blank'>Opens the file manager</a>

##### Answer the following to proceed:

In [26]:
import ipywidgets as widgets
from IPython.display import display, clear_output

# Define questions and options
questions = [
    {
        "question": "How can data transformation enhance recommendation accuracy?",
        "options": [
            "By ensuring consistent, relevant, and structured inputs for the model to learn from",
            "By reducing the volume of data available for analysis",
            "By eliminating all inconsistencies in the dataset",
            "By focusing solely on user demographics"
        ],
        "answer": "By ensuring consistent, relevant, and structured inputs for the model to learn from"
    },
    {
        "question": "What are the challenges of integrating multiple data sources?",
        "options": [
            "Disparate formats, large sizes, and missing or inconsistent data",
            "Having too few data sources",
            "Maintaining the same file type across sources",
            "Lack of variety in the data"
        ],
        "answer": "Disparate formats, large sizes, and missing or inconsistent data"
    },
    {
        "question": "How can the recommendation system leverage EDA (Exploratory Data Analysis) findings?",
        "options": [
            "By providing insights like cuisine popularity, user sentiment trends, and ambiance preferences to refine filtering criteria",
            "By eliminating irrelevant data points",
            "By creating more data points from the existing ones",
            "By using only the most recent user feedback"
        ],
        "answer": "By providing insights like cuisine popularity, user sentiment trends, and ambiance preferences to refine filtering criteria"
    },
    {
        "question": "What tools or techniques will ensure scalability for large datasets?",
        "options": [
            "Using chunking, parallel processing, and database storage (e.g., SQLite or PostgreSQL)",
            "Using a single-threaded process for data handling",
            "Limiting the dataset size to 100 entries",
            "Compressing the data to reduce storage requirements"
        ],
        "answer": "Using chunking, parallel processing, and database storage (e.g., SQLite or PostgreSQL)"
    },
    {
    "question": "What must be done before merging datasets in Apache Hop?",
    "options": [
        "Datasets must be sorted using a 'Sort Rows' transform prior to two inputs being joined",
        "Datasets must be formatted as CSV files before merging",
        "All rows must be the same before merging",
        "Nothing"
    ],
    "answer": "Datasets must be sorted using a 'Sort Rows' transform prior to two inputs being joined"
}

]

# Widgets for questions
quiz_widgets = []
for i, q in enumerate(questions):
    question_label = widgets.Label(value=f"Q{i+1}: {q['question']}")
    options = widgets.RadioButtons(
        options=q['options'],
        description='',
        disabled=False,
        value=None,
        layout=widgets.Layout(width='90%', height='auto')  # Ensures proper layout for longer options
    )
    quiz_widgets.append((question_label, options))

# Button to submit answers
submit_button = widgets.Button(description="Submit Answers", button_style="primary")
output = widgets.Output()

# Flag to track if the error message is already displayed
error_displayed = False

# Define button click event
def on_submit_click(b):
    global error_displayed
    # Disable the submit button
    submit_button.disabled = True
    clear_output(wait=True)
    unanswered = False
    score = 0

    # Check if all questions are answered
    for i, (label, options) in enumerate(quiz_widgets):
        if options.value is None:  # If a question is left unanswered
            unanswered = True

    with output:
        if unanswered:
            if not error_displayed:  # Only display the error if it hasn't been shown already
                error_displayed = True
                # Display error message in red
                display(widgets.HTML(
                    '<p style="color: red; font-weight: bold;">Please answer all the questions before submitting.</p>'
                ))
            submit_button.disabled = False  # Re-enable button if there's an error
        else:
            error_displayed = False  # Reset the flag if all questions are answered
            submit_button.button_style = ""  # Reset button style to default after click
            # Calculate score
            for i, (label, options) in enumerate(quiz_widgets):
                user_answer = options.value
                correct_answer = questions[i]["answer"]
                if user_answer == correct_answer:
                    score += 1
                print(f"Q{i+1}: {questions[i]['question']}")
                print(f"  - Your answer: {user_answer}")
                print(f"  - Correct answer: {correct_answer}")
                print()

            print(f"You scored {score}/{len(questions)}! ({(score / len(questions)) * 100:.2f}%)")
            
            # Show Continue or Try Again button based on score
            if score >= 0.8 * len(questions):
                continue_button = widgets.HTML(
                    '<a href="case-study-4.ipynb" style="display: inline-block; padding: 10px 15px; '
                    'background-color: #28a745; color: white; text-decoration: none; border-radius: 5px;">'
                    'Continue</a>'
                )
                display(continue_button)
            else:
                try_again_button = widgets.HTML(
                    '<a href="case-study-3.ipynb" style="display: inline-block; padding: 10px 15px; '
                    'background-color: #dc3545; color: white; text-decoration: none; border-radius: 5px;">'
                    'Score at least 80% to continue. Try Again</a>'
                )
                display(try_again_button)

# Attach event to the submit button
submit_button.on_click(on_submit_click)

# Display the quiz
for label, options in quiz_widgets:
    display(label, options)
display(submit_button, output)


Label(value='Q1: How can data transformation enhance recommendation accuracy?')

RadioButtons(layout=Layout(height='auto', width='90%'), options=('By ensuring consistent, relevant, and struct…

Label(value='Q2: What are the challenges of integrating multiple data sources?')

RadioButtons(layout=Layout(height='auto', width='90%'), options=('Disparate formats, large sizes, and missing …

Label(value='Q3: How can the recommendation system leverage EDA (Exploratory Data Analysis) findings?')

RadioButtons(layout=Layout(height='auto', width='90%'), options=('By providing insights like cuisine popularit…

Label(value='Q4: What tools or techniques will ensure scalability for large datasets?')

RadioButtons(layout=Layout(height='auto', width='90%'), options=('Using chunking, parallel processing, and dat…

Label(value='Q5: What must be done before merging datasets in Apache Hop?')

RadioButtons(layout=Layout(height='auto', width='90%'), options=("Datasets must be sorted using a 'Sort Rows' …

Button(button_style='primary', description='Submit Answers', style=ButtonStyle())

Output()