## (3) Case Study: Transform data for modeling using a data integration tool
##### (GenAI Life Cycle Phase 3: Data Preparation self-assesment)

---

In [13]:
import ipywidgets as widgets
from IPython.display import display

# EDA Pre-Reading Data Content
eda_pre_reading_data = [
    ["<b>Expected EDA Goals:</b>", 
     ("In your EDA, you are expected to get an overview of the categorical, ordinal, and interval variables present in the datasets "
      "and identify how they may later be used for the Retrieval-Augmented Generation (RAG) of a virtual agent instance in this toolkit. "
      "This dataset provides a rich source of information on restaurants, including reviews, user ratings, and business details. "
      "It will be crucial in building a robust and personalized restaurant recommendation system.")],
    
    ["<b>Schema Analysis:</b>", 
     ("<ul>"
      "<li><b>yelp_academic_dataset_business.csv</b>: Contains business information.<br>"
      "Key columns: <code>business_id</code>, <code>name</code>, <code>categories</code>, <code>stars</code>, <code>review_count</code>, etc.</li>"
      "<li><b>yelp_academic_dataset_review.csv</b>: Contains reviews of businesses.<br>"
      "Key columns: <code>review_id</code>, <code>user_id</code>, <code>business_id</code>, <code>stars</code>, <code>text</code>, <code>date</code>, etc.</li>"
      "<li><b>yelp_academic_dataset_user.csv</b>: Contains user data.<br>"
      "Key columns: <code>user_id</code>, <code>name</code>, <code>review_count</code>, <code>friends</code>, <code>yelping_since</code>, etc.</li>"
      "</ul>")],
    
    ["<b>EDA Summary:</b>", 
     ("The data primarily consists of interval and categorical variables, with <b>'stars'</b> as an example of ordinal data. "
      "These variables provide a basis for our upcoming data transformation."
      "<ul>"
      "<li><b>Business Dataset</b>: Provides detailed information on businesses, including location, categories, and operational details. "
      "Key for identifying review sources and geographical trends.</li>"
      "<li><b>Review Dataset</b>: Links users to businesses via reviews. Essential for analyzing customer feedback and ratings.</li>"
      "<li><b>User Dataset</b>: Captures user activity and social connections, valuable for understanding reviewer demographics and engagement.</li>"
      "</ul>")],

    
    ["<b>Data Relationships:</b>", 
     ("<ul>"
      "<li><code>business_id</code>: Links <b>businesses</b> to <b>reviews</b>.</li>"
      "<li><code>user_id</code>: Links <b>reviews</b> to <b>users</b>.</li>"
      "</ul>")]
]

# Create content for the widget
eda_pre_reading_content = widgets.VBox([widgets.HTML(value=f"{item[0]}<br>{item[1]}") for item in eda_pre_reading_data])

# Styled Box for EDA Pre-Reading
styled_eda_box = widgets.Box(
    [widgets.HTML(value="<h3 style='color: #1e7e34; display: inline;'>Pre-Reading Material: EDA and Data Schema Overview</h3>"),
     widgets.HTML(value="<hr style='border: 1px solid #1e7e34;'>"),  # Horizontal line for separation
     eda_pre_reading_content],
    layout=widgets.Layout(
        border="2px solid #1e7e34",
        padding="20px",
        width="90%",
        margin="20px 0px"
    )
)

# Display the styled box
display(styled_eda_box)


Box(children=(HTML(value="<h3 style='color: #1e7e34; display: inline;'>Pre-Reading Material: EDA and Data Sche…

NOTE: The file you will be directed below may take a few minutes to load given the size of the datasets used in this Case Study.
<a href="case-files/ailtk-running-code-case2.ipynb" target="_blank">(Click here to open Solution: Case Study 2 in Visual Studio Code)</a>

---

#### **Case Scenario:**

> With your exploratory data analysis (EDA) from Activity 2 complete, Welp now requires the transformation of the analyzed data into a format suitable for the AI-powered virtual assistant. Management envisions a virtual agent capable of personalized **restaurant and bar** recommendations, and this step is crucial to ensure the data is usable for the virtual assistant’s recommendation engine.
>
> Your role as an AI developer is to process and structure the dataset in a way that it can be effectively utilized by the virtual assistant. This includes transforming the raw data from the Yelp dataset into a format that captures essential features such as restaurant names, cuisine types, user ratings, review highlights, ambiance descriptions, and other relevant data, while ensuring compatibility with the AI system's needs.
>
> The transformation process will require you to clean the data, handle missing values, and organize the dataset in a way that the AI virtual assistant can access and interpret easily. The dataset should be formatted to allow the assistant to:
> - Filter restaurants based on user preferences such as cuisine type, dietary restrictions, location, ambiance, and more.
> - Provide detailed restaurant information including user reviews and highlights, to enhance the personalized recommendations.
> - Offer recommendations that prioritize the most relevant and trusted data based on the user’s historical preferences and current inputs.
>
> Management has specifically requested the output to include:
> - Cleaned and structured data, with features like cuisine type, location, and ambiance clearly labeled and formatted.
> - An enriched dataset that includes not only basic restaurant information but also sentiment analysis or summary statistics derived from user reviews.
> - A consistent structure suitable for easy integration into the virtual assistant system, with the possibility to be expanded as the project grows.
>
> Your tasks are as follows:
>
> (a) **Identify the features and attributes that need to be included in the dataset for use by the virtual assistant.** These features should directly support the AI’s ability to provide personalized recommendations.
>
> (b) **Process and clean the data**, ensuring that it is in a structured and consistent format. Handle any missing or incomplete information, normalize textual data, and prepare the dataset for AI consumption.
>
> (c) **Prepare the dataset for integration with the virtual assistant**, ensuring that the dataset format is compatible with machine learning models and can be easily accessed by the recommendation engine. Export the dataset into a file format suitable for training and real-time recommendation (e.g., CSV, Excel).
>
> By completing these tasks, you will gain hands-on experience in data preprocessing, cleaning, and structuring data for AI-based systems, preparing the dataset for deployment in a personalized recommendation system.


- [Open Apache Hop](../ailtk_learning-management-module/learning-files/ailtk-apachehop-howto.ipynb)

---

##### Answer the following:

In [14]:
import ipywidgets as widgets
from IPython.display import display, clear_output

# Define questions and options
questions = [
    {
        "question": "SAMPLE QUESTION ONLY: What is the main problem that the virtual assistant for Welp is trying to solve?",
        "options": [
            "Difficulty in finding trustworthy reviews for restaurants.",
            "Decision fatigue caused by too many dining options and lack of personalized filtering.",
            "Lack of information about restaurant locations.",
            "Inconsistent pricing across restaurants."
        ],
        "answer": "Decision fatigue caused by too many dining options and lack of personalized filtering."
    },
]


# Widgets for questions
quiz_widgets = []
for i, q in enumerate(questions):
    question_label = widgets.Label(value=f"Q{i+1}: {q['question']}")
    options = widgets.RadioButtons(
        options=q['options'],
        description='',
        disabled=False,
        value=None,
        layout=widgets.Layout(width='90%', height='auto')  # Ensures proper layout for longer options
    )
    quiz_widgets.append((question_label, options))

# Button to submit answers
submit_button = widgets.Button(description="Submit Answers", button_style="primary")
output = widgets.Output()

# Flag to track if the error message is already displayed
error_displayed = False

# Define button click event
def on_submit_click(b):
    global error_displayed
    # Disable the submit button
    submit_button.disabled = True
    clear_output(wait=True)
    unanswered = False
    score = 0

    # Check if all questions are answered
    for i, (label, options) in enumerate(quiz_widgets):
        if options.value is None:  # If a question is left unanswered
            unanswered = True

    with output:
        if unanswered:
            if not error_displayed:  # Only display the error if it hasn't been shown already
                error_displayed = True
                # Display error message in red
                display(widgets.HTML(
                    '<p style="color: red; font-weight: bold;">Please answer all the questions before submitting.</p>'
                ))
            submit_button.disabled = False  # Re-enable button if there's an error
        else:
            error_displayed = False  # Reset the flag if all questions are answered
            submit_button.button_style = ""  # Reset button style to default after click
            # Calculate score
            for i, (label, options) in enumerate(quiz_widgets):
                user_answer = options.value
                correct_answer = questions[i]["answer"]
                if user_answer == correct_answer:
                    score += 1
                print(f"Q{i+1}: {questions[i]['question']}")
                print(f"  - Your answer: {user_answer}")
                print(f"  - Correct answer: {correct_answer}")
                print()

            print(f"You scored {score}/{len(questions)}! ({(score / len(questions)) * 100:.2f}%)")
            
            # Show Continue or Try Again button based on score
            if score >= 0.8 * len(questions):
                continue_button = widgets.HTML(
                    '<a href="case-study-4.ipynb" style="display: inline-block; padding: 10px 15px; '
                    'background-color: #28a745; color: white; text-decoration: none; border-radius: 5px;">'
                    'Continue</a>'
                )
                display(continue_button)
            else:
                try_again_button = widgets.HTML(
                    '<a href="case-study-3.ipynb" style="display: inline-block; padding: 10px 15px; '
                    'background-color: #dc3545; color: white; text-decoration: none; border-radius: 5px;">'
                    'Score at least 80% to continue. Try Again</a>'
                )
                display(try_again_button)

# Attach event to the submit button
submit_button.on_click(on_submit_click)

# Display the quiz
for label, options in quiz_widgets:
    display(label, options)
display(submit_button, output)


Label(value='Q1: SAMPLE QUESTION ONLY: What is the main problem that the virtual assistant for Welp is trying …

RadioButtons(layout=Layout(height='auto', width='90%'), options=('Difficulty in finding trustworthy reviews fo…

Button(button_style='primary', description='Submit Answers', style=ButtonStyle())

Output()

Questions
a. How can data transformation enhance recommendation accuracy?
It ensures consistent, relevant, and structured inputs, enabling the model to learn and generalize better.

b. What are the challenges of integrating multiple data sources?
Disparate formats, large sizes, and missing or inconsistent data can complicate integration.

c. How can the recommendation system leverage EDA findings?
Insights like cuisine popularity, user sentiment trends, and ambiance preferences can fine-tune the assistant’s filtering criteria.

d. What tools or techniques will ensure scalability for large datasets?
Using chunking, parallel processing, and database storage (e.g., SQLite or PostgreSQL) can handle large datasets effectively.