# Practice Learning Activity 2: Source and investigate usable data sources 

#### **Case Scenario:** 
> Provided to you are access to view the SQL Product Database which includes Coffee bean information (e.g., origin, roast, flavor profile, recommended brew method), and brewing method recommendation. You were also specifically instructed to use a compilation of online resources already provided to you in a .csv file, including videos and online articles of the brand's endorsers, as basis for the coffee and brewing guidance. You are expected to perform exploratory data analysis (EDA) on the datasets provided in order to see how what features can be used later on for fine-tuning. 
>
> In your EDA, you are expected to get an overview of the categorical, ordinal, and interval variables present in the datasets and identify how they may later be used for the fine-tuning of a large-language model (LLM) instance later in this toolkit.

Sourcing and investigating usable data sources involves identifying relevant data that can fine-tune the LLM to ensure the agent’s responses and recommendations have accurate and comprehensive information to interact with users effectively. Developers must be adept at evaluating and selecting the right data sources to maximize virtual agent performance, making it more reliable and relevant in addressing user queries and providing tailored assistance.

Data is what powers AI models. The quality and quantity of your data directly impact the accuracy and performance of your AI applications.

---

Write down your answers somewhere you can refer to later. You can make a copy of the template for this activity found [here (redirects to a link)](https://docs.google.com/document/d/1jl664PnyPubTO61Ge-S68rR2NzmvRmuSdsm1YLJU4Yc/edit?usp=sharing).

### Pre-requisites: 
- [Ensure MySQL is running](../learning-files/ailtk-mysql-howto.ipynb).
- [Be able to run the code provided below using Visual Studio Code](../learning-files/ailtk-running-python-with-vscode.ipynb).

### (a) Use Python and Jupyter Notebooks to perform exploratory data analysis on the SQL database

1. [Access Visual Studio Code and the Jupyter Notebook prepared for this Practice Learning Activity (Click here to open)](test)

2. Run the code segment below in a Python code cell to import the necessary Python modules by clicking the button to the left of the cell. 

In [93]:
# Code segment
import ipywidgets as widgets
from IPython.display import display

# Define the Python code you want users to copy
code_snippet = """
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlalchemy
from sqlalchemy import inspect
import warnings
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Ignore warnings
warnings.filterwarnings('ignore')
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value="\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as …

3. Upon running a Python code cell for the first time, Visual Studio Code will prompt you to choose kernel source. Select `Python Environments`. 

    ![image.png](attachment:image.png)

- These have been installed already on the virtual machine under the `ailtk-env` virtual environment.

    ![image-2.png](attachment:image-2.png)

4. Afterward you should see a green check mark on the bottom left of the code cell as such: 

    ![image.png](attachment:image.png)

5. Now that the necessary Python modules are uploaded we can now proceed to wokring with the MySQL Database from our scenario. Here are the credentials that you can use to login to the MySQL server:

      | **Username:** | `ailtk-learner`|
      | **Password:** | `DLSU1234!`    |

   - The database for this practice learning activity is `ailtk_db`.

   Run the code segment below first establish connection to MySQL server and retrieve the databse's table names

In [94]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Connect to MySQL Database
engine = sqlalchemy.create_engine('mysql+pymysql://ailtk-learner:DLSU1234!@localhost:3306/ailtk_db')

# Inspect the database to get the table names
inspector = inspect(engine)
table_names = inspector.get_table_names()

# Print the table names
print("Tables in the database:", table_names)
"""
# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Connect to MySQL Database\nengine = sqlalchemy.create_engine(\'mysql+pymysql://ailtk-learn…

> Your output should look something like this:
> > ![placeholder.png](attachment:placeholder.png)

*What tables can you see in the database?*

In [95]:
# Create input text box
input_box1 = widgets.Textarea(
    placeholder='What tables can you see in the database? Type your answer here...',
    description='Answer:',
    layout=widgets.Layout(width='400px')
)

# Create button
submit_button1 = widgets.Button(
    description="Submit",
    button_style='primary',  # Optional: styling
)

# Create output widget
output1 = widgets.Output()

# Define the button click event
def on_submit_click(b):
    # Clear previous output
    output1.clear_output()
    
    # Access the input text and generate an answer
    question = input_box1.value
    answer = f"""
    From the output we can see we have six tables in the database selected:
    'products_beans', 'products_beans_origins', 'products_beans_reviews',
    'roasters', 'roasters_countries', and 'roasts'.
    
    Let's go over them more.
    """
    
    # Display the answer in the output widget
    with output1:
        print(answer)

# Set the button's on-click function
submit_button1.on_click(on_submit_click)

# Display the widgets
display(input_box1, submit_button1, output1)

Textarea(value='', description='Answer:', layout=Layout(width='400px'), placeholder='What tables can you see i…

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

6.  Run the code segment below in a Python code cell to load data and print basic statistics for our Exploratory Data Analysis for each of the tables.
    - Inspect the content of the SQL tables by printing the head of the data. This can be done by loading the table into a pandas dataframe and using the built-in `df.head()` function.

In [96]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Define the tablename to select
table_name = "products_beans"

# Load the table into a DataFrame
query = f"SELECT * FROM {table_name}"
df = pd.read_sql(query, engine)

# Show first few rows
display(df.head())
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Define the tablename to select\ntable_name = "products_beans"\n\n# Load the table into a D…

> Your output should look something like this:
> > ![placeholder.png](attachment:placeholder.png)

In [97]:
# Create input text box
input_box2 = widgets.Textarea(
    placeholder='(a) Define the Problem: Type your answer here...',
    description='Answer:',
    layout=widgets.Layout(width='400px')
)

# Create button
submit_button2 = widgets.Button(
    description="Submit",
    button_style='primary',  # Optional: styling
)

# Create output widget
output2 = widgets.Output()

# Define the button click event
def on_submit_click(b):
    # Clear previous output
    output2.clear_output()
    
    # Access the input text and generate an answer
    question = input_box2.value
    answer = f"""
    From the head of the table `products_beans`, we can see it includes
    columns for the table's primary key, the bean product's name, roast type
    roaster, and origin. The presence of foreign key columns (with the other
    ids referenced) indicates that the table can be linked to other tables
    containing more detailed information about roast types, roasters, and
    origins of the coffee products:
    - products_beans_id: Product identifier used to link other
        product-related information.
    - name: The name of the product (e.g., “Sweety Espresso Blend,”
        “Ethiopia Shakiso Mormora”). Useful in providing recommendations.
    - roast_id: The type of roast (e.g., light, medium, dark) associated with
        each product.
    - roaster_id: Identifies the coffee roaster, helping the model suggest
        products by a particular roaster if asked.
    - origin_id: Links to the coffee's geographical origin.

    Let's explore those referenced tables as they may provide more information that could
    contribute to the case.
    """
    
    # Display the answer in the output widget
    with output2:
        print(answer)

# Set the button's on-click function
submit_button2.on_click(on_submit_click)

# Display the widgets
display(input_box2, submit_button2, output2)

Textarea(value='', description='Answer:', layout=Layout(width='400px'), placeholder='(a) Define the Problem: T…

Button(button_style='primary', description='Submit', style=ButtonStyle())

Output()

In [98]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Define the tablename 
table_name = "roasts"

# Load the table into a DataFrame
query = f"SELECT * FROM {table_name}"
df = pd.read_sql(query, engine)

# Show first few rows
display(df.head())
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Define the tablename \ntable_name = "roasts"\n\n# Load the table into a DataFrame\nquery =…

> Your output should look something like this:
> > ![placeholder.png](attachment:placeholder.png)

In [99]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Define the tablename 
table_name = "products_beans_reviews"

# Load the table into a DataFrame
query = f"SELECT * FROM {table_name}"
df = pd.read_sql(query, engine)

# Show first few rows
display(df.head())
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Define the tablename \ntable_name = "products_beans_reviews"\n\n# Load the table into a Da…

> Your output should look something like this:
> > ![placeholder.png](attachment:placeholder.png)

In [100]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Define the tablename 
table_name = "products_beans_origins"

# Load the table into a DataFrame
query = f"SELECT * FROM {table_name}"
df = pd.read_sql(query, engine)

# Show first few rows
display(df.head())
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Define the tablename \ntable_name = "products_beans_origins"\n\n# Load the table into a Da…

> Your output should look something like this:
> > ![placeholder.png](attachment:placeholder.png)

In [101]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Define the tablename 
table_name = "roasters"

# Load the table into a DataFrame
query = f"SELECT * FROM {table_name}"
df = pd.read_sql(query, engine)

# Show first few rows
display(df.head())
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Define the tablename \ntable_name = "roasters"\n\n# Load the table into a DataFrame\nquery…

> Your output should look something like this:
> > ![placeholder.png](attachment:placeholder.png)

*What can we observe the following about the columns of `roasters`?*

In [102]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Define the tablename 
table_name = "roasters_countries"

# Load the table into a DataFrame
query = f"SELECT * FROM {table_name}"
df = pd.read_sql(query, engine)

# Show first few rows
display(df.head())
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Define the tablename \ntable_name = "roasters_countries"\n\n# Load the table into a DataFr…

*What can we observe the following about the columns of `roasters_countries`?*

7. Next we look at distributions of categorical data. In this example, categorical variables primarily entail columns of the `product_beans` featuring two or more categories of which have no intrinsic ord
ering. Let's take a look of the distribution of roasters.  *[Review categorical, ordinal and interval variables here.](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-interval-variables/)*

- We can first look at the number of `product_beans` rows in relationship to the `roasters` table, which we seen is referenced through the "roaster_id" column in the former table.

In [103]:
# Display distrubution of roasters code snippet

# Define the Python code you want users to copy
code_snippet = """
# Load the 'roasters' table
query = "SELECT * FROM roasters"
roasters_df = pd.read_sql(query, engine)

# Load the 'roasters_countries' table
query = "SELECT * FROM roasters_countries"
countries_df = pd.read_sql(query, engine)

# Show first few rows
display(roasters_df.head())
display(countries_df.head())

# Calculate total number of roasters and unique roasters
total_roasters = roasters_df['roaster_id'].count()
unique_roasters = roasters_df['roaster'].nunique()

#Display number of roasters and unique roasters
print("Total number of roasters:", total_roasters)
print("Total number of roasters:", unique_roasters)
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='550px', height='350px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Load the \'roasters\' table\nquery = "SELECT * FROM roasters"\nroasters_df = pd.read_sql(q…

> Your output should look something like this:
> > ![placeholder.png](attachment:placeholder.png)

We can observe that every product's roaster is unique

7. *(cont.)* We can look at the `product_beans` and `roasters` further by adding the dimension of the `roasters_countries`, which are also present in the MySQL database. Observe the distribution of the total number of `roasters` and `product_beans` by `roasters_countries` by running the code below. 

In [105]:
# Display distrubution of roasters code snippet

# Define the Python code you want users to copy
code_snippet = """
# Merge roasters with countries to get country names
roasters_with_countries = pd.merge(roasters_df, countries_df, on="country_id", how="left")

# Calculate number of roasters per country
roasters_per_country = roasters_with_countries.groupby("roaster_country")['roaster_id'].count().reset_index()
roasters_per_country.columns = ['Country', 'Number of Roasters']

# Separate countries with more than 1 roaster and those with exactly 1 roaster
multiple_roasters = roasters_per_country[roasters_per_country['Number of Roasters'] > 1]
single_roasters = roasters_per_country[roasters_per_country['Number of Roasters'] == 1]

# Sum the single-roaster countries and create an "Other Countries" row
other_countries_count = single_roasters['Number of Roasters'].sum()
other_countries_row = pd.DataFrame({'Country': ['Other Countries \n (Appearing only once)'], 'Number of Roasters': [other_countries_count]})

# Combine the multiple_roasters and other_countries_row DataFrames
final_roasters_per_country = pd.concat([multiple_roasters, other_countries_row], ignore_index=True)

# Display results
print("Total number of roasters:", total_roasters)
print("Total number of unique roasters:", unique_roasters)
print("\nNumber of roasters per country (with 'Other Countries' grouped):")
display(final_roasters_per_country)

# Plotting the number of roasters per country
plt.figure(figsize=(10, 6))
sns.barplot(data=final_roasters_per_country, x="Country", y="Number of Roasters")
plt.title("Number of Roasters per Country")
plt.xticks(rotation=45)
plt.show()
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='1000px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Merge roasters with countries to get country names\nroasters_with_countries = pd.merge(roa…

7. *(cont.)* Next, we can get an overview of the "origin_id" column by running the code cell below to visualize the relationship between `product_beans` and `product_beans_origins`. 

In [106]:
# Display distrubution of roasters and origin

# Define the Python code you want users to copy
code_snippet = """
# Merge roasters with countries to get country names
roasters_with_countries = pd.merge(roasters_df, countries_df, on="country_id", how="left")

# Calculate number of roasters per country
roasters_per_country = roasters_with_countries.groupby("roaster_country")['roaster_id'].count().reset_index()
roasters_per_country.columns = ['Country', 'Number of Roasters']

# Separate countries with more than 1 roaster and those with exactly 1 roaster
multiple_roasters = roasters_per_country[roasters_per_country['Number of Roasters'] > 1]
single_roasters = roasters_per_country[roasters_per_country['Number of Roasters'] == 1]

# Sum the single-roaster countries and create an "Other Countries" row
other_countries_count = single_roasters['Number of Roasters'].sum()
other_countries_row = pd.DataFrame({'Country': ['Other Countries \n (Appearing only once)'], 'Number of Roasters': [other_countries_count]})

# Combine the multiple_roasters and other_countries_row DataFrames
final_roasters_per_country = pd.concat([multiple_roasters, other_countries_row], ignore_index=True)

# Display results
print("Total number of roasters:", total_roasters)
print("Total number of unique roasters:", unique_roasters)
print("\nNumber of roasters per country (with 'Other Countries' grouped):")
display(final_roasters_per_country)

# Plotting the number of roasters per country
plt.figure(figsize=(10, 6))
sns.barplot(data=final_roasters_per_country, x="Country", y="Number of Roasters")
plt.title("Number of Roasters per Country")
plt.xticks(rotation=45)
plt.show()
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='1000px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Merge roasters with countries to get country names\nroasters_with_countries = pd.merge(roa…

7. *(cont.)* Let's explore origins and and roasting country to see if they're redundant or if they present different data. Run the code below in a Python cell.

In [107]:
# Display distrubution of origins and 

# Define the Python code you want users to copy
code_snippet = """
# Load tables from the database
products_beans = pd.read_sql("SELECT * FROM products_beans", engine)
roasters = pd.read_sql("SELECT * FROM roasters", engine)
roasters_countries = pd.read_sql("SELECT * FROM roasters_countries", engine)
beans_origins = pd.read_sql("SELECT * FROM products_beans_origins", engine)

# Merge the beans with origin and roaster country data
beans_with_origin_country = pd.merge(products_beans, beans_origins, on="origin_id", how="left")
beans_with_origin_country = pd.merge(beans_with_origin_country, roasters, on="roaster_id", how="left")
beans_with_origin_country = pd.merge(beans_with_origin_country, roasters_countries, on="country_id", how="left")

# Scenario 1: Coffee Beans with Similar Origins but Different Roaster Countries

# Group by 'origin' and 'roaster_country' and count the products
origin_country_summary = beans_with_origin_country.groupby(['origin', 'roaster_country']).size().reset_index(name='Count')

# Filter for origins that have multiple roaster countries
origins_with_multiple_roasters = origin_country_summary.groupby('origin').filter(lambda x: x['roaster_country'].nunique() > 1)

# Plotting Scenario 1
plt.figure(figsize=(12, 6))
sns.barplot(data=origins_with_multiple_roasters, x="origin", y="Count", hue="roaster_country")
plt.title("Coffee Beans with Similar Origins but Different Roaster Countries")
plt.xticks(rotation=45)
plt.show()

# Scenario 2: Coffee Beans with Similar Roaster Countries but Different Origins

# Group by 'roaster_country' and 'origin' and count the products
country_origin_summary = beans_with_origin_country.groupby(['roaster_country', 'origin']).size().reset_index(name='Count')

# Filter for roaster countries that have multiple origins
roasters_with_multiple_origins = country_origin_summary.groupby('roaster_country').filter(lambda x: x['origin'].nunique() > 1)

# Plotting Scenario 2
plt.figure(figsize=(12, 6))
sns.barplot(data=roasters_with_multiple_origins, x="roaster_country", y="Count", hue="origin")
plt.title("Coffee Beans with Similar Roaster Countries but Different Origins")
plt.xticks(rotation=45)
plt.show()
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='1000px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Load tables from the database\nproducts_beans = pd.read_sql("SELECT * FROM products_beans"…

*What can we make out of these graphs?*

The origin ("origin_id" column under `product_beans`) of coffee beans are separate entities from the country in which it is roasted (`roasters_countries column` under `roasters`). Given that, we must ensure that both data are included in our data for fine-tuning our LLM instance later on and are not confused with each other. We must also be knowledgable of the data set at hand in order to make clear prompts that allow the fine-tuned LLM to give proper and accurate responses.

8. Another table with data that will serve useful to the project is the `product_bean_reviews` Rather than being referenced by `product_beans` like our previous two tables of interset, `product_beans_reviews` references `product_beans`. Logically, this may imply that there are `product_beans_reviews` rows referring to the same `product_beans` entry. Although longer than the previous examples of categorical data, the "description" field containing multi-line entries of the flavor profiles of the specific `product_beans` rows they refer to.

Run the code cell below to visualize the char_count, word_count, mean_word_length, and mean_sent_length of the "description" column's entries.

In [119]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Define the table name
table_name = "products_beans_reviews"

# Load the table's data into a DataFrame
query = f"SELECT * FROM {table_name}"
df = pd.read_sql(query, engine)

from nltk import tokenize

nltk.download('punkt_tab')

# Character count
df['char_count'] = df['description'].str.len()

# Word count
def word_count(text):
    return len(text.split())

df['word_count'] = df['description'].apply(word_count)

# Mean word length
df['mean_word_length'] = df['description'].apply(lambda rev: np.mean([len(word) for word in rev.split()]))


# Mean sentence length
df['mean_sent_length'] = df['description'].apply(lambda rev: np.mean([len(sent) for sent in tokenize.sent_tokenize(rev)]))

def visualize(col):
    plt.figure(figsize=(15, 6))
    
    # Boxplot
    plt.subplot(1, 2, 1)
    sns.boxplot(data=df, y=col, x='Rating')
    plt.ylabel(col)
    plt.title(f'{col} by Rating')

    # KDE Plot
    plt.subplot(1, 2, 2)
    sns.kdeplot(data=df, x=col, hue='Rating')
    plt.title(f'Distribution of {col} by Rating')
    
    plt.tight_layout()
    plt.show()

# Columns to visualize
features = ['char_count', 'word_count', 'mean_word_length', 'mean_sent_length']
for feature in features:
    visualize(feature)

"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='1200px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Define the table name\ntable_name = "products_beans_reviews"\n\n# Load the table\'s data i…

- We can dive deeper into the free form text data under reviews beyond statistics for length. We can perform a sentiment analysis to see [importance of sentiment analysis]

In [120]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Sentiment analysis

# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon', quiet=True)

# Initialize sentiment analyzer
sid = SentimentIntensityAnalyzer()
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='550px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)


Textarea(value="\n# Sentiment analysis\n\n# Download the VADER lexicon for sentiment analysis\nnltk.download('…

In [123]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Apply VADER sentiment analysis on the description column
df['sentiment_score'] = df['description'].apply(lambda x: sid.polarity_scores(x)['compound'])

# Plotting sentiment scores
plt.figure(figsize=(10, 6))
sns.histplot(df['sentiment_score'], bins=30, kde=True)
plt.title("Sentiment Score Distribution")
plt.xlabel("Sentiment Score")
plt.ylabel("Frequency")
plt.show()
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='1000px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value='\n# Apply VADER sentiment analysis on the description column\ndf[\'sentiment_score\'] = df[\'d…

*A note on sentiment scores:*
- Positive values indicate positive sentiment.
- Negative values indicate negative sentiment.
- Values close to zero indicate neutral sentiment.

9. Going back to the `product_beans` table we can first observe the distribution of the `roasts`. From our preliminary inspection we can see that the `roasts` table contains for entries, namely: Medium-Light, Medium, Light, Medium-Dark, Dark. This makes it an example of ordinal data. 

- This can easily be done using the Pandas and Seaborn Python libraries imported earlier. Run the code to produce a graph of `roasts` distribution: 


10. We tackle the interval (numerical) variables in our data. In this case it is the ratings contained the the ratings found in the `product_bean_reviews` table. We can plot a box plot to get an overview of the distribution of the ratings as well as any outliers that may need to be taken note of.



In [None]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Define the table name
table_name = "products_beans_reviews"

# Load the table's data into a DataFrame
query = f"SELECT * FROM {table_name}"
df = pd.read_sql(query, engine)

# Display boxplots of ratings
plt.figure(figsize=(10, 6))
sns.boxplot(df['rating'])
plt.title("Boxplot of Ratings")
plt.xlabel("Rating")
plt.show()
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

---

### (b) Perform exploratory data analysis on the .csv file provided

1. First load the .csv file provided into a Python dataframe and display the head of the dataset. Run the Python code cell below in order to do so. 

In [112]:
# df_csv = pd.read_csv('../learning-files/coffeepro-online-resources-exported.csv')


> Your output should look something like this:
> > ![placeholder.png](attachment:placeholder.png)


*From the output we can see that the `coffeepro-online-resources-exported.csv` file contains categorical columns*

In [115]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of content types
plt.figure(figsize=(8, 6))
sns.countplot(data=df_csv, x='Type')
plt.title('Distribution of Content Types (Video vs Article)')
plt.xlabel('Type of Content')
plt.ylabel('Count')
plt.show()
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value="\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n\n# Plot the distribution of content…

In [116]:
# Code Segment

# Define the Python code you want users to copy
code_snippet = """
# Bar plot for top products
plt.figure(figsize=(10, 6))
common_products.plot(kind='bar', color='lightgreen')
plt.title('Top 10 Coffee Products Featured')
plt.xlabel('Product')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.show()
"""

# Create a TextArea widget to display the code
code_widget = widgets.Textarea(
    value=code_snippet,
    placeholder='Python code',
    description='Code:',
    disabled=True,  # Disable editing to make it read-only
    layout=widgets.Layout(width='500px', height='250px')  # Adjust size as needed
)

# Display the widget
display(code_widget)

Textarea(value="\n# Bar plot for top products\nplt.figure(figsize=(10, 6))\ncommon_products.plot(kind='bar', c…


*What does the `coffeepro-online-resources-exported.csv` file focus on content wise?*

1. Product
The name of coffee-related equipment or accessory, which can help the LLM recommend specific items for brewing.
2. Content Focus
Describes the key focus of the resource (e.g., "Step-by-step brewing instructions," "Tips for achieving optimal flavor"). This can help the LLM provide useful brewing tips or product guidance.
3. Online Resource
URLs linking to additional resources such as instructional videos or articles. These could be used to suggest supplementary learning resources to users.
4. Type
The type of online resource (e.g., video, blog), helpful when users are looking for a specific kind of content (e.g., video tutorials).
5. Content Summary
Summarizes the content in the resource, which can help the LLM generate concise answers to user queries or offer step-by-step guidance based on detailed information.
Features for Fine-Tuning: Product, Content Focus, Type, Content Summary

---

### In your own words, how would you describe the data provided in the case?

### What can you conclude from your exploratory data analysis? Can the data's features can we use for fine-tuning later on (i.e. input columns,  output columns)? How?



##### *[Click here to view the sample solution](sample-solutions/sample-solution-2.ipynb)*

#### [ Back to Learning Instructions 2](../learning-instructions-2.ipynb)