## use case 1: create dataset with specific criteria  

"create a dataframe in python with random sales for a period of 1 month. I want the Columns Name Surname, Age, Quantity, Budget, Date of Purchase and Store Name. Purchase date cannot be greater than today and men should be 70% of the dataset. Total rows 100."

In [6]:
%pip install --upgrade pip setuptools wheel
%pip install pandas numpy plotly
%pip install --upgrade nbformat

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting nbformat
  Using cached nbformat-5.10.4-py3-none-any.whl.metadata (3.6 kB)
Collecting fastjsonschema>=2.15 (from nbformat)
  Using cached fastjsonschema-2.21.1-py3-none-any.whl.metadata (2.2 kB)
Collecting jsonschema>=2.6 (from nbformat)
  Using cached jsonschema-4.24.0-py3-none-any.whl.metadata (7.8 kB)
Collecting attrs>=22.2.0 (from jsonschema>=2.6->nbformat)
  Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
Collecting jsonschema-specifications>=2023.03.6 (from jsonschema>=2.6->nbformat)
  Using cached jsonschema_specifications-2025.4.1-py3-none-any.whl.metadata (2.9 kB)
Collecting referencing>=0.28.4 (from jsonschema>=2.6->nbformat)
  Using cached referencing-0.36.2-py3-none-any.whl.metadata (2.8 kB)
Collecting rpds-py>=0.7.1 (from jsonschema>=2.6->nbformat)
  Downloading rpds_py-0.26.0-cp311-cp311-macosx_11_0_arm64.whl.metadata (

In [5]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Sample data for names and stores
male_names = ['John', 'Michael', 'David', 'James', 'Robert', 'William', 'Richard']
female_names = ['Mary', 'Patricia', 'Linda', 'Barbara', 'Elizabeth', 'Jennifer', 'Maria']
surnames = ['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Garcia', 'Miller']
stores = ['Store A', 'Store B', 'Store C', 'Store D']

num_rows = 100
num_men = int(num_rows * 0.7)
num_women = num_rows - num_men

# Generate gender list
genders = ['M'] * num_men + ['F'] * num_women
random.shuffle(genders)

# Generate data
names = [random.choice(male_names if g == 'M' else female_names) for g in genders]
surnames_list = [random.choice(surnames) for _ in range(num_rows)]
ages = np.random.randint(18, 65, size=num_rows)
quantities = np.random.randint(1, 10, size=num_rows)
budgets = np.round(np.random.uniform(10, 500, size=num_rows), 2)
today = datetime.today()
start_date = today - timedelta(days=30)
dates = [start_date + timedelta(days=random.randint(0, (today - start_date).days)) for _ in range(num_rows)]
store_names = [random.choice(stores) for _ in range(num_rows)]

# Create DataFrame
df = pd.DataFrame({
    'Name': names,
    'Surname': surnames_list,
    'Age': ages,
    'Quantity': quantities,
    'Budget': budgets,
    'Date of Purchase': [d.date() for d in dates],
    'Store Name': store_names
})

df

Unnamed: 0,Name,Surname,Age,Quantity,Budget,Date of Purchase,Store Name
0,Robert,Williams,42,4,277.36,2025-06-04,Store C
1,Barbara,Garcia,59,3,93.29,2025-06-09,Store B
2,James,Williams,50,2,388.44,2025-06-12,Store C
3,Richard,Johnson,38,6,478.46,2025-06-04,Store D
4,David,Williams,22,6,286.99,2025-07-01,Store A
...,...,...,...,...,...,...,...
95,Michael,Johnson,63,4,146.10,2025-06-16,Store B
96,Patricia,Smith,23,6,14.06,2025-06-29,Store C
97,Michael,Jones,38,9,81.85,2025-06-16,Store C
98,Linda,Williams,45,4,447.93,2025-06-08,Store C


In [8]:
from collections import Counter

# Validate the rules for the generated dataset


validation_results = {}

# 1. Check total rows
validation_results['total_rows'] = len(df) == 100

# 2. Check columns
expected_columns = {'Name', 'Surname', 'Age', 'Quantity', 'Budget', 'Date of Purchase', 'Store Name'}
validation_results['columns'] = set(df.columns) == expected_columns

# 3. Check men are 70% of dataset
gender_count = Counter(genders)
validation_results['men_70_percent'] = abs(gender_count['M'] / len(genders) - 0.7) < 0.01

# 4. Check purchase date is not greater than today
df_dates = pd.to_datetime(df['Date of Purchase'])
validation_results['dates_not_future'] = all(df_dates <= today.date())

# 5. Check date range is within 1 month
validation_results['dates_within_1_month'] = (df_dates.max() - df_dates.min()).days <= 31

validation_results

TypeError: Invalid comparison between dtype=datetime64[ns] and date

In [9]:
from collections import Counter

# Validate the rules for the generated dataset


validation_results = {}

# 1. Check total rows
validation_results['total_rows'] = len(df) == 100

# 2. Check columns
expected_columns = {'Name', 'Surname', 'Age', 'Quantity', 'Budget', 'Date of Purchase', 'Store Name'}
validation_results['columns'] = set(df.columns) == expected_columns

# 3. Check men are 70% of dataset
gender_count = Counter(genders)
validation_results['men_70_percent'] = abs(gender_count['M'] / len(genders) - 0.7) < 0.01

# 4. Check purchase date is not greater than today
df_dates = pd.to_datetime(df['Date of Purchase'])
today_ts = pd.Timestamp(datetime.now())  # ή σήμερα = pd.Timestamp.today()

validation_results['dates_not_future'] = (df_dates <= today_ts).all()

# 5. Check date range is within 1 month
validation_results['dates_within_1_month'] = (df_dates.max() - df_dates.min()).days <= 31

validation_results

{'total_rows': True,
 'columns': True,
 'men_70_percent': True,
 'dates_not_future': np.True_,
 'dates_within_1_month': True}

### notes/findings:
- pip install was required for the code in order to run
- caught the context and create the dataset
- when asked for validation script, did a logical werror which resulted in failure, needing manual adjustment.


## Case 2 - Data Manipulation


Now I want to add +50 in the quantities of the woman and - 50 from the quantities of the men. Also create a new column which shows the rate quantity/budget. Finally add the country in a new column after the store name. Note that The same name cannot be in 2 different countries  

In [10]:
df_orig = df.copy()

# Adjust quantities: +50 for women, -50 for men
df['Quantity'] = df['Quantity'] + [50 if g == 'F' else -50 for g in genders]

# Create rate column: Quantity / Budget
df['Rate'] = df['Quantity'] / df['Budget']

# Assign countries ensuring the same name is not in 2 different countries
countries = ['USA', 'Canada', 'UK', 'Germany']
name_to_country = {}
used_countries = {name: set() for name in set(df['Name'])}

assigned_countries = []
for name in df['Name']:
    if name not in name_to_country:
        # Assign a country not already used for this name
        available = [c for c in countries if c not in used_countries[name]]
        country = random.choice(available)
        name_to_country[name] = country
        used_countries[name].add(country)
    assigned_countries.append(name_to_country[name])

# Insert country column after 'Store Name'
df.insert(df.columns.get_loc('Store Name') + 1, 'Country', assigned_countries)

df.head()

Unnamed: 0,Name,Surname,Age,Quantity,Budget,Date of Purchase,Store Name,Country,Rate
0,Robert,Williams,42,-46,277.36,2025-06-04,Store C,Germany,-0.165849
1,Barbara,Garcia,59,53,93.29,2025-06-09,Store B,UK,0.568121
2,James,Williams,50,-48,388.44,2025-06-12,Store C,UK,-0.123571
3,Richard,Johnson,38,-44,478.46,2025-06-04,Store D,USA,-0.091962
4,David,Williams,22,-44,286.99,2025-07-01,Store A,Canada,-0.153315


In [11]:
# Validate the manipulation by comparing df and df_orig

validation_manipulation = {}

# 1. Check that for women, Quantity increased by 50; for men, decreased by 50
quantity_diff = df['Quantity'].values - df_orig['Quantity'].values
expected_diff = np.array([50 if g == 'F' else -50 for g in genders])
validation_manipulation['quantity_adjustment'] = np.array_equal(quantity_diff, expected_diff)

# 2. Check that Rate column is correct: Quantity / Budget
rate_correct = np.allclose(df['Rate'], df['Quantity'] / df['Budget'])
validation_manipulation['rate_column'] = rate_correct

# 3. Check that Country column exists and same name is not in 2 different countries
country_column_exists = 'Country' in df.columns
name_country_unique = df.groupby('Name')['Country'].nunique().max() == 1
validation_manipulation['country_column_exists'] = country_column_exists
validation_manipulation['name_country_unique'] = name_country_unique

validation_manipulation

{'quantity_adjustment': True,
 'rate_column': True,
 'country_column_exists': True,
 'name_country_unique': np.True_}

### notes/findings
- perfect implemnation, totally correct

## Case 3 - Complex Data Creation, Validation & Postprocessing

“Create a streaming Python script that produces 100 rows of sales data for the last calendar month, writing each record immediately to output.csv without keeping the entire DataFrame in memory.

Columns: Name (random Greek first names), Surname (random Greek surnames), Age (Poisson with μ=35, clamped to [18,70]), Quantity (Uniform integer 1–20), UnitPrice (Normal distribution per store: Store A→N(10,2), Store B→Uniform(5,15), Store C→Triangular(3,12,6)), Budget = Quantity×UnitPrice rounded to 2 decimals, Date of Purchase (ISO 8601 with +03:00, uniform on weekdays 08:00–20:00, weekends only 10:00–18:00, include Feb 29 if applicable), Store Name (pick from JSON file), Gender (M=70%±0.5%, F rest), with random.seed(42).**

Verify at runtime:

Exactly 70 males (±0.5% error)

No dates > today (Europe/Athens timezone)

CSV valid UTF-8, no missing fields

Memory footprint ≤ 10 MB

### zero shot generated code

In [None]:
import csv
import json
import random
import numpy as np
from datetime import datetime, timedelta, time
import pytz
import os
import sys

# Set random seed
random.seed(42)
np.random.seed(42)

# Greek names and surnames
greek_first_names = [
    "Giorgos", "Dimitris", "Kostas", "Nikos", "Vasilis", "Giannis", "Panagiotis",
    "Maria", "Eleni", "Katerina", "Vasiliki", "Sofia", "Georgia", "Dimitra"
]
greek_surnames = [
    "Papadopoulos", "Nikolaidis", "Georgiou", "Papanikolaou", "Ioannidis",
    "Konstantinou", "Christodoulou", "Papadakis", "Oikonomou", "Alexiou"
]

# Store names from JSON file (simulate file)
stores_json = '["Store A", "Store B", "Store C"]'
store_names = json.loads(stores_json)

# Gender distribution
num_rows = 100
num_males = 70
num_females = num_rows - num_males
genders = ['M'] * num_males + ['F'] * num_females
random.shuffle(genders)

# Date range: last full calendar month
athens = pytz.timezone('Europe/Athens')
today = datetime.now(athens)
first_of_this_month = today.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
last_month_end = first_of_this_month - timedelta(days=1)
last_month_start = last_month_end.replace(day=1)

def random_datetime_in_month():
    # Pick a random date in last calendar month
    while True:
        day = random.randint(1, last_month_end.day)
        dt = last_month_start.replace(day=day)
        weekday = dt.weekday()
        if weekday < 5:
            # Weekday: 08:00–20:00
            hour = random.randint(8, 20)
        else:
            # Weekend: 10:00–18:00
            hour = random.randint(10, 18)
        minute = random.randint(0, 59)
        second = random.randint(0, 59)
        dt = dt.replace(hour=hour, minute=minute, second=second)
        dt = athens.localize(dt)
        if dt <= today:
            return dt.isoformat()

def unit_price(store):
    if store == "Store A":
        return max(1, np.random.normal(10, 2))
    elif store == "Store B":
        return np.random.uniform(5, 15)
    elif store == "Store C":
        return np.random.triangular(3, 12, 6)
    else:
        return np.random.uniform(5, 15)

# Prepare CSV
fieldnames = [
    "Name", "Surname", "Age", "Quantity", "UnitPrice", "Budget",
    "Date of Purchase", "Store Name", "Gender"
]
with open("output.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    for i in range(num_rows):
        gender = genders[i]
        name = random.choice(greek_first_names[:7] if gender == 'M' else greek_first_names[7:])
        surname = random.choice(greek_surnames)
        age = int(np.clip(np.random.poisson(35), 18, 70))
        quantity = random.randint(1, 20)
        store = random.choice(store_names)
        price = round(unit_price(store), 2)
        budget = round(quantity * price, 2)
        date_of_purchase = random_datetime_in_month()
        row = {
            "Name": name,
            "Surname": surname,
            "Age": age,
            "Quantity": quantity,
            "UnitPrice": price,
            "Budget": budget,
            "Date of Purchase": date_of_purchase,
            "Store Name": store,
            "Gender": gender
        }
        # Check for missing fields
        assert all(str(v) != "" for v in row.values())
        writer.writerow(row)

# Validation
with open("output.csv", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    rows = list(reader)
    assert len(rows) == 100
    male_count = sum(1 for r in rows if r["Gender"] == "M")
    assert abs(male_count - 70) <= 1
    for r in rows:
        # Check date not in future
        dt = datetime.fromisoformat(r["Date of Purchase"])
        assert dt <= today
        # Check no missing fields
        assert all(r[k] != "" for k in fieldnames)
    # Check memory usage
    assert sys.getsizeof(rows) < 10 * 1024 * 1024  # <10MB

print("CSV generated and validated successfully.")

UnboundLocalError: cannot access local variable 'hour' where it is not associated with a value

### polished/corrected code

In [18]:
import csv
import json
import random
import numpy as np
from datetime import datetime, timedelta
import pytz
import sys

# Set random seed
random.seed(42)
np.random.seed(42)

# Greek names and surnames
greek_first_names = [
    "Giorgos", "Dimitris", "Kostas", "Nikos", "Vasilis", "Giannis", "Panagiotis",
    "Maria", "Eleni", "Katerina", "Vasiliki", "Sofia", "Georgia", "Dimitra"
]
greek_surnames = [
    "Papadopoulos", "Nikolaidis", "Georgiou", "Papanikolaou", "Ioannidis",
    "Konstantinou", "Christodoulou", "Papadakis", "Oikonomou", "Alexiou"
]

# Store names from JSON file (simulate file)
stores_json = '["Store A", "Store B", "Store C"]'
store_names = json.loads(stores_json)

# Gender distribution
num_rows = 100
num_males = 70
num_females = num_rows - num_males
genders = ['M'] * num_males + ['F'] * num_females
random.shuffle(genders)

# Date range: last full calendar month
athens = pytz.timezone('Europe/Athens')
today = datetime.now(athens)
first_of_this_month = today.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
last_month_end = first_of_this_month - timedelta(days=1)
last_month_start = last_month_end.replace(day=1)

def random_datetime_in_month():
    """Επιστρέφει ένα ISO‐8601 timestamp μέσα στον προηγούμενο μήνα,
    με ώρες 08–20 Δευτ–Παρασκευή, 10–18 Σαβ–Κυρ, tz Europe/Athens."""
    while True:
        # διαλέγουμε τυχαία μέρα
        day = random.randint(1, last_month_end.day)

        # προσωρινό naive datetime για να δούμε weekday
        dt0 = datetime(
            year=last_month_start.year,
            month=last_month_start.month,
            day=day
        )
        # καθορισμός ώρας
        if dt0.weekday() < 5:
            hour = random.randint(8, 20)
        else:
            hour = random.randint(10, 18)
        minute = random.randint(0, 59)
        second = random.randint(0, 59)

        # δημιουργούμε naive datetime με ώρα
        dt_naive = datetime(
            year=last_month_start.year,
            month=last_month_start.month,
            day=day,
            hour=hour,
            minute=minute,
            second=second
        )
        # το κάνουμε tz-aware
        dt_aware = athens.localize(dt_naive)

        # ελέγχουμε να μην είναι μελλοντικό
        if dt_aware <= today:
            return dt_aware.isoformat()

def unit_price(store):
    if store == "Store A":
        return max(1, np.random.normal(10, 2))
    elif store == "Store B":
        return np.random.uniform(5, 15)
    elif store == "Store C":
        # left=3, mode=6, right=12 ώστε να μην έχουμε mode>right
        return np.random.triangular(3, 6, 12)
    else:
        return np.random.uniform(5, 15)


# Prepare CSV
fieldnames = [
    "Name", "Surname", "Age", "Quantity", "UnitPrice", "Budget",
    "Date of Purchase", "Store Name", "Gender"
]
with open("output.csv", "w", encoding="utf-8", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    for i in range(num_rows):
        gender = genders[i]
        name = random.choice(greek_first_names[:7] if gender == 'M' else greek_first_names[7:])
        surname = random.choice(greek_surnames)
        age = int(np.clip(np.random.poisson(35), 18, 70))
        quantity = random.randint(1, 20)
        store = random.choice(store_names)
        price = round(unit_price(store), 2)
        budget = round(quantity * price, 2)
        date_of_purchase = random_datetime_in_month()
        row = {
            "Name": name,
            "Surname": surname,
            "Age": age,
            "Quantity": quantity,
            "UnitPrice": price,
            "Budget": budget,
            "Date of Purchase": date_of_purchase,
            "Store Name": store,
            "Gender": gender
        }
        # Check for missing fields
        assert all(v is not None and str(v) != "" for v in row.values())
        writer.writerow(row)

# Validation
with open("output.csv", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    rows = list(reader)
    assert len(rows) == num_rows
    male_count = sum(1 for r in rows if r["Gender"] == "M")
    assert male_count == num_males  # ακριβώς 70

    for r in rows:
        # Check date not in future (aware vs aware)
        dt = datetime.fromisoformat(r["Date of Purchase"])
        assert dt <= today

        # Check no missing fields
        assert all(r[k] != "" for k in fieldnames)

    # You could measure real memory via psutil, but for simplicity:
    assert sys.getsizeof(rows) < 10 * 1024 * 1024  # <10MB

print("CSV generated and validated successfully.")


CSV generated and validated successfully.


### notes/findings

- In 1st run, error "ValueError : Not naive datetime (tzinfo is already set)" was raised. After further searching it seems that last_month_start is already tz-aware datetime (tzinfo=Europe/Athens). As such, localize() expects a naive item (without tzinfo)
- In 2nd run, error "ValueError: mode > right" was raised. This indicates that όπου ο δεύτερος παράγοντας (mode=12) είναι μεγαλύτερος από τον τρίτο (right=6), ενώ η triangular( left, mode, right ) απαιτεί left ≤ mode ≤ right

## Case 4 - Visuals Generation

In a single Jupyter notebook, write Python code using Plotly (not matplotlib or seaborn) that:

Re-loads the existing output.csv (from our previous step) with pandas, parsing the Date of Purchase column as timezone-aware datetime in Europe/Athens.
Generates four interactive charts in separate code cells, always using non-default Plotly color palettes (no hard-coding of colors), and with custom hover templates:
Age distribution histogram (Age) with 12 bins, overlaying male vs. female distributions.
Quantity vs. Budget scatter plot, with a hover display of “Name Surname – Store Name – Date of Purchase” in ISO format.
Time series line chart showing “Purchases per Day” on the primary y-axis, and “Average Budget per Day” on the secondary y-axis.
Faceted bar chart of purchase counts by Store Name, faceted by gender in rows and by calendar week (Week 1…Week 5) in columns.
Finally, create a fifth notebook cell with a 2×2 subplot grid where:
The first three panels embed the histogram, the scatter plot, and the time series line chart.
The fourth panel is a sunburst chart drilled down as: gender → store → weekday vs. weekend.
All charts must:
Use fig.update_layout(margin=…, title_font=…) so titles don’t get clipped.
Use fig.update_xaxes or fig.update_yaxes for proper date formatting (weekday names, “DD-MM-YYYY”).
Be responsive: if you resize the notebook window, the facets and subplots reflow appropriately.
Additionally:
Initialize random.seed(42) and np.random.seed(42) in case any sampling is demonstrated (e.g. for annotations).
Include a brief Markdown comment above each cell explaining what the chart shows and the business insight it conveys.
In the final Markdown cell, assert that “all charts render without exception and remain fully interactive.”


Use plotly.subplots.make_subplots with shared_xaxes=True, shared_yaxes=False. Add a dropdown menu (updatemenus) to filter the sunburst chart by store.

This will push the LLM on parsing timezones, custom hover templates, dual axes, facets, subplots, responsive layouts, interactive widgets, and in-notebook documentation

### 1st try

In [2]:
import pandas as pd
import numpy as np
import random
from plotly.subplots import make_subplots
import pytz

# Cell 1: Load data and prepare for Plotly charts

import plotly.express as px
import plotly.graph_objects as go

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

# Load CSV with timezone-aware datetime
athens = pytz.timezone('Europe/Athens')
df_plot = pd.read_csv(
    "output.csv",
    parse_dates=['Date of Purchase']
)
df_plot['Date of Purchase'] = pd.to_datetime(df_plot['Date of Purchase']).dt.tz_convert('Europe/Athens')

# Add helper columns for facets and sunburst
df_plot['Week'] = df_plot['Date of Purchase'].dt.isocalendar().week - df_plot['Date of Purchase'].dt.isocalendar().week.min() + 1
df_plot['Week'] = df_plot['Week'].clip(1, 5)
df_plot['DayType'] = np.where(df_plot['Date of Purchase'].dt.weekday < 5, 'Weekday', 'Weekend')

In a single Jupyter cell, write Python code using Plotly (not matplotlib or seaborn) that:

Re-loads the existing output.csv (from our previous step) with pandas, parsing the Date of Purchase column as timezone-aware datetime in Europe/Athens.
Generates four interactive charts in separate code cells, always using non-default Plotly color palettes (no hard-coding of colors), and with custom hover templates:
Age distribution histogram (Age) with 12 bins, overlaying male vs. female distributions.
Quantity vs. Budget scatter plot, with a hover display of “Name Surname – Store Name – Date of Purchase” in ISO format.
Time series line chart showing “Purchases per Day” on the primary y-axis, and “Average Budget per Day” on the secondary y-axis.
Faceted bar chart of purchase counts by Store Name, faceted by gender in rows and by calendar week (Week 1…Week 5) in columns.
Finally, create a  a 2×2 subplot grid where:
The first three panels embed the histogram, the scatter plot, and the time series line chart.
The fourth panel is a sunburst chart drilled down as: gender → store → weekday vs. weekend.
All charts must:
Use fig.update_layout(margin=…, title_font=…) so titles don’t get clipped.
Use fig.update_xaxes or fig.update_yaxes for proper date formatting (weekday names, “DD-MM-YYYY”).
Be responsive: if you resize the notebook window, the facets and subplots reflow appropriately.
Additionally:
Initialize random.seed(42) and np.random.seed(42) in case any sampling is demonstrated (e.g. for annotations).
Include a brief Markdown comment above each cell explaining what the chart shows and the business insight it conveys.
In the final Markdown cell, assert that “all charts render without exception and remain fully interactive.”


Use plotly.subplots.make_subplots with shared_xaxes=True, shared_yaxes=False. Add a dropdown menu (updatemenus) to filter the sunburst chart by store.

This will push the LLM on parsing timezones, custom hover templates, dual axes, facets, subplots, responsive layouts, interactive widgets, and in-notebook documentation

In [8]:
# 1. Age distribution histogram (overlay male/female, 12 bins, custom palette, hover)
# 2. Quantity vs. Budget scatter (custom hover, palette)
# 3. Time series: Purchases/day (primary), Avg Budget/day (secondary)
# 4. Faceted bar: Store Name, facet by Gender (row), Week (col)
# 5. 2x2 subplot: histogram, scatter, time series, sunburst (gender→store→daytype), dropdown to filter sunburst by store

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import pandas as pd
import numpy as np
import random

# Ensure reproducibility for any sampling
random.seed(42)
np.random.seed(42)

# --- 1. Age distribution histogram ---
# (Overlay male/female, 12 bins, non-default palette, custom hover)
#display_markdown("**Age distribution histogram:**<br>This chart overlays the age distribution of male and female customers, helping identify demographic skews and target segments.", raw=True)
fig_hist = px.histogram(
    df_plot,
    x="Age",
    color="Gender",
    nbins=12,
    barmode="overlay",
    color_discrete_sequence=px.colors.qualitative.Pastel,
    hover_data=["Name", "Surname", "Store Name"],
    opacity=0.75
)
fig_hist.update_traces(
    hovertemplate="<b>%{customdata[0]} %{customdata[1]}</b><br>Store: %{customdata[2]}<br>Age: %{x}<br>Count: %{y}"
)
fig_hist.update_layout(
    title="Age Distribution by Gender",
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=20, family="Arial"),
    legend_title_text="Gender"
)
fig_hist.update_xaxes(title="Age")
fig_hist.update_yaxes(title="Count")
fig_hist.show()

# --- 2. Quantity vs. Budget scatter plot ---
#display_markdown("**Quantity vs. Budget scatter plot:**<br>Each point is a purchase. Hover to see full customer, store, and timestamp. Reveals spending patterns and outliers.", raw=True)
fig_scatter = px.scatter(
    df_plot,
    x="Quantity",
    y="Budget",
    color="Store Name",
    color_discrete_sequence=px.colors.qualitative.Set2,
    hover_data=["Name", "Surname", "Store Name", "Date of Purchase"],
    symbol="Gender"
)
fig_scatter.update_traces(
    hovertemplate="<b>%{customdata[0]} %{customdata[1]}</b><br>Store: %{customdata[2]}<br>Date: %{customdata[3]|%Y-%m-%dT%H:%M:%S%z}<br>Quantity: %{x}<br>Budget: %{y}"
)
fig_scatter.update_layout(
    title="Quantity vs. Budget by Store",
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=20, family="Arial"),
    legend_title_text="Store Name"
)
fig_scatter.update_xaxes(title="Quantity")
fig_scatter.update_yaxes(title="Budget (€)")
fig_scatter.show()

# --- 3. Time series: Purchases per Day & Avg Budget per Day ---
#display_markdown("**Purchases per Day & Average Budget per Day:**<br>Primary axis: number of purchases. Secondary: average budget. Shows sales trends and spending power over time.", raw=True)
df_daily = df_plot.groupby(df_plot['Date of Purchase'].dt.date).agg(
    Purchases=('Budget', 'count'),
    AvgBudget=('Budget', 'mean')
).reset_index()
fig_time = go.Figure()
fig_time.add_trace(go.Bar(
    x=df_daily['Date of Purchase'],
    y=df_daily['Purchases'],
    name="Purchases per Day",
    marker_color=px.colors.qualitative.Prism[2],
    hovertemplate="Date: %{x|%A, %d-%m-%Y}<br>Purchases: %{y}"
))
fig_time.add_trace(go.Scatter(
    x=df_daily['Date of Purchase'],
    y=df_daily['AvgBudget'],
    name="Average Budget per Day",
    yaxis="y2",
    mode="lines+markers",
    marker_color=px.colors.qualitative.Prism[4],
    hovertemplate="Date: %{x|%A, %d-%m-%Y}<br>Avg Budget: %{y:.2f} €"
))
fig_time.update_layout(
    title="Purchases and Average Budget per Day",
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=20, family="Arial"),
    yaxis=dict(title="Purchases"),
    yaxis2=dict(title="Avg Budget (€)", overlaying="y", side="right"),
    legend=dict(x=0.01, y=0.99, borderwidth=1)
)
fig_time.update_xaxes(
    title="Date",
    tickformat="%a\n%d-%m-%Y"
)
fig_time.show()

# --- 4. Faceted bar: Store Name, facet by Gender (row), Week (col) ---
#display_markdown("**Purchase counts by Store, faceted by Gender and Week:**<br>Shows how store popularity varies by gender and week. Useful for campaign timing and targeting.", raw=True)
fig_facet = px.bar(
    df_plot,
    x="Store Name",
    color="Store Name",
    color_discrete_sequence=px.colors.qualitative.Dark24,
    facet_row="Gender",
    facet_col="Week",
    category_orders={"Week": list(range(1, 6)), "Gender": ["M", "F"]},
    hover_data=["Name", "Surname", "Date of Purchase"]
)
fig_facet.update_traces(
    hovertemplate="<b>%{customdata[0]} %{customdata[1]}</b><br>Date: %{customdata[2]|%d-%m-%Y %H:%M}<br>Store: %{x}<br>Count: %{y}"
)
fig_facet.update_layout(
    title="Purchase Counts by Store, Gender, and Week",
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=20, family="Arial"),
    legend_title_text="Store Name"
)
fig_facet.for_each_xaxis(lambda ax: ax.update(title="Store Name"))
fig_facet.for_each_yaxis(lambda ax: ax.update(title="Count"))
fig_facet.update_xaxes(tickangle=0)
fig_facet.show()

# --- 5. 2x2 subplot grid: histogram, scatter, time series, sunburst ---
#display_markdown("**Composite dashboard:**<br>First three panels: histogram, scatter, time series. Fourth: sunburst (gender→store→weekday/weekend), with dropdown to filter by store. All charts are interactive and responsive.", raw=True)

from plotly.subplots import make_subplots

# Sunburst data
sunburst_df = df_plot.copy()
sunburst_df['DayType'] = np.where(sunburst_df['Date of Purchase'].dt.weekday < 5, 'Weekday', 'Weekend')

def make_sunburst(store_filter=None):
    d = sunburst_df if store_filter is None else sunburst_df[sunburst_df['Store Name'] == store_filter]
    return go.Sunburst(
        labels=d['Gender'] + " - " + d['Store Name'] + " - " + d['DayType'],
        parents=np.where(d['DayType'].notnull(), d['Gender'] + " - " + d['Store Name'], ""),
        ids=d['Gender'] + "|" + d['Store Name'] + "|" + d['DayType'],
        values=[1]*len(d),
        branchvalues="total",
        hovertemplate="<b>%{label}</b><br>Count: %{value}<extra></extra>",
        maxdepth=3
    )

stores = df_plot['Store Name'].unique().tolist()
sunburst_traces = [make_sunburst(None)] + [make_sunburst(s) for s in stores]
sunburst_buttons = [
    dict(label="All Stores", method="update", args=[{"visible": [True]+[False]*len(stores)}, {"title": "Sunburst: All Stores"}])
] + [
    dict(label=s, method="update", args=[{"visible": [False if i!=j+1 else True for i in range(len(stores)+1)]}, {"title": f"Sunburst: {s}"}])
    for j, s in enumerate(stores)
]

# Create subplots
fig_dash = make_subplots(
    rows=2, cols=2,
    subplot_titles=("Age Histogram", "Quantity vs. Budget", "Purchases & Avg Budget", "Sunburst: Gender→Store→DayType"),
    shared_xaxes=True, shared_yaxes=False,
    vertical_spacing=0.12, horizontal_spacing=0.08
)

# Panel 1: Histogram
for trace in fig_hist.data:
    fig_dash.add_trace(trace, row=1, col=1)
# Panel 2: Scatter
for trace in fig_scatter.data:
    fig_dash.add_trace(trace, row=1, col=2)
# Panel 3: Time series
for trace in fig_time.data:
    fig_dash.add_trace(trace, row=2, col=1)
# Panel 4: Sunburst (all stores + one per store, only one visible at a time)
for i, trace in enumerate(sunburst_traces):
    fig_dash.add_trace(trace.update(visible=(i==0)), row=2, col=2)

# Dropdown for sunburst
fig_dash.update_layout(
    updatemenus=[dict(
        type="dropdown",
        direction="down",
        x=0.82, y=0.48,
        showactive=True,
        buttons=sunburst_buttons,
        pad={"r": 10, "t": 10}
    )],
    margin=dict(l=40, r=40, t=80, b=40),
    title_font=dict(size=22, family="Arial"),
    height=900
)
# Axis formatting
fig_dash.update_xaxes(title_text="Age", row=1, col=1)
fig_dash.update_yaxes(title_text="Count", row=1, col=1)
fig_dash.update_xaxes(title_text="Quantity", row=1, col=2)
fig_dash.update_yaxes(title_text="Budget (€)", row=1, col=2)
fig_dash.update_xaxes(title_text="Date", tickformat="%a\n%d-%m-%Y", row=2, col=1)
fig_dash.update_yaxes(title_text="Purchases", row=2, col=1)
fig_dash.update_yaxes(title_text="Avg Budget (€)", row=2, col=1, secondary_y=True)
fig_dash.update_layout(
    title="Interactive Sales Dashboard",
    legend=dict(x=0.01, y=0.99, borderwidth=1),
    autosize=True
)
fig_dash.show()

ValueError: Trace type 'sunburst' is not compatible with subplot type 'xy'
at grid position (2, 2)

See the docstring for the specs argument to plotly.subplots.make_subplots
for more information on subplot types

In [7]:
# The error in the sunburst chart is due to incorrect parent-child relationships.
# We need to build a proper hierarchy: Gender → Store Name → DayType.
# Let's reconstruct the sunburst traces with correct parents/labels/ids.

import plotly.graph_objects as go

def build_sunburst_trace(df, store_filter=None):
    # Optionally filter by store
    if store_filter is not None:
        df = df[df['Store Name'] == store_filter]

    # Unique values
    genders = df['Gender'].unique()
    stores = df['Store Name'].unique()
    daytypes = df['DayType'].unique()

    labels = []
    parents = []
    ids = []
    values = []

    # Level 1: Gender
    for gender in genders:
        labels.append(gender)
        parents.append("")
        ids.append(gender)
        values.append(df[df['Gender'] == gender].shape[0])

        # Level 2: Store under Gender
        for store in stores:
            mask_store = (df['Gender'] == gender) & (df['Store Name'] == store)
            if mask_store.any():
                labels.append(f"{store}")
                parents.append(gender)
                ids.append(f"{gender}|{store}")
                values.append(df[mask_store].shape[0])

                # Level 3: DayType under Store
                for daytype in daytypes:
                    mask_day = mask_store & (df['DayType'] == daytype)
                    if mask_day.any():
                        labels.append(daytype)
                        parents.append(f"{gender}|{store}")
                        ids.append(f"{gender}|{store}|{daytype}")
                        values.append(df[mask_day].shape[0])

    return go.Sunburst(
        labels=labels,
        parents=parents,
        ids=ids,
        values=values,
        branchvalues="total",
        hovertemplate="<b>%{label}</b><br>Count: %{value}<extra></extra>",
        maxdepth=3
    )

# Rebuild sunburst_traces for all stores and each store
sunburst_traces = [build_sunburst_trace(sunburst_df, None)] + [
    build_sunburst_trace(sunburst_df, store) for store in stores
]

# Now you can use sunburst_traces in your dashboard/subplots as before.

### findings/results 
- on 1st try, Only 1st cell was create, the one that reads the data. Not a single visual was generated 
- On 2nd try, 
    - it used display_markdown which is not supported as it seems. Also "ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed" was thrown
    - in 2nd graph, on hover, it added quantity and budget. Something I didn;t ask. (is automatically) 
    - for the third visual, i asked for a linechart with purchases per day and average budget per day. But insted of lines, i gave a combo line and bar chart. 
    - error was raised for the sunburst. So I prompted "fix the error above". The above code was generated but still, it dodnt solve the issue. "ValueError: Trace type 'sunburst' is not compatible with subplot type 'xy' at grid position (2, 2).See the docstring for the specs argument to plotly.subplots.make_subplots for more information on subplot types" persists.

## Case 5 - Advanced Analytics



“In one Jupyter notebook code cell, write Python that:
Auto-installs any missing packages (pandas, numpy, scikit-learn, plotly, joblib, etc.) using %pip install only when needed.
Imports everything (pandas as pd, numpy as np, sklearn modules, plotly.express as px, joblib, etc.) and sets random.seed(42)/np.random.seed(42).
Loads output.csv with pd.read_csv(..., parse_dates=['Date of Purchase']), ensures tz-aware Europe/Athens.
Creates features:
7-day rolling averages for Budget and Quantity
One-hot encode Store Name/Gender
Extract Hour, Weekday name, IsWeekend
Plots (using Plotly) a dual-axis line of rolling averages.
Runs KMeans (k=2…6), selects best by silhouette, and plots 2D PCA scatter colored by cluster.
Trains two regressors (RidgeCV, RandomForestRegressor), hyperparameter-tunes with GridSearchCV, reports test RMSE/R².
Displays feature importances/coefs and a partial dependence plot for top 2 features.
Saves the best model via joblib.dump() and writes a results.csv summarizing silhouette, RMSE, R².
All in one cell—no markdown—just runnable code that prints summaries and shows plots without errors.

In [9]:
import sys
import subprocess
import pandas as pd
import numpy as np
import random
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
import joblib
import pytz
import warnings
from sklearn.inspection import PartialDependenceDisplay

# Auto-install required packages if missing
def pip_install(pkg):
    try:
        __import__(pkg)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg])

for pkg in ['pandas', 'numpy', 'scikit-learn', 'plotly', 'joblib', 'pytz']:
    pip_install(pkg)

import plotly.express as px
import plotly.graph_objects as go
warnings.filterwarnings("ignore")

random.seed(42)
np.random.seed(42)

# Load data
athens = pytz.timezone('Europe/Athens')
df = pd.read_csv("output.csv", parse_dates=['Date of Purchase'])
if df['Date of Purchase'].dt.tz is None:
    df['Date of Purchase'] = df['Date of Purchase'].dt.tz_localize('Europe/Athens')
else:
    df['Date of Purchase'] = df['Date of Purchase'].dt.tz_convert('Europe/Athens')

df = df.sort_values('Date of Purchase').reset_index(drop=True)

# Feature engineering
df['Budget_rolling7'] = df['Budget'].rolling(7, min_periods=1).mean()
df['Quantity_rolling7'] = df['Quantity'].rolling(7, min_periods=1).mean()
df['Hour'] = df['Date of Purchase'].dt.hour
df['Weekday'] = df['Date of Purchase'].dt.day_name()
df['IsWeekend'] = df['Date of Purchase'].dt.weekday >= 5

df_enc = pd.get_dummies(df, columns=['Store Name', 'Gender'], drop_first=True)

# Plot dual-axis line of rolling averages
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=df['Date of Purchase'], y=df['Budget_rolling7'],
    name='7-day Rolling Budget', yaxis='y1', mode='lines+markers'
))
fig.add_trace(go.Scatter(
    x=df['Date of Purchase'], y=df['Quantity_rolling7'],
    name='7-day Rolling Quantity', yaxis='y2', mode='lines+markers'
))
fig.update_layout(
    title="7-day Rolling Averages: Budget & Quantity",
    yaxis=dict(title='Budget'),
    yaxis2=dict(title='Quantity', overlaying='y', side='right'),
    xaxis=dict(title='Date'),
    legend=dict(x=0.01, y=0.99),
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=20, family="Arial")
)
fig.show()

# KMeans clustering (k=2..6), silhouette selection, PCA scatter
features = ['Age', 'Quantity', 'Budget', 'Hour'] + \
           [c for c in df_enc.columns if c.startswith('Store Name_') or c.startswith('Gender_')]
X = df_enc[features].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

best_score = -1
best_k = 2
best_labels = None
for k in range(2, 7):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    if score > best_score:
        best_score = score
        best_k = k
        best_labels = labels

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
fig2 = px.scatter(
    x=X_pca[:,0], y=X_pca[:,1], color=best_labels.astype(str),
    title=f"PCA Scatter by KMeans Cluster (k={best_k}, Silhouette={best_score:.2f})",
    labels={'x':'PCA1', 'y':'PCA2', 'color':'Cluster'},
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig2.show()

print(f"Best KMeans k={best_k}, silhouette={best_score:.3f}")

# Regression: RidgeCV & RandomForestRegressor, GridSearchCV, RMSE/R2
target = 'Budget'
Xreg = df_enc[features]
yreg = df[target]
X_train, X_test, y_train, y_test = train_test_split(Xreg, yreg, test_size=0.25, random_state=42)

# RidgeCV
ridge = RidgeCV(alphas=np.logspace(-3, 3, 20), cv=5)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
rmse_ridge = mean_squared_error(y_test, y_pred_ridge, squared=False)
r2_ridge = r2_score(y_test, y_pred_ridge)

# RandomForestRegressor with GridSearchCV
rf = RandomForestRegressor(random_state=42)
param_grid = {'n_estimators': [50, 100], 'max_depth': [3, 5, 8]}
gs = GridSearchCV(rf, param_grid, cv=3, n_jobs=-1)
gs.fit(X_train, y_train)
y_pred_rf = gs.predict(X_test)
rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"RidgeCV: RMSE={rmse_ridge:.2f}, R2={r2_ridge:.2f}")
print(f"RandomForest: RMSE={rmse_rf:.2f}, R2={r2_rf:.2f}")

# Feature importances/coefs
importances = None
feat_names = Xreg.columns
if hasattr(gs.best_estimator_, "feature_importances_"):
    importances = gs.best_estimator_.feature_importances_
    top_idx = np.argsort(importances)[::-1][:2]
    print("RandomForest top features:")
    for i in top_idx:
        print(f"  {feat_names[i]}: {importances[i]:.3f}")
    fig3 = px.bar(
        x=[feat_names[i] for i in top_idx],
        y=[importances[i] for i in top_idx],
        labels={'x':'Feature', 'y':'Importance'},
        title="Top 2 Feature Importances (RandomForest)",
        color=[feat_names[i] for i in top_idx],
        color_discrete_sequence=px.colors.qualitative.Pastel
    )
    fig3.show()
else:
    coefs = ridge.coef_
    top_idx = np.argsort(np.abs(coefs))[::-1][:2]
    print("RidgeCV top features:")
    for i in top_idx:
        print(f"  {feat_names[i]}: {coefs[i]:.3f}")
    fig3 = px.bar(
        x=[feat_names[i] for i in top_idx],
        y=[coefs[i] for i in top_idx],
        labels={'x':'Feature', 'y':'Coefficient'},
        title="Top 2 Coefficients (RidgeCV)",
        color=[feat_names[i] for i in top_idx],
        color_discrete_sequence=px.colors.qualitative.Pastel
    )
    fig3.show()

# Partial dependence plot for top 2 features (RandomForest)
import matplotlib.pyplot as plt
pip_install('matplotlib')  # Ensure matplotlib is available

fig4, ax = plt.subplots(1, 2, figsize=(10, 4))
PartialDependenceDisplay.from_estimator(
    gs.best_estimator_, X_test, [top_idx[0], top_idx[1]], feature_names=feat_names, ax=ax
)
plt.tight_layout()
plt.show()

# Save best model and results
joblib.dump(gs.best_estimator_, "best_model.joblib")
results_df = pd.DataFrame([{
    "Best_KMeans_k": best_k,
    "Silhouette": best_score,
    "RidgeCV_RMSE": rmse_ridge,
    "RidgeCV_R2": r2_ridge,
    "RF_RMSE": rmse_rf,
    "RF_R2": r2_rf
}])
results_df.to_csv("results.csv", index=False)
print("Best model saved as best_model.joblib and results written to results.csv")

ModuleNotFoundError: No module named 'sklearn'

In [10]:
%pip install scikit-learn joblib 

Collecting scikit-learn
  Downloading scikit_learn-1.7.0-cp311-cp311-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting joblib
  Using cached joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting scipy>=1.8.0 (from scikit-learn)
  Downloading scipy-1.16.0-cp311-cp311-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.0-cp311-cp311-macosx_12_0_arm64.whl (10.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.7/10.7 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hUsing cached joblib-1.5.1-py3-none-any.whl (307 kB)
Downloading scipy-1.16.0-cp311-cp311-macosx_14_0_arm64.whl (20.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.8/20.8 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages

In [17]:
import sys
import subprocess
import pandas as pd
import numpy as np
import random
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, root_mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import StandardScaler
import joblib
import pytz
import warnings
from sklearn.inspection import PartialDependenceDisplay

# Auto-install required packages if missing
def pip_install(pkg):
    try:
        __import__(pkg)
    except ImportError:
        subprocess.check_call([sys.executable, "-m", "pip", "install", pkg])

for pkg in ['pandas', 'numpy', 'scikit-learn', 'plotly', 'joblib', 'pytz']:
    pip_install(pkg)

import plotly.express as px
import plotly.graph_objects as go
warnings.filterwarnings("ignore")

random.seed(42)
np.random.seed(42)

# Load data
athens = pytz.timezone('Europe/Athens')
df = pd.read_csv("output.csv", parse_dates=['Date of Purchase'])
if df['Date of Purchase'].dt.tz is None:
    df['Date of Purchase'] = df['Date of Purchase'].dt.tz_localize('Europe/Athens')
else:
    df['Date of Purchase'] = df['Date of Purchase'].dt.tz_convert('Europe/Athens')

df = df.sort_values('Date of Purchase').reset_index(drop=True)

# Feature engineering
df['Budget_rolling7'] = df['Budget'].rolling(7, min_periods=1).mean()
df['Quantity_rolling7'] = df['Quantity'].rolling(7, min_periods=1).mean()
df['Hour'] = df['Date of Purchase'].dt.hour
df['Weekday'] = df['Date of Purchase'].dt.day_name()
df['IsWeekend'] = df['Date of Purchase'].dt.weekday >= 5

df_enc = pd.get_dummies(df, columns=['Store Name', 'Gender'], drop_first=True)

# Plot dual-axis line of rolling averages
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=df['Date of Purchase'], y=df['Budget_rolling7'],
    name='7-day Rolling Budget', yaxis='y1', mode='lines+markers'
))
fig.add_trace(go.Scatter(
    x=df['Date of Purchase'], y=df['Quantity_rolling7'],
    name='7-day Rolling Quantity', yaxis='y2', mode='lines+markers'
))
fig.update_layout(
    title="7-day Rolling Averages: Budget & Quantity",
    yaxis=dict(title='Budget'),
    yaxis2=dict(title='Quantity', overlaying='y', side='right'),
    xaxis=dict(title='Date'),
    legend=dict(x=0.01, y=0.99),
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=20, family="Arial")
)
fig.show()

# KMeans clustering (k=2..6), silhouette selection, PCA scatter
features = ['Age', 'Quantity', 'Budget', 'Hour'] + \
           [c for c in df_enc.columns if c.startswith('Store Name_') or c.startswith('Gender_')]
X = df_enc[features].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

best_score = -1
best_k = 2
best_labels = None
for k in range(2, 7):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    if score > best_score:
        best_score = score
        best_k = k
        best_labels = labels

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
fig2 = px.scatter(
    x=X_pca[:,0], y=X_pca[:,1], color=best_labels.astype(str),
    title=f"PCA Scatter by KMeans Cluster (k={best_k}, Silhouette={best_score:.2f})",
    labels={'x':'PCA1', 'y':'PCA2', 'color':'Cluster'},
    color_discrete_sequence=px.colors.qualitative.Set2
)
fig2.show()

print(f"Best KMeans k={best_k}, silhouette={best_score:.3f}")

# Regression: RidgeCV & RandomForestRegressor, GridSearchCV, RMSE/R2
target = 'Budget'
Xreg = df_enc[features]
yreg = df[target]
X_train, X_test, y_train, y_test = train_test_split(Xreg, yreg, test_size=0.25, random_state=42)

# RidgeCV
ridge = RidgeCV(alphas=np.logspace(-3, 3, 20), cv=5)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
rmse_ridge = root_mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

# RandomForestRegressor with GridSearchCV
rf = RandomForestRegressor(random_state=42)
param_grid = {'n_estimators': [50, 100], 'max_depth': [3, 5, 8]}
gs = GridSearchCV(rf, param_grid, cv=3, n_jobs=-1)
gs.fit(X_train, y_train)
y_pred_rf = gs.predict(X_test)
rmse_rf = root_mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"RidgeCV: RMSE={rmse_ridge:.2f}, R2={r2_ridge:.2f}")
print(f"RandomForest: RMSE={rmse_rf:.2f}, R2={r2_rf:.2f}")

# Feature importances/coefs
importances = None
feat_names = Xreg.columns
if hasattr(gs.best_estimator_, "feature_importances_"):
    importances = gs.best_estimator_.feature_importances_
    top_idx = np.argsort(importances)[::-1][:2]
    print("RandomForest top features:")
    for i in top_idx:
        print(f"  {feat_names[i]}: {importances[i]:.3f}")
    fig3 = px.bar(
        x=[feat_names[i] for i in top_idx],
        y=[importances[i] for i in top_idx],
        labels={'x':'Feature', 'y':'Importance'},
        title="Top 2 Feature Importances (RandomForest)",
        color=[feat_names[i] for i in top_idx],
        color_discrete_sequence=px.colors.qualitative.Pastel
    )
    fig3.show()
else:
    coefs = ridge.coef_
    top_idx = np.argsort(np.abs(coefs))[::-1][:2]
    print("RidgeCV top features:")
    for i in top_idx:
        print(f"  {feat_names[i]}: {coefs[i]:.3f}")
    fig3 = px.bar(
        x=[feat_names[i] for i in top_idx],
        y=[coefs[i] for i in top_idx],
        labels={'x':'Feature', 'y':'Coefficient'},
        title="Top 2 Coefficients (RidgeCV)",
        color=[feat_names[i] for i in top_idx],
        color_discrete_sequence=px.colors.qualitative.Pastel
    )
    fig3.show()

# Partial dependence plot for top 2 features (RandomForest)
import matplotlib.pyplot as plt
pip_install('matplotlib')  # Ensure matplotlib is available

fig4, ax = plt.subplots(1, 2, figsize=(10, 4))
PartialDependenceDisplay.from_estimator(
    gs.best_estimator_, X_test, [top_idx[0], top_idx[1]], feature_names=feat_names, ax=ax
)
plt.tight_layout()
plt.show()

# Save best model and results
joblib.dump(gs.best_estimator_, "best_model.joblib")
results_df = pd.DataFrame([{
    "Best_KMeans_k": best_k,
    "Silhouette": best_score,
    "RidgeCV_RMSE": rmse_ridge,
    "RidgeCV_R2": r2_ridge,
    "RF_RMSE": rmse_rf,
    "RF_R2": r2_rf
}])
results_df.to_csv("results.csv", index=False)
print("Best model saved as best_model.joblib and results written to results.csv")



Best KMeans k=5, silhouette=0.216
RidgeCV: RMSE=0.00, R2=1.00
RandomForest: RMSE=2.35, R2=1.00
RandomForest top features:
  Budget: 0.993
  Quantity: 0.003


ModuleNotFoundError: No module named 'matplotlib'

In [None]:
from sklearn.inspection import PartialDependenceDisplay

# Compute partial dependence using plotly only (no matplotlib)
from sklearn.inspection import partial_dependence

fig_pdp = make_subplots(rows=1, cols=2, subplot_titles=[
    f"Partial Dependence: {feat_names[top_idx[0]]}",
    f"Partial Dependence: {feat_names[top_idx[1]]}"
])

for i, feat_idx in enumerate(top_idx):
    pd_result = partial_dependence(gs.best_estimator_, X_test, [feat_idx], kind="average")
    x_vals = pd_result["values"][0]
    y_vals = pd_result["average"][0].ravel()
    fig_pdp.add_trace(
        go.Scatter(
            x=x_vals,
            y=y_vals,
            mode='lines+markers',
            name=f"PDP: {feat_names[feat_idx]}"
        ),
        row=1, col=i+1
    )
    fig_pdp.update_xaxes(title_text=feat_names[feat_idx], row=1, col=i+1)
    fig_pdp.update_yaxes(title_text="Partial Dependence", row=1, col=i+1)

fig_pdp.update_layout(
    title="Partial Dependence Plots (Top 2 Features)",
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=18, family="Arial"),
    height=400
)
fig_pdp.show()
    fig_pdp.add_trace(
        go.Scatter(
            x=x_vals,
            y=y_vals,
            mode='lines+markers',
            name=f"PDP: {feat_names[top_idx[i]]}"
        ),
        row=1, col=i+1
    )
    fig_pdp.update_xaxes(title_text=feat_names[top_idx[i]], row=1, col=i+1)
    fig_pdp.update_yaxes(title_text="Partial Dependence", row=1, col=i+1)

fig_pdp.update_layout(
    title="Partial Dependence Plots (Top 2 Features)",
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=18, family="Arial"),
    height=400
)
fig_pdp.show()

In [20]:
# Compute partial dependence using plotly only (no matplotlib)
from sklearn.inspection import partial_dependence

fig_pdp = make_subplots(rows=1, cols=2, subplot_titles=[
    f"Partial Dependence: {feat_names[top_idx[0]]}",
    f"Partial Dependence: {feat_names[top_idx[1]]}"
])

for i, feat_idx in enumerate(top_idx):
    pd_result = partial_dependence(gs.best_estimator_, X_test, [feat_idx], kind="average")
    x_vals = pd_result["values"][0]
    y_vals = pd_result["average"][0].ravel()
    fig_pdp.add_trace(
        go.Scatter(
            x=x_vals,
            y=y_vals,
            mode='lines+markers',
            name=f"PDP: {feat_names[feat_idx]}"
        ),
        row=1, col=i+1
    )
    fig_pdp.update_xaxes(title_text=feat_names[feat_idx], row=1, col=i+1)
    fig_pdp.update_yaxes(title_text="Partial Dependence", row=1, col=i+1)

fig_pdp.update_layout(
    title="Partial Dependence Plots (Top 2 Features)",
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=18, family="Arial"),
    height=400
)
fig_pdp.show()
fig_pdp.add_trace(
    go.Scatter(
        x=x_vals,
        y=y_vals,
        mode='lines+markers',
        name=f"PDP: {feat_names[top_idx[i]]}"
    ),
    row=1, col=i+1
)
fig_pdp.update_xaxes(title_text=feat_names[top_idx[i]], row=1, col=i+1)
fig_pdp.update_yaxes(title_text="Partial Dependence", row=1, col=i+1)

fig_pdp.update_layout(
    title="Partial Dependence Plots (Top 2 Features)",
    margin=dict(l=40, r=40, t=60, b=40),
    title_font=dict(size=18, family="Arial"),
    height=400
)
fig_pdp.show()

KeyError: 'values'

### Findings 
- In 1st run, 
    - the generated code is not able to handle missing libraries, auto-install doens't work. 
    - Also used mean_squared_error which is deprecated and refactoring was needed
    - Finally for one of the visuals, maplotlib was used while I asked it to use plotly instead like the rest of the visuals

- even after series of retries, the gpt cannot handle the request. The following promtps were given after the first one

    - refactor this to plotly import matplotlib.pyplot as plt
        pip_install('matplotlib') # Ensure matplotlib is available
        fig4, ax = plt.subplots(1, 2, figsize=(10, 4))
        PartialDependenceDisplay.from_estimator(
        gs.best_estimator_, X_test, [top_idx[0], top_idx[1]], feature_names=feat_names, ax=ax
        )
        plt.tight_layout()
        plt.show()
    - i dont have partial dependence
    - i dont want to use matplotlib
    - PartialDependenceDisplay.from_estimator needs matplotlib
    - error in indent


Στο πλαίσιο αυτής της άσκησης αξιολογήσαμε τις ικανότητες ενός LLM να παράγει σύνθετο, συνεκτικό και λειτουργικό κώδικα σε τρία βασικά σενάρια. Στόχος ήταν να εντοπίσουμε σημεία όπου η λογική του μοντέλου «σπάει» ή κάνει υποθέσεις που οδηγούν σε λάθη.

### Σενάριο 3 – Complex Data Creation, Validation & Post-processing

Prompt: Streaming script για 100 εγγραφές πωλήσεων με αυστηρές προδιαγραφές (διανομή ηλικιών Poisson, ώρες, timezone-aware, row-by-row I/O, …, seed=42).
Αποτέλεσμα:

- Το LLM παρήγαγε κώδικα χωρίς συντακτικά σφάλματα, αλλά εμφανίστηκαν runtime exceptions (π.χ. UnboundLocalError, λάθος παράμετροι np.random.triangular).
- Τα λάθη αποδίδονται σε ασυνέπειες μεταξύ των πολλαπλών βημάτων του prompt (πρώιμη χρήση μεταβλητών πριν τον ορισμό τους, διπλή χρήση localize, λάθος order παραμέτρων).
- Συμπέρασμα: Η σύνθεση μεγάλου, πολύπλοκου prompt οδήγησε σε κατακερματισμένες παραγράφους κώδικα που δεν ελέγχθηκαν end-to-end από το LLM, αποκαλύπτοντας έλλειψη ενιαίας internal δομής.

### Σενάριο 4 – Visuals Generation
Prompt: Πλήρες interactive dashboard σε Jupyter με Plotly, τέσσερα charts σε ξεχωριστά cells + 2×2 subplots (sunburst, facets, custom hover, responsive, custom palettes, updatemenus, markdown commentary).
Αποτέλεσμα:

- Παράχθηκαν τα περισσότερα διαγράμματα με Plotly, αλλά:
    - Το sunburst υλοποιήθηκε με matplotlib, παρά το ρητό αίτημα για Plotly.
    - Για το “Purchases & Average Budget per Day” αντί για line-line dual axis επιλέχτηκε line+barchart, αν και σε επόμενα prompts το LLM έδωσε το σωστό dual-line.
- Συμπέρασμα: Το LLM φαίνεται να έχει bias στα παραδείγματα εκπαίδευσής του—χρησιμοποιεί matplotlib για είδη διαγραμμάτων που εκπαιδεύτηκε να βλέπει συχνότερα, ακόμα και όταν ζητείται άλλη βιβλιοθήκη.

### Σενάριο 5 – Advanced Analytics
Prompt: Ένα cell που:

1. Αυτόματη εγκατάσταση ελλειπόντων βιβλιοθηκών
2. Feature engineering (rolling averages, one-hot, ώρα, weekday, weekend)
3. Dual-axis Plotly plot
4. KMeans (k=2–6, silhouette) + 2-D PCA scatter
5. Δύο regressors (RidgeCV, RandomForest) με GridSearchCV, αποτελέσματα RMSE & R²
6. Partial dependence plots
7. Αποθήκευση μοντέλου & results.csv
Αποτέλεσμα:

- Ο κώδικας για εγκατάσταση βιβλιοθηκών μπήκε μετά τις κλήσεις import, οπότε υπήρχε σφάλμα module not found.
- Χρησιμοποιήθηκαν deprecated συναρτήσεις (mean_squared_error(squared=False) αντί mean_squared_error(..., squared=True) ή mean_squared_error + τετραγωνική ρίζα).
- Συμπέρασμα: Το LLM δεν χειρίζεται σωστά την ordering των εντολών εγκατάστασης και import, και βασίζεται σε παλιές εκδόσεις API, υποδεικνύοντας ότι η βάση εκπαίδευσης δεν εκσυγχρονίστηκε πλήρως.

### Κύρια Συμπεράσματα

1. Fragmentation of Code Blocks: Όταν το prompt περιλαμβάνει πολλές ασύνδετες οδηγίες, το LLM συχνά δημιουργεί κομμάτια κώδικα που δεν λειτουργούν ενιαία.
2. Library Bias: Το LLM επανέρχεται σε βιβλιοθήκες/τυπικά patterns που έχει «δει» συχνότερα (π.χ. matplotlib για sunburst), παρά τις ρητές οδηγίες.
3. API Version Drift: Χρησιμοποιεί deprecated συναρτήσεις και δεν προσαρμόζει την εγκατάσταση βιβλιοθηκών με σωστή σειρά, δείχνοντας ότι η εκπαίδευσή του μπορεί να μην περιλαμβάνει up-to-date documentation.
4. Partial Compliance: Συχνά καλύπτει το μεγαλύτερο μέρος των απαιτήσεων, αλλά παραλείπει κρίσιμες λεπτομέρειες (custom palettes, hover templates, margin updates, dtype consistency).