# Custom Chatbot Project

I chose to use the 2024 WSL World Surf League Wikipedia page to supplement ChatGPT 3.5. Experimenting with ChatGPT 4.0, I noticed that even though its last training date was in 2023 it provided accurate answers because it could automatically search the web for up to date information. (I verified this by asking it to explain how it came up with that answer!) Although ChatGPT 4.0 is better than 3.5, there is still a legitimate use case for choosing the older model. Because it is cheaper per token than 4.0, a business case could be made for implementing a custom RAG solution depending on the scale of the application.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

Get the 2022 — 2024 WSL rankings and event schedules and results from the WSL Website.

Event request format: `https://www.worldsurfleague.com/events?all=1&year=<year>`

Rankings request format: `https://www.worldsurfleague.com/athletes/tour/<wct|mct>?year=<year>`

In [448]:
from datetime import datetime as dt

import requests
from bs4 import BeautifulSoup
import pandas as pd
from itertools import chain

In [449]:
years = ['2022','2023','2024']

In [450]:
def get_rankings_by_year(year):
    genders = ['mct','wct']
    data = []
    for gender in genders:
        res = requests.get(f"https://www.worldsurfleague.com/athletes/tour/{gender}?year={year}")
        soup = BeautifulSoup(res.text)
        rankings = soup.find_all('a', class_='athlete-name')
        gender_map = {"mct": "male", "wct": "female"}
        data.append([{"year": year, "gender": gender_map.get(gender), "rank": rank + 1, "name": el.text} for rank, el in enumerate(rankings)])
    return data

In [451]:
results = []
for year in years:
    data = get_rankings_by_year(year)
    results.append(data)

In [452]:
df = pd.DataFrame(chain(*chain(*results)))

In [453]:
df[["first_name", "last_name"]] = df['name'].str.split(' ', n=1, expand=True)

In [454]:
df = df.replace("Joao","João").replace("Joao Chianca", "João Chianca")

In [455]:
df["text"] = df.apply(lambda x: f"{x['name']} {'is' if int(x.year) == dt.now().year else 'was' } the number {x['rank']} ranked {x.gender} surfer in {x.year}".lower(), axis=1)

In [456]:
df_athletes = df[["name","gender"]].drop_duplicates().reset_index(drop=True)

In [457]:
import re
from daterangeparser import parse
def get_events_by_year(year):
    data = []
    res = requests.get(f"https://www.worldsurfleague.com/events?all=1&year={year}")
    soup = BeautifulSoup(res.text)
    event_pattern = re.compile(r"^event-*")
    event_table_rows = soup.find("div", class_="events-schedule-table").find_all("tr", class_=event_pattern)
    return event_table_rows, year

def parse_event_cols(rows, year):
    result = []
    for row in rows:
        date_range, location, tour, status = row.find_all("td")
        event_details = location.find("a")
        event_link = event_details["href"] if event_details else ""
        event_data = eval(event_details["data-gtm-event"]) if event_details else ""
        tour_code = event_data["tour_code"] if event_details else ""
        event_name = event_data["event_name"] if event_details else ""
        start, end = parse(f'{date_range.text} {year}')
        if tour_code and tour_code.lower() == "mct":
            result.append({
                "start_date": start,
                "end_date": end,
                "year": year,
                "event_name": event_name,
                "tour_code": tour_code,
                "event_link": event_link.strip(),
                "status": status.text.lower().strip()
            })
    return result

In [458]:
results = []

for year in years:
    row_data, year = get_events_by_year(year)
    result = parse_event_cols(row_data, year)
    results.append(result)
df_events = pd.DataFrame(chain(*results))

### Extract Links from Contest Results

In [459]:
df_events["id"] = pd.util.hash_pandas_object(df_events, index=False)

In [460]:
def get_event_by_url(url) -> list:
    result = []
    url = url.replace("/main","/results")
    res = requests.get(url)
    soup = BeautifulSoup(res.text)
    champs = soup.find("span", class_="status-module__status-message")
    return champs

def parse_event_results(data, comp_id: str, athletes: pd.DataFrame):
    results = []
    for _, row in athletes.iterrows():
        if row["name"] in data.text:
            results.append({"comp_id": comp_id, "athlete_name": row["name"], "athlete_gender": row["gender"]})
    return results
        

In [461]:
comp_results = []

for idx, e in df_events[df_events["end_date"] < dt.now()].iterrows():
    data = get_event_by_url(e.event_link)
    results = parse_event_results(data, e.id, df_athletes)
    comp_results.append(results)

In [462]:
df_athletes[df_athletes["name"].isin(["João Chianca"])]

Unnamed: 0,name,gender
33,João Chianca,male


In [463]:
comp_results_df = pd.DataFrame(list(chain(*comp_results)))

In [464]:
comp_results_df[comp_results_df["athlete_gender"] == "male"].sort_values("comp_id").shape

(26, 3)

In [465]:
comp_results_df = comp_results_df.merge(df_events, left_on="comp_id", right_on="id").drop(["event_link", "tour_code", "id", "comp_id"], axis=1)

In [468]:
comp_results_df["text"] = comp_results_df.apply(lambda x: f"{x.athlete_name} won the {'mens' if x.athlete_gender == 'male' else 'womens'} {x.year} {x.event_name}".lower(), axis=1)

In [474]:
all_df = pd.concat([comp_results_df["text"], df["text"]])

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [491]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 10
embeddings = []
for i in range(0, len(all_df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = client.embeddings.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        model=EMBEDDING_MODEL_NAME
    )
    # print(response.data)
    # raise
    # Add embeddings to list
    embeddings.extend([data.embedding for data in response.data])

# Add embeddings list to dataframe
all_df["embeddings"] = embeddings

BadRequestError: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}

In [None]:
import os
import openai
openai.api_key = None

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [None]:
from openai import OpenAI

In [None]:
client = OpenAI()

In [None]:
surf_prompt = """
Question: "Who was the number 1 ranked female surfer in 2024 in the WSL?"
Answer:
"""
surf_prompt = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=surf_prompt,
    stream=False,
    max_tokens=150
)
print(surf_prompt.choices[0].text)

### Question 2

In [None]:
surf_prompt = """
Question: "Who won the Pipe Pro surfing event in 2023?"
Answer:
"""
surf_prompt = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=surf_prompt,
    stream=False,
    max_tokens=150
)
print(surf_prompt.choices[0].text)