# Custom Chatbot Project

I chose to use the 2024 WSL World Surf League Wikipedia page to supplement ChatGPT 3.5. Experimenting with ChatGPT 4.0, I noticed that even though its last training date was in 2023 it provided accurate answers because it could automatically search the web for up to date information. (I verified this by asking it to explain how it came up with that answer!) Although ChatGPT 4.0 is better than 3.5, there is still a legitimate use case for choosing the older model. Because it is cheaper per token than 4.0, a business case could be made for implementing a custom RAG solution depending on the scale of the application.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

Get the 2022 — 2024 WSL rankings and event schedules and results from the WSL Website.

Event request format: `https://www.worldsurfleague.com/events?all=1&year=<year>`

Rankings request format: `https://www.worldsurfleague.com/athletes/tour/<wct|mct>?year=<year>`

In [6]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from itertools import chain

In [7]:
years = ['2022','2023','2024']

In [8]:
def get_rankings_by_year(year):
    genders = ['mct','wct']
    data = []
    for gender in genders:
        res = requests.get(f"https://www.worldsurfleague.com/athletes/tour/{gender}?year={year}")
        soup = BeautifulSoup(res.text)
        rankings = soup.find_all('a', class_='athlete-name')
        gender_map = {"mct": "male", "wct": "female"}
        data.append([{"year": year, "gender": gender_map.get(gender), "rank": rank + 1, "name": el.text} for rank, el in enumerate(rankings)])
    return data

In [9]:
results = []
for year in years:
    data = get_rankings_by_year(year)
    results.append(data)

In [10]:
df = pd.DataFrame(chain(*chain(*results)))

In [11]:
df

Unnamed: 0,year,gender,rank,name
0,2022,male,1,Filipe Toledo
1,2022,male,2,Italo Ferreira
2,2022,male,3,Jack Robinson
3,2022,male,4,Ethan Ewing
4,2022,male,5,Kanoa Igarashi
...,...,...,...,...
174,2024,female,14,Isabella Nichols
175,2024,female,15,India Robinson
176,2024,female,16,Alyssa Spencer
177,2024,female,17,Sophie McCulloch


In [126]:
import re
from daterangeparser import parse
def get_events_by_year(year):
    data = []
    res = requests.get(f"https://www.worldsurfleague.com/events?all=1&year={year}")
    soup = BeautifulSoup(res.text)
    event_pattern = re.compile(r"^event-*")
    event_table_rows = soup.find("div", class_="events-schedule-table").find_all("tr", class_=event_pattern)
    return event_table_rows, year

def parse_event_cols(rows, year):
    result = []
    for row in rows:
        date_range, location, tour, status = row.find_all("td")
        event_details = location.find("a")
        event_link = event_details["href"] if event_details else ""
        event_data = eval(event_details["data-gtm-event"]) if event_details else ""
        tour_code = event_data["tour_code"] if event_details else ""
        event_name = event_data["event_name"] if event_details else ""
        start, end = parse(f'{date_range.text} {year}')
        if tour_code and tour_code.lower() == "mct":
            result.append({
                "start_date": start,
                "end_date": end,
                "event_name": event_name,
                "tour_code": tour_code,
                "event_link": event_link.strip(),
                "status": status.text.lower().strip()
            })
    return result

In [135]:
results = []

for year in years:
    row_data, year = get_events_by_year(year)
    result = parse_event_cols(row_data, year)
    results.append(result)
df_events = pd.DataFrame(chain(*results))

### Extract Links from Contest Results

In [133]:
df_events[df_events["status"] == "completed"][["event_name","end_date","event_link"]]

Unnamed: 0,event_name,end_date,event_link
0,Billabong Pro Pipeline,2022-02-10,https://www.worldsurfleague.com/events/2022/ct/2/billabong-pro-pipeline/main
1,Hurley Pro Sunset Beach,2022-02-23,https://www.worldsurfleague.com/events/2022/ct/3/hurley-pro-sunset-beach/main
2,MEO Pro Portugal,2022-03-13,https://www.worldsurfleague.com/events/2022/ct/6/meo-portugal-pro/main
3,Rip Curl Pro Bells Beach,2022-04-20,https://www.worldsurfleague.com/events/2022/ct/9/rip-curl-pro-bells-beach/main
4,Margaret River Pro,2022-05-04,https://www.worldsurfleague.com/events/2022/ct/5/margaret-river-pro/main
5,Quiksilver Pro G-Land,2022-06-06,https://www.worldsurfleague.com/events/2022/ct/8/quiksilverroxy-pro-g-land/main
6,Surf City El Salvador Pro,2022-06-20,https://www.worldsurfleague.com/events/2022/ct/12/surf-city-el-salvador-pro/main
7,Oi Rio Pro,2022-06-30,https://www.worldsurfleague.com/events/2022/ct/7/oi-rio-pro/main
8,Corona Open J-Bay,2022-07-21,https://www.worldsurfleague.com/events/2022/ct/4/corona-open-j-bay/main
9,Rip Curl WSL Finals,2022-09-16,https://www.worldsurfleague.com/events/2022/ct/10/rip-curl-wsl-finals/main


In [140]:
df_events["id"] = pd.util.hash_pandas_object(df_events, index=False)

In [None]:
def get_event_by_url(id, url):
    result = []
    res = requests.get(url)
    soup = BeautifulSoup(res.text)
    # Find mens and womens results
    # Store by id
    # Going forward, store surfers by id

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [27]:
import os
import openai
openai.api_key = None

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [77]:
from openai import OpenAI

In [78]:
client = OpenAI()

In [100]:
surf_prompt = """
Question: "Who was the number 1 ranked female surfer in 2024 in the WSL?"
Answer:
"""
surf_prompt = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=surf_prompt,
    stream=False,
    max_tokens=150
)
print(surf_prompt.choices[0].text)


As a language model AI, I don't have access to real-time data or predictions for the future. It is impossible for me to accurately answer this question.


### Question 2