# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from openai import OpenAI
from scipy.spatial.distance import cosine
import tiktoken

In [2]:
OPENAI_API_KEY = 'YOUR_API_KEY'
EMBEDDINGS_MODEL = 'text-embedding-3-small'
GPT_MODEL = 'gpt-4o-mini-2024-07-18'
BASE_URL = 'https://openai.vocareum.com/v1'

In [3]:
#Website of the page to scrap
url = 'https://theanalyst.com/eu/2024/05/premier-league-best-facts-2023-24'
file = 'premier_league_2023-24.html'

In [4]:
#Function to fetch html page from the given URL
def fetch_html_page(url):
    
    response = requests.get(url)
    if response.status_code == 200:
        return response.content
    else:
        raise Exception('Connection error')

with open(file, mode='wb') as html_file:
    html_page = fetch_html_page(url)
    html_file.write(html_page) 

In [5]:
#Function to extract data from the given html page and return it as a pandas dataframe
def extract_data(html_file):
    with open(html_file) as fp:
        soup = BeautifulSoup(fp, 'html.parser')

    nodes = soup.find('h2', {'class': 'wp-block-heading has-text-align-center'})
    month_tags = []
    text = []
    current_month = None
    for month in soup.find_all('h2', {'class': 'wp-block-heading has-text-align-center'}):
        month_tags.append(month.find_next('strong'))

    for node in nodes.find_all_next():
        if node in month_tags:
            current_month = node.text

        elif node.name == 'ul':
            tags = node.find_all('li')
            for tag in tags:
                event = tag.text.strip()
                text.append(f'{current_month}: {event}') 

    return pd.DataFrame(text, columns=['text'])

#Extracting the data and adding it to a dataframe
df = extract_data(file)
pd.set_option('display.max_colwidth', None)
df.head()

Unnamed: 0,text
0,"August: With his brace against Burnley, Erling Haaland became only the second player to score 2+ goals in a team’s opening game of a Premier League season in consecutive campaigns (also two vs West Ham in 2022-23), after Didier Drogba in 2009-10 (two v Hull) and 2010-11 (three v West Brom) for Chelsea."
1,"August: Vincent Kompany became the first Belgian to manage in the Premier League, while he became just the third manager to face a team he previously played for in the competition in his maiden fixture (also Roberto Di Matteo for West Brom v Chelsea in August 2010 and Scott Parker for Fulham v Chelsea in March 2019), with all three losing those games."
2,"August: James Milner appeared in his 22nd Premier League season, equalling the competition’s record held by Ryan Giggs."
3,"August: Luton became the sixth side whose first ever Premier League goal was a penalty, and first since Swansea in September 2011."
4,August: Aston Villa’s 5-1 loss at Newcastle on MD1 was the first time they had conceded 5+ in their first league match since 1951-52 (2-5 v Bolton Wanderers).


In [6]:
#To check total number of token in the dataset based on the gpt model
encoding = tiktoken.encoding_for_model(GPT_MODEL) 

total_text = " ".join(df['text'].tolist())
tokens = encoding.encode(total_text)
num_tokens = len(tokens)

print(f"Total number of tokens: {num_tokens}")

Total number of tokens: 6724


In [8]:
#Initializing OpenAI Client
client = OpenAI(base_url=BASE_URL, api_key=OPENAI_API_KEY)

In [9]:
#Function to get embeddings for the given text based on the given embeddings model
def get_embeddings(text, model):
    response = client.embeddings.create(input=text, model=EMBEDDINGS_MODEL)
    return [data.embedding for data in response.data]

#Function to create embeddings for the given dataframe based on the given embeddings model
def create_embeddings(df, model):
    embeddings = []
    batch_size = 50
    for i in range(0, len(df), batch_size):
        batch = df.iloc[i : i + batch_size]['text'].tolist()
        embeddings.extend(get_embeddings(batch, model))
    return embeddings

In [10]:
#Creating embeddings for the dataset and adding it to a embeddings column
df['embeddings'] = create_embeddings(df, EMBEDDINGS_MODEL)

In [11]:
#Saving the dataframe with embeddings to a csv file
df.to_csv('text_embeddings.csv')

In [12]:
df.head()

Unnamed: 0,text,embeddings
0,"August: With his brace against Burnley, Erling Haaland became only the second player to score 2+ goals in a team’s opening game of a Premier League season in consecutive campaigns (also two vs West Ham in 2022-23), after Didier Drogba in 2009-10 (two v Hull) and 2010-11 (three v West Brom) for Chelsea.","[-0.0053741345182061195, -0.00987408310174942, 0.09268473088741302, -0.006345744244754314, 0.043646082282066345, 0.014713700860738754, 0.017589032649993896, 0.01740998402237892, 0.011259088292717934, 0.011111634783446789, -0.026688989251852036, -0.007493770681321621, -0.022518176585435867, 0.004739559721201658, 0.007314720656722784, 0.035788942128419876, -0.0344829298555851, -0.010837793350219727, -0.05847563594579697, -0.011680382303893566, -0.0275315772742033, 0.0031412749085575342, 0.006114032119512558, 0.03361927717924118, 0.018336830660700798, 0.0176100991666317, -0.0026949665043503046, -0.0139448381960392, -0.02833203598856926, 0.019621778279542923, 0.032397523522377014, -0.03214474767446518, 0.06690151989459991, 0.004394625313580036, 0.05666407197713852, 0.018673866987228394, -0.0004114201292395592, -0.004405157640576363, 0.035051677376031876, -0.024203352630138397, 0.004555243533104658, -0.03536764904856682, -0.0023079023230820894, -0.00512135773897171, -0.02847948856651783, -0.014144953340291977, 0.02871120162308216, -0.020295849069952965, 0.013460350222885609, 0.03722134232521057, -0.007988791912794113, -0.001346824923530221, 0.020464366301894188, 0.039243556559085846, 0.027468383312225342, -0.02173878252506256, 0.02348715253174305, 0.018663333728909492, -0.009647637605667114, -0.06112978979945183, 0.02407696470618248, 0.04191877320408821, -0.007714950479567051, 0.011827834881842136, -0.005792795680463314, -0.07499036937952042, -0.04490996524691582, 0.03810606151819229, 0.056200649589300156, 0.020801402628421783, -0.0056927381083369255, 0.009368530474603176, -0.023613540455698967, -0.0008590452489443123, -0.018410557880997658, 0.00031284388387575746, -0.013281300663948059, 0.017146674916148186, -0.00030494460952468216, 0.008599667809903622, -0.00736211659386754, -0.028795460239052773, -0.06664874404668808, -0.05101872980594635, 0.011185361072421074, 0.021317487582564354, 0.00282530440017581, -0.012112208642065525, -0.001302720746025443, -0.004713228903710842, 0.0209277905523777, -0.026225565001368523, 0.004678999073803425, -0.06336265057325363, 0.05388353019952774, 0.05316733196377754, 0.024919552728533745, -0.044151633977890015, 0.03905397281050682, 0.02628875896334648, ...]"
1,"August: Vincent Kompany became the first Belgian to manage in the Premier League, while he became just the third manager to face a team he previously played for in the competition in his maiden fixture (also Roberto Di Matteo for West Brom v Chelsea in August 2010 and Scott Parker for Fulham v Chelsea in March 2019), with all three losing those games.","[-0.021047059446573257, -0.0463116392493248, 0.0290360189974308, -0.024757668375968933, -0.041850797832012177, -0.0023546144366264343, 0.04533836618065834, -0.012429525144398212, -0.011679292656481266, 0.03157058730721474, 0.033476583659648895, -0.00022034907306078821, -0.023318031802773476, -0.021817568689584732, 0.01346362940967083, 0.024068264290690422, -0.06257343292236328, 0.05328677222132683, -0.04241854324936867, -0.017640598118305206, 0.01907009445130825, 0.04134388640522957, 0.030070124194025993, 0.04558168351650238, 0.012388971634209156, 0.0015803036512807012, -0.025690387934446335, -0.0420130118727684, -0.060505226254463196, 0.005454392172396183, 0.0503263957798481, -0.030637867748737335, -0.007051170337945223, 0.012946576811373234, 0.06743980199098587, -0.019688529893755913, -0.02398715913295746, -0.015389901585876942, 0.02504153922200203, 0.002141710603609681, 0.010422146879136562, -0.01625165529549122, 0.016028612852096558, -0.004029964096844196, -0.038221295922994614, 0.0010645188158378005, -0.0023596833925694227, 0.03256414085626602, -0.01654566451907158, 0.04142499342560768, 0.03517981246113777, 0.012895885854959488, 0.018025852739810944, 0.030049847438931465, 0.0260959193110466, -0.015754878520965576, 0.014365935698151588, -0.005353009328246117, 0.0013040356570854783, -0.018471937626600266, 0.01628207042813301, 0.00031587062403559685, 0.014305106364190578, 0.029725423082709312, -0.07603706419467926, -0.07875411957502365, -0.04655496031045914, -0.0038018531631678343, 0.06439832597970963, -0.014832296408712864, 0.04176969453692436, 0.03353741392493248, -0.013321693055331707, 0.036558620631694794, -0.01958714798092842, -0.014021234586834908, 0.008288039825856686, 0.0509752482175827, -0.029380720108747482, -0.03021205961704254, -0.010422146879136562, -0.05077248066663742, -0.02658255770802498, -0.03473373129963875, -0.030779803171753883, 0.008455321192741394, -0.03167197108268738, -0.01784336380660534, 0.03315215930342674, 0.005560843739658594, 0.01121293194591999, -0.03177335485816002, -0.06873750686645508, -0.04627108573913574, 0.030779803171753883, 0.040208399295806885, 0.005135036073625088, -0.04501394182443619, 0.015035062097012997, 0.004440564662218094, ...]"
2,"August: James Milner appeared in his 22nd Premier League season, equalling the competition’s record held by Ryan Giggs.","[0.004917625337839127, 0.03175695613026619, 0.011986400000751019, -0.026354070752859116, 0.0026789302937686443, -0.018239738419651985, 0.037359945476055145, 0.024613142013549805, 0.0022086792159825563, -0.02293224446475506, -0.006513477768748999, 0.0032617414835840464, -0.015908492729067802, 0.032637424767017365, 0.046624891459941864, 0.021631550043821335, -0.06435436010360718, -0.006533488165587187, -0.05743066221475601, -0.010925833135843277, 0.026994412764906883, 0.013917430303990841, 0.013567243702709675, 0.035398900508880615, 0.011776287108659744, -0.04950643330812454, -0.013537227176129818, -0.013777355663478374, -0.028495213016867638, 0.022532029077410698, 0.05418893322348595, 0.004002136643975973, 0.08748670667409897, 0.01146612223237753, 0.057510703802108765, -0.040761761367321014, 0.00768410274758935, -0.01644878275692463, 0.0017797001637518406, -0.010935838334262371, -0.011516148224473, 0.003952110186219215, 0.008694642223417759, -0.0026839328929781914, 0.002294975332915783, -0.005658020731061697, 0.023192383348941803, -0.016979064792394638, 0.008714652620255947, 0.03924095258116722, -0.02939569391310215, -0.01582845114171505, -0.02189168892800808, -0.005557967349886894, 0.03227723389863968, -0.009004808031022549, 0.0004552430473268032, 0.020190779119729996, -0.009840253740549088, -0.01084579061716795, -0.006033221259713173, 0.003964616917073727, 0.01676895283162594, -0.010735731571912766, -0.0723586305975914, -0.046104613691568375, -0.01490795984864235, 0.06019213795661926, 0.022652093321084976, -0.038300447165966034, 0.005567972548305988, -0.044143568724393845, -0.007519014645367861, 0.020290832966566086, -0.007714118808507919, 0.026694253087043762, -0.020891154184937477, -0.041262030601501465, 0.04694506525993347, 0.01643877662718296, 0.040641698986291885, -0.0235325638204813, -0.05478925257921219, -0.014507745392620564, 0.0044098543003201485, 0.04350322484970093, -0.018780026584863663, -0.011626207269728184, -0.008329447358846664, 0.05294826999306679, 0.004094686359167099, -0.043623290956020355, -0.01954043284058571, -0.02609393186867237, 0.03577910363674164, -0.01982058212161064, 0.0221918486058712, -0.06159288436174393, -0.005863130558282137, 0.01443770807236433, ...]"
3,"August: Luton became the sixth side whose first ever Premier League goal was a penalty, and first since Swansea in September 2011.","[-0.036540597677230835, -0.003917684778571129, 0.06347959488630295, -0.014391922391951084, -0.023249296471476555, -0.01230598520487547, 0.018846815451979637, -0.001184477237984538, -0.0002433812478557229, -0.022284943610429764, -0.02572307176887989, 0.015576399862766266, -0.006436008960008621, 0.0017609926871955395, 0.02155119739472866, -0.0152619369328022, 0.013228409923613071, -0.014098423533141613, -0.08955905586481094, -0.011404524557292461, 0.05417149141430855, -0.01828078180551529, 0.060293037444353104, -0.0029061620589345694, -0.0029559521935880184, -0.007458013948053122, -0.0454084537923336, -0.06184438616037369, -0.07375205308198929, 0.0400206558406353, 0.051949284970760345, -0.019308026880025864, 0.020901307463645935, 0.046959806233644485, 0.0711105614900589, 0.017494624480605125, 0.030104590579867363, 0.0115407919511199, 0.005062853917479515, -0.055261630564928055, 0.018175961449742317, -0.010167636908590794, -0.007326987572014332, -0.09215862303972244, -0.042494431138038635, -0.0003861998557113111, 0.05517777055501938, -0.0018435392994433641, -0.046624377369880676, -0.002564183669164777, -0.011677059344947338, -0.004939689300954342, 0.020670700818300247, 0.026372961699962616, 0.02045057713985443, -0.05538741499185562, 0.0217818021774292, -0.0018867779290303588, -0.01470638532191515, -0.02211723104119301, 0.045995451509952545, 0.0027698948979377747, 0.0013305714819580317, -0.0035403291694819927, -0.0077934409491717815, -0.07253612577915192, -0.04310239478945732, 0.011813325807452202, 0.027232494205236435, 0.024360399693250656, 0.0618024580180645, -0.02005225606262684, 0.0344441793859005, 6.870688957860693e-05, 0.014769278466701508, 0.008333269506692886, -0.01756799966096878, 0.020419130101799965, -0.008438089862465858, 0.04165586456656456, 0.029769163578748703, -0.023878222331404686, -0.052955567836761475, -0.04027222841978073, -0.0018828471656888723, 0.0017976800445467234, -0.04314432293176651, 0.0056079234927892685, -0.023689545691013336, 0.047672588378190994, 0.0037814173847436905, -0.025219932198524475, -0.06540830433368683, -0.027672743424773216, 0.003852171590551734, 0.038469307124614716, -0.015272418968379498, -0.03662445768713951, 0.035010211169719696, 0.0029297468718141317, ...]"
4,August: Aston Villa’s 5-1 loss at Newcastle on MD1 was the first time they had conceded 5+ in their first league match since 1951-52 (2-5 v Bolton Wanderers).,"[-0.03858356177806854, 0.00453770300373435, 0.01843343675136566, 0.002221747301518917, 0.013157769106328487, 0.008363609202206135, 0.037892699241638184, -0.03883478417992592, 0.016821427270770073, 0.02669237181544304, 0.009740098379552364, -0.004467046819627285, -0.009672058746218681, -0.028785889968276024, -0.011933060362935066, 0.04668547958135605, 0.031821493059396744, 0.01706218160688877, -0.06372673064470291, -0.0013195713981986046, 0.038395144045352936, 0.019940771162509918, 0.039086006581783295, -0.020286202430725098, 0.0034647744614630938, -0.0103105828166008, 0.00041248873458243906, -0.07151462137699127, -0.02794848196208477, -0.04706231504678726, 0.04831842705607414, -0.022358786314725876, 0.07281260192394257, 0.03640630096197128, 0.04999323934316635, 0.016737686470150948, 0.031025955453515053, -0.03592479228973389, 0.01408938504755497, -0.04706231504678726, 0.055645741522312164, -0.03705529123544693, -0.027404168620705605, -0.009787202812731266, -0.052086759358644485, -0.022421590983867645, 0.0028969072736799717, 0.0014118170365691185, -0.04021650552749634, -0.02415921352803707, 0.006296259351074696, -0.01672721840441227, -0.005532124545425177, 0.01866372488439083, 0.05455711483955383, -0.02451511099934578, 0.04140981286764145, 0.020349007099866867, 0.021175947040319443, -0.013283380307257175, -0.03584105148911476, -0.0030015832744538784, 0.03849982097744942, 0.0030879410915076733, -0.038332339376211166, -0.08290336281061172, -0.04107484966516495, 0.04609929397702217, 0.03429184481501579, 0.0022309066262096167, -0.0090754060074687, -0.032386742532253265, -0.018873076885938644, 0.03561076521873474, -0.0077460212633013725, 0.011901657097041607, -0.0006260931259021163, -0.013126365840435028, 0.02010825276374817, 0.001707526738755405, 0.04266592487692833, -0.012069138698279858, -0.06983980536460876, -0.01655973680317402, 0.05158431455492973, 0.009792436845600605, -0.052547335624694824, 0.03573637455701828, -0.006458506919443607, 0.03305666893720627, 0.03276357799768448, 0.009483642876148224, -0.04184944927692413, -0.03722277283668518, 0.03795550391077995, 0.019396455958485603, 0.022589072585105896, -0.009426070377230644, -0.0013038699980825186, -0.013723019510507584, ...]"


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [13]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    # Get embeddings for the question text
    question_embeddings = get_embeddings(question, EMBEDDINGS_MODEL)[0]
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()  
    df_copy['distances'] = df_copy['embeddings'].apply(lambda embedding: cosine(embedding, question_embeddings))
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order
    df_copy.sort_values('distances', ascending=True, inplace=True)
    return df_copy

In [14]:
def create_context_prompt(question, df, max_token_count=1500):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """

    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    prompt = '''
    Use the context below consisting of some facts about the 2023-24 English Premier League season to answer the subsequent question. If the answer cannot be found, write "I don't know."

    Context:
    \"\"\"
    {}
    \"\"\"

    Question: {}'''

    # Count the number of tokens in the prompt template and question
    current_token_count = len(tokenizer.encode(prompt)) + len(tokenizer.encode(question))
    context = []
    for text in get_rows_sorted_by_relevance(question, df)['text'].values:
        text_token_count = len(tokenizer.encode(text))
        # Increase the counter based on the number of tokens in this row
        current_token_count += text_token_count
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt.format(''.join(context), question)

In [15]:
#Function to create a basic prompt containing the given question
def create_simple_prompt(question):
    prompt = '''

    Question: {}

    Answer: 
    '''
    return prompt.format(question)

In [16]:
def answer_question(prompt):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    response = client.chat.completions.create(
        messages=[
            {'role': 'system', 'content': 'You answer questions about the 2023-24 English Premier League season.'},
            {'role': 'user', 'content': prompt},
        ],
        model=GPT_MODEL,
    )

    return response.choices[0].message.content

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [67]:
question_1 = 'Has any player scored 10+ goals and provided 10+ assists in three consecutive english premier league seasons?'
print('Answer without context: ', answer_question(create_simple_prompt(question_1)))

Answer without context:  As of the 2023-24 Premier League season, no player has scored 10 or more goals and provided 10 or more assists in three consecutive seasons. While there have been players who have consistently performed at a high level, achieving this particular milestone over three successive seasons is quite rare in the league's history. Players like Kevin De Bruyne and Mohamed Salah have come close, but not specifically in three consecutive seasons with that exact statistic.


In [26]:
print('Answer with context: ', answer_question(create_context_prompt(question_1, df)))

Answer with context:  Yes, Liverpool’s Mohamed Salah became the first player in Premier League history to score 10+ goals and provide 10+ assists in three consecutive seasons.


In [32]:
question_2 = "Has any chelsea player scored a first half hatrick in the premier league?"
print('Answer without context: ', answer_question(create_simple_prompt(question_2)))

Answer without context:  As of October 2023, there has been no record of a Chelsea player scoring a first-half hat-trick specifically in the Premier League. However, players from Chelsea have achieved hat-tricks in Premier League matches, but details about specific instances of first-half hat-tricks may vary. For the most accurate and updated statistics, it is always best to check the official Premier League records or Chelsea FC's official communications.


In [33]:
print('Answer with context: ', answer_question(create_context_prompt(question_2, df)))

Answer with context:  Yes, Cole Palmer became the first Chelsea player to score a first half hat-trick in the Premier League during the 2023-24 season against Everton.


In [34]:
question_3 = "Which goalkeeper has scored the most own goals in the premier league?"
print('Answer without context: ', answer_question(create_simple_prompt(question_3)))

Answer without context:  As of the 2023-24 Premier League season, the record for the most own goals scored by a goalkeeper in the Premier League is held by Peter Schmeichel, who scored 4 own goals during his career in the league.


In [36]:
print('Answer with context: ', answer_question(create_context_prompt(question_3, df)))

Answer with context:  Emiliano Martínez has scored the most own goals in Premier League history, with three own goals.


In [45]:
question_4 = "Who has the most premier league appearances for Everton?"
print('Answer without context: ', answer_question(create_simple_prompt(question_4)))

Answer without context:  The player with the most Premier League appearances for Everton is Gareth Barry. He made a total of 422 appearances for the club in the Premier League during his time at Everton from 2013 to 2018.


In [44]:
print('Answer with context: ', answer_question(create_context_prompt(question_4, df)))

Answer with context:  Séamus Coleman has the most Premier League appearances for Everton, having made his 355th appearance for the club.


In [63]:
question_5 = "Who is the youngest player to score in four consecutive Premier League appearances for Manchester United?"
print('Answer without context: ', answer_question(create_simple_prompt(question_5)))

Answer without context:  The youngest player to score in four consecutive Premier League appearances for Manchester United is Mason Greenwood.


In [64]:
print('Answer with context: ', answer_question(create_context_prompt(question_5, df)))

Answer with context:  Rasmus Højlund is the youngest player to score in four consecutive Premier League appearances for Manchester United.
