# Custom Chatbot Project

## 2024 - T20 Cricket World cup

The 2024 T20 Cricket World Cup was hosted for the first time in the United States.  Twenty countries from all geographies have participated and played 50 matches. Millions of people watched tournment across the globe.

This chatbot helps in answering various questions regarding this event conducted during June 2024.

The data from Wikipedia is extracted and used for fine tuning GPT.

* https://en.wikipedia.org/wiki/2024_ICC_Men's_T20_World_Cup



#### Import required libraries

In [1]:
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import nltk
#from nltk.corpus import stopwords
#from nltk.tokenize import word_tokenize
#from nltk.stem import WordNetLemmatizer
#from sklearn.feature_extraction.text import TfidfVectorizer
#from sklearn.metrics.pairwise import cosine_similarity



#### Helper function to use Wikipedia API and to extract data  

In [2]:
import requests

def get_wikipedia_content(page_title):
    # Construct the URL for the Wikipedia API
    url = f"https://en.wikipedia.org/w/api.php"

    # Define the parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "prop": "extracts",
        "explaintext": True,
        "titles": page_title,
        #"exlimit": 1,
        #"exintro": True
    }

    # Send the request to the Wikipedia API
    response = requests.get(url, params=params)

    if response.status_code != 200:
        raise Exception("Failed to retrieve data from Wikipedia API")

    # Extract the page content from the JSON response
    data = response.json()
    pages = data["query"]["pages"]
    page = next(iter(pages.values()))

    if "extract" not in page:
        raise Exception(f"Failed to retrieve extract for page '{page_title}'")

    return page["extract"]



#### Extract data from T20 World cup website

In [3]:
page_title = "2024 ICC Men's T20 World Cup"
content = get_wikipedia_content(page_title)

# Save content to a file
with open("wikipedia_content.txt", "w", encoding="utf-8") as file:
    file.write(content)

### Check the content

In [4]:
content

'The 2024 ICC Men\'s T20 World Cup was the ninth edition of the ICC Men\'s T20 World Cup. It was co-hosted by the West Indies and the United States from 1 to 29 June 2024; the tournament being hosted by the West Indies for the second time, while this was also the first major ICC tournament to feature matches played in the United States.\nThe tournament field expanded from 16 to 20 teams, including the two hosts, the top eight teams from the 2022 edition, the best placed two teams in the ICC Men\'s T20I Team Rankings not already qualified, and eight other teams determined by regional qualifiers. Canada and Uganda qualified for the men\'s T20 World Cup for the first time, while the United States participated for the first time by virtue of being co-hosts.\nEngland were the defending champions and were beaten in the semi-finals by India, who went on to win their second T20 World Cup title, defeating South Africa by 7 runs and equalling England and West Indies with the most titles in the T

In [5]:
# Parse the conents
content = content.split("\n")

In [6]:
content

["The 2024 ICC Men's T20 World Cup was the ninth edition of the ICC Men's T20 World Cup. It was co-hosted by the West Indies and the United States from 1 to 29 June 2024; the tournament being hosted by the West Indies for the second time, while this was also the first major ICC tournament to feature matches played in the United States.",
 "The tournament field expanded from 16 to 20 teams, including the two hosts, the top eight teams from the 2022 edition, the best placed two teams in the ICC Men's T20I Team Rankings not already qualified, and eight other teams determined by regional qualifiers. Canada and Uganda qualified for the men's T20 World Cup for the first time, while the United States participated for the first time by virtue of being co-hosts.",
 'England were the defending champions and were beaten in the semi-finals by India, who went on to win their second T20 World Cup title, defeating South Africa by 7 runs and equalling England and West Indies with the most titles in th

In [7]:
# Load the content to the dataframe
df_t20 = pd.DataFrame(content, columns=["text"])

In [8]:
df_t20

Unnamed: 0,text
0,The 2024 ICC Men's T20 World Cup was the ninth...
1,The tournament field expanded from 16 to 20 te...
2,England were the defending champions and were ...
3,
4,The ICC Men's T20 World Cup is a professional ...
...,...
89,"In India, Disney Star handled host broadcastin..."
90,In an effort to help promote the sport to U.S....
91,
92,Official website


In [9]:
# Removing blank lines
df_t20["text"] = df_t20[df_t20["text"].str.len() > 0]

In [10]:
df_t20

Unnamed: 0,text
0,The 2024 ICC Men's T20 World Cup was the ninth...
1,The tournament field expanded from 16 to 20 te...
2,England were the defending champions and were ...
3,
4,The ICC Men's T20 World Cup is a professional ...
...,...
89,"In India, Disney Star handled host broadcastin..."
90,In an effort to help promote the sport to U.S....
91,
92,Official website


In [11]:
# Removing Headers
df_t20 = df_t20[~df_t20["text"].astype(str).str.startswith("==")]

In [12]:
df_t20

Unnamed: 0,text
0,The 2024 ICC Men's T20 World Cup was the ninth...
1,The tournament field expanded from 16 to 20 te...
2,England were the defending champions and were ...
3,
4,The ICC Men's T20 World Cup is a professional ...
...,...
89,"In India, Disney Star handled host broadcastin..."
90,In an effort to help promote the sport to U.S....
91,
92,Official website


In [13]:
df_t20 = df_t20.dropna()

In [14]:
df_t20.tail(30)

Unnamed: 0,text
49,"On 7 June 2024, Canada earned their first T20 ..."
51,The second week began with South Africa taking...
53,"On 9 June 2024, the second biggest rivalry in ..."
54,"In the next match, South Africa batting first ..."
55,"On 12 June 2024, India and United States faced..."
56,"In the next match, Bangladesh set a target of ..."
58,Match 33 between Canada and India on 15 June 2...
59,"On 16 June 2024, three matches were played, in..."
60,"The second match, Bangladesh beat Nepal by 21 ..."
61,"On the last day of group stage, New Zealand de..."


In [15]:
# Checking the shape of data
df_t20.shape

(62, 1)

## Data Wrangling

Load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [16]:
df_t20.head()

Unnamed: 0,text
0,The 2024 ICC Men's T20 World Cup was the ninth...
1,The tournament field expanded from 16 to 20 te...
2,England were the defending champions and were ...
4,The ICC Men's T20 World Cup is a professional ...
6,"In November 2021, the ICC announced that the 2..."


## Install OPENAI

In [17]:
!pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.28.0


In [18]:
#!openai migrate

In [19]:
OPENAI_API_KEY = 'sk-oF92SjdPydqhZwxxxxxxxxxxxnOsAYzo'

In [20]:
import openai as openai
import os

openai.api_key = OPENAI_API_KEY
#os.getenv("OPENAI_API_KEY")


In [21]:
!openai --version

openai 0.28.0


#### Embedding Model

In [22]:
# Ask the model a question
model = "text-embedding-ada-002"
#"gpt-3.5-turbo"

###  Check with GPT3.5 before finetuning

In [23]:
def get_completion(prompt, model="gpt-3.5-turbo"):
  messages = [{"role": "user", "content": prompt}]
  response = openai.ChatCompletion.create(
    model=model,
    messages=messages,
    temperature=0, # this is the degree of randomness of the model's output
  )
  return response.choices[0].message["content"]


#### Ask GPT3.5 to know who won the T20 World cup

In [24]:
prompt = "who has won T20 cricket world cup in 2024?"
response = get_completion(prompt)
print(response)

I'm sorry, but I am unable to provide real-time information as I do not have access to the internet. Please check the latest updates on the T20 Cricket World Cup in 2024 from a reliable source.


In [25]:
prompt2 = 'Who has played in finals of T20 world cup in 2024?'

In [26]:
response = get_completion(prompt2)
print(response)

As of now, the T20 World Cup in 2024 has not taken place yet. Therefore, the teams that will play in the finals of the T20 World Cup in 2024 are not known.


In [99]:
prompt3 = 'Who has hosted 2024 T20 world ckicket matches?'

In [100]:
response = get_completion(prompt3)
print(response)

The host country for the 2024 T20 World Cup has not been officially announced yet. Typically, the International Cricket Council (ICC) announces the host country a few years before the tournament takes place.


### Observation :
#### The above answer indicates, that the GPT3.5 doesn't know the details of the event happened in June 2024.

#### Create embeddings for the text in dataframe

In [27]:
response = openai.Embedding.create(
  model="text-embedding-ada-002", # Text embedding model
  input=df_t20["text"].tolist()
)


In [28]:
type(response)

In [29]:
# Check the keys
response.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [30]:
response["data"][0]["embedding"]

[-0.00999754574149847,
 -0.025412291288375854,
 0.034930720925331116,
 -0.0010205162689089775,
 -0.01276363991200924,
 0.009205407463014126,
 -0.005231942515820265,
 -0.0018541773315519094,
 -0.0007869473774917424,
 -0.013939069584012032,
 0.027469294145703316,
 0.010131698101758957,
 0.0006851353100501001,
 -0.013095826841890812,
 0.012156759388744831,
 0.004602703731507063,
 0.010847177356481552,
 -0.019177399575710297,
 -0.013581329956650734,
 -0.02810811437666416,
 -0.025642266497015953,
 0.008917937986552715,
 -0.02456904761493206,
 0.023278629407286644,
 0.0004363944462966174,
 0.002373219234868884,
 0.007704178802669048,
 -0.018027523532509804,
 0.020442266017198563,
 -0.015191158279776573,
 -0.011767079122364521,
 0.002548895077779889,
 -0.0033889443147927523,
 -0.017043739557266235,
 -0.020506147295236588,
 -0.018244722858071327,
 -0.0026686738710850477,
 -0.022422609850764275,
 0.008745457045733929,
 -0.009103196673095226,
 0.0022007375955581665,
 0.019087964668869972,
 0.000

In [31]:
len(response["data"][0]["embedding"])

1536

In [32]:
# getthe embeddings in dataframe
embeddings = [data["embedding"] for data in response["data"]]

In [33]:
embeddings

[[-0.00999754574149847,
  -0.025412291288375854,
  0.034930720925331116,
  -0.0010205162689089775,
  -0.01276363991200924,
  0.009205407463014126,
  -0.005231942515820265,
  -0.0018541773315519094,
  -0.0007869473774917424,
  -0.013939069584012032,
  0.027469294145703316,
  0.010131698101758957,
  0.0006851353100501001,
  -0.013095826841890812,
  0.012156759388744831,
  0.004602703731507063,
  0.010847177356481552,
  -0.019177399575710297,
  -0.013581329956650734,
  -0.02810811437666416,
  -0.025642266497015953,
  0.008917937986552715,
  -0.02456904761493206,
  0.023278629407286644,
  0.0004363944462966174,
  0.002373219234868884,
  0.007704178802669048,
  -0.018027523532509804,
  0.020442266017198563,
  -0.015191158279776573,
  -0.011767079122364521,
  0.002548895077779889,
  -0.0033889443147927523,
  -0.017043739557266235,
  -0.020506147295236588,
  -0.018244722858071327,
  -0.0026686738710850477,
  -0.022422609850764275,
  0.008745457045733929,
  -0.009103196673095226,
  0.002200737

In [34]:
df_t20['embeddings']=embeddings

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_t20['embeddings']=embeddings


In [35]:
# Check the text and corresponding embeddings
df_t20

Unnamed: 0,text,embeddings
0,The 2024 ICC Men's T20 World Cup was the ninth...,"[-0.00999754574149847, -0.025412291288375854, ..."
1,The tournament field expanded from 16 to 20 te...,"[0.002700002631172538, -0.03767475485801697, 0..."
2,England were the defending champions and were ...,"[-0.009014622308313847, -0.01883614808320999, ..."
4,The ICC Men's T20 World Cup is a professional ...,"[-0.01135981734842062, -0.027375193312764168, ..."
6,"In November 2021, the ICC announced that the 2...","[-0.005802159663289785, -0.029249241575598717,..."
...,...,...
87,"On 30 June 2024, the ICC announced its team of...","[-0.01073498371988535, -0.0021276825573295355,..."
89,"In India, Disney Star handled host broadcastin...","[-0.008782428689301014, 0.0031884482596069574,..."
90,In an effort to help promote the sport to U.S....,"[0.00023881436209194362, -0.007622278295457363..."
92,Official website,"[-0.01566064916551113, -0.007410381454974413, ..."


In [36]:
# Save the embeddings to embeddings.csv
df_t20.to_csv("embeddings.csv")

In [37]:
# Read embeddings from csv file
df_embeddings = pd.read_csv("embeddings.csv", index_col=0)

In [38]:
df_embeddings.columns


Index(['text', 'embeddings'], dtype='object')

In [39]:
df_embeddings['embeddings']=df_embeddings['embeddings'].apply(eval).apply(np.array)

In [40]:
df_embeddings.head()

Unnamed: 0,text,embeddings
0,The 2024 ICC Men's T20 World Cup was the ninth...,"[-0.00999754574149847, -0.025412291288375854, ..."
1,The tournament field expanded from 16 to 20 te...,"[0.002700002631172538, -0.03767475485801697, 0..."
2,England were the defending champions and were ...,"[-0.009014622308313847, -0.01883614808320999, ..."
4,The ICC Men's T20 World Cup is a professional ...,"[-0.01135981734842062, -0.027375193312764168, ..."
6,"In November 2021, the ICC announced that the 2...","[-0.005802159663289785, -0.029249241575598717,..."


## Custom Query Completion



In [41]:
question = "who has won T20 cricket world cup in 2024?"

In [42]:
import openai
from openai.embeddings_utils import cosine_similarity
from openai.embeddings_utils import get_embedding


#### Get embeddings of the question

In [43]:
question_embedding = get_embedding(question, engine='text-embedding-ada-002')

In [44]:
# Check the embeddings of a question
question_embedding

[-0.0029320362955331802,
 -0.027124682441353798,
 0.010101467370986938,
 0.023054642602801323,
 -0.03293520212173462,
 0.030926957726478577,
 -0.019198812544345856,
 -0.011018565855920315,
 -0.01752527616918087,
 0.007550997659564018,
 0.04013810306787491,
 -0.012149876914918423,
 0.006844765041023493,
 -0.01682908460497856,
 0.01163442712277174,
 -0.0015722877578809857,
 0.0007543466635979712,
 -0.023724056780338287,
 0.0053017642349004745,
 -0.026067007333040237,
 -0.02134094014763832,
 0.00735686719417572,
 -0.004053305834531784,
 -0.001294480636715889,
 0.01034245640039444,
 0.012169959023594856,
 0.011975828558206558,
 -0.016721978783607483,
 0.01605256460607052,
 -0.015289431437849998,
 0.0032031491864472628,
 0.015035053715109825,
 -0.017163792625069618,
 -0.027285343036055565,
 -0.02602684311568737,
 -0.02392488159239292,
 -0.004307683557271957,
 -0.000788235804066062,
 0.03194446861743927,
 0.010516504757106304,
 0.0006836397224105895,
 0.02068491280078888,
 0.0113465785980224

In [45]:
# Distance from embeddings
from openai.embeddings_utils import get_embedding, cosine_similarity, distances_from_embeddings


### Check the distance between embeddings and find cosine similarity

In [46]:
distance = distances_from_embeddings(question_embedding, df_embeddings['embeddings'].tolist(), distance_metric='cosine')
distance

[0.1422600338175951,
 0.1786981104712606,
 0.14762174754509327,
 0.16080114766559495,
 0.1447072318279634,
 0.18022290764243387,
 0.23898636858304212,
 0.22205363340007378,
 0.19527045006276622,
 0.23005152131142925,
 0.1862697677760814,
 0.2145978497197164,
 0.18041670487882855,
 0.1726529986151144,
 0.1973959414095937,
 0.1884517852664651,
 0.19492841996733634,
 0.20185848860694466,
 0.24945097195758736,
 0.18038908708215262,
 0.1918498390888801,
 0.2706265058782814,
 0.25680405106351256,
 0.18121785291849335,
 0.17125819008143217,
 0.2854863379806324,
 0.18428814144969308,
 0.18009498277636138,
 0.1813773874254585,
 0.23515820592737957,
 0.1665302778494091,
 0.15737886822678615,
 0.1427352804696681,
 0.2013326006956455,
 0.16978860244584826,
 0.17985701059066428,
 0.16743126268387187,
 0.18469576568980306,
 0.17059193362425862,
 0.17864462980190732,
 0.20133277904904467,
 0.1832946087050139,
 0.22490720847589352,
 0.18296645389781918,
 0.20688299713499958,
 0.19989962324845312,
 0.1

In [47]:
df_embeddings['distance'] = distance

In [48]:
# Sort the values in ascending order to find closet one
df_embeddings.sort_values('distance', ascending=True)

Unnamed: 0,text,embeddings,distance
0,The 2024 ICC Men's T20 World Cup was the ninth...,"[-0.00999754574149847, -0.025412291288375854, ...",0.142260
49,"On 7 June 2024, Canada earned their first T20 ...","[-0.006749629508703947, -0.012705976143479347,...",0.142735
6,"In November 2021, the ICC announced that the 2...","[-0.005802159663289785, -0.029249241575598717,...",0.144707
2,England were the defending champions and were ...,"[-0.009014622308313847, -0.01883614808320999, ...",0.147622
48,"On 6 June 2024, hosts USA and former World Cup...","[-0.00797905120998621, -0.018593383952975273, ...",0.157379
...,...,...,...
74,"The knockout stage bracket is shown below, wit...","[-0.008113885298371315, -0.016345174983143806,...",0.252309
35,Umpires,"[-0.001412971643730998, -0.01937602087855339, ...",0.256804
33,Match referees,"[-0.01628486067056656, -0.002085704356431961, ...",0.270627
40,Teams are placed in alphabetical order.,"[-0.003911144100129604, -0.03168421611189842, ...",0.285486


In [49]:
# Save distance between embeddings to a 'distance.csv'
df_embeddings.to_csv("distance.csv")

### Get the Shortest distance

In [50]:
df_embeddings.iloc[0]["distance"]

0.1422600338175951

In [51]:
current_distance = df_embeddings['distance'].min()

In [52]:
current_distance

0.1422600338175951

In [53]:
# Check the current shortest distance and corresponding key
current_shortest = df_embeddings.iloc[0]["distance"]
current_shortest_index = 0

current_shortest, current_shortest_index

(0.1422600338175951, 0)

In [54]:
for index, distance in enumerate(df_embeddings["distance"].values):
    if distance < current_shortest:
        current_shortest = distance
        current_shortest_index = index

current_shortest, current_shortest_index

(0.1422600338175951, 0)

In [55]:
# Check the best answer provided
df_embeddings.iloc[current_shortest_index]["text"]

"The 2024 ICC Men's T20 World Cup was the ninth edition of the ICC Men's T20 World Cup. It was co-hosted by the West Indies and the United States from 1 to 29 June 2024; the tournament being hosted by the West Indies for the second time, while this was also the first major ICC tournament to feature matches played in the United States."

### Observation:
#### **The first choice provided doesn't show correct answer . Though info provided is relavant**

In [56]:
df_embeddings.iloc[2]["text"]

'England were the defending champions and were beaten in the semi-finals by India, who went on to win their second T20 World Cup title, defeating South Africa by 7 runs and equalling England and West Indies with the most titles in the T20 World Cup.'

In [57]:
df_embeddings.iloc[10]["text"]

'The ICC hosted a "trophy tour" before the tournament which began on 19 March in New York and the trophy was taken to various locations around the world. Former cricketers Yuvraj Singh, Chris Gayle, and Shahid Afridi as well as Olympian athlete Usain Bolt were named as the ambassadors of the tournament.'

### Observation : **The Third index has provided correct answer . Though is not crisp answer.**

This indicates that if we could improve training by providing proper prompt, the results will be exactly correct

In [58]:
df_embeddings.sort_values('distance', ascending=True)

Unnamed: 0,text,embeddings,distance
0,The 2024 ICC Men's T20 World Cup was the ninth...,"[-0.00999754574149847, -0.025412291288375854, ...",0.142260
49,"On 7 June 2024, Canada earned their first T20 ...","[-0.006749629508703947, -0.012705976143479347,...",0.142735
6,"In November 2021, the ICC announced that the 2...","[-0.005802159663289785, -0.029249241575598717,...",0.144707
2,England were the defending champions and were ...,"[-0.009014622308313847, -0.01883614808320999, ...",0.147622
48,"On 6 June 2024, hosts USA and former World Cup...","[-0.00797905120998621, -0.018593383952975273, ...",0.157379
...,...,...,...
74,"The knockout stage bracket is shown below, wit...","[-0.008113885298371315, -0.016345174983143806,...",0.252309
35,Umpires,"[-0.001412971643730998, -0.01937602087855339, ...",0.256804
33,Match referees,"[-0.01628486067056656, -0.002085704356431961, ...",0.270627
40,Teams are placed in alphabetical order.,"[-0.003911144100129604, -0.03168421611189842, ...",0.285486


In [59]:
## Save the distance to 'distance_sorted.csv' file
df_embeddings.sort_values('distance', ascending=True).to_csv("distances_sorted.csv")

### Providing better context to the prompt

#### Using Tiktokens

In [60]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.7.0


In [61]:
import tiktoken

In [62]:
tokenizer = tiktoken.get_encoding('cl100k_base')
tokenizer

<Encoding 'cl100k_base'>

#### Let's give  and check the tokens consumed

In [63]:
prompt_template = """
Answer the question based on the context below. If the
question cannot be answered using the information provided answer
with "I don't know".

Context: {context}

---

Question: {question}
Answer:"""


In [64]:
question = "who has won T20 cricket world cup in 2024?"

In [65]:
print(prompt_template.format(context=df_embeddings.iloc[0]["text"], question=question))



Answer the question based on the context below. If the
question cannot be answered using the information provided answer
with "I don't know".

Context: The 2024 ICC Men's T20 World Cup was the ninth edition of the ICC Men's T20 World Cup. It was co-hosted by the West Indies and the United States from 1 to 29 June 2024; the tournament being hosted by the West Indies for the second time, while this was also the first major ICC tournament to feature matches played in the United States.

---

Question: who has won T20 cricket world cup in 2024?
Answer:


In [66]:
# Max tokens allowed is 1000
max_token_count = 1000

In [67]:
tokenizer.encode(question)

[14965, 706, 2834, 350, 508, 37099, 1917, 10747, 304, 220, 2366, 19, 30]

In [68]:
# Find the length of the tokens
current_token_count = len(tokenizer.encode(prompt_template.format(context=df_embeddings.iloc[0]["text"], question=question)))
current_token_count


130

In [69]:
tokenizer.encode(prompt_template.format(context=df_embeddings.iloc[0]["text"], question=question))

[198,
 16533,
 279,
 3488,
 3196,
 389,
 279,
 2317,
 3770,
 13,
 1442,
 279,
 198,
 7998,
 4250,
 387,
 19089,
 1701,
 279,
 2038,
 3984,
 4320,
 198,
 4291,
 330,
 40,
 1541,
 956,
 1440,
 11690,
 2014,
 25,
 578,
 220,
 2366,
 19,
 59332,
 11258,
 596,
 350,
 508,
 4435,
 11098,
 574,
 279,
 43641,
 14002,
 315,
 279,
 59332,
 11258,
 596,
 350,
 508,
 4435,
 11098,
 13,
 1102,
 574,
 1080,
 39689,
 291,
 555,
 279,
 4410,
 85318,
 323,
 279,
 3723,
 4273,
 505,
 220,
 16,
 311,
 220,
 1682,
 5651,
 220,
 2366,
 19,
 26,
 279,
 16520,
 1694,
 21685,
 555,
 279,
 4410,
 85318,
 369,
 279,
 2132,
 892,
 11,
 1418,
 420,
 574,
 1101,
 279,
 1176,
 3682,
 59332,
 16520,
 311,
 4668,
 9248,
 6476,
 304,
 279,
 3723,
 4273,
 382,
 45464,
 14924,
 25,
 889,
 706,
 2834,
 350,
 508,
 37099,
 1917,
 10747,
 304,
 220,
 2366,
 19,
 5380,
 16533,
 25]

In [70]:
#Initialize the context
context = []

In [71]:
# this dataset is sorted specific to the question
import pandas as pd
df = pd.read_csv("distances_sorted.csv")
df

Unnamed: 0.1,Unnamed: 0,text,embeddings,distance
0,0,The 2024 ICC Men's T20 World Cup was the ninth...,[-0.00999755 -0.02541229 0.03493072 ... -0.00...,0.142260
1,49,"On 7 June 2024, Canada earned their first T20 ...",[-0.00674963 -0.01270598 0.02530439 ... 0.00...,0.142735
2,6,"In November 2021, the ICC announced that the 2...",[-0.00580216 -0.02924924 0.03433607 ... -0.00...,0.144707
3,2,England were the defending champions and were ...,[-0.00901462 -0.01883615 0.00646783 ... -0.00...,0.147622
4,48,"On 6 June 2024, hosts USA and former World Cup...",[-0.00797905 -0.01859338 0.03383277 ... 0.00...,0.157379
...,...,...,...,...
57,74,"The knockout stage bracket is shown below, wit...",[-0.00811389 -0.01634517 0.01380143 ... -0.00...,0.252309
58,35,Umpires,[-0.00141297 -0.01937602 0.01110016 ... -0.00...,0.256804
59,33,Match referees,[-0.01628486 -0.0020857 -0.01063392 ... -0.01...,0.270627
60,40,Teams are placed in alphabetical order.,[-0.00391114 -0.03168422 0.00738772 ... -0.00...,0.285486


#### Append the context for remaining tokens

In [72]:
for text in df["text"].values:
  text_token_count = len(tokenizer.encode(text))
  current_token_count += text_token_count
  context.append(text)
  print(f"Current token count: {current_token_count}")
  print(f"Text token count: {text_token_count}")
  print()

  if current_token_count < max_token_count:
    context.append(text)
  else:
    break

Current token count: 210
Text token count: 80

Current token count: 382
Text token count: 172

Current token count: 443
Text token count: 61

Current token count: 496
Text token count: 53

Current token count: 721
Text token count: 225

Current token count: 761
Text token count: 40

Current token count: 848
Text token count: 87

Current token count: 1078
Text token count: 230



In [73]:
context

["The 2024 ICC Men's T20 World Cup was the ninth edition of the ICC Men's T20 World Cup. It was co-hosted by the West Indies and the United States from 1 to 29 June 2024; the tournament being hosted by the West Indies for the second time, while this was also the first major ICC tournament to feature matches played in the United States.",
 "The 2024 ICC Men's T20 World Cup was the ninth edition of the ICC Men's T20 World Cup. It was co-hosted by the West Indies and the United States from 1 to 29 June 2024; the tournament being hosted by the West Indies for the second time, while this was also the first major ICC tournament to feature matches played in the United States.",
 "On 7 June 2024, Canada earned their first T20 World Cup victory by beating Ireland by 12 runs. While Afghanistan earned their first win against New Zealand in T20Is with Rahmanullah Gurbaz scoring 80 from 56 balls to ensure Afghanistan reach 159/6. In the second innings, Rashid Khan and Fazalhaq Farooqi picked up 4 w

In [74]:
# Format the context - Line spaces and paragraph spaces
print(prompt_template.format(context="\n\n##\n\n".join(context), question=question))


Answer the question based on the context below. If the
question cannot be answered using the information provided answer
with "I don't know".

Context: The 2024 ICC Men's T20 World Cup was the ninth edition of the ICC Men's T20 World Cup. It was co-hosted by the West Indies and the United States from 1 to 29 June 2024; the tournament being hosted by the West Indies for the second time, while this was also the first major ICC tournament to feature matches played in the United States.

##

The 2024 ICC Men's T20 World Cup was the ninth edition of the ICC Men's T20 World Cup. It was co-hosted by the West Indies and the United States from 1 to 29 June 2024; the tournament being hosted by the West Indies for the second time, while this was also the first major ICC tournament to feature matches played in the United States.

##

On 7 June 2024, Canada earned their first T20 World Cup victory by beating Ireland by 12 runs. While Afghanistan earned their first win against New Zealand in T20Is 

## Custom Performance Demonstration

Demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

  **question = "who has won T20 cricket world cup in 2024?"**

In [75]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-1.36.0-py3-none-any.whl (328 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m328.7/328.7 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, openai
  Attempting uninstall: openai
    Found existing i

In [76]:
import openai

#### Let's check our prompt

In [77]:
openai.Completion.create(
    model="gpt-3.5-turbo-instruct", # Use a current model
    prompt = prompt_template.format(context="\n\n##\n\n".join(context),question=question))["choices"][0]["text"]


' India'

### Observation: ** India Won - Here is the exact answer *

### Question 2

In [78]:
question2 = 'Who has played in finals of T20 world cup in 2024?'

In [79]:
openai.Completion.create(
    model="gpt-3.5-turbo-instruct", # Use a current model
    prompt = prompt_template.format(context="\n\n##\n\n".join(context), question=question2))["choices"][0]["text"]


' India and South Africa.'

#### Observation:
  * Great. The trained model provides good results

#### Question3

In [95]:
question3 = 'Who has hosted 2024 T20 world ckicket matches?'

In [96]:
openai.Completion.create(
    model="gpt-3.5-turbo-instruct", # Use a current model
    prompt = prompt_template.format(context="\n\n##\n\n".join(context),question=question3))["choices"][0]["text"]


' The West Indies and the United States co-hosted the 2024 T20'

### Observation:
  With fine tuning by using custom data , the performance of LLM has improved drastically, though there is still some scope for further improvement

####  *** Thanks a lot ***