*Final Project for Abanti Ghosh enrolled in April'23 Batch of Artificial Intelligence*

*Assigned: Chatbot using TF-IDF and Cosine Similarity*

# ChattyBetty

This is a chatbot named ChattyBetty. ChattyBetty employs TF-IDF and Cosine Similarity to process user input.

TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).

**Term Frequency:** In document d, the frequency represents the number of instances of a given word t. Therefore, we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term in the paper, there is an entry with the value being the term frequency.

The weight of a term that occurs in a document is simply proportional to the term frequency.

        tf(t,d) = count of t in d / number of words in d

**Document Frequency:** This tests the meaning of the text, which is very similar to TF, in the whole corpus collection. The only difference is that in document d, TF is the frequency counter for a term t, while df is the number of occurrences in the document set N of the term t. In other words, the number of papers in which the word is present is DF.

        df(t) = occurrence of t in documents

**Inverse Document Frequency:** Mainly, it tests how relevant the word is. The key aim of the search is to locate the appropriate records that fit the demand. Since tf considers all terms equally significant, it is therefore not only possible to use the term frequencies to measure the weight of the term in the paper. First, we find the document frequency of a term t by counting the number of documents containing the term:

        df(t) = N(t)
        where
        df(t) = Document frequency of a term t
        N(t) = Number of documents containing the term t
        idf(t) = N/ df(t) = N/N(t)

Putting it together:

        tf-idf(t, d) = tf(t, d) * idf(t)

Cosine similarity is a metric, helpful in determining, how similar the data objects are irrespective of their size. We can measure the similarity between two sentences in Python using Cosine Similarity. In cosine similarity, data objects in a dataset are treated as a vector. The formula to find the cosine similarity between two vectors is –

        Cos(x, y) = x . y / ||x|| * ||y||

        where

        x . y = product (dot) of the vectors ‘x’ and ‘y’.
        ||x|| and ||y|| = length of the two vectors ‘x’ and ‘y’.
        ||x|| * ||y|| = cross product of the two vectors ‘x’ and ‘y’.

        

.
.
.

ChattyBetty uses TF-IDF vectorizer from the scikit-learn library to compute the document vectors. The cosine_similarity function calculates the cosine similarity scores between the user query vector and the document vectors. The response with the highest similarity score is then selected and returned as ChattyBetty's reply.


---






In [None]:
#importing allnecessary libraries

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize
import urllib
from urllib.request import urlopen
import nltk
import numpy as np
import random

nltk.download('punkt')

#decoding an online text document
#this document contains chat messages that the bot will be trained on

file_url="https://github.com/abanti1/Text_Corpus/raw/main/Text%20Corpus%202.txt"
text_file=urlopen(file_url).read().decode('utf-16le')

#separating the file into individual tokens or sentences

corpus = sent_tokenize(text_file)

# initialize the TF-IDF vectorizer

vec = TfidfVectorizer()
vec.fit(corpus)

#this function processes user input and generates a corresponding output

def bettys_response(input_text):

    # preprocess user input

    input_vector = vec.transform([input_text])
    output_vector = vec.transform(corpus)

    # compute cosine similarity between input and output

    similarities = cosine_similarity(input_vector, output_vector)

    # find index of response with highest similarity score

    most_similar_index = np.argmax(similarities)

    # return the corresponding output

    return corpus[most_similar_index]

# introductory message

print("Hello, this is ChattyBetty!")

# interaction with the bot

flag=True

while (flag==True):
    user_input = input("You: ")

    # 'bye' ends the conversation

    if(user_input.lower()=='bye'):
      flag=False

    response = bettys_response(user_input)
    print("ChattyBetty:", response)