# Chatbot using Natural Language Processing on Custom Data

I have taken help from Intellipaat YouTube Channel

## 1. Importing Necessary Libraries
- numpy: Used for numerical computations (not directly used in the chatbot code).
- nltk: The Natural Language Toolkit, essential for NLP tasks like tokenization and lemmatization.
- string: Used for string manipulations, like removing punctuation.
- random: Provides random responses for greeting messages.

In [1]:
import numpy as np
import nltk
import string
import random

## 2. Loading and Preprocessing the Dataset

In [2]:
f = open('AI_Dataset.txt', 'r',errors = 'ignore')
raw_doc = f.read()

## 3. Tokenization

In [3]:
raw_doc = raw_doc.lower() # Converting entire text to lowercase
nltk.download('punkt') # Using the Punkt tokenizer
nltk.download('wordnet') # Using the wordnet dictionary
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Rishuraj\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Rishuraj\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Rishuraj\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [4]:
raw_doc



In [5]:
sentence_tokens = nltk.sent_tokenize(raw_doc)
word_tokens = nltk.word_tokenize(raw_doc)

- Sentence Tokenization: Splits the text into individual sentences using sent_tokenize.
- Word Tokenization: Splits the text into individual words using word_tokenize.

In [6]:
sentence_tokens[:5]

['\n\nwikipediathe free encyclopedia\nsearch wikipedia\nsearch\ndonate\ncreate account\nlog in\n\ncontents hide\n(top)\ngoals\n\ntechniques\n\napplications\n\nethics\n\nhistory\nphilosophy\n\nfuture\n\nin fiction\nsee also\nexplanatory notes\nreferences\n\nfurther reading\nexternal links\nartificial intelligence\n\narticle\ntalk\nread\nview source\nview history\n\ntools\nappearance hide\ntext\n\nsmall\n\nstandard\n\nlarge\nwidth\n\nstandard\n\nwide\ncolor (beta)\n\nautomatic\n\nlight\n\ndark\npage semi-protected\nfrom wikipedia, the free encyclopedia\n"ai" redirects here.',
 'for other uses, see ai (disambiguation), artificial intelligence (disambiguation), and intelligent agent.',
 'part of a series on\nartificial intelligence\n\nmajor goals\napproaches\napplications\nphilosophy\nhistory\nglossary\nvte\nartificial intelligence (ai), in its broadest sense, is intelligence exhibited by machines, particularly computer systems.',
 'it is a field of research in computer science that develo

In [7]:
word_tokens[:5]

['wikipediathe', 'free', 'encyclopedia', 'search', 'wikipedia']

## 4. Performing Text-Preprocessing Steps - Lemmatization
- Lemmatizer: Converts words to their base or root form (e.g., "running" → "run").
- Remove Punctuation: Strips punctuation from the text.
- Normalization: Combines lemmatization and punctuation removal for efficient preprocessing.

In [18]:
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]
remove_punc_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punc_dict)))

## 5. Greeting Function 
Purpose: Responds to greetings like "hi" or "hello" with random friendly replies.

In [19]:
greeting_inputs = ('hello', 'hi', 'whassup', 'how are you?')
greeting_responses = ('hi', 'Hey', 'Hey There!', 'There there!!')
def greet(sentence):
    for word in sentence.split():
        if word.lower() in greeting_inputs:
            return random.choice(greeting_responses)

## 6. Response Generation by the BOT
- TF-IDF Vectorizer: Converts text into numerical vectors based on Term Frequency-Inverse Document Frequency.
- Cosine Similarity: Measures the similarity between user input and dataset sentences.

Response Logic:
- If no match is found (req_tfidf == 0), the bot apologizes.
- Otherwise, it retrieves the most similar sentence.

In [20]:
# For intelligence of the Bot
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
def response(user_response):
    bo1_response = ''
    TfidfVec = TfidfVectorizer(tokenizer = LemNormalize, stop_words = 'english')
    tfidf = TfidfVec.fit_transform(sentence_tokens)
    vals = cosine_similarity(tfidf[-1], tfidf)
    idx = vals.argsort()[0][-2] #finding the most similar
    flat = vals.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    if(req_tfidf == 0):
        bo1_response = bo1_response  + "I am sorry. Unable to understand you!"
        return bo1_response
    else:
        bo1_response = bo1_response +  sentence_tokens[idx]
        return bo1_response

## 7. Defining the Chatflow
- Greeting: Starts with an introductory message.
- Conversation Loop:
  - Checks if the user input is bye. If so, it ends the chat.
  - Checks for "thank you" or "thanks" to respond with a polite message.
  - For other inputs:
     - If it's a greeting, responds with a random greeting.
     - Otherwise, processes the input through the response function.

In [24]:
flag = True
print("Hello! I am the Retreival Learning Bot. Start typing your text after greeting to talk to me. For ending conversation type bye!")
while(flag == True):
    user_response = input()
    user_response = user_response.lower()
    if(user_response != 'bye'):
        if(user_response == 'thank you' or user_response == 'thanks'):
            flag = False
            print('Bot : you are WELCOME...')
        else:
            if(greet(user_response) != None):
                print('Bot:'  +  greet(user_response))
            else:
                sentence_tokens.append(user_response)
                word_tokens = word_tokens + nltk.word_tokenize(user_response)
                final_words = list(set(word_tokens))
                print('Bot: ', end = '')
                print(response(user_response))
                sentence_tokens.remove(user_response)
    else:
        flag = False
        print('Bot: Goodbye!')

Hello! I am the Retreival Learning Bot. Start typing your text after greeting to talk to me. For ending conversation type bye!


 Hi


Bot:hi


 What is General intelligence


Bot: 



artificial general intelligence.


 Can you tell me about turing test


Bot: "ai is closer than ever to passing the turing test for 'intelligence'.


 What is Local search


Bot: [68] there are two very different kinds of search used in ai: state space search and local search.


 Artificial neural networks


Bot: neural networks.


 Tell me more about Artificial neural networks


Bot: neural networks.


 Full form of GPT


Bot: [j]

gpt
generative pre-trained transformers (gpt) are large language models (llms) that generate text based on the semantic relationships between words in sentences.


 Applications of artificial intelligence


Bot: "artificial intelligence".


 bye


Bot: Goodbye!


## Workflow Summary:
- The bot preprocesses the custom dataset using tokenization, normalization, and TF-IDF vectorization.
- User input is matched with the dataset using cosine similarity.
- The bot either returns the most relevant response or apologizes for not understanding.