In [1]:
import pandas as pd
df = pd.read_csv("test.csv", encoding='unicode_escape')

In [2]:
questions_list = df['Questions'].tolist()

In [3]:
answers_list = df['Answers'].tolist()

## Importing Libraries

- **nltk**: This is a library used for natural language processing (NLP). It helps with tasks like text analysis and language modeling.
- **numpy**: This is a library used for numerical computations. It helps with handling arrays and performing mathematical operations.
- **TfidfVectorizer from sklearn**: This is a tool used to convert text data into numerical values that represent the importance of words in the text.
- **cosine_similarity from sklearn**: This is a tool used to measure how similar two pieces of text are by comparing their numerical representations.
- `nltk.download('punkt')`: Downloads a tokenizer that can split text into individual sentences or words.
- `nltk.download('wordnet')`: Downloads a database of English words that can help with tasks like finding synonyms.
- `nltk.download('stopwords')`: Downloads a list of common words (like "the", "and", "is") that are usually ignored in text analysis because they don't carry much meaning.


In [5]:
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\14807\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\14807\AppData\Roaming\nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\14807\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True


- **WordNetLemmatizer** and **PorterStemmer** are tools from the **NLTK** library used for processing words.
- `WordNetLemmatizer` is initialized to help reduce words to their base or dictionary form.
- `PorterStemmer` is initialized to reduce words to their root form by removing suffixes.

- **stopwords** is a list of common words (like "the", "is", "in") that are often removed in text processing.
- **re** is a library for working with regular expressions, which allows us to search and manipulate strings.





In [8]:
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords
import re

def preprocess_with_stopwords(text):
    lemmatizer = WordNetLemmatizer()
    stemmer = PorterStemmer()
    text = re.sub(r'[^\w\s]', '', text)  # Remove non-alphanumeric characters
    tokens = nltk.word_tokenize(text.lower()) # The text is converted to lowercase and split into individual words (tokens).
    tokens = [token for token in tokens if token not in stopwords.words('english')] # This line removes common words (stopwords) from the list of tokens. These words don't usually add significant meaning and can be ignored.
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens] # Each token is processed by the lemmatizer to convert it to its base form. For example, "running" becomes "run".
    stemmed_tokens = [stemmer.stem(token) for token in lemmatized_tokens] # Each lemmatized token is further processed by the stemmer to reduce it to its root form. For example, "runner" might become "run".
    return ' '.join(stemmed_tokens) # The stemmed tokens are joined back together into a single string with spaces in between each word, and this processed text is returned.

**TfidfVectorizer:** This is a tool that converts text data into numbers. Specifically, it uses a method called TF-IDF (Term Frequency-Inverse Document Frequency) which helps in finding the important words in the text.

**tokenizer=nltk.word_tokenize:** This tells the vectorizer to use the `nltk.word_tokenize` function to break the text into individual words.

- `[preprocess(q) for q in questions_list]`: This part creates a new list where each question in `questions_list` is processed by a function called `preprocess`. This `preprocess` function is to clean and prepare each question for further analysis..
- `vectorizer.fit_transform(...)`: This does two things:
  - **fit**: It learns the vocabulary and the importance of each word from the processed questions.
  - **transform**: It converts the processed questions into a numerical form (a matrix of numbers).


In [9]:
vectorizer = TfidfVectorizer(tokenizer=nltk.word_tokenize)
X = vectorizer.fit_transform([preprocess(q) for q in questions_list])



In [10]:
def get_response(text):
    #This line takes the input text and processes it using another function called preprocess_with_stopwords. 
    #This function likely removes unnecessary words (stopwords) and cleans the text.
    processed_text = preprocess_with_stopwords(text)
    print("processed_text:", processed_text)
    
    #This line converts the processed text into a form that the computer can understand better (using something called vectorizer). 
    #It transforms the text into a list of numbers (a vector).
    vectorized_text = vectorizer.transform([processed_text])

    #This line compares the vectorized text to other texts stored in X and finds out how similar they are using a method called cosine similarity
    similarities = cosine_similarity(vectorized_text, X) 
    print("similarities:", similarities)

    #This line finds the highest similarity score from the list of similarity scores.
    max_similarity = np.max(similarities)
    print("max_similarity:", max_similarity)
    if max_similarity > 0.6:
        #This line makes a list of questions from questions_list that have a similarity score greater than 0.6.
        high_similarity_questions = [q for q, s in zip(questions_list, similarities[0]) if s > 0.6]
        print("high_similarity_questions:", high_similarity_questions)

        #This block of code finds the answers that correspond to the similar questions and puts them in a list called target_answers.
        target_answers = []
        for q in high_similarity_questions:
            q_index = questions_list.index(q)
            target_answers.append(answers_list[q_index])
        print(target_answers)

        #This line processes and vectorizes the list of similar questions again.
        Z = vectorizer.fit_transform([preprocess_with_stopwords(q) for q in high_similarity_questions])

        #This line processes the input text again
        processed_text_with_stopwords = preprocess_with_stopwords(text)
        print("processed_text_with_stopwords:", processed_text_with_stopwords)

        #This line converts the processed input text into a vector again.
        vectorized_text_with_stopwords = vectorizer.transform([processed_text_with_stopwords])

        #This line compares the vectorized input text to the vectorized similar questions and finds the similarity scores.
        final_similarities = cosine_similarity(vectorized_text_with_stopwords, Z)
        #This line finds the index of the question with the highest similarity score.
        closest = np.argmax(final_similarities)
        return target_answers[closest]
    else:
        return "Unable to answer this question."

In [11]:
get_response('Who is tom brady?')

processed_text: who is tom bradi
similarities: [[0. 0. 0. 0. 0. 0. 0. 0.]]
max_similarity: 0.0


'Unable to answer this question.'

In [12]:
get_response('what is machine learning')

processed_text: what is machin learn
similarities: [[0.         0.         0.         0.         0.77627227 0.
  0.         0.        ]]
max_similarity: 0.7762722680124386
high_similarity_questions: ['What is the role of machine learning in data analytics?']
['Machine learning plays a crucial role in data analytics by enabling the development of algorithms that can automatically learn from data and make predictions or take actions without being explicitly programmed. It is used for tasks such as classification, regression, clustering, and anomaly detection.']
processed_text_with_stopwords: what is machin learn




'Machine learning plays a crucial role in data analytics by enabling the development of algorithms that can automatically learn from data and make predictions or take actions without being explicitly programmed. It is used for tasks such as classification, regression, clustering, and anomaly detection.'

In [19]:
from autocorrect import Speller

# Create a spell checker object
spell = Speller()

# Define the text to be checked
text = 'What is Data Anlytics'

# Correct the text
corrected_text = spell(text)

print("Original text:", text)
print("Corrected text:", corrected_text)


Original text: What is Data Anlytics
Corrected text: What is Data Analytics
