<center><h1>ChatBot</h1></center>

<p>Dans la première séance, nous avons évoqué les chatbots comme l'une des principales applications réelles 
du traitement automatique des langues. A présent, nous en savons suffisamment pour créer un chatbot 
basique qui pourrait être “entraîné” à l'aide d'un corpus prédéfini et fournir des réponses à des requêtes 
en utilisant des concepts de similarité. Dans ce tp, il est demandé de développer un chatbot en utilisant 
les concepts de vectorisation et la similarité cosinus.</p>
<p>La condition la plus importante pour construire un chatbot est relatif au corpus ou les données textuelles 
sur lesquelles le chatbot sera entraîné. Le corpus doit être pertinent et exhaustif. Si vous construisez un 
chatbot pour le département des ressources humaines (RH) de votre organisation, vous aurez 
généralement besoin d'un corpus contenant toutes les politiques RH pour entraîner le chatbot et non d'un 
corpus contenant des discours présidentiels. Vous devrez également vous assurer que le temps de réponse 
est acceptable et que le robot ne prend pas un temps excessif pour répondre. Idéalement, le chatbot devrait 
aussi ressembler à un être humain et avoir un taux de précision acceptable.</p>
<p>Pour les besoins du chatbot que vous allez développer, vous utiliserez une base de données de questions 
et de réponses recueillies sur le site Web d'Amazon pour diverses catégories de produits<b> (http://jmcauley.ucsd.edu/data/amazon/qa/)</b>. Dans un premier temps, vous vous focalisez uniquement 
aux données relatives aux produits électroniques sous format json <b> (http://jmcauley.ucsd.edu/data/amazon/qa/qa_Electronics.json.gz)</b>.</p>
<p>Extrait des cinq premières lignes du fichier json:</p>
<ul><li>{'questionType': 'yes/no', 'asin': '0594033926', 'answerTime': 'Dec 27, 2013', 'unixTime': 1388131200, 
'question': 'Is this cover the one that fits the old nook color? Which I believe is 8x5.', 'answerType': 'Y', 
'answer': 'Yes this fits both the nook color and the same-shaped nook tablet'}</li><li>{'questionType': 'yes/no', 'asin': '0594033926', 'answerTime': 'Jan 5, 2015', 'unixTime': 1420444800, 
'question': 'Does it fit Nook GlowLight?', 'answerType': 'N', 'answer': 'No. The nook color or color 
    tablet'}</li><li>{'answer': "I don't think so. The nook color is 5 x 8 so not sure anything smaller would stay locked in, 
but would be close.", 'asin': '0594033926', 'answerTime': '2 days ago', 'question': 'Would it fit Nook 1st 
    Edition? 4.9in x 7.7in ?', 'questionType': 'open-ended'}</li>
<li>{'questionType': 'yes/no', 'asin': '0594033926', 'answerTime': '17 days ago', 'question': "Will this fit a 
    Nook Color that's 5 x 8?", 'answerType': 'Y', 'answer': 'yes'}</li><li>{'questionType': 'yes/no', 'asin': '0594033926', 'answerTime': 'Feb 10, 2015', 'unixTime': 1423555200, 
'question': 'will this fit the Samsung Galaxy Tab 4 Nook 10.1', 'answerType': 'N', 'answer': "No, the tab 
    is smaller than the 'color'"}</li></ul>
<p>Comme nous pouvons le voir, chaque ligne de données est au format d'un dictionnaire avec diverses 
paires clé-valeur.</p><p>Voici les étapes principales pour concevoir un chatbot:</p>

<p>Le programme doit être interactif, c’est à dire il doit permettre à l’utilisateur humain anglophone et au 
chatbot d’échanger continuellement jusqu’à ce que l’utilisateur demande explicitement d’arrêter.</p>
<p>Le programme doit pouvoir proposer deux types de vectorisation : le sac de mots et tf-idf.</p>
<p>Pour tester votre programme, établissez une liste de 6 questions parmi lesquelles 3 sont issues du 
corpus questions. Le reste des questions doit être similaire aux questions de corpus sans être 
exactement identiques.</p> 
<p>Testez votre programme avec le sac de mots et tf-idf. Y a-t-il une différence significative en termes de 
précision dans les réponses entre ces deux techniques de représentation vectorielle ?</p>

<h2>1. Stocker toutes les questions du corpus dans une liste</h2>

In [1]:
import pandas as pd
import gzip

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

df = getDF('data/qa_Electronics.json.gz')
# Filtering for performances purpose (top 25000 rows from the given file)
questions = df.question.tolist()[:25000]

<h2>2. Stocker toutes les réponses correspondantes du corpus dans une liste</h2>

In [2]:
answers = df.answer.tolist()

<h2>3. Vectoriser et prétraiter les données des questions</h2>

In [3]:
import pandas
import re
import nltk
import numpy as np
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.metrics.pairwise import cosine_similarity,cosine_distances
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import warnings
warnings.filterwarnings("ignore")
    
def get_cleanText(text):
    text = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', 'URL', text)
    text = re.sub(r'(www[\.\w]+|\/|\?|\=|\&|\%)\b', 'URL', text)
    text = re.sub(r'[0-9]+', 'DIGIT', text)
    text = re.sub(r'#([aA-zZ]+|[0-9]+)+', 'HASHTAG', text)
    text = re.sub(r'@\w+', 'USER', text)
    text = re.sub('\s+', ' ',text)
    return text

def get_emptyWordsRemoving(text):
    words = text.split(' ')
    enStopWords = set(stopwords.words('english'))
    wordsList = []
    for w in words :
        if w not in enStopWords : 
            wordsList.append(w)
    text = ' '.join(wordsList)
    return text

def get_racination(text):
    words = text.split(' ')
    snowWords = []
    snow_stemmer = SnowballStemmer(language='english')
    for w in words:
        word = snow_stemmer.stem(w)
        snowWords.append(str(word))
    text = ' '.join(snowWords)
    return text

def get_lematizer(text):
    words = text.split(' ')
    lemmatizer = WordNetLemmatizer()
    lematized = []
    for w in words:
        lematized.append(str(lemmatizer.lemmatize(str(w))))
    text = ' '.join(lematized)
    return text

def get_tokens(text):
    nltkTokenizer = nltk.RegexpTokenizer('[a-zA-Z]\w+\'?\w*')
    return nltkTokenizer.tokenize(text)

def preprocess_texts(texts, action):
    i = 0
    for text in texts :
        text = text.lower()
        match action:
            case 'clean':
                text = get_cleanText(text)
            case 'racinisation':
                text = get_racination(text)
            case 'lemmatisation':
                text = get_lematizer(text)
            case 'emptyWords' :
                text = get_emptyWordsRemoving(text)
            case 'tokenize' :
                text = get_tokens(text)
            case default:
                text = text

        texts[i] = text
        i = i+1
    return texts

def similarity_cos(firstList, secondList): 
    result= cosine_similarity(firstList.reshape(1,-1),secondList.reshape(1,-1))
    return result

def myVectorizer(choice):
    match choice:
            case 'tf_idf':
                vectorizer = TfidfVectorizer(tokenizer=get_tokens)
            case 'wordBag':
                vectorizer = CountVectorizer(tokenizer=get_tokens)
            case default:
                vectorizer = TfidfVectorizer(tokenizer=get_tokens)
    return vectorizer

In [4]:
def get_questionsBotVectorized(choice):
    questions_toProcess = questions

    #Processing
    questions_toProcess = preprocess_texts(questions_toProcess,'clean')
    questions_toProcess = preprocess_texts(questions_toProcess,'emptyWords')
    questions_toProcess = preprocess_texts(questions_toProcess,'racinisation')
    questions_toProcess = preprocess_texts(questions_toProcess,'lemmatisation')

    vectorizer = myVectorizer(choice)
    texts = vectorizer.fit_transform(questions_toProcess)
    texts = texts.toarray()

    colsName = vectorizer.get_feature_names_out()

    return pandas.DataFrame(texts, columns = colsName)

<h2>4. Vectoriser et prétraiter la requête de l'utilisateur.
</h2>

In [5]:
def get_questionUserVectorized(text_input, choice) :
    text = [text_input]
    text = preprocess_texts(text,'clean')
    text = preprocess_texts(text,'emptyWords')
    text = preprocess_texts(text,'racinisation')
    text = preprocess_texts(text,'lemmatisation')
    
    vectorizer = myVectorizer(choice)
    texts = vectorizer.fit_transform(text)
    texts = texts.toarray()

    colsName = vectorizer.get_feature_names_out()
    return pandas.DataFrame(texts, columns = colsName)
    

<h2>5. Evaluer la question la plus similaire à la requête de l'utilisateur en utilisant la similarité cosinus</h2>

In [6]:
def similarityCalc(userQuestion, botQuestions):
    userQuestion = userQuestion.iloc[0].array
    
    i = 0
    similarity = []
    for row in botQuestions.iterrows():
        question = botQuestions.iloc[i].array
        similarity.append([i, similarity_cos(question,userQuestion)[0][0]])
        i = i + 1
    
    similarity = pd.DataFrame(similarity, columns=['questionNb', 'similarity'])
    bestFit = similarity['similarity'].sort_values(ascending = False).head(1).index.values[0]
    return bestFit

<h2>6. Renvoyer la réponse correspondante à la question la plus similaire sous forme de réponse de 
chat.</h2>

In [7]:
def filteringColumns(botQ, userQ):
    columnToShow = list(filter(lambda token: token in userQ.columns, botQ.columns))
    
    newUserQ = userQ[list(columnToShow)]
    newBotQ = botQ[list(columnToShow)]

    return [newUserQ, newBotQ]   

def get_answer(index):
    return answers[index]

In [17]:
# Questions list for testing purpose (3 exact questions + 3 questions near to the original)
#questionsTest = [
#    'Is this cover the one that fits the old nook color? Which I believe is 8x5.',
#    'does this have a flip stand',
#    'my vizio has 200 ht x 600 width mounting holes. will this mount handle that?',
#    'arm extend, but, how far please ?',
#    'It is working with Viso VX37L?',
#    'It is working with mac mini?'
#]
# Expected answers 
#answersTest = [
#    'Yes this fits both the nook color and the same-shaped nook tablet',
#    'Hi, no it doesn't',
#    'I'm sorry mine is mounted already so I could not measure for you but as long as it's within the specified size it should fit just fine I recommend',
#    '18 inches on our TV.',
#    'Yes, definitely. I bought me a Vizio 42 inch and install it without the extended brackets. Make sure the wall bracket is installed with the stud.',
#    'Yes'
#]
print("Enter 'exit', to stop the program.\n\n")

userQuestion = ''
userQuestion = input('Question : ')

if userQuestion.lower() != 'exit' :
    userVectorizedChoice = input('Vectorizing : Press 1 for tf_idf or 2 for wordBag => ')
    
    match userVectorizedChoice:
        case 1:
            userVectorizedChoice = 'tf_idf'
        case 2:
            userVectorizedChoice = 'wordBag'
        
    questionsBotVectorized = get_questionsBotVectorized(userVectorizedChoice)
    
    while(userQuestion.lower() != 'exit') :
        questionUserVectorized = get_questionUserVectorized(userQuestion, userVectorizedChoice)

        questionUserVectorizedFiltered = filteringColumns(questionsBotVectorized,questionUserVectorized)[0]
        questionsBotVectorizedFiltered = filteringColumns(questionsBotVectorized,questionUserVectorized)[1]

        questionBotIndex = similarityCalc(questionUserVectorizedFiltered, questionsBotVectorizedFiltered)
        answer = get_answer(questionBotIndex)
        print("Answer => {}".format(answer))

        userQuestion = input('\nQuestion : ')
    
print("\nProgram stopped")

Enter 'exit', to stop the program.


Question : Is this cover the one that fits the old nook color? Which I believe is 8x5.
Vectorizing : Press 1 for tf_idf or 2 for wordBag => 1
Answer => Yes this fits both the nook color and the same-shaped nook tablet

Question : does this have a flip stand
Answer => No, there is not a flip stand. It has a pocket in the front flap. It is a very nice cover.

Question : my vizio has 200 ht x 600 width mounting holes. will this mount handle that?
Answer => I'm sorry mine is mounted already so I could not measure for you but as long as it's within the specified size it should fit just fine I recommend

Question : arm extend, but, how far please ?
Answer => 18 inches on our TV.

Question : It is working with Viso VX37L?
Answer => Yes, definitely. I bought me a Vizio 42 inch and install it without the extended brackets. Make sure the wall bracket is installed with the stud.

Question : It is working with mac mini?
Answer => yes. my computer actually auto-

<p>Jusqu’à maintenant vous avez conçu un chatbot spécialisé aux produits électroniques. On s’intéresse à 
généraliser ce chatbot à d’autres catégories du produit. Sur le site 
    <b>http://jmcauley.ucsd.edu/data/amazon/qa/</b>, téléchargez les fichiers json relatives à trois autres catégories 
de produits éloignés du monde électronique. Puis adaptez votre programme à ces nouveaux fichiers.</p><p>
Le test reste identique au précédent. Pour chaque catégorie, 6 questions sont à définir (3 sont issues du 
corpus et les autres hors du corpus).</p>
<p>Le chatbot est-il précis dans les réponses à vos requêtes</p>

In [15]:
import glob

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield eval(l)

def getDF(path):
    i = 0
    df = {}
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')

allFiles = glob.glob('data/*.json.gz')

df = None
for path in allFiles:
    # Filtering for performance purpose (top 2500 rows for each files)
    data = getDF(path)[:2500]
    if df is not None:
        df = pd.concat([df, data])
    else :
        df = data

questions = df.question.tolist()
answers = df.answer.tolist()

In [18]:
# Questions list for testing purpose (3 exact questions + 3 questions near to the original)
#questionsTest = [
#    "What is the heat of this compared to the yellow and red curry?",
#    "I have Windows 8, Will this work on my computer?",
#    "does this game works on windows 7 vista hp laptop and windows 7 hp laptop computers let me know right away please and thank you",
#    "what is the size of the bottles?",
#    "What is the version of it ?",
#    "Do we have a map editor on this version of battle chest ?"
#]
# Expected answers 
#answersTest = [
#    "I think that the yellow is the most mild. The green has a much deeper flavor profile than the yellow and red though.",
#    "Yes",
#    "I really don't know, I do know it works on Vista, though I currently not gotten it reinstalled when I had to reinstall Vista over a year ago. I would imagine it might install on Win7 and work fine, unless there is a significant difference internally between Vista and Win7. I should note that I often played it on a standard aspect ratio monitor instead of my widescreen monitor (in dual screen mode) and it did fine. After all, it IS a Microsoft product so getting it to work on later OS may be easier than some others if you have issues initially as there is a make compatible mode for such older programs in newer OS'.",
#    "Each bottle is .75 ounce. 3/4 of an ounce. That doesn't seem like a lot but believe me a little goes a long way. I am so happy I bought these. Best food coloring I have used in a long time. And I have been baking for 45 years.",
#    "Hi Leah... I am not sure what you are asking? This product has a description::Rosetta Stone Homeschool teaches your student a new language naturally, the same way they mastered their first language. Innovative solutions get them speaking new words, right from the start. Rosetta Stone Homeschool moves forward only when your student is ready--you set the schedule and your student drives the pace. Parent Administrative Tools allow you to formulate lesson plans, manage your student's progress and track their success. Audio Companion CDs let them reinforce the Rosetta Stone experience anytime, anywhere. My daughter hadn't taken spanish for about 4 years and used this to brush up so she could take spanish in college. She found this very helpful. She continues to use this at higher levels as well. Hope this answers your question.",
#    "Yes it does. Thanks for your interest."
#]

print("Enter 'exit', to stop the program.\n\n")

userQuestion = ''
userQuestion = input('Question : ')

if userQuestion.lower() != 'exit' :
    userVectorizedChoice = input('Vectorizing : Press 1 for tf_idf or 2 for wordBag => ')
    
    match userVectorizedChoice:
        case 1:
            userVectorizedChoice = 'tf_idf'
        case 2:
            userVectorizedChoice = 'wordBag'
        
    questionsBotVectorized = get_questionsBotVectorized(userVectorizedChoice)
    
    while(userQuestion.lower() != 'exit') :
        questionUserVectorized = get_questionUserVectorized(userQuestion, userVectorizedChoice)

        questionUserVectorizedFiltered = filteringColumns(questionsBotVectorized,questionUserVectorized)[0]
        questionsBotVectorizedFiltered = filteringColumns(questionsBotVectorized,questionUserVectorized)[1]

        questionBotIndex = similarityCalc(questionUserVectorizedFiltered, questionsBotVectorizedFiltered)
        answer = get_answer(questionBotIndex)
        print("Answer => {}".format(answer))

        userQuestion = input('\nQuestion : ')
    
print("\nProgram stopped")

Enter 'exit', to stop the program.


Question : What is the heat of this compared to the yellow and red curry?
Vectorizing : Press 1 for tf_idf or 2 for wordBag => 1
Answer => I think that the yellow is the most mild. The green has a much deeper flavor profile than the yellow and red though.

Question : I have Windows 8, Will this work on my computer?
Answer => It appears that it will but I am running a Mac Maybe you can search around on Google... Top result was:Windows 8.1 compatibility for The Sims 3 version 1 http://www.microsoft.com/en-us/windows/compatibility/CompatCenter/ProductDetailsViewer?Type=Software&Name=The+Sims+3&ModelOrVersion=1&Vendor=EA&Locale=1033&LastSearchTerm=&BreadcrumbPath=The+Sims+3&TempOsid=Windows+8.1

Question : does this game works on windows 7 vista hp laptop and windows 7 hp laptop computers let me know right away please and thank you
Answer => I really don't know, I do know it works on Vista, though I currently not gotten it reinstalled when I had to rein

### Analyse de précision

Le chatbot n'est pas très précis, il peut trouver une ressemblence avec une autre question hors sujet, mais c'est un bon début.</br>