Tomar un ejemplo de los bots utilizados (uno de los dos) y construir el propio.
Sacar conclusiones de los resultados

In [55]:
import string
import re
import urllib.request
import bs4 as bs
import nltk
import unicodedata
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import gradio as gr

In [56]:
nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Usuario\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

### Datos
Se consumirán los datos de un artículo sobre la historia de los mundiales en inglés.

In [57]:
url = 'https://en.wikipedia.org/wiki/FIFA_World_Cup'

In [58]:
def get_article_text(url: str) -> str:
    raw_html = urllib.request.urlopen(url)
    raw_html = raw_html.read()

    article_html = bs.BeautifulSoup(raw_html)
    article_paragraphs = article_html.find_all('p')

    article_text = ''

    for p in article_paragraphs:
        article_text += p.text

    return article_text.lower()

In [59]:
article_text = get_article_text(url)

In [60]:
article_text

'\nthe fifa world cup, often simply called the world cup, is an international association football competition between the senior men\'s national teams of the members of the fédération internationale de football association (fifa), the sport\'s global governing body. the tournament has been held every four years since the inaugural tournament in 1930, with the exception of 1942 and 1946 due to the second world war. the reigning champions are argentina, who won their third title at the 2022 tournament.\nthe contest starts with the qualification phase, which takes place over the preceding three years, to determine which teams qualify for the tournament phase. in the tournament phase, 32 teams compete for the title at venues within the host nation(s) over about a month. the host nation(s) automatically qualify for the group stage of the tournament. the next fifa world cup is scheduled to expand to 48 teams for the 2026 tournament.\nas of the 2022 fifa world cup, 22 final tournaments have 

In [61]:
len(article_text)

33901

In [62]:
def preprocess_text(text: str) -> str:
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'\[[0-9]*\]', ' ', text)
    text = re.sub('(?<=[a-z])\'(?=[a-z])', '', text)

    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    pattern = r'[^a-zA-z0-9.,!?/:;\"\'\s]'
    text = re.sub(pattern, '', text)
    # text = ''.join([c for c in text if c not in string.punctuation])

    return text

In [63]:
preprocessed_text = preprocess_text(article_text)

In [64]:
preprocessed_text

' the fifa world cup, often simply called the world cup, is an international association football competition between the senior mens national teams of the members of the federation internationale de football association fifa, the sports global governing body. the tournament has been held every four years since the inaugural tournament in 1930, with the exception of 1942 and 1946 due to the second world war. the reigning champions are argentina, who won their third title at the 2022 tournament. the contest starts with the qualification phase, which takes place over the preceding three years, to determine which teams qualify for the tournament phase. in the tournament phase, 32 teams compete for the title at venues within the host nations over about a month. the host nations automatically qualify for the group stage of the tournament. the next fifa world cup is scheduled to expand to 48 teams for the 2026 tournament. as of the 2022 fifa world cup, 22 final tournaments have been held sin

In [65]:
len(preprocessed_text)

33149

In [66]:
def divide_in_corpus_and_words(text:str) -> [[str], [str]]:
    corpus = nltk.sent_tokenize(text)
    words = nltk.word_tokenize(text)
    return corpus, words

In [67]:
corpus, words = divide_in_corpus_and_words(preprocessed_text)

In [68]:
print("Vocabulario:", len(words))

Vocabulario: 6169


In [69]:
lemmatizer = WordNetLemmatizer()

In [70]:
def perform_lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

punctuation_removal = dict((ord(punctuation), None) for punctuation in string.punctuation)

def get_processed_text(document):
    return perform_lemmatization(nltk.word_tokenize(document.lower().translate(punctuation_removal)))

In [71]:
def generate_response(user_input, corpus):
    response = ''
    corpus.append(user_input)
    word_vectorizer = TfidfVectorizer(tokenizer=get_processed_text, stop_words='english')

    all_word_vectors = word_vectorizer.fit_transform(corpus)

    similar_vector_values = cosine_similarity(all_word_vectors[-1], all_word_vectors)

    similar_sentence_number = similar_vector_values.argsort()[0][-2]
    matched_vector = similar_vector_values.flatten()
    matched_vector.sort()
    vector_matched = matched_vector[-2]

    if vector_matched == 0:
        response = "I am sorry, I could not understand you"
    else:
        response = corpus[similar_sentence_number]

    corpus.remove(user_input)
    return response

In [72]:

def bot_response(human_text):
    print("Q:", human_text)
    resp = generate_response(human_text.lower(), corpus)
    print("A:", resp)
    return resp

iface = gr.Interface(
    fn=bot_response,
    inputs=["textbox"],
    outputs="text",
    layout="vertical")

iface.launch(debug=True)

  iface = gr.Interface(


Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Q: champion 2002
A: brazil, winners in 2002, were the first defending champions to play qualifying matches.




Q: tennis
A: I am sorry, I could not understand you
Q: maradona
A: I am sorry, I could not understand you
Q: argentina
A: both argentina and uruguay thus boycotted the 1938 fifa world cup.
Q: 2022
A: the reigning champions are argentina, who won their third title at the 2022 tournament.
Q: 2018
A: the same day, fifa postponed the bidding process for the 2026 fifa world cup in light of the allegations surrounding bribery in the awarding of the 2018 and 2022 tournaments.
Q: 1986
A: fifa has licensed world cup video games since 1986, sponsored by electronic arts.
Q: mexico 1986
A: the 2026 tournament will be jointly hosted by canada, the united states and mexico, which will give mexico the distinction of being the first country to host games in three world cups.
Q: south africa
A: so far, south africa 2010 and qatar 2022 failed to advance beyond the first round.
Q: country with most trophies
A: after 1970, a new trophy, known as the fifa world cup trophy, was designed.
Q: most winner
A: s



Conclusiones

Las oraciones del corpus son demasiado largas y no son concretas, algunas desarrollan mas de un concepto. Al preguntarle por un pais, termina devolviendo la primer oracion donde lo encuentre. Asi y todo, se ven algunas respuestas correctas, como cuando se le pidio saber el maximo goleador, o quien fue el ganador en 2022.
Habria que usar embeddings o algun otra tecnica para que entienda de alguna manera el significado de cada pais, y no solo haga un matcheo de string a string.