<img src="https://raw.githubusercontent.com/dmlls/jizt/c2d7b9b81783e298d1898b5743b147d1faff8f29/images/JIZT-logo.svg" title="JIZT" alt="JIZT" width="230" align="left" style="margin-top:15px;margin-right:30px;" />

---

### Pre-procesamiento básico del texto
[Diego Miguel Lozano](https://github.com/dmlls) \
GPL-3.0 License

*Última actualización: 9 de noviembre de 2020*

---

# Introducción

Este notebook se centra en el proceso de pre-procesado del texto de entrada para adaptarlo a los modelos `BART`y `T5`, con los cuales se llevará a cabo el resumen de los mismos. Los textos empleados como ejemplo están en inglés, dado que estos modelos están optimizados para este idioma.

---

---


---

# Requerimientos

Para poder ejecutar este notebook, se debe tener instalada la última versión de los siguientes paquetes:

- `NLTK`
- `SpaCy` con `en_core_web_sm`
- [`blingfire`](https://github.com/microsoft/BlingFire)

---
---
---

# Pre-procesamiento del texto

El pre-procesamiento del texto para adaptarlo a estos dos modelos va a consistir en:
- Eliminar saltos de carro, tabuladores (`\n`, `\t`) y espacios sobrantes entre palabras (p. ej. `I    am` → `I am`).
- Añadir un espacio al inicio de las frases intermedias (p. ej.: `How's it going?Great!` → `How's it going? Great!`. Esto es especialmente relevante en el caso del modelo `BART`, que tiene en cuenta ese espacio inicial para distinguir entre frases iniciales y frases intermedias.
- Establecer un mecanismo que permita dividir el texto en frases. Esto es importante dado que los modelos tienen un tamaño de entrada máximo (que viene dado en número de tókenes codificados). Tener el texto dividio en frases nos permite ajustar el tamaño del texto de entrada manteniendo la coeherencia del texto, esto es, sin partir frases, con lo cual perderíamos el sentido de las mismas.

Además, se va a asumir que:
- El punto (`.`) indica el final de una frase solo si la siguiente palabra empieza con una *letra* mayúscula. Por ejemplo: `Your idea is interesting. However, I would... ` se separaría en dos frases. Sin embargo: `We already mentioned in section 1.1 that this example shows...` conformaría una única frase. Lo mismo ocurre en el caso de los signos de interrogación (`?`) y de exclamación (`!`). Por ejemplo: `She asked "How's it going?", and I said "Great!".` se tomará como una sola frase.


- Además, con la restricción de que para conformar una nueva frase, el siguiente carácter tras el punto, interrogación o exclamación sea una *letra*, se clasifican correctamente signos de puntuación como los puntos suspensivos. No obstante, esta suposición fallaría en situaciones como: `NLP (i.e. Natural Language Processing) is a subfield of Linguistics, Computer Science, and Artificial Intelligence.` en la que la división sería: `NLP (i.e.` y `Natural Language Processing) is a subfield...` es decir, dos frases, cuando en realidad solo hay una.

---

Veamos un ejemplo. Emplearemos el siguiente texto (mal formateado a propósito):

In [1]:
text = "How's your        day going???!It's     going...\n Let's just say it's not going to \t bad."
print(text)

How's your        day going???!It's     going...
 Let's just say it's not going to 	 bad.


---

El primer paso es eliminar los saltos de carro, tabuladores (`\n`, `\t`) y espacios sobrantes entre palabras:

In [2]:
text = ' '.join(text.split())
print(text)

How's your day going???!It's going... Let's just say it's not going to bad.


---

A continuación, separamos el texto en frases con ayuda de la siguiente expresión regular:

In [3]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'[^.!?]+[.!?]+[^A-Z]*')

La expresión regular se divide en tres partes:
- `[^.!?]+`: coge cualquier carácter que no sea el carácter de terminación de frase (esto es, punto, exclamación o interrogación).
- `[.!?]+`: hasta llegar a uno o varios caracteres de terminación. El uno o varios asegura que parseamos correctamente los puntos suspensivos (`...`), exclamación-interrogación ( `!?`), o repetición de cualquiera de estos caracteres (`????`).
- `[^A-Z]*`: además, si lo que sigue no es una letra mayúscula, cógelo también. De esta forma, no dividimos incorrectamente frases como: `As we can see in Figure 1.1, the model is overfitting`.

In [4]:
tokenizer.tokenize(text)

["How's your day going???!",
 "It's going... ",
 "Let's just say it's not going to bad."]

---

La razón por la que no usamos el método `sent_tokenize` de la librería `nltk` se debe a que este método falla en algunas ocasiones concretas. Por ejemplo:

In [5]:
from nltk import sent_tokenize

print(sent_tokenize("Hello.Goodbye.")) # fails: there are two sentences, it parses everything as one
print(sent_tokenize("Seriously??!That can't be true.")) # fails: takes '!' as part of the second sentence

['Hello.Goodbye.']
['Seriously??', "!That can't be true."]


---

Mientras que nuestra sencilla expresión regular las separa correctamente:

In [6]:
print(tokenizer.tokenize("Hello.Goodbye."))
print(tokenizer.tokenize("Seriously??!That can't be true."))

['Hello.', 'Goodbye.']
['Seriously??!', "That can't be true."]


---

Sin embargo, nuestra expresión regular no es perfecta. Hay casos en los que irremediablemente falla. Por ejemplo:

In [7]:
tokenizer.tokenize("I was born in 02.28.1980 in New York.")

['I was born in 02.28.1980 in ', 'New York.']

---

Encontrar una expresión regular que contemple este caso y los anteriores no es tarea sencilla, y en cualquier caso complicaría mucho la expresión. Por ello, ello es más fácil revisar si alguna frase ha quedado sin terminar en punto, exclamación o interrogación, y en ese caso concatenarla con la siguiente frase:

In [8]:
sentences = tokenizer.tokenize("I was born in 02.28.1980 in New York.")

final_sentences = [sentences[0].strip()] # remove leading and trailing whitespaces
    
for sent in sentences[1:]:
    sent = sent.strip()
    # check first that sentences is not empty
    # if the previous sentence doesn't end with a '.', '!' or '?' we concatenate the current sentence to it
    if final_sentences[-1][-1] != '.' and final_sentences[-1][-1] != '!' and final_sentences[-1][-1] != '?':
        final_sentences[-1] += (' ' + sent)
    else:
        final_sentences.append(sent)
            
final_sentences

['I was born in 02.28.1980 in New York.']

---

Ya solo nos queda juntar todos los pasos en una función:

In [9]:
from nltk.tokenize import RegexpTokenizer

def preprocess_text(text, tokenizer=None, return_as_list=False):
    if tokenizer is None:
        # if next letter after period is lowercase, consider it part of the same sentence
        # ex: "As we can see in Figure 1.1. the sentence will not be split."
        tokenizer = RegexpTokenizer(r'[^.!?]+[.!?]+[^A-Z]*')
        # if there's no final period, add it (this makes the assumption that the last
        # sentence is not interrogative or exclamative, i.e., ends with '?' or '!')
        if text[-1] != '.' and text[-1] != '?' and text[-1] != '!':
            text += '.'
    
    text = ' '.join(text.split()) # remove '\n', '\t', etc.
    
    sentences = tokenizer.tokenize(text)

    final_sentences = [sentences[0].strip()] # remove leading and trailing whitespaces
    
    for sent in sentences[1:]:
        sent = sent.strip()
        # if the previous sentence doesn't end with a '.', '!' or '?' we concatenate the current sentence to it
        if final_sentences[-1][-1] != '.' and final_sentences[-1][-1] != '!' and final_sentences[-1][-1] != '?':
            final_sentences[-1] += (' ' + sent)
        else:
            final_sentences.append(sent)
                                       
    return final_sentences if return_as_list else ' '.join(final_sentences)

---

¿Hemos cubierto todos los casos? La respuesta es no. Nuestra función no tiene en cuenta las Entidades Nombradas y fallará en casos como:

In [10]:
print(preprocess_text("Mr. Elster looked worried.", return_as_list=True)) # fails: it's only one sentence, not two
print(preprocess_text("London is the capital of U.K.", return_as_list=True)) # fails: splits U.K.
print(preprocess_text("The soldier was declared A.W.O.L.", return_as_list=True)) # fails: splits A.W.O.L.

['Mr.', 'Elster looked worried.']
['London is the capital of U.', 'K.']
['The soldier was declared A.', 'W.', 'O.', 'L.']


---

Es por aspectos como este que los tokenizadores basados en reglas están empezando a ser reemplazados por modelos probabilísticos, los cuales ofrecen una mayor potencia.

Probemos, pues, con un modelo más potente. Para ello, vamos a hacer uso de la librería `Spacy`, muy conocida junto a `NLTK` en el mundo del Procesamiento de Lenguaje Natural en Python. Usaremos además el modelo para inglés llamado `en_core_web_sm`. Este modelo implementa una red neuronal convolucional entrenada sobre `OntoNotes`. 

In [11]:
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

---

Veamos qué tal se comporta este modelo sobre los ejemplos vistos anteriormente. Para empezar, vamos a probar qué tal despempeña la tarea de Reconocimiento de Entidades Reconocidas (NER, por sus siglas en inglés):

In [12]:
texts_NER = ["Mr. Elster looked worried.", "London is the capital of U.K.", "The soldier was declared A.W.O.L."]

sentences = []

for text in texts_NER:
    sentences += [str(sen) for sen in nlp(text).sents]
sentences

['Mr. Elster looked worried.',
 'London is the capital of U.K.',
 'The soldier was declared A.W.O.L.']

Vemos que el modelo reconoce las Entidades Nombradas de manera correcta y no divide las anteriores frases erróneamente, como pasaba con nuestra función.

---

Pero, de nuevo, hay casos en los que el modelo no funciona como debería. Algunos de los ejemplos que veíamos anteriormente y que nuestra función separaba correctamente, fallan con el modelo de `Spacy`:

In [13]:
texts_fail = ["Seriously??!That can't be true.", # fails: there are two sentences; the model sees only one
              "As we can see in Figure 1.1. the model will fail."] # fails: there's only one sentence, not two

sentences = []

for text in texts_fail:
    sentences += [str(sen) for sen in nlp(text).sents]
sentences

["Seriously??!That can't be true.",
 'As we can see in Figure 1.1.',
 'the model will fail.']

---

Con todo lo visto hasta ahora, podemos hacernos la siguiente pregunta: ¿Y por qué no emplear ambos, nuestra función y el modelo preentrenado para realizar el pre-procesado del texto de la forma más precisa posible? Probemos:

In [15]:
from nltk.tokenize import RegexpTokenizer
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

def preprocess_text(text, tokenizer=None, return_as_list=False):
    if tokenizer is None:
        # if next letter after period is lowercase, consider it part of the same sentence
        # ex: "As we can see in Figure 1.1. the sentence will not be split."
        tokenizer = RegexpTokenizer(r'[^.!?]+[.!?]+[^A-Z]*')
        # if there's no final period, add it (this makes the assumption that the last
        # sentence is not interrogative or exclamative, i.e., ends with '?' or '!')
        if text[-1] != '.' and text[-1] != '?' and text[-1] != '!':
            text += '.'
    
    text = ' '.join(text.split()) # remove '\n', '\t', etc.
    
    sentences = ' '.join(tokenizer.tokenize(text)).replace('  ', ' ') # ensure there's 1 whitespace at most
    
    sentences = [str(sent).strip() for sent in nlp(sentences).sents] # Spacy model

    final_sentences = [sentences[0]]
    
    for sent in sentences[1:]:
        # if the previous sentence doesn't end with a '.', '!' or '?' we concatenate the current sentence to it
        if final_sentences[-1][-1] != '.' and final_sentences[-1][-1] != '!' and final_sentences[-1][-1] != '?':
            final_sentences[-1] += ' ' + sent
        # if the next sentence doesn't start with a letter or a number, concatenate it to the previous
        elif not sent[0].isalpha() and not sent[0].isdigit():
            final_sentences[-1] += sent
        else:
            final_sentences.append(sent)
                                       
    return final_sentences if return_as_list else ' '.join(final_sentences)

---

Probemos, por última vez, los ejemplos vistos hasta ahora:

In [6]:
examples = ["How's your        day going???!It's     going...\n Let's just say it's not going to \t bad.",
            "Hello.Goodbye.",
            "Seriously??!That can't be true.",
            "Mr. Elster looked worried.",
            "London is the capital of U.K.",
            "I was born in 02.28.1980 in New York",
            "She asked \"How's it going?\", and I said \"Great!\"",
            "As we can see in Figure 1.1. the model will fail."]

for text in examples:
    print(preprocess_text(text, return_as_list=True))

["How's your day going???!", "It's going... Let's just say it's not going to bad."]
['Hello.', 'Goodbye.']
['Seriously??!', "That can't be true."]
['Mr. Elster looked worried.']
['London is the capital of U. K.']
['I was born in 02.28.1980 in New York.']
['She asked "How\'s it going?", and I said "Great!".']
['As we can see in Figure 1.1. the model will fail.']


Hasta ahora, es el mejor resultado obtenido.

Sin embargo, en situaciones muy concretas, como en el caso de la última frase, el modelo sigue fallando. Además, es una función bastante costosa en cuanto a tiempo:

In [17]:
quite_long_text = """Our country is being held hostage by mad scientists and MPs afraid of blame.And the most important vision they should be using, hindsight they all appear to be blind too. I will try and break it down in this long thread. #WhyAreTheyDoingThis #pensionerprisoners #Covid_19 1 - Covid-19 arrived far earlier than anyone knew. It was likely in Britain in November or December and by January and February it was spreading rapidly. This took our government and other countries by surprise. And the resulting overload on health services was massive. 2 - there was little choice by mid March for most leaders and action to prevent a catastrophic failure and collapse of the NHS was essential. Cases were flooding in stadd sickness in NHS was rising fast. We were on the brink. 3 - the virus was so new the medical knowledge just hadn\'t accumulated to make good decisions on treatments. Doctors were fighting for peoples lives based on at times best guess using all their previous knowledge of other viruses. But covid was different and its impact nuanced. 4 - Although lockdown was a hammer to hit a flea it seemed at the time the only logical way to wrestle control back, to give the health service and the government time to find a strategy to fight this. Treatments were still so new no one knew for sure what was best. It was a mess 5 - So we all suffered while doctors, sage and government learnt on the hoof, but as we gathered more and more information we began to understand the virus more, who it targeted, what treatments seemed to work, the fog of covid was lifting. 6 - Since March we have seen 43,000 lives lost. But we since March we have not moved from the panic days to what is now a very clear picture of the virus. Instead fear is still being used to cover up for the failures of the first half of the year. Fear is now the driver. 7 - But now we know much more, but we had a government desperate to stop the hemorrhaging of money and the suspension of the economy which to anyone with a brain can see will cost more lives over time. The country was facing a mental health breakdown. 8 - But we now know more, we have data from over a million days in hospitals, data from 43,000 deaths and the victims profiles, we have data on 450,000 survivors. We have insight on how big wave one was, the speed of spread and where. We have knowledge on treatments that work. 9 - we now know that the average age of victims is 82. We know which conditions make people vulnerable. We know the virus has a long incubation period and can stay on some surfaces for days. We know 10 times what we knew before lockdown. We know young people are resilient to it. 10 - What we also know is that by the time we see a covid spike any action we take is about mitigation. The spike is an indicator. We also know lockdowns are the worst form of treatment, the hammer and the flea. It is a steam roller to crack a nut. Lockdowns are madness. 11 - So what is the answer? Well we know many things, we know that by the time of national lockdown the cases were running at up to 120,000 new infections per day. Therefore it is highly likely we have seen since January 5 million people at least be infected. That is huge. 12 - We know that the virus mortality rate is not 3 or 4% its more like 0.13%. Statistically a person living in the UK has a 0.06% chance of dying of covid in their lifetime. Car accidents, robbings you are 6 times greater chance of being involved in. So why the fear, the panic? 13 - so now we know who it targets, who are resistant, how it spreads, who is at risk, the mortality of it, better treatments. The why are we still using the hammer? Fear. Fear of being wrong, fear of being blamed, fear of well just about anything covid related. 14 - So what is happening. #sage are badly advising thw government, they are not united as they seem. Too many egos, and quite frankly to many personal scientific theories to be proved before they see the damage they are doing to you, me the economy and our mental well-being. 15 - The data makes it very clear that a slightly more risky but far more sensible policy is controlled spread with shielding of our vulnerable people is a far better and sane, yes sane way forward. Why is it risky, well peoples lives of course. But that is why we must change. 16 - Lives, livelihoods, businesses, debt nental health, cancer patients, heart and lung patients all are dying either actually or financially or mentally now. Children at risk, educations destroyed. Because of Fear! 17 - The impact of this is going to ripple for a decade or more, lockdowns will kill far more than they save and that is a fact. Why? Because unlike #sage i see the impact of financial hardship on ordinary people, the damage is like a car crash, its long term effect deep! 18 - I have modelled recessions and impact for over twenty years, this gives me a terrible foresight on the damage we are now doing. I see the faces of those that will soon be flooding jobcentres desperate, in real crisis. Lives will be lost. Abuse will rise, crimes go up. 19 - I also see the damage we are doing to the nations health, trust me, the data is clear we are stocking up a health crisis so large it will hit every single family in every home everywhere, no one will escape untouched. And it is growing daily. It is insane self harm. 20 - 2 million cancer screenings lost. 62,000 urgent cancer referrals lost. A&amp;E referrals down 30%. Operations cancelled for up to 2 years. 1.5 million jobs lost. That will rise to 3 million over next 3 months. Businesses failing left right and centre. The tsunami is coming. 21 and last. I know a different way may seem scary, so much fear peddled by media and MPs like "let it rip" an awful nasty and evil set of words. But there is a better way and we all need to push for it. Be safe all. Statistics Guy"""
%timeit preprocess_text(quite_long_text)

116 ms ± 2.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


---

Llegados a este punto, la última pregunta que nos podemos hacer es: ¿Realmente necesitamos un modelo tan potente como el de `Spacy`? ¿O si combinamos la rápida y bastante eficaz función `sent_tokenize`de la librería `nltk` con nuestra función será suficiente?

Hagamos una última prueba.

Únicamente sustituiremos la línea:

`sentences = [str(sent).strip() for sent in nlp(sentences).sents] # Spacy model`

por

`sentences = sent_tokenize(sentences)`.

In [14]:
from nltk.tokenize import RegexpTokenizer
from nltk import sent_tokenize

def preprocess_text(text, tokenizer=None, return_as_list=False):
    if tokenizer is None:
        # if next letter after period is lowercase, consider it part of the same sentence
        # ex: "As we can see in Figure 1.1. the sentence will not be split."
        tokenizer = RegexpTokenizer(r'[^.!?]+[.!?]+[^A-Z]*')
        # if there's no final period, add it (this makes the assumption that the last
        # sentence is not interrogative or exclamative, i.e., ends with '?' or '!')
        if text[-1] != '.' and text[-1] != '?' and text[-1] != '!':
            text += '.'
    
    text = ' '.join(text.split()) # remove '\n', '\t', etc.
    
    sentences = ' '.join(tokenizer.tokenize(text)).replace('  ', ' ') # ensure there's 1 whitespace at most

    sentences = sent_tokenize(sentences)

    final_sentences = [sentences[0]]
    
    for sent in sentences[1:]:
        # if the previous sentence doesn't end with a '.', '!' or '?' we concatenate the current sentence to it
        if final_sentences[-1][-1] != '.' and final_sentences[-1][-1] != '!' and final_sentences[-1][-1] != '?':
            final_sentences[-1] += ' ' + sent
        # if the next sentence doesn't start with a letter or a number, concatenate it to the previous
        elif not sent[0].isalpha() and not sent[0].isdigit():
            final_sentences[-1] += sent
        else:
            final_sentences.append(sent)
                                       
    return final_sentences if return_as_list else ' '.join(final_sentences)

In [18]:
examples = ["How's your        day going???!It's     going...\n Let's just say it's not going to \t bad.",
            "Hello.Goodbye.",
            "Seriously??!That can't be true.",
            "Mr. Elster looked worried.",
            "London is the capital of U.K.",
            
            "I was born in 02.28.1980 in New York",
            "She asked \"How's it going?\", and I said \"Great!\"",
            "As we can see in Figure 1.1. the model will fail."]

for text in examples:
    print(preprocess_text(text, return_as_list=True))

["How's your day going???!", "It's going... Let's just say it's not going to bad."]
['Hello.', 'Goodbye.']
['Seriously??!', "That can't be true."]
['Mr. Elster looked worried.']
['London is the capital of U. K.']
['I was born in 02.28.1980 in New York.']
['She asked "How\'s it going?", and I said "Great!".']
['As we can see in Figure 1.1. the model will fail.']


Vemos que la única diferencia es que en el primer ejemplo ha tomado las frases separadas con puntos suspensivos como una sola frase, lo cual no es muy grave ya que en muchas ocasiones efectivamente se trata de una única frase (p. ej.: "Digamos que estuvo... interesante").

Además, los dos últimos ejemplos lo ha clasificado correctamente.

Midamos el tiempo:

In [24]:
quite_long_text = """Our country is being held hostage by mad scientists and MPs afraid of blame.And the most important vision they should be using, hindsight they all appear to be blind too. I will try and break it down in this long thread. #WhyAreTheyDoingThis #pensionerprisoners #Covid_19 1 - Covid-19 arrived far earlier than anyone knew. It was likely in Britain in November or December and by January and February it was spreading rapidly. This took our government and other countries by surprise. And the resulting overload on health services was massive. 2 - there was little choice by mid March for most leaders and action to prevent a catastrophic failure and collapse of the NHS was essential. Cases were flooding in stadd sickness in NHS was rising fast. We were on the brink. 3 - the virus was so new the medical knowledge just hadn\'t accumulated to make good decisions on treatments. Doctors were fighting for peoples lives based on at times best guess using all their previous knowledge of other viruses. But covid was different and its impact nuanced. 4 - Although lockdown was a hammer to hit a flea it seemed at the time the only logical way to wrestle control back, to give the health service and the government time to find a strategy to fight this. Treatments were still so new no one knew for sure what was best. It was a mess 5 - So we all suffered while doctors, sage and government learnt on the hoof, but as we gathered more and more information we began to understand the virus more, who it targeted, what treatments seemed to work, the fog of covid was lifting. 6 - Since March we have seen 43,000 lives lost. But we since March we have not moved from the panic days to what is now a very clear picture of the virus. Instead fear is still being used to cover up for the failures of the first half of the year. Fear is now the driver. 7 - But now we know much more, but we had a government desperate to stop the hemorrhaging of money and the suspension of the economy which to anyone with a brain can see will cost more lives over time. The country was facing a mental health breakdown. 8 - But we now know more, we have data from over a million days in hospitals, data from 43,000 deaths and the victims profiles, we have data on 450,000 survivors. We have insight on how big wave one was, the speed of spread and where. We have knowledge on treatments that work. 9 - we now know that the average age of victims is 82. We know which conditions make people vulnerable. We know the virus has a long incubation period and can stay on some surfaces for days. We know 10 times what we knew before lockdown. We know young people are resilient to it. 10 - What we also know is that by the time we see a covid spike any action we take is about mitigation. The spike is an indicator. We also know lockdowns are the worst form of treatment, the hammer and the flea. It is a steam roller to crack a nut. Lockdowns are madness. 11 - So what is the answer? Well we know many things, we know that by the time of national lockdown the cases were running at up to 120,000 new infections per day. Therefore it is highly likely we have seen since January 5 million people at least be infected. That is huge. 12 - We know that the virus mortality rate is not 3 or 4% its more like 0.13%. Statistically a person living in the UK has a 0.06% chance of dying of covid in their lifetime. Car accidents, robbings you are 6 times greater chance of being involved in. So why the fear, the panic? 13 - so now we know who it targets, who are resistant, how it spreads, who is at risk, the mortality of it, better treatments. The why are we still using the hammer? Fear. Fear of being wrong, fear of being blamed, fear of well just about anything covid related. 14 - So what is happening. #sage are badly advising thw government, they are not united as they seem. Too many egos, and quite frankly to many personal scientific theories to be proved before they see the damage they are doing to you, me the economy and our mental well-being. 15 - The data makes it very clear that a slightly more risky but far more sensible policy is controlled spread with shielding of our vulnerable people is a far better and sane, yes sane way forward. Why is it risky, well peoples lives of course. But that is why we must change. 16 - Lives, livelihoods, businesses, debt nental health, cancer patients, heart and lung patients all are dying either actually or financially or mentally now. Children at risk, educations destroyed. Because of Fear! 17 - The impact of this is going to ripple for a decade or more, lockdowns will kill far more than they save and that is a fact. Why? Because unlike #sage i see the impact of financial hardship on ordinary people, the damage is like a car crash, its long term effect deep! 18 - I have modelled recessions and impact for over twenty years, this gives me a terrible foresight on the damage we are now doing. I see the faces of those that will soon be flooding jobcentres desperate, in real crisis. Lives will be lost. Abuse will rise, crimes go up. 19 - I also see the damage we are doing to the nations health, trust me, the data is clear we are stocking up a health crisis so large it will hit every single family in every home everywhere, no one will escape untouched. And it is growing daily. It is insane self harm. 20 - 2 million cancer screenings lost. 62,000 urgent cancer referrals lost. A&amp;E referrals down 30%. Operations cancelled for up to 2 years. 1.5 million jobs lost. That will rise to 3 million over next 3 months. Businesses failing left right and centre. The tsunami is coming. 21 and last. I know a different way may seem scary, so much fear peddled by media and MPs like "let it rip" an awful nasty and evil set of words. But there is a better way and we all need to push for it. Be safe all. Statistics Guy"""
%timeit preprocess_text(quite_long_text)

1.7 ms ± 7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Es unas 60-70 veces más rápido.

---

Esta última versión sigue sin ser perfecta. Por ejemplo, fallaría en el siguiente caso:

In [105]:
preprocess_text("NLP (i.e. Natural Language Processing) is a subfield of Linguistics, " +
                "Computer Science, and Artificial Intelligence.", return_as_list=True)

['NLP (i.e.',
 'Natural Language Processing) is a subfield of Linguistics, Computer Science, and Artificial Intelligence.']

Vemos que divide la frase en dos.

---

No obstante, si añadiésemos una coma tras el i.e. (como suele ser común), sí funcionaría correctamente:

In [4]:
preprocess_text("Tomorrow I work the morning shift, i.e., from 6 am to 1 pm.", return_as_list=True)

['Tomorrow I work the morning shift, i.e., from 6 am to 1 pm.']

# Conclusión

Como conclusión, cabe admitir que es muy complicado desarrollar un modelo que se ajuste a todos y cada uno de los casos.

Recordemos que lo que buscábamos era:
- "Limpiar" y formatear el texto correctamente.
- Identificar las frases en el texto. Esto nos permite dividirlo con el fin de ajustarlo a la entrada máxima de los modelos de resumen.

En nuestro caso, creemos haber encontrado un buen compromiso entre *precisión* y *eficiencia*.

---
---
---

## Actualización (v0.2)

Tiempo después de escribir este notebook, dimos con otra librería, esta vez de Microsoft, que también implementa una función de división de frases (`text_to_sentence`). Tras varias pruebas rápidas, se comprobó que, a primera vista, trabaja mejor que la función `sent_tokenize` de NLTK.

Modifiquemos la función que implementamos anteriormente para comprobar si esto es cierto.

Además, también modificaremos la expresión regular utilizada para el `tokenizer`, de forma que capture siglas que contengan puntos, por ejemplo: `U.K.`, `U.S`, `A.K.A.`, `R.I.P.`, etc. Este tipo de cadenas las capturaremos con la siguiente expresión: `(?:[A-Z][.])+`, es decir, coge una letra y un punto, una o más veces. El `?:` indica un grupo sin captura, a fin de que coincida con la expresión completa, y no solo con el último grupo.

Con esto resolvemos el problema que existía anteriormente que provocaba que se insertara un espacio en las siglas que contenían puntos, es decir, `U. K.`, en vez de `U.K.`, `B. C.`, en vez de `B.C.`, etc.

In [23]:
from nltk.tokenize import RegexpTokenizer
from blingfire import text_to_sentences

def preprocess_text(text, tokenizer=None, return_as_list=False):
    if tokenizer is None:
        # if next letter after period is lowercase, consider it part of the same sentence
        # ex: "As we can see in Figure 1.1. the sentence will not be split."
        tokenizer = RegexpTokenizer(r'[^.!?]+(?:(?:[A-Z][.])+|[.!?]+)+[^A-Z]*')
        # if there's no final period, add it (this makes the assumption that the last
        # sentence is not interrogative or exclamative, i.e., ends with '?' or '!')
        if text[-1] != '.' and text[-1] != '?' and text[-1] != '!':
            text += '.'
    
    text = ' '.join(text.split()) # remove '\n', '\t', etc.
    
    sentences = ' '.join(tokenizer.tokenize(text)).replace('  ', ' ') # ensure there's 1 whitespace at most

    sentences = text_to_sentences(sentences).split('\n')

    final_sentences = [sentences[0]]
    
    for sent in sentences[1:]:
        # if the previous sentence doesn't end with a '.', '!' or '?' we concatenate the current sentence to it
        if final_sentences[-1][-1] != '.' and final_sentences[-1][-1] != '!' and final_sentences[-1][-1] != '?':
            final_sentences[-1] += ' ' + sent
        # if the next sentence doesn't start with a letter or a number, concatenate it to the previous
        elif not sent[0].isalpha() and not sent[0].isdigit():
            final_sentences[-1] += sent
        else:
            final_sentences.append(sent)
                                       
    return final_sentences if return_as_list else ' '.join(final_sentences)

In [28]:
examples = ["How's your        day going???!It's     going...\n Let's just say it's not going to \t bad.",
            "Hello.Goodbye.",
            "Seriously??!That can't be true.",
            "Mr. Elster looked worried.",
            "London is the capital of U.K.",
            "I was born in 02.28.1980 in New York",
            "She asked \"How's it going?\", and I said \"Great!\"",
            "As we can see in Figure 1.1. the model will fail."]

for text in examples:
    print(preprocess_text(text, return_as_list=True))

["How's your day going???!", "It's going...", "Let's just say it's not going to bad."]
['Hello.', 'Goodbye.']
['Seriously??!', "That can't be true."]
['Mr. Elster looked worried.']
['London is the capital of U.K.']
['I was born in 02.28.1980 in New York.']
['She asked "How\'s it going?", and I said "Great!".']
['As we can see in Figure 1.1. the model will fail.']


Vemos que todo sigue funcionando correctamente.

---

Probemos con el caso en el que antes fallaba nuestra función:

In [99]:
preprocess_text("NLP (i.e. Natural Language Processing) is a subfield of Linguistics, " +
                "Computer Science, and Artificial Intelligence.", return_as_list=True)

['NLP (i.e. Natural Language Processing) is a subfield of Linguistics,Computer Science, and Artificial Intelligence.']

¡Funciona!

---

---

---