# Preprocessing text
This notebook preprocesses text so that it can be used in further downstream processing operations.

It does the following
 - Normalizes text
 - Lowercases text
 - Removes stopwords
 - Lematizes text
 - Removes special characters
 - Increases tokenization accuracy by using named entity recognition to create custom tokenization rules
 - Saves your the processed text in csv file, one document per line, with tokens separated by commas

## Attribution: 
Many of the ideas and/or code in this notebook were taken from "Blueprints for Text Analytics Using Python" - Jens Albrecht, Sidharth Ramachandran, Christian Winkler
Chapter 4: Preparing Textual Data For Statistics And Machine Learning

The data used as an example in this notebook is the ["UN General Debate Corpus (UNGDC)"](https://github.com/sjankin/UnitedNations) an English language corpus of 8,093 texts of General Debate statements from 1970 (Session 25) to 2018 (Session 73). 

In [1]:
import pandas as pd
import re
import spacy
from spacy.symbols import ORTH  #ORTH = exact verbatim text of a token
import textacy

from tqdm import tqdm
from pprint import pprint

entities_file = 'entities.txt'
preprocessed_savepath = 'out.csv'
lemmatize = True
keep_stop = False


2023-04-19 19:55:33.032905: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
#pd.set_option('display.max_rows', 500)
#pd.set_option('display.max_columns', 500)
#pd.set_option('display.width', 1000)
pd.set_option('max_colwidth', 800)

Load data in CSV format into a dataframe. Here we load the UNGDC Corpus.

In [3]:
#https://github.com/blueprints-for-text-analytics-python/blueprints-text/blob/master/data/un-general-debates/un-general-debates-blueprint.csv.gz
df = pd.read_csv("un.gz")

Normalize text in a number of different ways. Read more [https://textacy.readthedocs.io/en/0.12.0/api_reference/preprocessing.html](https://textacy.readthedocs.io/en/0.12.0/api_reference/preprocessing.html) 

In [4]:
#Taken from "Blueprints for Text Analytics Using Python" - Jens Albrecht, Sidharth Ramachandran, Christian Winkler
#Chapter 4: Preparing Textual Data For Statistics And Machine Learning
import textacy.preprocessing as tprep

def normalize(text):
    text = tprep.normalize.bullet_points(text)
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.whitespace(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    #Don't lowercase text here, if going on to extract named entities. The Spacy's named entity parser relies heavily on 
    #capitalization of words to determine if a combination of tokens is an entity
    return text

Create a function "impurity" that returns the percentage of unusual characters in a text. Show the top three texts with the highest percentage of these characters.

In [5]:
#Taken from "Blueprints for Text Analytics Using Python" - Jens Albrecht, Sidharth Ramachandran, Christian Winkler
#Chapter 4: Preparing Textual Data For Statistics And Machine Learning
RE_SUSPICIOUS = re.compile(r'[\^\~&#<>{}\[\]\\]')

def impurity(text, min_len=10):
    """returns the share of suspicious characters in a text"""
    if text == None or len(text) < min_len:
        return 0
    return len(RE_SUSPICIOUS.findall(text))/len(text)

In [6]:
df['text'] = df['text'].apply(normalize)
df['impurity'] = df['text'].apply(impurity, min_len=10)
df[['text', 'impurity']].sort_values(by='impurity', ascending=False).head(3)

Unnamed: 0,text,impurity
599,"Sir, I have the honor to congratulate you on behalf of the Finnish Government on your important election to the presidency. We greet you as an eminent European statesman who has contributed much to European reconciliation and to international co-operation in general.\n2. The thirtieth anniversary of the United Nations gives us an opportunity to survey the full spectrum of our activities in order to understand better the value of the Organization in present-day international politics. An examination of the main trends in the development of the United Nations gives us confidence in the future, The Organization has been able to enlarge its membership, which now approaches universality. We very warmly welcome this trend: it makes the United Nations unique as a tool for international co-ope...",0.001708
2319,"My delegation warmly congratulates you, Sir, on your election to the presidency of the forty-second session of the General Assembly. Our thanks and compliments go to your predecessor, Mr. Humayun Rasheed Choudhury, the Foreign Minister of Bangladesh, for his outstanding leadership of the forty-first session.\nThe international community is aware of the recent political and constitutional developments which have led to the change of Government in Fiji. The situation is an Internal matter which the people of Fiji must be allowed to resolve in their own way. Interference of any kind from outside will not help in resolving our domestic difficulties, and we urge all Member States to show understanding of our situation.\nLooking back at the international scene over this past year, my Governm...",0.001378
115,"116. I should like, Sir, on behalf of the Irish delegation, to add my warm congratulations on your election as President of the twenty-sixth session of the General Assembly. The Assembly has chosen a President who knows how to bear wisdom lightly and, in electing you to this high honor, has paid tribute to you as a distinguished leader and statesman of your great country, a country of rich cultural 'diversity in unity' and one which has such an important role in the affairs of your region and among all the nations.\n117. I am happy to join in the universal tribute that has also been paid to. Mr. Hambro, the distinguished representative of Norway, whose patience, skill and dynamism were so brilliantly displayed as President of the twenty-fifth session of the Assembly.\n113. It is my ple...",0.001312


Now create a function that removes patterns based on regular expressions.

In [7]:
#Taken from "Blueprints for Text Analytics Using Python" - Jens Albrecht, Sidharth Ramachandran, Christian Winkler
#Chapter 4: Preparing Textual Data For Statistics And Machine Learning

#clean/masking step after determining impurity
def clean(text, min_len=10):
    if text == None or len(text) < min_len:
        return text
    
    patterns = [
                ("\n", " "), #change new lines to white space 
                (r"(\[|\()ibid\..*?(\]|\))", ""), #remove ibid references 
                (r"\[.*?\d+.*?\]", ""), #remove resolution references
               ]
    for regex, sub in patterns:
        text = re.sub(regex, sub, text)
    return text


Repeat this block of code - updating the clean function - until the level of impurity is acceptable.

In [8]:
df['text'] = df['text'].apply(clean, min_len=10)
df['impurity'] = df['text'].apply(impurity, min_len=10)
df[['text', 'impurity']].sort_values(by='impurity', ascending=False).head(3)

Unnamed: 0,text,impurity
2319,"My delegation warmly congratulates you, Sir, on your election to the presidency of the forty-second session of the General Assembly. Our thanks and compliments go to your predecessor, Mr. Humayun Rasheed Choudhury, the Foreign Minister of Bangladesh, for his outstanding leadership of the forty-first session. The international community is aware of the recent political and constitutional developments which have led to the change of Government in Fiji. The situation is an Internal matter which the people of Fiji must be allowed to resolve in their own way. Interference of any kind from outside will not help in resolving our domestic difficulties, and we urge all Member States to show understanding of our situation. Looking back at the international scene over this past year, my Governmen...",0.001378
3079,"Mr. President, the contribution that your country, Bulgaria, is making to the building of a new world order accords with the genius of its people. He are gratified by your election to the presidency of the General Assembly at its forty-seventh session, and we wish you every success in the discharge of your mandate. He take pleasure in hailing and paying tribute to His Excellency Ambassador Shihabi of Saudi Arabia, who skilfully handled his responsibilities at the forty-sixth session, which was marked by, among other major events, the election of the new Secretary-General of our Organization. He also take pleasure in addressing heartfelt congratulations to His Excellency Mr. Boutros Boutros Ghali. The energy and distinction with which he has assumed his new functions reinforce our convi...",0.000785
2724,"Mr. President, ascending the steps which lead to this podium# and speaking before the United Nations General Assembly is always a great honour and privilege. Today this honour and privilege take on added significance for our delegation. I have the pleasure of conveying to you, and to the Government and people of the Federal Republic of Nigeria# congratulations and best wishes on your unanimous election as President of the forty-fourth session of the General Assembly. Furthermore# I have the honour of conveying to you a personal congratulatory message from Father Halter Prime Minister of the Republic of Vanuatu. The Prime Minister: sends his fondest regards to you, a friend and a man he knows to be not only a proud son of Nigeria, but also a proud son of Vanuatu. We recall that the hono...",0.00062


Next improve tokenization accuracy by relying on Spacy's ability for named entity recognition. It is highly likely that custom tokenization rules will need to be written depending on the particular domain. Unfortunately this may mean writing hundreds if not thousands of rules. However, if we use Spacy's NER, we may be able to cut down on the number of rules that we need to write.

Let's take a look at an example of how this works. 

In [9]:
nlp = spacy.load("en_core_web_sm")
sentence_text = 'The People\'s Republic of China has a large population. Boutros Boutros-Ghali is a former Secretary-General of the United Nations. The Red Army had a decisive victory over the Nazi hordes. Vietnam is in East Asia.'
print(sentence_text)
print()
sentence = nlp(sentence_text)
tokens = {token.text for token in sentence}
print("This is a set (no duplicates present) of the tokens that were parsed from the above sentence")
print()
print(tokens)
print()
print("These are the named entities")
print()
ents = {ent.text for ent in sentence.ents}
print(ents)

The People's Republic of China has a large population. Boutros Boutros-Ghali is a former Secretary-General of the United Nations. The Red Army had a decisive victory over the Nazi hordes. Vietnam is in East Asia.

This is a set (no duplicates present) of the tokens that were parsed from the above sentence

{'over', 'hordes', "'s", 'East', 'in', 'Republic', 'Vietnam', '-', 'large', 'has', 'Red', 'Army', 'former', 'Nations', 'Secretary', 'victory', 'China', '.', 'Ghali', 'General', 'decisive', 'population', 'had', 'Asia', 'of', 'Boutros', 'is', 'Nazi', 'People', 'the', 'a', 'United', 'The'}

These are the named entities

{'the United Nations', 'East Asia', 'Vietnam', 'Boutros Boutros-Ghali', 'Nazi', 'The Red Army', "The People's Republic of China"}


As you can see, 'Boutros Boutros-Ghali' was tokenized as 'Boutros','Boutros','Ghali', 'East Asia' as 'East' and 'Asia', 'the United Nations' as 'the', 'United', 'Nations' and so on. So the tokenization failed to preserve phrases but the NER got it right. When text is tokenized in this way, we loose information that these words were ever joined together in the first place. It's important that we keep this information available for future processing steps.

Spacy has a number of different NER types we are interested in. Run the code below to see what they are. Excluded from this list are DATE,TIME,PERCENT,MONEY,QUANTITY,ORDINAL, and CARDINAL.

In [10]:
einterest = ['EVENT','FAC','GPE','LANGUAGE','LAW','LOC','NORP','ORG','PERSON','PRODUCT','WORK_OF_ART']
for ner_type in einterest:
    print(f"{ner_type}:{spacy.explain(ner_type)}") 

EVENT:Named hurricanes, battles, wars, sports events, etc.
FAC:Buildings, airports, highways, bridges, etc.
GPE:Countries, cities, states
LANGUAGE:Any named language
LAW:Named documents made into laws.
LOC:Non-GPE locations, mountain ranges, bodies of water
NORP:Nationalities or religious or political groups
ORG:Companies, agencies, institutions, etc.
PERSON:People, including fictional
PRODUCT:Objects, vehicles, foods, etc. (not services)
WORK_OF_ART:Titles of books, songs, etc.


This next code section extracts sets of tokens and entities from the corpus and compares them. The idea is to create custom tokenizer rules for any entities not found in the tokens. Since NER is a statistical process and is prone to error, we also need to verify the entities are correct.

The process looks like this
 - Compare entities to tokens
 - Write out a list of entities (one per line)
 - Manually inspect entities, fixing as necessary
 - Transform the list of entities into tokenization rules 

In [None]:
def extract_entities(text, output=entities_file):
    nlp = spacy.load("en_core_web_sm")
    for pipe_component in ['tok2vec','lemmatizer','tagger','parser','senter','attribute_ruler']:
        nlp.disable_pipe(pipe_component)

    print("Determining entities")
    with tqdm(total=len(text)) as pbar:
        entities = set()
        for doc in nlp.pipe(text):
            for ent in doc.ents:
                if ent.label_ in einterest:
                    entities.add(ent.text.lower())
            pbar.update(1)

    print("Determining tokens")
    with tqdm(total=len(text)) as pbar:
        tokens = set()
        for doc in nlp.pipe(text):
            for token in doc:
                tokens.add(token.text.lower())
            pbar.update(1)

    print("Determining difference")
    difference = entities - tokens

    print("Saving difference")
    with open(output,'w') as ent_file:
        ent_file.writelines('\n'.join(sorted(difference)))
    print("Done")

extract_entities(df['text'])

Determining entities


100%|██████████| 7507/7507 [10:00<00:00, 12.50it/s]


Determining tokens


 47%|████▋     | 3565/7507 [05:46<02:37, 25.00it/s]

Once any necessary edits to the entities.txt file have been made run this code block to add load the entities as special tokenization rules. This will ensure that these entities are treated as single tokens.

The code then tokenizes the texts in the corpus and saves the tokens in a file "out.csv". Each line in the output represents a single processed text and tokens are lemmatized after excluding stopwords.

In [None]:
nlp = spacy.load("en_core_web_sm")
with open(entities_file,'r') as ent_file:
    for entity in ent_file.readlines():
        entity = entity.strip()
        nlp.tokenizer.add_special_case(entity, [{ORTH:entity}])
        
for pipe_component in ['tok2vec','parser','senter','attribute_ruler','ner']:
    nlp.disable_pipe(pipe_component)

#lowercase the text so that our lowercased tokenization rules can match
lowercased_text = df['text'].apply(lambda t:t.lower())    

with tqdm(total=len(lowercased_text)) as pbar:
    sentences = []                                   
    for doc in nlp.pipe(lowercased_text):
        if lemmatize:
            sentences.append([token.lemma_ for token in doc if keep_stop or not token.is_stop])
        else:
            sentences.append([token.text for token in doc if keep_stop or not token.is_stop])
        pbar.update(1)

textacy.io.csv.write_csv(sentences, preprocessed_savepath)    