# Text cleaning

In order to use "create_pretraining_data.py" from BERT repository, the input must be a plain text file, with one sentence per line (it is important that these be actual sentences for the "next sentence prediction" task). They advise to perform sentence segmentation with an off-the-shelf NLP toolkit such as spaCy.

In [34]:
# Open and read file
f = open("sample.txt", "r")
contents = f.read()
contents



In [37]:
import re
#contents = contents.encode('ascii', 'ignore').decode("utf-8") 
contents = re.sub('[/]',' ', contents) #*=/$-+`#\~
contents



## 1. Tokenization
Perform sentence segmentation on plain text.

In [36]:
# Sentence segmentation with spacy
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(contents)
sentences = list(doc.sents)
for sentence in sentences:
    print(str(sentence)+'\n')

Americas Headquarters:

Cisco Systems, Inc., 170 West Tasman Drive, San Jose, CA

95134-1706 USA  Open Source Used In Cisco IOS XR Release 4.3.0  

This document contains the licenses and notices for open source software used in this product.

With respect to the free/open source software listed in this document, if you have any questions or wish to receive a copy of the source code to which you are entitled under the applicable free/open source license(s) (such as the GNU Lesser/General Public License), please contact us at external-opensource-requests@cisco.com.  

In your requests please include the following reference number 78EE117C99-25360040  Contents 1.1 commons-logging 1.0.3  1.1.1 Notifications :  

1.1.2 Available under license :  1.2

Expect.pm 1.20  

1.2.1

Available under license :  

1.3 GNU Bison 1.28  

1.3.1 Available under license :  

1.4 GNU Nano Editor 2.0.1  

1.4.1 Available under license :  1.5 JED

0.99.16  

1.5.1 Available under license :  1.6 libxml2 2.6.3

In [7]:
# Sentences segmentation with nltk
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
sentences = sent_tokenize(contents)
for sentence in sentences:
    print(sentence+'\n')

[nltk_data] Downloading package punkt to /home/antoloui/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Americas Headquarters: Cisco Systems, Inc., 170 West Tasman Drive, San Jose, CA 95134-1706 USA  Open Source Used In Cisco IOS XR Release 4.3.0  This document contains the licenses and notices for open source software used in this product.With respect to the free/open source software listed in this document, if you have any questions or wish to receive a copy of the source code to which you are entitled under the applicable free/open source license(s) (such as the GNU Lesser/General Public License), please contact us at external-opensource-requests@cisco.com.

In your requests please include the following reference number 78EE117C99-25360040  Contents 1.1 commons-logging 1.0.3  1.1.1 Notifications :  1.1.2 Available under license :  1.2 Expect.pm 1.20  1.2.1 Available under license :  1.3 GNU Bison 1.28  1.3.1 Available under license :  1.4 GNU Nano Editor 2.0.1  1.4.1 Available under license :  1.5 JED 0.99.16  1.5.1 Available under license :  1.6 libxml2 2.6.31  1.6.1 Available under 

## 2. Cleaning
Remove all undesirable content.

In [12]:
from multiprocessing import Pool
import argparse
import glob
import os
import io
import time
import logging
import re



del_pattern = re.compile(r'''(?x)
                        \([^\(\)]*\)
                        |<[^<>]*>.*
                        |\*\*\*
                        |\#\#\#
                        |^\w*\s*\w*[:\-–]+\s*
                        |(http:)*www.*com
                        |\d*(\-\d+)+
                        |.*©.*
                        |(?<=\s)'
                        |'(?=\s)
                        |(?<=\s)\-
                        |\-(?=\s)
                        ''')

post_tok_del = re.compile(r'''(?x)
                        ;
                        |^\d+$
                        |^'.*
                        |–+
                        |\++
                        |•+
                        |_+
                        |@+
                        |&+
                        |`+
                        |\*+
                        |\{\{\{1
                        |\[+
                        |\]+
                        |\{+
                        |\}+
                        |\(+
                        |\)+
                        |\$+
                        |\+
                        |>+
                        |<+
                        |=+
                        |\#+
                        |~+
                        |!\[\]
                        ''')

def process(input_file):
    with io.open(input_file, 'r', encoding="utf-8") as fin:
        with io.open(input_file.replace('.txt', '.txt.re'), 'w', encoding="utf-8") as fout:
            lines = fin.readlines()
            for line in lines:
                line = line.strip()
                line = re.sub(del_pattern, "", line)
                #line = re.sub(r"‘|’", "'", line)
                line = re.sub(post_tok_del, "", line)
                line = re.sub(r"/|…|->|=|-(-)+", " ", line)
                line = line + "\n"
                fout.write(line)
                
                
tic = time.time()
process('sample.txt')
toc = time.time()
print("Processed in %.2f sec"%(toc-tic))

Processed in 1.39 sec
