## Tidy up the 'Text' column

Since the main thrust of this research is to use natural language processing on the 'Text' column, it makes sense to do a bit of cleaning at this early stage. I had a look at what was in the 'Text' column and noticed a few problems encoding/decoding errors, extra whitespace, run on sentences, extra punctuation, spelling, etc. The problem is that this is not a normal, conversational English set of texts. There are, quite rightly, a lot of characters from other languages, words that unlikely to be in typical language dictionaries, etc. 

I set up a few processes to loop over the text to clean these up. We need a few more import/download functions here, so let's start with that. 


In [123]:
# importing the nltk suite  
import nltk 
from nltk import word_tokenize                     # a useful functions from nltk that helps identify individual words
  
# importing jaccard distance 
# and ngrams from nltk.util 
from nltk.metrics.distance import jaccard_distance 
from nltk.util import ngrams

# importing edit distance   
from nltk.metrics.distance  import edit_distance 

# Downloading and importing 
# package 'words' from nltk corpus 
nltk.download('words') 
from nltk.corpus import words 

!pip install unidecode
from unidecode import unidecode
correct_words = words.words()

from spellchecker import SpellChecker
spell = SpellChecker()

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\mzyssjkc\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!

[notice] A new release of pip is available: 23.1.2 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting unidecode
  Downloading Unidecode-1.3.7-py3-none-any.whl (235 kB)
                                              0.0/235.5 kB ? eta -:--:--
     ------------------------------------   225.3/235.5 kB 6.9 MB/s eta 0:00:01
     -------------------------------------- 235.5/235.5 kB 4.8 MB/s eta 0:00:00
Installing collected packages: unidecode
Successfully installed unidecode-1.3.7


In [15]:
file = open('words_alpha.txt','r')
words_alpha=file.readlines()
file.close()
dictionary = []
for entry in words_alpha:
    item = re.sub(r'\n', r'', entry)
    dictionary.append(item)
    
print(dictionary[:100])

['a', 'aa', 'aaa', 'aah', 'aahed', 'aahing', 'aahs', 'aal', 'aalii', 'aaliis', 'aals', 'aam', 'aani', 'aardvark', 'aardvarks', 'aardwolf', 'aardwolves', 'aargh', 'aaron', 'aaronic', 'aaronical', 'aaronite', 'aaronitic', 'aarrgh', 'aarrghh', 'aaru', 'aas', 'aasvogel', 'aasvogels', 'ab', 'aba', 'ababdeh', 'ababua', 'abac', 'abaca', 'abacay', 'abacas', 'abacate', 'abacaxi', 'abaci', 'abacinate', 'abacination', 'abacisci', 'abaciscus', 'abacist', 'aback', 'abacli', 'abacot', 'abacterial', 'abactinal', 'abactinally', 'abaction', 'abactor', 'abaculi', 'abaculus', 'abacus', 'abacuses', 'abada', 'abaddon', 'abadejo', 'abadengo', 'abadia', 'abadite', 'abaff', 'abaft', 'abay', 'abayah', 'abaisance', 'abaised', 'abaiser', 'abaisse', 'abaissed', 'abaka', 'abakas', 'abalation', 'abalienate', 'abalienated', 'abalienating', 'abalienation', 'abalone', 'abalones', 'abama', 'abamp', 'abampere', 'abamperes', 'abamps', 'aband', 'abandon', 'abandonable', 'abandoned', 'abandonedly', 'abandonee', 'abandoner'

In [16]:
no_null_texts['Text'][1]                                        # First, let's have a look at the 'Text' column to spot some
                                                                # issues. Right away, I can see "areonly", "earthÃ¢ÂÂs", 
                                                                # "sequen cing", etc. Work to be done!

1    The mechanisms whereby inherited DNA mutations cause disease areonly beginning to be understood. These are best understood in the contextof knowledge of the three dimensional structure of the rele...
1    Gene therapy is an attractive option for a number of genetic  disorders. Genetic supplementation could in theory lead to long  lasting disease phenotype correction. However, efficient targeting,  ...
1                                                                                                                                                                                     No abstract available.
1    Four decades ago homocystinuria due to cystathionine beta synthase (CBS) has been described as a typical inborn error of metabolism partially resembling the Marfan syndrome. As extremely high conc...
1    The earthÃ¢ÂÂs rotation causes 24 hour cycles in many aspects of the  physical environment, while the earthÃ¢ÂÂs revolution around the sun causes seasonal changes . Most l

In [129]:
test_string = "This is a test.String. It has   problems that areonly going to get better."

def remove_errors (input):
    no_extra_spaces = re.sub(r'(\s)(\s+)', r'\1', input)               # identifies 2 or more sequential whitespaces and cuts them to 1
    no_run_ons = re.sub(r'([a-z].)([A-Z])', r'\1 \2', no_extra_spaces) # identifies runons (e.g. "word.New sentence "
    normalised = unidecode(no_run_ons)
    tokens = word_tokenize(normalised)                                     # 
    output = []
    for token in tokens:
        if token.lower() in dictionary :
            output.append(token)
        else:
            if token in "-!\"#$%&()'*-–+,./:;<=>?@[\]^_`{|}~''“”":
                output.append(token)
            else:
                segmented = get_segments(token)
                output.append(segmented)
    return(output)

remove_errors(test_string)

['This',
 'is',
 'a',
 'test',
 '.',
 'String',
 '.',
 'It',
 'has',
 'problems',
 'that',
 ['are', 'only'],
 'going',
 'to',
 'get',
 'better',
 '.']

In [128]:
import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii

remove_accents(Ã¢ÂÂclock)

SyntaxError: invalid character '¢' (U+00A2) (1044527773.py, line 8)

In [None]:
for thingy in no_null_texts['Text'][1]:
    remove_errors(thingy)

In [114]:
def get_segments(input):
    sentence = input.lower()
    onegrams = OneGramDist(filename='count_1w.txt')
    onegram_fitness = functools.partial(onegram_log, onegrams)
    return(segment(sentence, word_seq_fitness=onegram_fitness))
    
get_segments('areonly')

['are', 'only']