# Advanced model
Rather than basic Linear model, Ensemble, etc that uses sklearn, and rather than using basic DNN, we'll do something more complicated that other uses (though they're not necessarily current SOTA on both model and techniques). 

- Cleaning of data (With additional mapping all to American English from British English). 

## Cleaning Data
Information on British --> American: Get a mapping from the internet, get all unique words from the whole corpus, get all the words that're British, and find their corresponding American, and apply it throughout whole corpus using regex. 

In [1]:
from pathlib import Path
import xml.etree.ElementTree as ET

In [2]:
path = Path("/home/fastai2/Music/BNC_Corpus/download/Texts/news")
this = path/"A1E.xml"

In [3]:
tree = ET.parse(this)
root = tree.getroot()
root.tag, root.attrib

('bncDoc', {'{http://www.w3.org/XML/1998/namespace}id': 'A1E'})

In [4]:
wtext = list(root)[1]
div1 = list(wtext)[0]
head1 = list(div1)[0]
s1 = list(head1)[0]
w1 = list(s1)[0]

In [5]:
w1.text

'Latest '

Let's skip that for now and we'll see how it goes later. 

In [6]:
from sklearn.datasets import fetch_20newsgroups
all_xs, all_y = fetch_20newsgroups(subset="all", remove=('headers', 'footers', 'quotes'),
                    shuffle=True, return_X_y=True)

In [7]:
sys.path

['/home/fastai2/notebooks/DataGlacier/NLP_GroupProject_DG/Week_12',
 '/home/fastai2/.vscode-server/extensions/ms-toolsai.jupyter-2021.10.1101450599/pythonFiles',
 '/home/fastai2/.vscode-server/extensions/ms-toolsai.jupyter-2021.10.1101450599/pythonFiles/lib/python',
 '/anaconda/envs/fastai/lib/python38.zip',
 '/anaconda/envs/fastai/lib/python3.8',
 '/anaconda/envs/fastai/lib/python3.8/lib-dynload',
 '',
 '/anaconda/envs/fastai/lib/python3.8/site-packages',
 '/anaconda/envs/fastai/lib/python3.8/site-packages/locket-0.2.1-py3.8.egg',
 '/anaconda/envs/fastai/lib/python3.8/site-packages/IPython/extensions',
 '/home/fastai2/.ipython']

In [8]:
from fastai.text.all import *
from tqdm.notebook import tqdm
import sys
parent_path = Path("/home/fastai2/notebooks/DataGlacier")
sys.path.append(str(parent_path/"NLP_GroupProject_DG/python_files"))

from nlputils import *

In [9]:
k = 7

choice = np.load(f"choice_{k}.npy")
all_xs = np.array(all_xs)[threshold_subset(all_xs, k)]
print("Finish thresholding. ")
all_xs = all_xs[choice < 0.1]
print("Finish removing xxunk above threshold.")
all_xs = np.array([clean_data(x, False) for x in tqdm(all_xs)])
print("Finish cleaning data.")
len(all_xs)

Finish thresholding. 
Finish removing xxunk above threshold.


  0%|          | 0/11708 [00:00<?, ?it/s]

Finish cleaning data.


11708

One is thinking of doing this manually. Thing is, one isn't even sure if we could deal with everything. There's no promise. 

And we don't have to do it for **everything**, just those that are confusing. WE could still mix british english and american english **as long as there're no confusion** (such as both being used). For a single word, if there is only in British English used all the time, just do that. 

**Weakness**: When you try to apply it to real product, you need to check again for any British English because the corpus **cannot contain all** British English. 

There are also some thing on non-english words. One is thinkingo f just leaving it there, because there seems to be some that contains these and we don't really know how to clean these things out without doing quite a lot of work. (brute force comparison with english dictionary, and then have to decide again things especially for science like SCSI that's a computer term but not in the English dict, etc that is complicated). So we'll just leave it there as some noise. 

In [10]:
try: counter = load_pickle(f"counter_{k}.pkl")
except Exception: 
    counter = Counter()
    for data in tqdm(all_xs): counter += Counter(data.split())
    save_pickle(f"counter_{k}.pkl", counter)

- Lower all cases
- Remove all punctuations

In [14]:
def strip_punc(s): 
    return s.translate(str.maketrans('', '', string.punctuation))

# lower all cases
our_vocab = {v.lower() for v in counter}
del counter

# remove all punctuations
our_vocab = {strip_punc(v) for v in our_vocab}

We ignore some things like `endeavor` can only be used as verb but it's British English type can be used both as verb and noun thingy. Too complicated. 

Commonly we'll have these [spelling method](https://www.oxfordinternationalenglish.com/differences-in-british-and-american-spelling/) differences and we also have [common words](https://www.thoughtco.com/american-english-to-british-english-4010264)

In [None]:
# Discovery method
from IPython.display import clear_output


def brit_to_amer(ending, brit_rule, americ_rule):
    r = re.compile(ending)
    compare_list = list(filter(r.match, our_vocab))
    ame_list = [re.sub(brit_rule, americ_rule, s) for s in compare_list]

    non_empty = {}
    for b, a in tqdm(zip(compare_list, ame_list)):
        m = re.compile(rf'{a}')
        z = list(filter(m.match, our_vocab))
        if a in z: non_empty[b] = a

    return non_empty


british_eng = {
    "rumour": "rumor",
    "vapour": "vapor",
    "arbour": "arbor",
    "colour": "color",
    "behaviour": "behavior",
    "saviour": "savior",
    "favour": "favor",
    "armour": "armor",
    "honour": "honor",
    "inferiour": "inferior",
    "labour": "labor",
    "humour": "humor",
    "endeavour": "endeavor",
    "harbour": "harbor",
    "fervour": "fervor",
    "parlour": "parlor",
    "neighbour": "neighbor",
    "flavour": "flavor",
    "belabour": "belabor",
    'survivour': 'survivor',  # end of our --> or.
    'aerial': "antenna",
    'anywhere': 'anyplace', 
    # 'autumn': 'fall'   # which fall we're deciding? fall or fall?
    "solicitor": "attorney",
    'biscuit': 'cookie',
    'bonnet': 'hood',
    'janitor': 'aretaker',
    'constable': 'patrolman',
    'dynamo': 'generator',
    # and others it's just too many one decide to leave it here for now. 
}

# our to or is too dirty to be used. 
british_eng.update(brit_to_amer(r'[a-z]+ise$', r'ise', r'ize')) # ise --> ize. 
british_eng.update(brit_to_amer(r'[a-z]+yse$', 'yse', 'yze'))  # use --> yze
british_eng.update(brit_to_amer(r'[a-z]+ae[a-z]$', 'ae', 'e'))  # ae --> e
del british_eng["michael"], british_eng["laer"], british_eng["caen"]
del british_eng["raes"]
# oe --> e checked nothing useful (mostly useful translated become rubbish)
british_eng.update({'defence': 'defense',
 'sence': 'sense',
 'selfdefence': 'selfdefense',
 'nonsence': 'nonsense',
 'pretence': 'pretense',
 'absence': 'absense',
 'essence': 'essense',
 'licence': 'license',
 'offence': 'offense'})  # ence --> ense after deleting rubbish.
british_eng.update({'catalogue': 'catalog'})  # ogue --> og. 
# spelling


clear_output()
british_eng = defaultdict(str, british_eng)
# british_eng

Spelling mistakes are done manually when we check the output of stuffs and hence depends on my expertize in English. 

In [62]:
spelling_mistakes = {
    "bahaviour": "behavior",
    "excercise": "exercise",
    "supprise": "surprise",
    "suprise": "surprise",
    "appologise": "apologize",
    "appologize": "apologize",
    'excersise': "exercise",
    'oterwise': "otherwise",
    "frnachise": "franchise",
    "fulfullment": "fulfillment",
    'usuallu': "usually",
    'specfically': "specifically",
    'espically': "especially",
    "talll": "tall",
    'usally': "usually",
    'ususally': "unusually",
    'adventually': 'eventually',
    'oscialltor': 'oscillator',
    'xcellerator': 'accelerator',
    'reccollecting': 'recollecting',
    'osciallator': 'oscillator',
    'unballance': 'unbalance',
    'congroller': 'controller',
    'weeeeelllllll': 'well',
    'killig': 'killing',
    'oscilliscope': "oscilloscope",
    "ussually": "usually",
    'knoew': 'knew',
    "hense": "hence",
    
}

spelling_mistakes.update(brit_to_amer(r'[a-z]+lll[a-z]+$', 'lll', 'll'))

0it [00:00, ?it/s]

In [23]:
all_xs[251].replace("rumour", "rumor")

'\nThe rumor was basically everywhere in Toronto based on reports\nthat Keenan has told both San Jose and Philadelphia that he\nwas no longer interested in pursuing further negotiations with\neither team. \n\nThe Ranger announcement is supposed to happen tomorrow supposedly.\n\nThe Rangers have so many veterans that they had to get a coach\nwith "weight" and a proven record...and whom they know Messier respects.'

Replace spelling mistakes before british to american. 