<span style='color:red'> This notebook is copied from "hunspell-run-evaluation.ipynb". Hunspell is replaced by Nuspell spell checker. </span>

# Spelling correction with nuspell on litkey data set

Contents:
1. Preparation
    - Imports
    - Load data set
        - **Types**
        - **Token**
    - Preprocessing
2. Running nuspell on litkey data set
    - **Types**
    - **Tokens**
3. Evaluation

## 1 - Preparation

In [1]:
# IMPORTS
#%autoreload 2

import re
import pandas as pd
import matplotlib as mpl
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
from tqdm.notebook import tqdm # A fast, extensible progress bar

import sys

sys.path.insert(0, '../../')
# sys.path

from Lisa import litkey_2
#import Lisa.litkey_2

In [2]:
# Configuration
# Set figure size for sns plot
%config InlineBackend.figure_format = 'retina'
mpl.rc('figure', figsize=(8, 6), dpi=100)
sns.set()
sns.set_style('darkgrid')

# Set tqdm on pandas
tqdm.pandas(desc="Progress so far...")

# Do not truncate columns of DataFrame
pd.set_option('display.max_rows', None)

In [3]:
#tqdm.pandas(desc='my bar')
#import numpy as np
#tqdm.pandas()
#df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
#df.progress_apply(lambda x: x**2)

In [3]:
# LOAD LITKEY DATA SET
# It was the wrong path; Go two levels up
data_error_types = litkey_2.load(litkey_data_path ='../../litkey-data/')

data_error_token = litkey_2.load(litkey_data_path ='../../litkey-data/', toss_duplicates=False) 

In [4]:
# PREPROCESSING

#Hunspell treats words with some punctuation marks as two words (e.g. Seil=bahn), which would destroy the indices; 
#therefore, "=" is replaced with "-" and the others are deleted prior to analysis with Hunspell

data_error_types['unchanged_corrected'] = data_error_types['corrected']
#data_error_types['corrected'] = data_error_types['corrected'].str.replace("=", '-')
data_error_types['corrected'] = data_error_types['corrected'].str.replace("=", '')
data_error_types['corrected'] = data_error_types['corrected'].str.replace("-", '') # "-" had to be removed  additionally 
#data_error_types['corrected'] = data_error_types['corrected'].str.replace(" ", '')


data_error_types['corrected'] = data_error_types['corrected'].str.replace('"', '')
data_error_types['corrected'] = data_error_types['corrected'].str.replace(":", '')
data_error_types['corrected'] = data_error_types['corrected'].str.replace(",", '')
data_error_types['corrected'] = data_error_types['corrected'].str.replace("'", '')

data_error_token['unchanged_corrected'] = data_error_token['corrected']
#data_error_token['corrected'] = data_error_token['corrected'].str.replace("=", '-')
data_error_token['corrected'] = data_error_token['corrected'].str.replace("=", '')
data_error_token['corrected'] = data_error_token['corrected'].str.replace("-", '') # "-" had to be removed  additionally 
#data_error_token['corrected'] = data_error_token['corrected'].str.replace(" ", '')

data_error_token['corrected'] = data_error_token['corrected'].str.replace('"', '')
data_error_token['corrected'] = data_error_token['corrected'].str.replace(":", '')
data_error_token['corrected'] = data_error_token['corrected'].str.replace(",", '')
data_error_token['corrected'] = data_error_token['corrected'].str.replace("'", '')

In [20]:
#display(data_error_types.head(100))
#display(data_error_token.head(100))

In [5]:
data_error_types[~data_error_types.corrected.str.isalpha()]

Unnamed: 0,original,corrected,filename,freq_ori,freq_cor,freq_tup,unchanged_corrected
5614,1234schulle,1234Schule,06-477-3-IV-Weg.csv,1,1,1,1234-Schule
9426,10000mal,10000mal,10-603-4-IV-Weg.csv,1,1,1,10000-mal


In [6]:
data_error_types[data_error_types.corrected.str.isnumeric()]

Unnamed: 0,original,corrected,filename,freq_ori,freq_cor,freq_tup,unchanged_corrected


In [7]:
data_error_types[~data_error_types.corrected.str.isalnum()]

Unnamed: 0,original,corrected,filename,freq_ori,freq_cor,freq_tup,unchanged_corrected


In [8]:
data_error_types[data_error_types.corrected.str.contains(r' ')]

Unnamed: 0,original,corrected,filename,freq_ori,freq_cor,freq_tup,unchanged_corrected


In [9]:
data_error_types[data_error_types.corrected.str.isspace()]

Unnamed: 0,original,corrected,filename,freq_ori,freq_cor,freq_tup,unchanged_corrected


## 2 - Running nuspell on litkey data set
###  a) Types

In [10]:
import sys
print(sys.version)

3.9.6 (default, Aug 18 2021, 12:38:10) 
[Clang 10.0.0 ]


In [11]:
# Create csv-file containing 'corrected'-column (misspelled) data to be analysed with nuspell
data_error_types.corrected.to_csv('corrected_nu.csv', header=False, index=False)

# The beforementioned csv-file is input to nuspell analysis with german dictionary;
# Output of analysis is put to a txt file
# Build a file 'hunspell.txt' with corrections for each word
!cat corrected_nu.csv | nuspell -d de_DE > output_nu.txt
#!cat corrected_nu.csv | nuspell -d de_DE > output_nu.txt



# Build a list (suggestions-column) of top correction suggestions from nuspell for each word, '' for correct, '?' for unrecognized
# Reading in all the lines
hs = []
with open('output_nu.txt') as f:
    #next(f) # Why did this work/was necessary in hunspell version though?
    for index, l in enumerate([l for l in [line.strip() for line in f] if l]):
        # Words recognized as correct
        # (* = dictionary stem (e. g. "man"), + = affixed forms of the following dictionary stem (e. g. "wollt" - wollen))
        # Append nothing to hs
        
        # TODO: Why minus flag?
        if re.match(r'\+|\*|-', l):
            hs.append('')
            
        # Words not recognized/rejected words (# = without suggestions)
        # Append '?' to hs
        elif re.match('#', l):
            hs.append('?')
            
        # Words not recognized/rejected words (& = with suggestions)
        # Append suggested words to hs
        else:
            hs.append(l.split(': ')[1].split(', '))

            
# Add nuspell's corrections as column to data
print(data_error_types.shape[0])
print(len(hs))

data_error_types['suggestions'] = hs

data_error_types.head(40)    # Print excerpt from DataFrame

INFO: Locale LC_CTYPE=de_DE.UTF-8, Used encoding=UTF-8
INFO: Pointed dictionary /Users/lisaprepens/Library/Spelling/de_DE.{dic,aff}
9484
9484


Unnamed: 0,original,corrected,filename,freq_ori,freq_cor,freq_tup,unchanged_corrected,suggestions
0,belt,bellt,01-005-2-III-Eis.csv,91,138,91,bellt,
1,kukt,kuckt,01-005-2-III-Eis.csv,73,152,73,kuckt,"[kickt, juckt, duckt, guckt, zuckt, Kuckuck]"
2,dan,dann,01-005-2-III-Eis.csv,627,651,621,dann,
3,gekricht,gekriegt,01-005-2-III-Eis.csv,2,15,2,gekriegt,
4,leken,lecken,01-005-2-III-Eis.csv,14,17,14,lecken,
5,felt,fällt,01-005-2-III-Eis.csv,93,198,90,fällt,
6,wolte,wollte,01-005-2-III-Eis.csv,173,201,173,wollte,
7,lekt,leckt,01-005-2-III-Eis.csv,20,42,19,leckt,
8,fom,vom,01-005-2-III-Eis.csv,13,16,13,vom,
9,gawen,kaufen,01-006-2-III-Eis.csv,1,6,1,kaufen,


In [12]:
data_error_types.head(2)
data_error_types.tail(2)

Unnamed: 0,original,corrected,filename,freq_ori,freq_cor,freq_tup,unchanged_corrected,suggestions
9482,mutzte,nutzte,10-693-4-IV-Weg.csv,1,1,1,nutzte,
9483,momen,Moment,10-693-4-IV-Weg.csv,1,4,1,Moment,


### b) Token

In [13]:
# Create csv-file containing 'corrected'-column (misspelled) data to be analysed with nuspell
data_error_token.corrected.to_csv('corrected_nu_token.csv', header=False, index=False)

# The beforementioned csv-file is input to nuspell analysis with german dictionary;
# Output of analysis is put to a txt file
# Build a file 'hunspell.txt' with corrections for each word
!cat corrected_nu_token.csv | nuspell -d de_DE > output_nu_token.txt
#!type corrected_hun.csv | hunspell -d de_DE > hunspell.txt



# Build a list (suggestions-column) of top correction suggestions from nuspell for each word, '' for correct, '?' for unrecognized
# Reading in all the lines
hs = []
with open('output_nu_token.txt') as f:
    #next(f)
    for index, l in enumerate([l for l in [line.strip() for line in f] if l]):
        # Words recognized as correct
        # (* = dictionary stem (e. g. "man"), + = affixed forms of the following dictionary stem (e. g. "wollt" - wollen))
        # Append nothing to hs
        
        # TODO: Why minus flag?
        if re.match(r'\+|\*|-', l):
            hs.append('')
            
        # Words not recognized/rejected words (# = without suggestions)
        # Append '?' to hs
        elif re.match('#', l):
            hs.append('?')
            
        # Words not recognized/rejected words (& = with suggestions)
        # Append suggested words to hs
        else:
            hs.append(l.split(': ')[1].split(', '))

            
# Add nuspell's corrections as column to data
print(data_error_token.shape[0])
print(len(hs))

data_error_token['suggestions'] = hs

data_error_token.head(40)    # Print excerpt from DataFrame

INFO: Locale LC_CTYPE=de_DE.UTF-8, Used encoding=UTF-8
INFO: Pointed dictionary /Users/lisaprepens/Library/Spelling/de_DE.{dic,aff}
24601
24601


Unnamed: 0,original,corrected,filename,freq_ori,freq_cor,freq_tup,unchanged_corrected,suggestions
0,belt,bellt,01-005-2-III-Eis.csv,91,138,91,bellt,
1,kukt,kuckt,01-005-2-III-Eis.csv,73,152,73,kuckt,"[kickt, juckt, duckt, guckt, zuckt, Kuckuck]"
2,dan,dann,01-005-2-III-Eis.csv,627,651,621,dann,
3,gekricht,gekriegt,01-005-2-III-Eis.csv,2,15,2,gekriegt,
4,leken,lecken,01-005-2-III-Eis.csv,14,17,14,lecken,
5,felt,fällt,01-005-2-III-Eis.csv,93,198,90,fällt,
6,wolte,wollte,01-005-2-III-Eis.csv,173,201,173,wollte,
7,lekt,leckt,01-005-2-III-Eis.csv,20,42,19,leckt,
8,fom,vom,01-005-2-III-Eis.csv,13,16,13,vom,
9,dan,dann,01-005-2-III-Eis.csv,627,651,621,dann,


In [14]:
# Write pickle
data_error_types.to_pickle('target_error_types_nuspell_evaluation.pkl')
data_error_token.to_pickle('target_error_token_nuspell_evaluation.pkl')

## Get upper bound
_Here: For case insensitive_ <br>
There are 3 cases to be distinguished:
- not recognized
    - Hunspell flag = '#' , i. e. append '?' to hs (list = [?])
- recognized as right
    - Hunspell flag = '+' or '*', i. e. append nothing to hs (see above)
- recognized as wrong
    - list contains suggestions

In [2]:
# Read pickle
data_error_token = pd.read_pickle('target_error_token_nuspell_evaluation.pkl')
data_error_types = pd.read_pickle('target_error_types_nuspell_evaluation.pkl')

### a) Types

In [15]:
# RECOGNIZED AS RIGHT / NOT RIGHT (false and nor recognized)

# if suggestions is empty, word has been recognized as correct
# <-> if there is a suggestion, words is not recognized as correct

ct_nu_not_right = data_error_types[data_error_types.suggestions.apply(len).gt(0)].shape[0] # check for length greater than 0, i. e. list not empty
ct_nu_right = data_error_types[~data_error_types.suggestions.apply(len).gt(0)].shape[0] # not right is either false or not recognized
display('Nuspell recognizes as not correct...', data_error_types[data_error_types.suggestions.apply(len).gt(0)].head(20))

'Nuspell recognizes as not correct...'

Unnamed: 0,original,corrected,filename,freq_ori,freq_cor,freq_tup,unchanged_corrected,suggestions
1,kukt,kuckt,01-005-2-III-Eis.csv,73,152,73,kuckt,"[kickt, juckt, duckt, guckt, zuckt, Kuckuck]"
14,nahaise,nachhause,01-006-2-III-Eis.csv,1,10,1,nachhause,"[nach hause, nach-hause, nachschaue, Nachschau]"
28,weitagegangen,weitergegangen,01-025-2-III-Eis.csv,1,1,1,weitergegangen,"[weiter gegangen, weiter-gegangen, weitergegeb..."
63,runtergefalen,runtergefallen,01-029-2-III-Eis.csv,12,28,12,runtergefallen,"[runter gefallen, runter-gefallen, heruntergef..."
77,wegetan,wehgetan,01-045-2-III-Eis.csv,11,17,11,wehgetan,"[weggetan, weh getan, weh-getan, angeweht]"
79,rich,riech,01-045-2-III-Eis.csv,2,2,2,riech,"[reich, rieche, riecht, siech, Riecher]"
84,Nachause,nachhause,01-049-2-III-Eis.csv,1,10,1,nachhause,"[nach hause, nach-hause, nachschaue, Nachschau]"
103,schtoiper,stolper,01-057-2-III-Eis.csv,1,1,1,stolper,"[stolpre, stolpere, stolpern, stolpert, stolpe..."
146,pas,pass,01-065-2-III-Eis.csv,8,10,8,pass,"[Pass, passe, passt, passé, -pass, pass-, nass..."
221,cemast,zermatscht,01-113-2-III-Eis.csv,1,1,1,zermatscht,[schmatze]


In [16]:
display(ct_nu_not_right)
display(ct_nu_right)
display(data_error_types.shape[0])

ratio = ct_nu_right/data_error_types.shape[0]
print('New upper bound for types is', round(ratio,4)*100,'%')
print(ratio)

731

8753

9484

New upper bound for types is 92.29 %
0.9229228173766343


### b) Token

In [17]:
ct_nu_not_right_token = data_error_token[data_error_token.suggestions.apply(len).gt(0)].shape[0]
ct_nu_right_token = data_error_token[~data_error_token.suggestions.apply(len).gt(0)].shape[0]

display(ct_nu_not_right_token)
display(ct_nu_right_token)
display(data_error_token.shape[0])

ratio_token = ct_nu_right_token/data_error_token.shape[0]
print('New upper bound for tokens is', round(ratio_token,4)*100,'%')
print(ratio_token)

1473

23128

24601

New upper bound for tokens is 94.01 %
0.940124385187594
