<h1 style="text-align:center;color:red;">The Lord of The Ring - Grammar</h1>
<p style="text-align:center;">By Maycon Cypriano Batestin</p>



### About the Dataset

The objective is to create a grammar checker in Portuguese based on the book Lord of the Rings by J.R.R. Tolkien

- **Source:** The book Lord of the Rings by J.R.R. Tolkien
- **Release:** Maycon Batestin
- **Licence:** Creative Commons Attribution-ShareAlike 4.0 International ([CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))



<h1 style="text-align:center;color:red;">Glossary</h1>


Fields	                                                  | Type  	  |    Description                              |
----------------------------------------------------------|:---------:|:-------------------------------------------:|
word                            						  |string     | a singular word of the book                             |
count                                                     |int        | the count time of word                               |
freq													  |float      | a frequence of the word           |






<h1 style="text-align:center;color:red;">Getting the Dataset </h1>


In [172]:
!clear
!python3 /Users/mayconcyprianobatestin/Documents/repositorios/DATA_SCIENCE/GRAMMAR/scripts/create_dataset.py

[H[2JXref table not zero-indexed. ID numbers for objects will be corrected.


<h1 style="text-align:center;color:red;">Librarys </h1>


In [173]:
import pandas as pd
import re
import nltk
import string
import PyPDF2
import requests
import random
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import linear_model
from sklearn.preprocessing import LabelEncoder
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords


nltk.download("punkt")



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mayconcyprianobatestin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [174]:
def df(path):
    df = pd.read_csv(path)
    return df

df = df("/Users/mayconcyprianobatestin/Documents/repositorios/DATA_SCIENCE/GRAMMAR/dataset/lotr.csv")
df

Unnamed: 0,word,count,freq
0,special,19,0.000032
1,note,26,0.000044
2,in,8775,0.014918
3,this,1664,0.002829
4,reprint,1,0.000002
...,...,...,...
16395,archival,1,0.000002
16396,utmost,1,0.000002
16397,xxxstmxxx,1,0.000002
16398,06,1,0.000002


In [175]:
def checkNAN(df):
    if df.isnull().values.any():
        df.dropna(inplace=True) 
        df.reset_index(drop=True, inplace=True)
        print("Checking for NaN values and fixing!.")
    else:
        print("There no NaN values on your dataset")

checkNAN(df)


Checking for NaN values and fixing!.


In [176]:
# Variables

text = " ".join([l for l in  df.word ])
token = set([word for word in nltk.tokenize.word_tokenize(text) if word.isalpha()])
alpha = string.ascii_letters
numbers = string.digits
puncts = string.punctuation
printable = string.printable
space = string.whitespace
token_string = "".join(token)


In [177]:
#create a test dataframe

def token_wrongs(token):
    if len(token) > 1:
        index = random.randint(0, len(token) - 1) 
        token_list = list(token)
        token_list.pop(index)
        return ''.join(token_list)
    else:
        return token


tuples = [(token, token_wrongs(token)) for token in token]

tId = list(enumerate(tuples))

test_df = pd.DataFrame(tId, tuples, columns=["id", 'test'])


test_df = list(test_df.test)
test_df

[('motor', 'motr'),
 ('furnace', 'funace'),
 ('greed', 'reed'),
 ('beaklike', 'beaklik'),
 ('hardly', 'hardl'),
 ('himsel', 'imsel'),
 ('htly', 'hty'),
 ('pps', 'ps'),
 ('eh', 'e'),
 ('suspense', 'susense'),
 ('suspected', 'suspecte'),
 ('bridled', 'ridled'),
 ('ges', 'gs'),
 ('goon', 'goo'),
 ('idleness', 'idlenes'),
 ('stupid', 'supid'),
 ('bodyguards', 'bodygurds'),
 ('unearthly', 'unearthy'),
 ('fallohides', 'fallohids'),
 ('highland', 'highlnd'),
 ('reciting', 'recitng'),
 ('uruks', 'uruk'),
 ('advantage', 'advntage'),
 ('celebdil', 'celbdil'),
 ('sawn', 'saw'),
 ('object', 'bject'),
 ('tuesday', 'tuesda'),
 ('forehead', 'foreead'),
 ('brightened', 'brightene'),
 ('burnished', 'burished'),
 ('heeding', 'heedig'),
 ('disappeared', 'disapeared'),
 ('goldwine', 'goldwne'),
 ('stoo', 'sto'),
 ('tentacle', 'entacle'),
 ('wafer', 'waer'),
 ('walked', 'waled'),
 ('matches', 'mathes'),
 ('ping', 'png'),
 ('leaflock', 'eaflock'),
 ('quieted', 'queted'),
 ('tion', 'tio'),
 ('retentive', 're

In [178]:
def grammar(word):
    frequency = nltk.FreqDist(token)
    name = word.lower()
    slice = [ (name[:i], name[i:]) for i in range(len(name) + 1) ]
    new_word = [f"{r}{word}{l}" for r, l in slice for word in alpha.lower()]
    frequency = nltk.FreqDist(token)
    relative_frequency = {word: count / len(token) for word, count in frequency.items()}
    right = max(new_word, key=lambda word: relative_frequency.get(word, 0))
    return right
    


In [179]:
grammar("aragor")



'aragorn'

In [180]:
def acuracy(test_df):
    number_words = len(test_df)
    right = 0
    for r, w in test_df:
        right_word = grammar(w)
        if right_word == r:
            right = right + 1

    final = round(right * 100 / number_words, 3)

    return f"accuracy: {final}%"

acuracy(test_df)

'accuracy: 78.133%'