# NLP Practice Assignments
Day 4
    Create Your Own Spell Checker
    Objective: Creating a spell checker, correct the incorrect word in the given sentence.
    
    Problem Statement: While typing or sending any message to person, we generally make 
    
    spelling mistakes. Write a script which will correct the misspelled words in a sentence.
    
    The input will be a raw string and the output will be a string with the case normalized 
    and the incorrect word corrected.
    
    Domain: General
    
    Analysis to be done: Words availability in corpus
    
    Content: 
    Dataset: None
    We will be using NLTK’s inbuilt corpora (words, stop words etc.) and no specific dataset.
    
    Steps to perform:
    While there are several approaches to correct spelling , you will use the Levenshtein or 
    Edit distance approach. 
    
    The approach will be straightforward for correcting a word: 
        ▪ If the word is present in a list of valid words, the word is correct.
        ▪ If the word is absent from the valid word list, we will find the correct 
    word, i.e., the word from the valid word list which has the lowest edit 
    distance from the target word.
    
    Once you define a function, you will iterate over the terms in the given sentence, 
    correct the words identified as incorrect, and return a joined string with all the terms. 
    To help speed up execution, you won’t be applying the spell check on the stop words
    and punctuation.


In [9]:
import nltk
from nltk.corpus import words
from nltk.metrics import edit_distance
import string

# Download NLTK resources (if not already downloaded)
nltk.download('words')
# Use a simplified set of valid English words for demonstration
valid_words = {"the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog", "will", "work"}


def is_valid_word(word):
    return word.lower() in valid_words

# Get the set of valid English words
valid_words = set(words.words())

def is_valid_word(word):
    # Check if the word is a valid English word
    return word.lower() in valid_words

def correct_spelling(word):
    # If the word is valid, return it as is
    if is_valid_word(word):
        return word

    # Find the correct word with the lowest edit distance
    suggestions = [w for w in valid_words if w[0].lower() == word[0].lower()]
    
    if not suggestions:
        return word  # No suggestions, return the original word
    
    corrected_word = min(suggestions, key=lambda w: edit_distance(word, w))
    return corrected_word

def spell_checker(sentence):
    # Tokenize the sentence
    words = nltk.word_tokenize(sentence)

    # Correct the spelling of each word in the sentence
    corrected_words = [correct_spelling(word) if word not in string.punctuation else word for word in words]

    # Join the corrected words to form the final corrected sentence
    corrected_sentence = ' '.join(corrected_words)

    return corrected_sentence

# Example usage:
input_sentence = "Ths sentece has sme spelng mistkes. Wll ths wrk?"
output_sentence = spell_checker(input_sentence)
print("Input Sentence:", input_sentence)
print("Corrected Sentence:", output_sentence)


[nltk_data] Downloading package words to C:\Users\Abhisek
[nltk_data]     Das\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Input Sentence: Ths sentece has sme spelng mistkes. Wll ths wrk?
Corrected Sentence: The sentence his sie sprang misken . Will the wro ?


# Tasks: 
    1. Get a list of valid words in the English language using NLTK’s list of words (Hint:
    
    use nltk.download(‘words’) to get the raw list.
    2. Look at the first 20 words in the list. Is the casing normalized?
    
    3. Normalize the casing for all the terms.
    
    4. Some duplicates would have been induced, create unique list after normalizing.
   
    5. Create a list of stop words which should include: 
    i. Stop words from NLTK
    ii. All punctuations (Hint: use ‘punctuation’ from string module)
    iii. Final list should be a combination of these two
    
    6. Define a function to get correct a single term
    • For a given term, find its edit distance with each term in the valid word 
    list. To speed up execution, you can use the first 20,000 entries in the 
    valid word list.
    • Store the result in a dictionary, the key as the term, and edit distance as 
    value.
    • Sort the dictionary in ascending order of the values.
    • Return the first entry in the sorted result (value with minimum edit 
    distance).
    • Using the function, get the correct word for committee.
    
    7. Make a set from the list of valid words, for faster lookup to see if word is in valid list or not.
    
    8. Define a function for spelling correction in any given input sentence:
    
    1. To tokenize them after making all the terms in lowercase 
    For each term in the tokenized sentence:
    2. Check if the term is in the list of valid words (valid_words_set).
    3. If yes, return the word as is.
    4. If no, get the correct word using get_correct_term function.
    5. To return the joined string as output.
    9. Test the function for the input sentence “The new abacos is great”

In [10]:
import nltk
from nltk.corpus import words, stopwords
from nltk.metrics import edit_distance
from string import punctuation

# Task 1: Get a list of valid words in the English language
nltk.download('words')
valid_words = set(words.words())

# Task 2: Look at the first 20 words in the list. Is the casing normalized?
print("Task 2 - First 20 words:")
print(list(valid_words)[:20])

# Task 3: Normalize the casing for all the terms
normalized_valid_words = set(word.lower() for word in valid_words)

# Task 4: Create a unique list after normalizing
unique_normalized_valid_words = list(normalized_valid_words)

# Task 5: Create a list of stop words
stop_words = set(stopwords.words('english') + list(punctuation))

# Task 6: Define a function to get correct a single term
def get_correct_term(term):
    # Use the first 20,000 entries in the valid word list
    subset_valid_words = list(valid_words)[:20000]
    
    # Store edit distances in a dictionary
    edit_distances = {word: edit_distance(term, word) for word in subset_valid_words}
    
    # Sort the dictionary by edit distances
    sorted_distances = sorted(edit_distances.items(), key=lambda x: x[1])
    
    # Return the correct word with the minimum edit distance
    return sorted_distances[0][0]

# Task 7: Make a set from the list of valid words
valid_words_set = set(valid_words)

# Task 8: Define a function for spelling correction
def correct_spelling(sentence):
    # Tokenize and make all terms lowercase
    tokens = nltk.word_tokenize(sentence)
    tokens_lower = [token.lower() for token in tokens]

    # Check and correct each term
    corrected_tokens = [token if token in valid_words_set else get_correct_term(token) for token in tokens_lower]

    # Return the joined string as output
    return ' '.join(corrected_tokens)

# Task 9: Test the function for the input sentence
input_sentence = "The new abacos is great"
output_sentence = correct_spelling(input_sentence)
print("Task 9 - Corrected Sentence:", output_sentence)


[nltk_data] Downloading package words to C:\Users\Abhisek
[nltk_data]     Das\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Task 2 - First 20 words:
['skelderdrake', 'wrencher', 'lixive', 'Jamnia', 'anteprohibition', 'hepatopneumonic', 'ostracism', 'teethily', 'toyishly', 'hypnotically', 'gnosticizer', 'engild', 'plowline', 'francium', 'sidebones', 'bantayan', 'unbenefitable', 'oilman', 'Gorkiesque', 'Necrophorus']
Task 9 - Corrected Sentence: the new abac is great
