## Spell Checking

In this notebook we construct a data frame consisting of essays and the number of spelling errors detected in them.

To analyse the words in our essays we will need to clean the text. This will consist of expanding contractions, removing non-letter and non-number symbols, and making the text lowercase.

We compare the words in the essays with words from a list of 125k most common English words.

The code below is the same as Derek's (essay topic analysis). What it does is it imports each text file from our data set as a string into a text_list using the os module, and then reads off the first entry.

In [1]:
import os

folder_path = 'small test'

file_list = [file for file in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, file))]
file_list.sort() 
# os.listdir() returns the file list in random(ish) order. Sort to standardize.

text_list =[]

for file_name in file_list:
    file_path = os.path.join(folder_path, file_name)
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()
        text_list.append(text)

In [2]:
text_list[1]

"Driverless cars are exaclty what you would expect them to be. Cars that will drive without a person actually behind the wheel controlling the actions of the vehicle. The idea of driverless cars going in to developement shows the amount of technological increase that the wolrd has made. The leader of this idea of driverless cars are the automobiles they call Google cars. The arduous task of creating safe driverless cars has not been fully mastered yet. The developement of these cars should be stopped immediately because there are too many hazardous and dangerous events that could occur.\n\nOne thing that the article mentions is that the driver will be alerted when they will need to take over the driving responsibilites of the car. This is such a dangerous thing because we all know that whenever humans get their attention drawn in on something interesting it is hard to draw their focus somewhere else. The article explains that companies are trying to implement vibrations when the car is

We now expand contractions from words. For example, we replace "don't" with "do not".

The code below installs a contractions package found online.

After that we create a list "text_no_contr" which contains essays with expanded contractions.

In [3]:
import sys  
!{sys.executable} -m pip install contractions



In [4]:
import contractions
text_no_contr = [contractions.fix(text) for text in text_list]


In [5]:
print(text_no_contr[1])

Driverless cars are exaclty what you would expect them to be. Cars that will drive without a person actually behind the wheel controlling the actions of the vehicle. The idea of driverless cars going in to developement shows the amount of technological increase that the wolrd has made. The leader of this idea of driverless cars are the automobiles they call Google cars. The arduous task of creating safe driverless cars has not been fully mastered yet. The developement of these cars should be stopped immediately because there are too many hazardous and dangerous events that could occur.

One thing that the article mentions is that the driver will be alerted when they will need to take over the driving responsibilites of the car. This is such a dangerous thing because we all know that whenever humans get their attention drawn in on something interesting it is hard to draw their focus somewhere else. The article explains that companies are trying to implement vibrations when the car is in

We now remove all non-letter and non-number symbols from the essays. To do this we import RegexpTokenizer. We create a new list "words_no_punct". Each i-th entry in this list consists of lowercase words from the i-th essay, excluding non-letter and non-number symbols.

In [6]:
import nltk
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+') #this tokenises strings that consist of characters and number, i.e. it removes other symbols
words_no_punct= [tokenizer.tokenize(text.lower()) for text in text_no_contr]


In [7]:
print(words_no_punct[15])

['dear', 'teacher', 'name', 'i', 'agree', '100', 'with', 'you', 'i', 'think', 'in', 'order', 'to', 'be', 'on', 'a', 'team', 'sport', 'you', 'will', 'have', 'to', 'have', 'you', 'grade', 'up', 'to', 'a', 'b', 'i', 'think', 'that', 'because', 'its', 'not', 'fair', 'to', 'the', 'teachers', 'if', 'all', 'you', 'think', 'about', 'is', 'sports', 'and', 'your', 'grades', 'are', 'going', 'down', 'in', 'school', 'but', 'you', 'are', 'doing', 'sports', 'that', 'telling', 'the', 'teachers', 'you', 'just', 'do', 'not', 'care', 'also', 'i', 'think', 'its', 'unfair', 'to', 'the', 'other', 'students', 'that', 'you', 'are', 'on', 'the', 'soccer', 'team', 'but', 'have', 'grades', 'that', 'are', 'failing', 'and', 'you', 'miss', 'practice', 'all', 'the', 'time', 'because', 'you', 'have', 'detention', 'its', 'fair', 'do', 'have', 'the', 'honor', 'students', 'play', 'for', 'are', 'school', 'even', 'if', 'there', 'not', 'the', 'best', 'in', 'the', 'school', 'also', 'its', 'unfair', 'to', 'you', 'you', 'will

This code reads a txt file consisting of the most common English words. Then the code adds number strings up to 1000000, as I assume we don't want to count numbers as spelling errors. These words are stored in a set named "word_set".

In [8]:
my_words = open("word_list.txt", "r") 
text1 = my_words.read()  #this is a text file with 1 word per line
words_into_list = text1.split("\n")  #split the text file at every line
words_into_list = words_into_list+[str(i) for i in range(0,1000000)] #add numbers to the list
word_set = set(words_into_list)

print(words_into_list[0:6]) 

['the', 'of', 'in', 'a', 'and', 'is']


Now we will create a data frame consisting of essays and their error count.

We import pandas. We then loop through each essay and each word of the essay checking if the word is contained in word_set. If it is not we add +1 to the error count.

data_spell is a list of pairs [essay,number of errors].

In [9]:
import pandas as pd 

In [10]:
data_spell=[]
for essay_index in range(0,len(words_no_punct)): #loop through essays
    errors=0
    for word in words_no_punct[essay_index]: #loop through words in each essay
        if word not in word_set: #if a word is not contained in our word_set, add +1 to errors
            errors=errors+1
    data_spell.append([text_list[essay_index], errors]) #create a list of tuples [essay, number of errors]



I include the code below to serve as a test. 
The code counts errors in an essay, prints the essay and the erroneous words. This can be used to check if the spell checker is working well.

In [11]:
errors=0
i=12 #change i to analylize different essays
bad=[] #this will be a list of mispelled words
for word in words_no_punct[i]:
    if word not in word_set:
        errors=errors+1
        bad.append(word)
print(words_no_punct[i])
print(errors)
print(bad)


['so', 'if', 'you', 'are', 'a', 'nasa', 'scientist', 'you', 'should', 'be', 'able', 'to', 'tell', 'me', 'the', 'whole', 'story', 'about', 'the', 'face', 'on', 'mars', 'which', 'obviously', 'is', 'evidence', 'that', 'there', 'is', 'life', 'on', 'mars', 'and', 'that', 'the', 'face', 'was', 'created', 'by', 'aliens', 'correct', 'no', 'twenty', 'five', 'years', 'ago', 'our', 'viking', '1', 'spacecraft', 'was', 'circling', 'the', 'planet', 'snapping', 'photos', 'when', 'it', 'spotted', 'the', 'shadowy', 'likeness', 'of', 'a', 'human', 'face', 'us', 'scientists', 'figured', 'out', 'that', 'it', 'was', 'just', 'another', 'martian', 'mesa', 'common', 'around', 'cydonia', 'only', 'this', 'one', 'had', 'shadows', 'that', 'made', 'it', 'look', 'like', 'an', 'egyption', 'pharaoh', 'very', 'few', 'days', 'later', 'we', 'revealed', 'the', 'image', 'for', 'all', 'to', 'see', 'and', 'we', 'made', 'sure', 'to', 'note', 'that', 'it', 'was', 'a', 'huge', 'rock', 'formation', 'that', 'just', 'resembled', 

Below we create the dataframe consisting of essays and the corresponding number of errors.

In [12]:
df_errors = pd.DataFrame(data_spell, columns=['essay', 'number of spelling errors'])
df_errors

Unnamed: 0,essay,number of spelling errors
0,"Some people belive that the so called ""face"" o...",4
1,Driverless cars are exaclty what you would exp...,10
2,Dear: Principal\n\nI am arguing against the po...,2
3,Would you be able to give your car up? Having ...,5
4,I think that students would benefit from learn...,0
5,The seagoing cowboy program is the best thing ...,0
6,"Venus also known as the Earth's ""twin"" is simi...",8
7,It is every student's dream to be able to loun...,0
8,Cars have been an issue to our community for a...,4
9,Phones & Driving\n\nWaking up from a wonderful...,9


## Notes:
- The spell checker will not detect correctly spelled words that are not used correctly. For example, when "its" is used instead of "it's", this will not be counted.
- The english word list has mistakes. For example it contains "becuase".
- I have not analysed chat gpt essays yet.