# CS 2300 Lab 2: Search Terms 

By: Cecelia Henson 

### Introduction

The purpose of this lab is to analyze search terms using lists, sets, and dictionaries in Python for basic data cleaning and natural language processing tasks. Search terms provided by Direct Supply's DSS eProcedure system containing queries entere by food-service users over a 60 day period from 2019

The csv file will filter search queries and create single tokens with no spaces and a frequency dictrionary from that list and is then searched from most searched to least searched terms. Then a second dictionary is created with frequency of the original filtered list but that has been spellchecked with the spellchecker library.


#### Importing SpellChecker and csv and creating list and dictionary

In [1]:
import csv
from spellchecker import SpellChecker

csv_freq_dict = {}
csv_freq_spellchecked = []

## Upload List:

The **import_csv_list** function imports a CSV file and creates a list of the first item in each row 

In [2]:
def import_csv_list(csv):
    temp = []
    csv_raw_data = []
    i = 0
    with open(csv ,encoding = "utf8") as file:
        for line in file:
            if i == 70000:
                break
            temp.append(line.rstrip('\n').split(','))
            i+=1
    csv_raw_data = [str(row[0]) for row in temp]
    file.close
    return csv_raw_data

#import_csv_list('searchTerms.csv')
    

## Split Tokens:

The **split_tokens** function takes a list of string and creates a new list where each string is split by spaces

In [3]:
def split_tokens(original_list):
    new_list = []
    for item in original_list:
        new_list.extend(item.split(" "))
    return new_list

#split_tokens(import_csv_list('searchTerms.csv'))

## Web Spaces: 

The **remove_web_spaces** function relaces the web spaces (%20) with a regular space from a string token

In [4]:
def remove_web_spaces(token):
    token.replace("%20", " ")
    return token

## Non-Alphabet:

The remove **non_alphabet** function removed any non-alphabet characters from the string token 

In [5]:
def remove_non_alphabet(token):
    fixed_token = ""
    for char in token:
        if(char.isalpha() or char == " "):
            fixed_token = fixed_token + char
    return fixed_token

## List Frequency:

The **list_to_freq_dict** function creates a frequency dictionary given a list of strings where the key is a string and the key-value is how many time the string apprears in the list

In [6]:
def list_to_freq_dict(input_list):
    freq_dict = {}
    for i in input_list:
        freq_dict[i] = input_list.count(i)
    return freq_dict

## Sort Frequency:

The **sort_freq_dict** function creates a sorted frequency list given a frequency dictionary

In [7]:
def sort_freq_dict(freq_dict):
    sorted_list = [(freq_dict[key], key) for key in freq_dict]
    sorted_list.sort()
    sorted_list.reverse()
    return sorted_list

## SpellChecker: 

The **spellcheck_dict_init** function creates a spellchecker dictionary where the key is the misspelled word and they key-value is the most likely correct spelled word

In [8]:
def spellcheck_dict_init(input_list):
    spell = SpellChecker(distance=1)
    spellchecked_dict = {}
    for word in input_list:
        spellchecked_dict[word] = spell.correction(word)
    return spellchecked_dict

## SpellCheck Token: 

The **spellcheck_token** function returns the most likely corrected word given a misspelled string token

In [9]:
def spellcheck_token(token):
    fixed_token = csv_spellcheck_dict[token]
    return fixed_token

## Create CSV:

The **list_to_csv** function creates a csv files from a frequency list

In [10]:
def list_to_csv(input_list):
    fields = ["Frequency", "Word"]
    with open("frequencyOfSearchTerms.csv", "w", newline = "") as file:
        write = csv.writer(file)
        write.writerow(fields)
        write.writerows(input_list)

## Run Benchmark

In [11]:
# Importing csv to searchterms list
%time csv_raw = import_csv_list("searchTerms.csv")

Wall time: 103 ms


In [12]:
%%time
# This filters the data from the csv file by removing non 
# alphabet character and replacing web spaces with spaces, 
# while splitting search terms by word
csv_filtered = []
for i in range(len(csv_raw)):
    temp = remove_non_alphabet(remove_web_spaces(csv_raw[i]))
    csv_filtered.append(temp)
    
#print(csv_filtered)

Wall time: 94 ms


In [13]:
# This splits the filtered list into tokens
%time csv_filtered = split_tokens(csv_filtered)

#print(csv_filtered)

Wall time: 19.4 ms


In [14]:
%%time
# This section removes all the blank serch terms
# that are generated by the removal of non-alphabet 
# characters
csv_fixed = []
for word in csv_filtered:
    if len(word) != 0:
        csv_fixed.append(word)
        
#print(csv_fixed)

Wall time: 18.9 ms


In [15]:
# This createsa frequency dictionary of the filterd data
# A list is also creatd to sort the frequency dictionary
# from most frequent to least frequent
%time csv_freq_dict = list_to_freq_dict(csv_fixed)
%time csv_freq_list = sort_freq_dict(csv_freq_dict)


#print(csv_freq_dict)
#print(csv_freq_list)

Wall time: 1min 57s
Wall time: 15.1 ms


In [16]:
%%time
# This creates a spellchecked version of the fully filteres 
# version and creates a frequency dictionary and sorted
# list of the most frequent 
csv_spellcheck_dict = spellcheck_dict_init(csv_fixed)
csv_spellchecked = []
for word in csv_fixed:
    csv_spellchecked.append(spellcheck_token(word))
##print(csv_spellchecked)

Wall time: 4.63 s


In [17]:
%time csv_spellchecked_dict = list_to_freq_dict(csv_spellchecked)
%time csv_spellchecked_freq = sort_freq_dict(csv_spellchecked_dict)
#print(csv_spellchecked_freq)

Wall time: 1min 35s
Wall time: 2.04 ms


In [18]:
# Creates a csv file of spellchecked search terms 
# frequency list from most frequent to least
# frequent
%time list_to_csv(csv_spellchecked_freq)

Wall time: 6.8 ms


## Conclusion

**1:** The most frequent search token for both the non-spellchecked data and spell-checked data was bacon which had 459 search queries, followed by 180 queries of milk, 168 queries of chicken and 140 queries of beef. I believe that this could be a result of the "typical" American breakfast that people have in their household and nursing homes. Humans require food to function and having a nutritious breakfast is important for health and wellness which shows in the high queries for items that are in the American breakfast

**2:** My hypothesis for the "nonwords" within the original list of search terms csv is that they could be a lookup code associated to a specific product that certain companies or nursing homes order frequently. That could relate to allergy reqirements or food restrictions and by having the "nonwords" makes the organization or lookup a lot easier to confirm it is correct.

**3:** I think that they spellchecked data is more accurate than the non-spellchecked version of the data but only by a small amount. The accuracy is due to how to how it fixed the misspelling containing small errors and the spellchecker has a decent accuracy in its spelling prediction. The only aspect that the spellchecking could be an issue is that spelling properties like spelling potato as potatoe. The spellchecking could see that as accidentally misspelling the plural form and add a "s" instead of correcting it to potato and removing the "e"

**4:** The longest running cell is **list_to_freq_dict** because of the count method that had to iterate through each item in the list. The Big-O would be O(n^2) operation due to having to iterate through the each element in the list. When running the non-spellchecked the wall time was 3.23s and spellchecked the wall time was 3.41s.

**5:** I predict that if the  list is 10x bigger it would take 100x longer and at 100x bigger it would take 10000x longer