# CS 2300 Lab 3: Search Terms w Pandas

Cecelia Henson

### Introduction
The purpose of this lab is to build on my **Lab 2: Search Terms** where I analyzed search terms using lists, sets, and dictionaries in Python for basic data cleaning and natural language processing tasks. The search terms were provided by Direct Supply's DSS eProcedure system containing queries entere by food-service users over a 60 day period from 2019

However in this lab I am using pandas to store the full list of tokens instead of the built in Python data structures combined with functional programming to improve the performance and get the same results as I did in the **Lab 2** 

**Importing necessary libaries and creating list and dictionary**

In [1]:
import csv
from spellchecker import SpellChecker
import pandas as pd
import sys
freq_dict = {}
freq_dict_spellchecked = []

#### Upload List:
The **import_csv_list** function imports a CSV file and creates a dataframe of the first item in each row, removed the webspaces and splits search terms by space

Note: I had to limit the amount to the data set for it to run in a decent amount of time for the list version, the dataframe didn't have any runtime duration issues

In [2]:
def import_csv_list(csv):
    temp = []
    csv_raw_data = []
    i = 0
    with open(csv ,encoding = "utf8") as file:
        for line in file:
            if i == 200000:
                break
            temp.append(line.rstrip('\n').split(','))
            i+=1
    file.close
    remove_web_spaces = [str(row[0]).replace("%20", " ") for row in temp]
    split_on_space_list = []
    for item in remove_web_spaces:
        split_on_space_list.extend(item.split(" "))
    df = pd.DataFrame({"Raw Data" : split_on_space_list})
    return df

 #### List Frequency:
 Creates a frequency dictionary given a string lust where the key is a string and the key-value is how many times the string appreared in the list

In [3]:
def list_to_freq_dict(input_list):
    freq_dict = {}
    for i in input_list:
        freq_dict[i] = input_list.count(i)
    return freq_dict

#### Sort Frequency:
Creates a sorted frequency list given a frequency dictionary

In [4]:
def sort_freq_dict(freq_dict):
    sorted_list = [(freq_dict[key], key) for key in freq_dict]
    sorted_list.sort()
    sorted_list.reverse()
    return sorted_list

#### SpellChecker
Creates a spellchecker dictionary where the key is misspelled word and the key-value is the most likely corrected word

In [5]:
def spellcheck_dict_init(input_list):
    spell = SpellChecker(distance=1)
    spellchecked_dict = {}
    for word in input_list:
        spellchecked_dict[word] = spell.correction(word)
    return spellchecked_dict

#### Spell check token
Given a misspelled string token, returns the most likely corrected word

In [6]:
def spellcheck_token(token):
    fixed_token = ''
    if(token != ''):
        fixed_token = spellcheck_dict[token]
    return fixed_token

### Testing 
This cell imports the csv to the search term data frame and filters the data from the dataset by removing non-alphabet charcters

In [7]:
df = import_csv_list("searchTerms.csv")

%time df["Removed Numbers"] = df["Raw Data"].str.replace('[0-9]', '')
%time df["Letters Only"] = df["Removed Numbers"].str.replace('[^a-zA-Z]', '')



Wall time: 206 ms
Wall time: 175 ms




This is the cell spellchecks the dataset 

In [8]:
%%time
spellcheck_dict = spellcheck_dict_init(df["Letters Only"].tolist())

df["Spellchecked"] = df["Letters Only"].map(lambda token: spellcheck_token(token))
df.head(10)

Wall time: 10.4 s


Unnamed: 0,Raw Data,Removed Numbers,Letters Only,Spellchecked
0,SearchTerm,SearchTerm,SearchTerm,
1,36969,,,
2,CMED,CMED,CMED,med
3,500100,,,
4,KEND,KEND,KEND,kind
5,5750,,,
6,CMED,CMED,CMED,med
7,980228,,,
8,DYNC1815H,DYNCH,DYNCH,lynch
9,DYND70642,DYND,DYND,dyed


This cell benchmarks the time it takes for the data frame approach for the sorted frequency list of search terms 

In [9]:
%time series_freq = df["Spellchecked"].value_counts(dropna = True)
series_freq.head(10)

Wall time: 25.9 ms


           42701
chicken     3640
bacon       2698
juice       2627
beef        2581
cream       2575
cheese      2443
pork        2193
green       2143
diced       2017
Name: Spellchecked, dtype: int64

This cell benchmarks the time it takes for the dictionary and list sorted frequency list of search terms

In [10]:
%%time
spellcheck_freq_dict = list_to_freq_dict(df["Spellchecked"].tolist())
spellcheck_freq_list = sort_freq_dict(spellcheck_freq_dict)

Wall time: 19min 1s


This cell benchmarks the size of the sorted frequency data frame

In [11]:
series_freq.memory_usage(deep = True)

371725

This cell benchmarks the size of the sorted frequency list

In [12]:
sys.getsizeof(spellcheck_freq_list)

47160

## Conclusion



The time to removed all non alphabet characters form the dataset took 369 ms at first to remove the numbers and then 255 ms to remove all uppercase letters as well. This benchmark difference was because the second dataset filter had a smaller amount of opertations to do compared to the original, because of this it allows for a faster runtime. 

The runtime for the spellchecked dataset took 25.9 ms while the list version took 19 min and 1 secong which is significantly longer than the dataset benchmark because the **value_counts** function that is in the Pandas library uses a single loop through the data while the list has more steps to do the same process. Because of the functional programming it has to go through the list_to_freq_dict method, get that result and then go back through the sort_freq_dict method to sort and reverse which leads to a longer runtime for the way that I programmed this lab as the values get bigger and bigger, which is why I could only use 200,000 entries for this because if I used any more it would take too long for the list version

Pandas overall through this lab compared to the previous one has shown that it can be useful in the sense of minimizing the amount of code that has to be written for data minipulation while giving a faster runtime at the same time as a developer. However, it comes with the disadvantage of the significant amount of memory that is used because of it because even though the list version of the spellchecked data took 19 minutes, the amount of memeory used was only 47160, while the dataset took 25 ms but the memory usage was huge at 371725 and memory can get very expensive. It also can be difficult to decipher at time with the amount of dataframes being created since there are no function headers describing what it contains, you would have to go through from top to bottom if given a random persons project with a bunch of dataframes or go back to a personal project that you haven't worked on in a while. 