<h1>Preprocessing</h1>

We want to preprocess the text
- lemmatization 
- remove stop words 
- normalize spaces 
- remove special characters 
- lowercase everything

In [1]:
from dask.distributed import Client 
import dask.dataframe as dd

In [2]:
#load data 
raw_df = dd.read_parquet('./data/reviews.parquet')
raw_df.head()

Unnamed: 0,user_id,name,time,rating,text,gmap_id
0,1.049447e+20,Jaron Whitfield,1517731762839,5,"Joe is quite unique of his line of work, he as...",0x89c6c63c8cd87141:0x54d0d283872eecbb
1,1.176097e+20,Jonathan McCarthy,1476276291163,5,For such a small place their impact on my life...,0x89c6c63c8cd87141:0x54d0d283872eecbb
2,1.105632e+20,Rocky Kev,1338826945578,5,I usually give them a call before I stop by to...,0x89c6c63c8cd87141:0x54d0d283872eecbb
3,1.12896e+20,Josep Valls,1363286110554,5,My bike had been sitting outdoors for a good w...,0x89c6c63c8cd87141:0x54d0d283872eecbb
4,1.106197e+20,Timaree Schmit,1548798329760,5,Always an easy experience. Service is knowledg...,0x89c6c63c8cd87141:0x54d0d283872eecbb


### lowercase

Want to make everything lowercase, remove numbers from the cleaning process

In [3]:
raw_df['text'] = raw_df['text'].str.lower().str.replace(r'\s+', ' ', regex=True).str.strip()
raw_df['text'] = raw_df['text'].str.replace(r'[^\w\s]', '', regex=True)
raw_df.head()


Unnamed: 0,user_id,name,time,rating,text,gmap_id
0,1.049447e+20,Jaron Whitfield,1517731762839,5,joe is quite unique of his line of work he ass...,0x89c6c63c8cd87141:0x54d0d283872eecbb
1,1.176097e+20,Jonathan McCarthy,1476276291163,5,for such a small place their impact on my life...,0x89c6c63c8cd87141:0x54d0d283872eecbb
2,1.105632e+20,Rocky Kev,1338826945578,5,i usually give them a call before i stop by to...,0x89c6c63c8cd87141:0x54d0d283872eecbb
3,1.12896e+20,Josep Valls,1363286110554,5,my bike had been sitting outdoors for a good w...,0x89c6c63c8cd87141:0x54d0d283872eecbb
4,1.106197e+20,Timaree Schmit,1548798329760,5,always an easy experience service is knowledge...,0x89c6c63c8cd87141:0x54d0d283872eecbb


### Lemmatization

lemmatize and remove stopwords using wordnet lemmatizer

In [4]:
import nltk 
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('stopwords')

lemmatizer = WordNetLemmatizer()

def lemmatize_text(text): 
    if not isinstance(text,str):
        return text
    words = text.split(' ')
    lemmatized = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized)

raw_df['lemmatized_text'] = raw_df['text'].map(lemmatize_text, meta=('text', 'str'))
raw_df.head()
 

[nltk_data] Downloading package wordnet to /Users/alanwu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alanwu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,user_id,name,time,rating,text,gmap_id,lemmatized_text
0,1.049447e+20,Jaron Whitfield,1517731762839,5,joe is quite unique of his line of work he ass...,0x89c6c63c8cd87141:0x54d0d283872eecbb,joe is quite unique of his line of work he ass...
1,1.176097e+20,Jonathan McCarthy,1476276291163,5,for such a small place their impact on my life...,0x89c6c63c8cd87141:0x54d0d283872eecbb,for such a small place their impact on my life...
2,1.105632e+20,Rocky Kev,1338826945578,5,i usually give them a call before i stop by to...,0x89c6c63c8cd87141:0x54d0d283872eecbb,i usually give them a call before i stop by to...
3,1.12896e+20,Josep Valls,1363286110554,5,my bike had been sitting outdoors for a good w...,0x89c6c63c8cd87141:0x54d0d283872eecbb,my bike had been sitting outdoors for a good w...
4,1.106197e+20,Timaree Schmit,1548798329760,5,always an easy experience service is knowledge...,0x89c6c63c8cd87141:0x54d0d283872eecbb,always an easy experience service is knowledge...


In [5]:
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text): 
    if not isinstance(text,str):
        return text
    words = word_tokenize(text)
    filtered = [word for word in words if word not in stop_words]
    return ' '.join(filtered)

raw_df['without_stopwords'] = raw_df['lemmatized_text'].map(remove_stopwords, meta=('text', 'str'))

In [6]:
raw_df.head()

Unnamed: 0,user_id,name,time,rating,text,gmap_id,lemmatized_text,without_stopwords
0,1.049447e+20,Jaron Whitfield,1517731762839,5,joe is quite unique of his line of work he ass...,0x89c6c63c8cd87141:0x54d0d283872eecbb,joe is quite unique of his line of work he ass...,joe quite unique line work assist problem bicy...
1,1.176097e+20,Jonathan McCarthy,1476276291163,5,for such a small place their impact on my life...,0x89c6c63c8cd87141:0x54d0d283872eecbb,for such a small place their impact on my life...,small place impact life huge keep family bike ...
2,1.105632e+20,Rocky Kev,1338826945578,5,i usually give them a call before i stop by to...,0x89c6c63c8cd87141:0x54d0d283872eecbb,i usually give them a call before i stop by to...,usually give call stop see store bought random...
3,1.12896e+20,Josep Valls,1363286110554,5,my bike had been sitting outdoors for a good w...,0x89c6c63c8cd87141:0x54d0d283872eecbb,my bike had been sitting outdoors for a good w...,bike sitting outdoors good expectation ever ab...
4,1.106197e+20,Timaree Schmit,1548798329760,5,always an easy experience service is knowledge...,0x89c6c63c8cd87141:0x54d0d283872eecbb,always an easy experience service is knowledge...,always easy experience service knowledgeable s...


In [None]:
raw_df['text'] = raw_df['without_stopwords']
raw_df = raw_df.drop(columns=['lemmatized_text', 'without_stopwords'])
raw_df.to_parquet('./data/cleaned_reviews.parquet', engine='pyarrow')