# **Text Preprocessing Using spaCy**

## **Challange**

Data preprocessing is an important and crucial step in building any machine learning model, and the performance of the model depends on how well the data is preprocessed. Similarly, text preprocessing is the frst step in any natural language processing before building any model. There are different steps oftext preprocessing technique introduced along with different built in modules in Python, but the selection of step and Technique depends solely on the problem domain.

In this challenge, you will be working with spacy and NLTK libraries, and you will note your observations from each task provided in the subsequent slides.

About the dataset: Extracted tom Kaggle, 'cooking-light.

This dataset includes 15,404 reviews related to cooking experiences. Consider only the content attribute for these challenge tasks.

## **Task 1.1: Import SpaCy and the English module of the spacy library**


In [2]:
import spacy
from spacy.lang.en import English

## **Task 1.2: import the en_core_web_sm**

In [3]:
nlp = spacy.load("en_core_web_sm")

## **Task 1.3: Load the data**

In [None]:
from google.colab import files
upload = files.upload()

Saving DS3_C2_S3_CookingReview_Data_Challenge.csv to DS3_C2_S3_CookingReview_Data_Challenge.csv


In [None]:
import pandas as pd

Data = pd.read_csv("DS3_C2_S3_CookingReview_Data_Challenge.csv")
Data

Unnamed: 0,id,title,content,tags
0,1,get chewy chocolate chip cookies,chocolate chips cookies always crisp get chewy...,"['baking', 'cookies', 'texture']"
1,2,cook bacon oven,heard people cooking bacon oven laying strips ...,"['oven', 'cooking-time', 'bacon']"
2,3,difference white brown eggs,always use brown extra large eggs honestly say...,['eggs']
3,4,difference baking soda baking powder,use one place certain recipes,"['substitutions', 'please-remove-this-tag', 'b..."
4,5,tomato sauce recipe cut acidity,seems every time make tomato sauce pasta sauce...,"['sauce', 'pasta', 'tomatoes', 'italian-cuisine']"
...,...,...,...,...
15399,73670,poached eggs altitude,recently signed america test kitchen cooking s...,"['eggs', 'poaching', 'high-altitude']"
15400,73678,thicken buttercream without adding sugar,made buttercream frosting brownies using recip...,['frosting']
15401,73680,looking old italian recipe chamellas,italian mom gowould pour flour board place egg...,['baking']
15402,73681,make ice cream artificial sweetener,wonder artificial sweetener like sucralose ery...,['ice-cream']


## **Task 1.4: Check the missing value**

In [None]:
Data.isnull().sum()

id         0
title      0
content    0
tags       0
dtype: int64

## **Task 2.1: Use nltk and remoce the stop words from the review text of the 'content' attribute**

In [None]:
Data_string = " ".join(x for x in Data['content'])

Data_string[0:2000]    # [0:2000] = for first 2000 characters including white space

'chocolate chips cookies always crisp get chewy cookies like starbucks thank everyone answered far tip biggest impact chill rest dough however also increased brown sugar ratio increased bit butter also adding maple syrup helped heard people cooking bacon oven laying strips cookie sheet using method long cook bacon temperature always use brown extra large eggs honestly say habit point distinct advantages disadvantages like flavor shelf life etc use one place certain recipes seems every time make tomato sauce pasta sauce little bit acid taste tried using sugar sodium bicarbonate satisfied results recipe calls fresh parsley substituted fresh herbs dried equivalents fresh dried parsley something else ex another dried herb use instead parsley know used mainly looks rather taste pasta recipe calls 2 tablespoons parsley sauce another 2 tablespoons top done know parsley top looks must something taste otherwise would call parsley within sauce well would especially like hear substitutes availabl

In [None]:
len(Data_string)

4571879

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#using nltk remove stopwords
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
ex_sent = Data_string[0:1000]
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(ex_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []

for w in word_tokens:
  if w not in stop_words:
    filtered_sentence.append(w)

print(filtered_sentence)

['chocolate', 'chips', 'cookies', 'always', 'crisp', 'get', 'chewy', 'cookies', 'like', 'starbucks', 'thank', 'everyone', 'answered', 'far', 'tip', 'biggest', 'impact', 'chill', 'rest', 'dough', 'however', 'also', 'increased', 'brown', 'sugar', 'ratio', 'increased', 'bit', 'butter', 'also', 'adding', 'maple', 'syrup', 'helped', 'heard', 'people', 'cooking', 'bacon', 'oven', 'laying', 'strips', 'cookie', 'sheet', 'using', 'method', 'long', 'cook', 'bacon', 'temperature', 'always', 'use', 'brown', 'extra', 'large', 'eggs', 'honestly', 'say', 'habit', 'point', 'distinct', 'advantages', 'disadvantages', 'like', 'flavor', 'shelf', 'life', 'etc', 'use', 'one', 'place', 'certain', 'recipes', 'seems', 'every', 'time', 'make', 'tomato', 'sauce', 'pasta', 'sauce', 'little', 'bit', 'acid', 'taste', 'tried', 'using', 'sugar', 'sodium', 'bicarbonate', 'satisfied', 'results', 'recipe', 'calls', 'fresh', 'parsley', 'substituted', 'fresh', 'herbs', 'dried', 'equivalents', 'fresh', 'dried', 'parsley', 

## **Task 2.2: Use SpaCy and remoce the stop words from the review text of the 'content' attribute**

In [None]:
#using spacy remove stop words
from spacy.lang.en.stop_words import STOP_WORDS
filtered_sent = []
doc = nlp(Data_string[0:1000]) # nlp : tokenization, lemmatization, etc..
for word in doc:
  if word.is_stop == False:
    filtered_sent.append(word)
print("\n \nFiltered Sentences :", filtered_sent)


 
Filtered Sentences : [chocolate, chips, cookies, crisp, chewy, cookies, like, starbucks, thank, answered, far, tip, biggest, impact, chill, rest, dough, increased, brown, sugar, ratio, increased, bit, butter, adding, maple, syrup, helped, heard, people, cooking, bacon, oven, laying, strips, cookie, sheet, method, long, cook, bacon, temperature, use, brown, extra, large, eggs, honestly, habit, point, distinct, advantages, disadvantages, like, flavor, shelf, life, etc, use, place, certain, recipes, time, tomato, sauce, pasta, sauce, little, bit, acid, taste, tried, sugar, sodium, bicarbonate, satisfied, results, recipe, calls, fresh, parsley, substituted, fresh, herbs, dried, equivalents, fresh, dried, parsley, ex, dried, herb, use, instead, parsley, know, mainly, looks, taste, pasta, recipe, calls, 2, tablespoons, parsley, sauce, 2, tablespoons, know, parsley, looks, taste, parsley, sauce, especially, like, hear, substitutes, available]


SpaCy is better to remove the Stop Words

## **Task 3.1: Capture the first reviewed text from the first row**

In [None]:
a=str(Data['content'][0]).split()
a

['chocolate',
 'chips',
 'cookies',
 'always',
 'crisp',
 'get',
 'chewy',
 'cookies',
 'like',
 'starbucks',
 'thank',
 'everyone',
 'answered',
 'far',
 'tip',
 'biggest',
 'impact',
 'chill',
 'rest',
 'dough',
 'however',
 'also',
 'increased',
 'brown',
 'sugar',
 'ratio',
 'increased',
 'bit',
 'butter',
 'also',
 'adding',
 'maple',
 'syrup',
 'helped']

## **Task 3.2: Apply stemming using nltk on the reviewed text and print stem**

In [None]:
from nltk.stem import snowball
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer =SnowballStemmer(language='english')
tokens = a

for token in tokens:
  print(token + " | " + stemmer.stem(token))

chocolate | chocol
chips | chip
cookies | cooki
always | alway
crisp | crisp
get | get
chewy | chewi
cookies | cooki
like | like
starbucks | starbuck
thank | thank
everyone | everyon
answered | answer
far | far
tip | tip
biggest | biggest
impact | impact
chill | chill
rest | rest
dough | dough
however | howev
also | also
increased | increas
brown | brown
sugar | sugar
ratio | ratio
increased | increas
bit | bit
butter | butter
also | also
adding | ad
maple | mapl
syrup | syrup
helped | help


## **Task 3.3: Apply lemmitization using the SpaCy library and print stema**

In [None]:
doc = nlp(Data['content'][0])
for token in doc:
  print(token.text, "|",token.lemma_)

chocolate | chocolate
chips | chip
cookies | cookie
always | always
crisp | crisp
get | get
chewy | chewy
cookies | cookie
like | like
starbucks | starbuck
thank | thank
everyone | everyone
answered | answer
far | far
tip | tip
biggest | big
impact | impact
chill | chill
rest | rest
dough | dough
however | however
also | also
increased | increase
brown | brown
sugar | sugar
ratio | ratio
increased | increase
bit | bit
butter | butter
also | also
adding | add
maple | maple
syrup | syrup
helped | help


Lamitization show good accuracy

## **Task 4:**

In [None]:
# 1) print pos and tag
for token in doc:
  print(token.text," | ", token.pos_, " | ", token.tag_ )

chocolate  |  NOUN  |  NN
chips  |  NOUN  |  NNS
cookies  |  NOUN  |  NNS
always  |  ADV  |  RB
crisp  |  ADV  |  RB
get  |  VERB  |  VB
chewy  |  NOUN  |  NNS
cookies  |  NOUN  |  NNS
like  |  ADP  |  IN
starbucks  |  NOUN  |  NNS
thank  |  VERB  |  VBP
everyone  |  PRON  |  NN
answered  |  VERB  |  VBD
far  |  ADV  |  RB
tip  |  NOUN  |  NN
biggest  |  ADJ  |  JJS
impact  |  NOUN  |  NN
chill  |  NOUN  |  NN
rest  |  NOUN  |  NN
dough  |  NOUN  |  NN
however  |  ADV  |  RB
also  |  ADV  |  RB
increased  |  VERB  |  VBD
brown  |  ADJ  |  JJ
sugar  |  NOUN  |  NN
ratio  |  NOUN  |  NN
increased  |  VERB  |  VBD
bit  |  NOUN  |  NN
butter  |  NOUN  |  NN
also  |  ADV  |  RB
adding  |  VERB  |  VBG
maple  |  ADJ  |  JJ
syrup  |  NOUN  |  NN
helped  |  VERB  |  VBD


In [None]:
# 2) Print count of each part type of speech
noun = []
adv = []
verb = []
adp = []
adj = []

for word in doc:
    if word.pos_=='NOUN':
    noun.append(word)
    if word.pos_=='ADV':
    adv.append(word)
    if word.pos_=='VERB':
    verb.append(word)
    if word.pos_ == 'ADP':
    adp.append(word)
    if word.pos_ == 'ADJ':
    adj.append(word)

print("count of noun words = {}".format(len(noun)))
print("count of adv words = {}".format(len(adv)))
print("count of verb words = {}".format(len(verb)))
print("count of adp words = {}".format(len(adp)))
print("count of adj words = {}".format(len(adj)))

count of noun words = 16
count of adv words = 6
count of verb words = 7
count of adp words = 1
count of adj words = 3
