# ICE-4 Text Data: Flattening, Filtering, and Chunking

## (Tutorial) Bag of X
Following is a sample of applying bag of n-grams to Yelp academic dataset review, please download it with following link:

https://github.com/knowitall/yelp-dataset-challenge/blob/master/data/yelp_phoenix_academic_dataset/yelp_academic_dataset_review.json

In [6]:
import pandas as pd
import json

In [7]:
f = open('C:\\Users\\PULAPA YESHWANTH\\Downloads\\yelp_academic_dataset_review.json')
js = []
for i in range(10000):
    js.append(json.loads(f.readline()))
f.close()
review_df = pd.DataFrame(js)
review_df.shape

(10000, 8)

In [8]:
review_df.head()

Unnamed: 0,votes,user_id,review_id,stars,date,text,type,business_id
0,"{'funny': 0, 'useful': 5, 'cool': 2}",rLtl8ZkDX5vH5nAx9C3q5Q,fWKvX83p0-ka4JS3dc6E5A,5,2011-01-26,My wife took me here on my birthday for breakf...,review,9yKzy9PApeiPPOUJEtnvkg
1,"{'funny': 0, 'useful': 0, 'cool': 0}",0a2KyEL0d3Yb1V6aivbIuQ,IjZ33sJrzXqU-0X6U8NwyA,5,2011-07-27,I have no idea why some people give bad review...,review,ZRJwVLyzEJq1VAihDhYiow
2,"{'funny': 0, 'useful': 1, 'cool': 0}",0hT2KtfLiobPvh6cDC8JQg,IESLBzqUCLdSzSqm0eCSxQ,4,2012-06-14,love the gyro plate. Rice is so good and I als...,review,6oRAC4uyJCsJl1X0WZpVSA
3,"{'funny': 0, 'useful': 2, 'cool': 1}",uZetl9T0NcROGOyFfughhg,G-WvGaISbqqaMHlNnByodA,5,2010-05-27,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,_1QQZuf4zZOyFCvXc0o6Vg
4,"{'funny': 0, 'useful': 0, 'cool': 0}",vYmM4KTsC8ZfQBg-j5MWkw,1uJFq2r5QfJG_6ExMRCaGw,5,2012-01-05,General Manager Scott Petello is a good egg!!!...,review,6ozycU1RpktNG2-1BroVtw


note: in the default settings of CountVectorizer, the token_pattern = '(?u)\\b\\w\\w+\\b', which ignores single-character words. Whe employ the token_pattern = '(?u)\\b\\w+\\b' to include the single-character words.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
bow_converter = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
x = bow_converter.fit_transform(review_df['text'])

In [10]:
unigram = bow_converter.get_feature_names_out()

In [11]:
bigram_converter = CountVectorizer(ngram_range=(2,2), token_pattern='(?u)\\b\\w+\\b')
x2 = bigram_converter.fit_transform(review_df['text'])

In [12]:
bigram = bigram_converter.get_feature_names_out()

In [13]:
trigram_converter = CountVectorizer(ngram_range=(3,3), token_pattern='(?u)\\b\\w+\\b')
x3 = trigram_converter.fit_transform(review_df['text'])

In [14]:
trigram = trigram_converter.get_feature_names_out()

In [15]:
unigram

array(['0', '00', '000', ..., 'école', 'ém', 'òc'], dtype=object)

In [16]:
bigram

array(['0 0', '0 20', '0 39', ..., 'école lenôtre', 'ém all', 'òc châm'],
      dtype=object)

In [17]:
trigram

array(['0 0 eye', '0 20 less', '0 39 oz', ..., 'école lenôtre trained',
       'ém all they', 'òc châm a'], dtype=object)

In [18]:
print (len(unigram), len(bigram), len(trigram))

29222 368943 881620


In [19]:
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn as sns

In [20]:
sns.set_style("darkgrid")
counts = [len(unigram), len(bigram), len(trigram)]
plt.plot(counts, color='cornflowerblue')
plt.plot(counts, 'bo')
plt.margins(0.1)
plt.xticks(range(3), ['unigram', 'bigram', 'trigram'])
plt.tick_params(labelsize=14)
plt.title('Number of ngrams in the first 10,000 reviews of the Yelp dataset', {'fontsize':16})
plt.show()

<IPython.core.display.Javascript object>

## Task 1. 1 Applying the unigram, bigram, and trigram tokenization methods to the given text below.

In [21]:
train_text = """My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.
Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.
I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.
It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.
It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!"""

# write your code here
x = bow_converter.fit_transform([train_text])
# Tokens using unigram
unigram = bow_converter.get_feature_names_out()
# unigram results
unigram

array(['2', 'a', 'absolute', 'absolutely', 'amazing', 'an', 'and',
       'anyway', 'arrived', 'back', 'best', 'better', 'birthday', 'blend',
       'bloody', 'bread', 'breakfast', 'busy', 'came', 'can', 'complete',
       'delicious', 'do', 'earlier', 'eggs', 'ever', 'everything',
       'excellent', 'favor', 'fills', 'food', 'for', 'fresh', 'from',
       'garden', 'get', 'go', 'griddled', 'grounds', 'had', 'here', 'i',
       'ingredients', 'it', 'like', 'looked', 'looks', 'm', 'made',
       'mary', 'me', 'meal', 'menu', 'morning', 'my', 'of', 'on', 'only',
       'order', 'our', 'outside', 'overlooking', 'perfect', 'phenomenal',
       'pieces', 'place', 'pleasure', 'pretty', 'quickly', 'saturday',
       'scrambled', 'semi', 'simply', 'sitting', 'skillet', 'so', 'sure',
       't', 'tasty', 'the', 'their', 'them', 'they', 'to', 'toast',
       'took', 'truffle', 'up', 'use', 've', 'vegetable', 'wait',
       'waitress', 'was', 'weather', 'when', 'which', 'while', 'white',
       

In [22]:
bigram_converter = CountVectorizer(ngram_range=(2,2), token_pattern='(?u)\\b\\w+\\b')
x2 = bigram_converter.fit_transform([train_text])
bigram = bigram_converter.get_feature_names_out()
bigram

array(['2 pieces', 'a favor', 'absolute pleasure', 'absolutely made',
       'amazing and', 'amazing while', 'an absolute', 'and blend',
       'and delicious', 'and get', 'and it', 'and our', 'and simply',
       'anyway i', 'arrived quickly', 'best i', 'best toast', 'better do',
       'birthday for', 'blend them', 'bloody mary', 'bread with',
       'breakfast and', 'busy saturday', 'came with', 'can t',
       'complete it', 'delicious it', 'do yourself', 'earlier you',
       'eggs vegetable', 'ever had', 'everything on', 'excellent and',
       'excellent i', 'excellent the', 'favor and', 'fills up',
       'food arrived', 'for breakfast', 'fresh when', 'from their',
       'garden and', 'get here', 'get their', 'go back', 'griddled bread',
       'grounds an', 'had anyway', 'had i', 'had the', 'here on',
       'here the', 'i can', 'i had', 'i m', 'i ve', 'ingredients from',
       'it absolutely', 'it came', 'it it', 'it looked', 'it was',
       'like the', 'looked like', 'loo

In [23]:
trigram_converter = CountVectorizer(ngram_range=(3,3), token_pattern='(?u)\\b\\w+\\b')
x3 = trigram_converter.fit_transform([train_text])
trigram = trigram_converter.get_feature_names_out()
trigram

array(['2 pieces of', 'a favor and', 'absolute pleasure our',
       'absolutely made the', 'amazing and it',
       'amazing while everything', 'an absolute pleasure',
       'and blend them', 'and delicious it', 'and get their',
       'and it absolutely', 'and it was', 'and our food',
       'and simply the', 'anyway i can', 'arrived quickly on',
       'best i ve', 'best toast i', 'better do yourself',
       'birthday for breakfast', 'blend them fresh', 'bloody mary it',
       'bread with was', 'breakfast and it', 'busy saturday morning',
       'came with 2', 'can t wait', 'complete it was',
       'delicious it came', 'do yourself a', 'earlier you get',
       'eggs vegetable skillet', 'ever had anyway', 'ever had i',
       'everything on the', 'excellent and our', 'excellent i had',
       'excellent the weather', 'favor and get', 'fills up pretty',
       'food arrived quickly', 'for breakfast and', 'fresh when you',
       'from their garden', 'garden and blend', 'get here 

## Task 1.2 Create your own naive tokenization method (whitespace-based), and apply it to the text given in the task 1.1
note: 1. do not use the existing togkenization methods given by NLP; 2. split the words by whitespace character, the output is more likely as the unigram; 3. no repeating elements in the output.

In [24]:
# tokenization method
word_tokens = train_text.split()
storing_tokens = []

for i in word_tokens:
    if i not in storing_tokens:
        storing_tokens.append(i)

print(storing_tokens)

['My', 'wife', 'took', 'me', 'here', 'on', 'my', 'birthday', 'for', 'breakfast', 'and', 'it', 'was', 'excellent.', 'The', 'weather', 'perfect', 'which', 'made', 'sitting', 'outside', 'overlooking', 'their', 'grounds', 'an', 'absolute', 'pleasure.', 'Our', 'waitress', 'excellent', 'our', 'food', 'arrived', 'quickly', 'the', 'semi-busy', 'Saturday', 'morning.', 'It', 'looked', 'like', 'place', 'fills', 'up', 'pretty', 'so', 'earlier', 'you', 'get', 'better.', 'Do', 'yourself', 'a', 'favor', 'Bloody', 'Mary.', 'phenomenal', 'simply', 'best', "I've", 'ever', 'had.', "I'm", 'sure', 'they', 'only', 'use', 'ingredients', 'from', 'garden', 'blend', 'them', 'fresh', 'when', 'order', 'it.', 'amazing.', 'While', 'EVERYTHING', 'menu', 'looks', 'excellent,', 'I', 'had', 'white', 'truffle', 'scrambled', 'eggs', 'vegetable', 'skillet', 'tasty', 'delicious.', 'came', 'with', '2', 'pieces', 'of', 'griddled', 'bread', 'amazing', 'absolutely', 'meal', 'complete.', '"toast"', 'Anyway,', "can't", 'wait', '

## **Question 1**. Given a sentence "He likes cat". In unigram representation, it could be "He", "likes", "cat". In bigram representation, it could be "He likes", "likes cat". In trigram representation, it could be "He likes cat". Explain why the storage and computation cost increase with the growth of n in n-gram methods.

Answer to Q1:
Definitely storage and computation cost increases with the growth of n in n-gram methods. There are several reasons for the growth of n in n-gram methods.
1. Increases the vocabulary size as the growth of n in n-gram methods.It leads to more storage and computation cost .
2. As the n-gram increases more features extracted from the raw text document . If we build the model using these features , it would require more time complexity .
3. It can lead to data sparsity as the flow of features increases.
4. Dimensionality space also increases with high n-grams.It leads to curse of dimensionality.

---

## (Tutorial) Stemming and Lemmatization

In [25]:
# import PorterStemmer class form nltk.stem.porter module
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

stem = stemmer.stem('flowers')
print(f"'flowers' after stemming: {stem}")

stem = stemmer.stem('zeroes')
print(f"'zeroes' after stemming: {stem}")

stem = stemmer.stem('better')
print(f"'better' after stemming: {stem}")

stem = stemmer.stem('sixties')
print(f"'sixties' after stemming: {stem}")

stem = stemmer.stem('goes')
print(f"'goes' after stemming: {stem}")

stem = stemmer.stem('go')
print(f"'go' after stemming: {stem}")

'flowers' after stemming: flower
'zeroes' after stemming: zero
'better' after stemming: better
'sixties' after stemming: sixti
'goes' after stemming: goe
'go' after stemming: go


In [26]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [27]:
# import lemmatizer class from nltk.stem module
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

lemma = lemmatizer.lemmatize('flowers')
print(f"'flowers' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('zeros')
print(f"'zeros' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('better')
print(f"'better' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('sixties')
print(f"'sixties' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('goes')
print(f"'goes' after lemmatization: {lemma}")

lemma = lemmatizer.lemmatize('go')
print(f"'go' after lemmatization: {lemma}")

print("\n\n")
lemma = lemmatizer.lemmatize('better', pos='a')   # 'a' denoted ADJECTIVE part-of-speech
print(f"'better' (as an adjective) after lemmatization: {lemma}")

'flowers' after lemmatization: flower
'zeros' after lemmatization: zero
'better' after lemmatization: better
'sixties' after lemmatization: sixty
'goes' after lemmatization: go
'go' after lemmatization: go



'better' (as an adjective) after lemmatization: good


## Task 2. Text filtering for cleaner feature
1. clean the text used in the task 1; 2. remove all punctuations; 3. convert all characters to their lowercase; 4. remove all words in "stopwords"; 5. remove all relatively meaningless words like " 've ", " 's ", etc. 6. after finishing the above operations, apply stemming and lemmatization to the cleaned text respectively.

In [28]:
# Removing all punctuations
import string
translator = str.maketrans("", "", string.punctuation)
train_text = train_text.translate(translator)

print(train_text)

My wife took me here on my birthday for breakfast and it was excellent  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure
Our waitress was excellent and our food arrived quickly on the semibusy Saturday morning  It looked like the place fills up pretty quickly so the earlier you get here the better

Do yourself a favor and get their Bloody Mary  It was phenomenal and simply the best Ive ever had
Im pretty sure they only use ingredients from their garden and blend them fresh when you order it  It was amazing

While EVERYTHING on the menu looks excellent I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious
It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete
It was the best toast Ive ever had

Anyway I cant wait to go back


In [29]:
# Convert to lowercase
train_text = train_text.lower()

print(train_text)

my wife took me here on my birthday for breakfast and it was excellent  the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure
our waitress was excellent and our food arrived quickly on the semibusy saturday morning  it looked like the place fills up pretty quickly so the earlier you get here the better

do yourself a favor and get their bloody mary  it was phenomenal and simply the best ive ever had
im pretty sure they only use ingredients from their garden and blend them fresh when you order it  it was amazing

while everything on the menu looks excellent i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious
it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete
it was the best toast ive ever had

anyway i cant wait to go back


In [30]:
import nltk
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
# Remove stopwords
stop_words = set(stopwords.words('english'))
words = nltk.word_tokenize(train_text)
words = [word for word in words if word not in stop_words]
cleaned_text = ' '.join(words)

print(cleaned_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


wife took birthday breakfast excellent weather perfect made sitting outside overlooking grounds absolute pleasure waitress excellent food arrived quickly semibusy saturday morning looked like place fills pretty quickly earlier get better favor get bloody mary phenomenal simply best ive ever im pretty sure use ingredients garden blend fresh order amazing everything menu looks excellent white truffle scrambled eggs vegetable skillet tasty delicious came 2 pieces griddled bread amazing absolutely made meal complete best toast ive ever anyway cant wait go back


In [31]:
import re
# Defining meaningless words
meaningless_words = ["'ve", "'s"]

# discarding words using regular expression
final_text = re.sub(r"\b(?:%s)\b" % "|".join(meaningless_words), "", cleaned_text)

print(final_text)

wife took birthday breakfast excellent weather perfect made sitting outside overlooking grounds absolute pleasure waitress excellent food arrived quickly semibusy saturday morning looked like place fills pretty quickly earlier get better favor get bloody mary phenomenal simply best ive ever im pretty sure use ingredients garden blend fresh order amazing everything menu looks excellent white truffle scrambled eggs vegetable skillet tasty delicious came 2 pieces griddled bread amazing absolutely made meal complete best toast ive ever anyway cant wait go back


In [32]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
# Tokenize the text
tokens = nltk.word_tokenize(final_text)
# Initialize the stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Apply stemming
stemmed_tokens = [stemmer.stem(token) for token in tokens]

# Apply lemmatization
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

print("Stemmed Tokens:")
print(stemmed_tokens)
print("\nLemmatized Tokens:")
print(lemmatized_tokens)

Stemmed Tokens:
['wife', 'took', 'birthday', 'breakfast', 'excel', 'weather', 'perfect', 'made', 'sit', 'outsid', 'overlook', 'ground', 'absolut', 'pleasur', 'waitress', 'excel', 'food', 'arriv', 'quickli', 'semibusi', 'saturday', 'morn', 'look', 'like', 'place', 'fill', 'pretti', 'quickli', 'earlier', 'get', 'better', 'favor', 'get', 'bloodi', 'mari', 'phenomen', 'simpli', 'best', 'ive', 'ever', 'im', 'pretti', 'sure', 'use', 'ingredi', 'garden', 'blend', 'fresh', 'order', 'amaz', 'everyth', 'menu', 'look', 'excel', 'white', 'truffl', 'scrambl', 'egg', 'veget', 'skillet', 'tasti', 'delici', 'came', '2', 'piec', 'griddl', 'bread', 'amaz', 'absolut', 'made', 'meal', 'complet', 'best', 'toast', 'ive', 'ever', 'anyway', 'cant', 'wait', 'go', 'back']

Lemmatized Tokens:
['wife', 'took', 'birthday', 'breakfast', 'excellent', 'weather', 'perfect', 'made', 'sitting', 'outside', 'overlooking', 'ground', 'absolute', 'pleasure', 'waitress', 'excellent', 'food', 'arrived', 'quickly', 'semibusy', 

## **Question 2.** Based on the examples and the output of your code, which one has the better performance, Stemming or Lemmatization? Try to analyze it.

**Answer to Q2**:
As we have seen in the output of stemming . Some of the results are meaning less . outside became outsid after applying stemming which is not in the dictionary. But we can see meaningful word in the lemmatized tokens.The stemmed tokens retain the original form of the words, but they might be truncated or modified to their root form using stemming algorithms.
Some examples of stemmed tokens in this text include: "excel" (from "excellent"), "outsid" (from "outside"), "arriv" (from "arrived"), and "amaz" (from "amazing").
The lemmatized tokens are transformed to their base or dictionary form, providing a more standardized representation of the words.
Some examples of lemmatized tokens in this text include: "excellent" (instead of "excel"), "outside" (instead of "outsid"), "arrived" (instead of "arriv"), and "amazing" (instead of "amaz").

Based on my analysis lemmatization performed best compared to stemming.

---

## (Tutorial) PoS tagging and chunking

**note:** you need to install spacy and textblob modules first for the following codes
If you have problem to install spacy module, try to follow the instruction in the following link:
https://stackoverflow.com/questions/66149878/e053-could-not-read-config-cfg-resumeparser
If you have problem to use textblob module, try to install nltk libraries as shown in the following link:
http://www.nltk.org/data.html

In [33]:
# Load the first 10 reviews
f = open('yelp_academic_dataset_review.json')
js = []
for i in range(10):
    js.append(json.loads(f.readline()))
f.close()
review_df = pd.DataFrame(js)
review_df.shape

(10, 8)

In [34]:
# chunking in spaCy
import spacy
spacy.info('en_core_web_sm')

{'lang': 'en',
 'name': 'core_web_sm',
 'version': '3.5.0',
 'description': 'English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.',
 'author': 'Explosion',
 'email': 'contact@explosion.ai',
 'url': 'https://explosion.ai',
 'license': 'MIT',
 'spacy_version': '>=3.5.0,<3.6.0',
 'spacy_git_version': '9e0322de1',
 'vectors': {'width': 0, 'vectors': 0, 'keys': 0, 'name': None},
 'labels': {'tok2vec': [],
  'tagger': ['$',
   "''",
   ',',
   '-LRB-',
   '-RRB-',
   '.',
   ':',
   'ADD',
   'AFX',
   'CC',
   'CD',
   'DT',
   'EX',
   'FW',
   'HYPH',
   'IN',
   'JJ',
   'JJR',
   'JJS',
   'LS',
   'MD',
   'NFP',
   'NN',
   'NNP',
   'NNPS',
   'NNS',
   'PDT',
   'POS',
   'PRP',
   'PRP$',
   'RB',
   'RBR',
   'RBS',
   'RP',
   'SYM',
   'TO',
   'UH',
   'VB',
   'VBD',
   'VBG',
   'VBN',
   'VBP',
   'VBZ',
   'WDT',
   'WP',
   'WP$',
   'WRB',
   'XX',
   '_SP',
   '``'],
  'parser': ['ROOT',
   'acl',
   'acomp',


In [35]:
nlp = spacy.load("en_core_web_sm")
doc_df = review_df['text'].apply(nlp)
type(doc_df)

pandas.core.series.Series

In [36]:
type(doc_df[0])

spacy.tokens.doc.Doc

In [37]:
doc_df[4]

General Manager Scott Petello is a good egg!!! Not to go into detail, but let me assure you if you have any issues (albeit rare) speak with Scott and treat the guy with some respect as you state your case and I'd be surprised if you don't walk out totally satisfied as I just did. Like I always say..... "Mistakes are inevitable, it's how we recover from them that is important"!!!

Thanks to Scott and his awesome staff. You've got a customer for life!! .......... :^)

In [38]:
for doc in doc_df[4]:
    print(doc.text, doc.pos_, doc.tag_)

General PROPN NNP
Manager PROPN NNP
Scott PROPN NNP
Petello PROPN NNP
is AUX VBZ
a DET DT
good ADJ JJ
egg NOUN NN
! PUNCT .
! PUNCT .
! PUNCT .
Not PART RB
to PART TO
go VERB VB
into ADP IN
detail NOUN NN
, PUNCT ,
but CCONJ CC
let VERB VB
me PRON PRP
assure VERB VB
you PRON PRP
if SCONJ IN
you PRON PRP
have VERB VBP
any DET DT
issues NOUN NNS
( PUNCT -LRB-
albeit ADV RB
rare ADJ JJ
) PUNCT -RRB-
speak VERB VBP
with ADP IN
Scott PROPN NNP
and CCONJ CC
treat VERB VB
the DET DT
guy NOUN NN
with ADP IN
some DET DT
respect NOUN NN
as SCONJ IN
you PRON PRP
state VERB VBP
your PRON PRP$
case NOUN NN
and CCONJ CC
I PRON PRP
'd AUX MD
be AUX VB
surprised ADJ JJ
if SCONJ IN
you PRON PRP
do AUX VBP
n't PART RB
walk VERB VB
out ADP RP
totally ADV RB
satisfied ADJ JJ
as SCONJ IN
I PRON PRP
just ADV RB
did VERB VBD
. PUNCT .
Like INTJ UH
I PRON PRP
always ADV RB
say VERB VBP
..... PUNCT :
" PUNCT ``
Mistakes NOUN NNS
are AUX VBP
inevitable ADJ JJ
, PUNCT ,
it PRON PRP
's AUX VBZ
how SCONJ WRB
we PR

In [39]:
# spaCy also does some basic noun chunking
print([chunk for chunk in doc_df[4].noun_chunks])

[General Manager Scott Petello, a good egg, detail, me, you, you, any issues, Scott, the guy, some respect, you, your case, I, you, I, I, Mistakes, it, we, them, that, Thanks, Scott, his awesome staff, You, a customer, life]


In [40]:
# chunking in textblob

from textblob import TextBlob
blob_df = review_df['text'].apply(TextBlob)
type(blob_df)

pandas.core.series.Series

In [41]:
type(blob_df[4])

textblob.blob.TextBlob

In [42]:
import nltk
nltk.download('punkt')
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [43]:
blob_df[4].tags

[('General', 'NNP'),
 ('Manager', 'NNP'),
 ('Scott', 'NNP'),
 ('Petello', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('good', 'JJ'),
 ('egg', 'NN'),
 ('Not', 'RB'),
 ('to', 'TO'),
 ('go', 'VB'),
 ('into', 'IN'),
 ('detail', 'NN'),
 ('but', 'CC'),
 ('let', 'VB'),
 ('me', 'PRP'),
 ('assure', 'VB'),
 ('you', 'PRP'),
 ('if', 'IN'),
 ('you', 'PRP'),
 ('have', 'VBP'),
 ('any', 'DT'),
 ('issues', 'NNS'),
 ('albeit', 'IN'),
 ('rare', 'NN'),
 ('speak', 'NN'),
 ('with', 'IN'),
 ('Scott', 'NNP'),
 ('and', 'CC'),
 ('treat', 'VB'),
 ('the', 'DT'),
 ('guy', 'NN'),
 ('with', 'IN'),
 ('some', 'DT'),
 ('respect', 'NN'),
 ('as', 'IN'),
 ('you', 'PRP'),
 ('state', 'NN'),
 ('your', 'PRP$'),
 ('case', 'NN'),
 ('and', 'CC'),
 ('I', 'PRP'),
 ("'d", 'MD'),
 ('be', 'VB'),
 ('surprised', 'VBN'),
 ('if', 'IN'),
 ('you', 'PRP'),
 ('do', 'VBP'),
 ("n't", 'RB'),
 ('walk', 'VB'),
 ('out', 'RP'),
 ('totally', 'RB'),
 ('satisfied', 'JJ'),
 ('as', 'IN'),
 ('I', 'PRP'),
 ('just', 'RB'),
 ('did', 'VBD'),
 ('Like', 'IN'),
 ('

In [44]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [45]:
# textblob can do some basic noun chunking
print([np for np in blob_df[4].noun_phrases])

['general manager', 'scott petello', 'good egg', 'scott', "n't walk", 'mistakes', 'thanks', 'scott', 'awesome staff']


## Task 3. Apply spacy and textblob chunking to the text used in tesk 1 respectively, and output the noun phrase chunking results

In [47]:
# Spacy chunking for the train_text
nlp = spacy.load("en_core_web_sm")
doc_df = nlp(train_text)
type(doc_df)

spacy.tokens.doc.Doc

In [50]:
for doc in doc_df:
    print(doc.text, doc.pos_, doc.tag_)

my PRON PRP$
wife NOUN NN
took VERB VBD
me PRON PRP
here ADV RB
on ADP IN
my PRON PRP$
birthday NOUN NN
for ADP IN
breakfast NOUN NN
and CCONJ CC
it PRON PRP
was AUX VBD
excellent ADJ JJ
  SPACE _SP
the DET DT
weather NOUN NN
was AUX VBD
perfect ADJ JJ
which PRON WDT
made VERB VBD
sitting VERB VBG
outside ADV RB
overlooking VERB VBG
their PRON PRP$
grounds NOUN NNS
an DET DT
absolute ADJ JJ
pleasure NOUN NN

 SPACE _SP
our PRON PRP$
waitress NOUN NN
was AUX VBD
excellent ADJ JJ
and CCONJ CC
our PRON PRP$
food NOUN NN
arrived VERB VBD
quickly ADV RB
on ADP IN
the DET DT
semibusy ADJ JJ
saturday PROPN NNP
morning NOUN NN
  SPACE _SP
it PRON PRP
looked VERB VBD
like SCONJ IN
the DET DT
place NOUN NN
fills VERB VBZ
up ADP RP
pretty ADV RB
quickly ADV RB
so SCONJ IN
the DET DT
earlier ADV RBR
you PRON PRP
get VERB VBP
here ADV RB
the DET DT
better ADJ JJR


 SPACE _SP
do VERB VBP
yourself PRON PRP
a DET DT
favor NOUN NN
and CCONJ CC
get VERB VB
their PRON PRP$
bloody ADJ JJ
mary NOUN NN
  S

In [51]:
# spaCy also does some basic noun chunking
print([chunk for chunk in doc_df.noun_chunks])

[my wife, me, my birthday, breakfast, it, the weather, which, their grounds, an absolute pleasure, our waitress, our food, it, the place, you, yourself, a favor, their bloody mary, it, i, i, they, ingredients, their garden, them, you, it, it, everything, the menu, i, the white truffle scrambled eggs vegetable skillet, it, it, 2 pieces, their griddled bread, it, the meal, it, the best toast, i, i]


In [52]:
from textblob import TextBlob
blob_df = TextBlob(train_text)
type(blob_df)

textblob.blob.TextBlob

In [53]:
blob_df.tags

[('my', 'PRP$'),
 ('wife', 'NN'),
 ('took', 'VBD'),
 ('me', 'PRP'),
 ('here', 'RB'),
 ('on', 'IN'),
 ('my', 'PRP$'),
 ('birthday', 'NN'),
 ('for', 'IN'),
 ('breakfast', 'NN'),
 ('and', 'CC'),
 ('it', 'PRP'),
 ('was', 'VBD'),
 ('excellent', 'JJ'),
 ('the', 'DT'),
 ('weather', 'NN'),
 ('was', 'VBD'),
 ('perfect', 'JJ'),
 ('which', 'WDT'),
 ('made', 'VBD'),
 ('sitting', 'VBG'),
 ('outside', 'IN'),
 ('overlooking', 'VBG'),
 ('their', 'PRP$'),
 ('grounds', 'NNS'),
 ('an', 'DT'),
 ('absolute', 'JJ'),
 ('pleasure', 'NN'),
 ('our', 'PRP$'),
 ('waitress', 'NN'),
 ('was', 'VBD'),
 ('excellent', 'JJ'),
 ('and', 'CC'),
 ('our', 'PRP$'),
 ('food', 'NN'),
 ('arrived', 'VBD'),
 ('quickly', 'RB'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('semibusy', 'JJ'),
 ('saturday', 'NN'),
 ('morning', 'NN'),
 ('it', 'PRP'),
 ('looked', 'VBD'),
 ('like', 'IN'),
 ('the', 'DT'),
 ('place', 'NN'),
 ('fills', 'VBZ'),
 ('up', 'RP'),
 ('pretty', 'RB'),
 ('quickly', 'RB'),
 ('so', 'IN'),
 ('the', 'DT'),
 ('earlier', 'JJR'),
 ('y

In [54]:
# textblob can do some basic noun chunking
print([np for np in blob_df.noun_phrases])

['absolute pleasure', 'semibusy saturday morning', 'place fills', 'bloody mary', 'excellent i', 'white truffle', 'vegetable skillet', 'i cant']


## **Question 3**. Comparing the outputs of spacy and textblob chunking in tast 3, which one would you like to use in your application? Explain it.

**Answer to Q3**:
Both spaCy and TextBlob tokenize the phrases 'an absolute pleasure', 'semibusy saturday morning', 'place fills', 'bloody mary', and 'excellent i' in the same way.The spaCy tokenization includes additional details such as 'scrambled eggs' in the 'white truffle scrambled eggs vegetable skillet', '2 pieces' (presumably referring to the griddled bread), 'the meal', and 'the best toast'.
TextBlob's tokenization separates 'white truffle' and 'vegetable skillet' into two separate tokens, while spaCy treats it as a single token.
TextBlob tokenizes 'i cant' as two separate tokens, whereas spaCy treats it as a single token 'i' with the contraction 'cant'.These differences in tokenization can occur due to variations in the underlying algorithms and rule-based approaches used by spaCy and TextBlob. It's important to note that tokenization results can vary depending on the specific library, configuration, and language model used.Finally, if the application requires highly accurate tokenization, complex linguistic analysis, multi-language support, or additional NLP functionalities, spaCy would be the recommended choice. On the other hand, if simplicity, ease of use, and integration with NLTK are more important, TextBlob may be a suitable option.


---

## Question 4. Whats the disadvantage in bag of words  . Please explain in your own words with an example .

## Write code for the example.

### Answer to Q4:
The main disadvantage in bag of words ,it doesn't maintain the order and structure of the words in a sentence. It treats each word as one feature , and it contains frequency of words in particular sentence. It doesn't maintain the semantic meaning of words while applying to text classification tasks.

In [1]:
text_doc=[
"Politics is the practice and study of influencing decisions that apply to members of a group or society, and it involves the exercise of power and authority to achieve collective goals.",
"Political systems can vary from democracies to autocracies, and political ideologies shape the policies and principles guiding governance.",
"Elections, political campaigns, and public opinion play crucial roles in determining the composition of governments and shaping policy outcomes.",
"Political discourse encompasses debates on issues such as social justice, economic inequality, human rights, environmental sustainability, and foreign relations.",
"Political leaders and policymakers formulate and implement laws, regulations, and public policies that have a direct impact on citizens' lives and shape the direction of a nation or region."
]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
# Defining object for count vectorizer
bagofwords= CountVectorizer()

# Applying bagofwords to text_doc
bagofwordsarray = bagofwords.fit_transform(text_doc)
# array representation
bagofwordsarray.toarray()

array([[1, 3, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,
        0, 0, 0, 0, 1, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 2, 2, 0],
       [0, 2, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1],
       [0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
        1, 0, 0

In [5]:
namesfeature = bagofwords.get_feature_names_out()
namesfeature

array(['achieve', 'and', 'apply', 'as', 'authority', 'autocracies',
       'campaigns', 'can', 'citizens', 'collective', 'composition',
       'crucial', 'debates', 'decisions', 'democracies', 'determining',
       'direct', 'direction', 'discourse', 'economic', 'elections',
       'encompasses', 'environmental', 'exercise', 'foreign', 'formulate',
       'from', 'goals', 'governance', 'governments', 'group', 'guiding',
       'have', 'human', 'ideologies', 'impact', 'implement', 'in',
       'inequality', 'influencing', 'involves', 'is', 'issues', 'it',
       'justice', 'laws', 'leaders', 'lives', 'members', 'nation', 'of',
       'on', 'opinion', 'or', 'outcomes', 'play', 'policies', 'policy',
       'policymakers', 'political', 'politics', 'power', 'practice',
       'principles', 'public', 'region', 'regulations', 'relations',
       'rights', 'roles', 'shape', 'shaping', 'social', 'society',
       'study', 'such', 'sustainability', 'systems', 'that', 'the', 'to',
       'vary'],