# <font color="maroon"> NLP Toolkits and Preprocessing Techniques </font>
Python libraries for natural language processing
1. Converting text to a meaningful format for analysis
2. Preprocessing and cleaning text

Open-Source Libraries<br>
1. <font color="red">NLTK<br> </font>
2. <font color="red">TextBlob<br></font>
3. SpaCy<br>
4. GenSim<br>

Cloud-Based NLP Services<br>
1. IBM Watson<br>
2. Google Cloud Natural Language API
3. Amazon Comprehend
4. Microsoft Azure

## How to Install NLTK?

### Method (i) Command Line
pip install nltk<br>
import nltk<br>
nltk.download()

### Method (ii) Anaconda Navigator (Environment)
![Installation of NLTK library](NLTK.png)

### Method (iii) Download Package and Place into Site-package directory
Install nltk toolkit from https://sourceforge.net/projects/nltk/<br>
![Installation of NLTK library](nltk_package.png)
<br>Locate the package into site-package directory <br>
(to find the path:<br> import site <br>site.getsitepackages())


In [1]:
import site
site.getsitepackages()

['/usr/local/lib/python3.11/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/lib/python3.11/dist-packages']

# Method 1

In [2]:
pip install nltk  #complete this



In [3]:
import nltk
# nltk.download()
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.


True

In [4]:
print(nltk.data.path)


['/root/nltk_data', '/usr/nltk_data', '/usr/share/nltk_data', '/usr/lib/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


## Sample Text Data

Consider this sentence:
<br>**Hi Mr. Smith! I am going to buy some vegetables (3 tomatoes and 3 cucumbers) from the
store. Should I pick up some black-eyed peas as well?**

Text data is messy and unstructured. To analyze this data, we need to preprocess the text.


![](https://i.imgur.com/pt5p6Hb.png)

# Code: Tokenization (Words)

In [5]:
from nltk.tokenize import word_tokenize  #complete this

my_text = '''Hi Mr. Smith! I am going to buy some vegetables (3 tomatoes and 3 cucumbers)
from the store. Should I pick up some black-eyed peas as well?'''

print(word_tokenize(my_text))

['Hi', 'Mr.', 'Smith', '!', 'I', 'am', 'going', 'to', 'buy', 'some', 'vegetables', '(', '3', 'tomatoes', 'and', '3', 'cucumbers', ')', 'from', 'the', 'store', '.', 'Should', 'I', 'pick', 'up', 'some', 'black-eyed', 'peas', 'as', 'well', '?']


# Code: Tokenization (Sentences)

In [6]:
from nltk.tokenize import sent_tokenize   #complete this

my_text = '''Hi Mr. Smith! I am going to buy some vegetables (3 tomatoes and 3 cucumbers)
from the store. Should I pick up some black-eyed peas as well?'''

print(sent_tokenize(my_text))

['Hi Mr. Smith!', 'I am going to buy some vegetables (3 tomatoes and 3 cucumbers)\nfrom the store.', 'Should I pick up some black-eyed peas as well?']


![](https://i.imgur.com/3L6x92C.png)

# Code: Remove Punctuation

In [7]:
import re              #complete this
import string

#Replace punctuations with a white space
s = re.sub('[^\w\s]','',my_text)            #complete this
s

#OR
# clean_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', my_text)  # string.punctuation is a string defined in the string module of Python. It contains all the punctuation characters: !"#$%&'()*+,-./:;<=>?@[\]^_{|}~`.
# clean_text

'Hi Mr Smith I am going to buy some vegetables 3 tomatoes and 3 cucumbers\nfrom the store Should I pick up some blackeyed peas as well'

# Code: Make All Text Lowercase

In [8]:
clean_text = s.upper()           #complete this
clean_text

'HI MR SMITH I AM GOING TO BUY SOME VEGETABLES 3 TOMATOES AND 3 CUCUMBERS\nFROM THE STORE SHOULD I PICK UP SOME BLACKEYED PEAS AS WELL'

In [9]:
clean_text = s.lower()           #complete this
clean_text

'hi mr smith i am going to buy some vegetables 3 tomatoes and 3 cucumbers\nfrom the store should i pick up some blackeyed peas as well'

# Code: Remove Numbers

In [10]:
# Removes all words containing digits
clean_text = re.sub('\d', '', clean_text)  #complete this
clean_text

'hi mr smith i am going to buy some vegetables  tomatoes and  cucumbers\nfrom the store should i pick up some blackeyed peas as well'

# <font color='blue'>Preprocessing: Stop Words</font>

![](https://i.imgur.com/T5RJXrX.png)

# Code: Stop Words

In [11]:
from nltk.corpus import stopwords      #complete this
set(stopwords.words('english'))        #complete this

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 "he's",
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 "i'll",
 "i'm",
 "i've",
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

# Code: Remove Stop Words

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">CountVectorizer</a>

In [12]:
from sklearn.feature_extraction.text import CountVectorizer        #complete this
import pandas as pd

my_text = ["Hi Mr. Smith! I’m black going Smith to buy black going Smith some vegetables \
(3 tomatoes and 3 cucumbers from goingSmith going the store. Should Smith I pick up going some black-eyed peas as well?"]

# Incorporate stop words when creating the count vectorizer
cv = CountVectorizer(stop_words='english')           #complete this
X = cv.fit_transform(my_text)                        #complete this
print (X)
pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

# Reference: https://www.geeksforgeeks.org/difference-between-pandas-vs-numpy/
# self learn pandas: https://www.w3schools.com/python/pandas/pandas_intro.asp
# self learn numpy: https://www.w3schools.com/python/numpy/numpy_intro.asp

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 14 stored elements and shape (1, 14)>
  Coords	Values
  (0, 6)	1
  (0, 7)	1
  (0, 10)	4
  (0, 0)	3
  (0, 4)	4
  (0, 1)	1
  (0, 13)	1
  (0, 12)	1
  (0, 2)	1
  (0, 5)	1
  (0, 11)	1
  (0, 9)	1
  (0, 3)	1
  (0, 8)	1


Unnamed: 0,black,buy,cucumbers,eyed,going,goingsmith,hi,mr,peas,pick,smith,store,tomatoes,vegetables
0,3,1,1,1,4,1,1,1,1,1,4,1,1,1


The process of using CountVectorizer.fit_transform involves the following steps:

(1)Tokenization: The text documents are first tokenized, breaking them into individual words or tokens.

(2)Vocabulary Building (fit): CountVectorizer builds a vocabulary, which is a dictionary mapping each unique word (or token) in the documents to an integer index.

(3)Counting (transform): It then counts the occurrences of each word in each document and stores these counts in a sparse matrix, where rows represent documents, and columns represent the vocabulary words. Each element of the matrix represents the frequency of the corresponding word in the respective document.

![](https://i.imgur.com/9qllh8j.png)

# Code: Stemming

In [14]:
from nltk.stem import LancasterStemmer              #complete this
stemmer = LancasterStemmer()                        #complete this

# Try some stems
print('drive:{}'.format(stemmer.stem('drive')))
print('drives:{}'.format(stemmer.stem('drives')))
print('driver:{}'.format(stemmer.stem('driver')))
print('drivers:{}'.format(stemmer.stem('drivers')))
print('driven:{}'.format(stemmer.stem('driven')))

drive:driv
drives:driv
driver:driv
drivers:driv
driven:driv


# Code: Lemmatization

In [15]:
from nltk.stem import WordNetLemmatizer   #complete this # Reference: https://www.nltk.org/api/nltk.stem.wordnet.html
from nltk.tokenize import word_tokenize
lemmatizer=WordNetLemmatizer()

input_str = "been had done languages cities mice running flies"
input_str = word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse
running
fly


In [16]:
print(lemmatizer.lemmatize("running", pos="v"))

run


![](https://i.imgur.com/8edVsCR.png)

# Code: Parts of Speech Tagging

In [17]:
from nltk.tag import pos_tag                         #complete this
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

my_text = "James Smith lives in the Penang."
my_text2 = "James Smith is having a live band in the United States."
tokens = pos_tag(word_tokenize(my_text))
tokens2 = pos_tag(word_tokenize(my_text2))
print("Sentence 1:",tokens)
print("Sentence 2:",tokens2)

#Reference:https://pythonspot.com/nltk-speech-tagging/

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Sentence 1: [('James', 'NNP'), ('Smith', 'NNP'), ('lives', 'VBZ'), ('in', 'IN'), ('the', 'DT'), ('Penang', 'NNP'), ('.', '.')]
Sentence 2: [('James', 'NNP'), ('Smith', 'NNP'), ('is', 'VBZ'), ('having', 'VBG'), ('a', 'DT'), ('live', 'JJ'), ('band', 'NN'), ('in', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('.', '.')]


![POS](nltk-speech-codes.png)

## Named Entity Recognition

In [18]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.chunk import ne_chunk  #complete this

my_text = "James Smith lives in the Penang."
tokens = pos_tag(word_tokenize(my_text)) # this labels each word as a part of speech
entities = ne_chunk(tokens) # this extracts entities from the list of words
# entities.draw() # This requires a graphical display, which is not available in Colab.
print(entities) # Print the tree structure to the console instead.

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


(S
  (PERSON James/NNP)
  (PERSON Smith/NNP)
  lives/VBZ
  in/IN
  the/DT
  (ORGANIZATION Penang/NNP)
  ./.)


# <font color="blue"> Prepocessing: Compound Term Extraction </font>

![](https://i.imgur.com/q1WuWai.png)

# Code: Compound Term Extraction

In [19]:
from nltk.tokenize import MWETokenizer       #complete this

my_text = "You all are the greatest students of all time."

mwe_tokenizer = MWETokenizer([('You','all'), ('of', 'all', 'time')])
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(my_text))

mwe_tokens

# New York City, take into account, make use of, high probability, kick the bucket

['You_all', 'are', 'the', 'greatest', 'students', 'of_all_time', '.']

# Lambda Function

In [20]:
# Basic example, https://www.w3schools.com/python/python_lambda.asp
square_me = lambda x: x*x

my_numbers = [9, 3, 4, 100, 2, 1]
my_numbers_squared = list(map(square_me, my_numbers)) #map = applies a function to all the items in an input_list
                                                      #map(function, iterable)
print(my_numbers_squared)

[81, 9, 16, 10000, 4, 1]


# <font color=red>Preprocessing Exercise </font>



# Introduction

We will be using review data from Kaggle to practice preprocessing text data. The dataset contains user reviews for many products, but today we'll be focusing on the product in the dataset that had the most reviews - an oatmeal cookie.

The following code will help you load in the data. If this is your first time using nltk, you'll to need to pip install it first.


In [21]:
import nltk
# nltk.download() <-- Run this if it's your first time using nltk to download all of the datasets and models
import pandas as pd

In [22]:
df = pd.read_csv('cookie_reviews.csv')
df.head(10)

Unnamed: 0,user_id,stars,reviews
0,A368Z46FIKHSEZ,5,I love these cookies! Not only are they healt...
1,A1JAPP1CXRG57A,5,Quaker Soft Baked Oatmeal Cookies with raisins...
2,A2Z9JNXPIEL2B9,5,I am usually not a huge fan of oatmeal cookies...
3,A31CYJQO3FL586,5,I participated in a product review that includ...
4,A2KXQ2EKFF3K2G,5,My kids loved these. I was very pleased to giv...
5,A2U5TAIAQ675BL,5,I really enjoyed these individually wrapped bi...
6,A1R4PIBZBD3NZ0,4,I was surprised at how soft the cookie was. I ...
7,A1ECQ8LJMXG4WI,5,Filled with oats and raisins you'll love this ...
8,A3MSG4E5MLI1XP,5,"I was recently given a complimentary ""vox box""..."
9,A3BUDUV9GORLWH,5,the best and freshest cookie that comes in a p...


**Question 1:**

Determine how many reviews there are in total.
   

In [23]:
total=df.count()
print(total)

user_id    913
stars      913
reviews    913
dtype: int64


**Question 2:**
    
Determine the percentage of 1, 2, 3, 4 and 5 star reviews.

In [24]:
#stars_col = df.stars
val=df['stars'].value_counts()
print (val)

print (val/total[1]*100)

stars
5    624
4    217
3     56
2     12
1      4
Name: count, dtype: int64
stars
5    68.346112
4    23.767798
3     6.133625
2     1.314348
1     0.438116
Name: count, dtype: float64


  print (val/total[1]*100)


**Question 3:**

(a) Remove stop words

In [25]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
df['reviews_without_stopwords'] = df['reviews'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
#split()->split on whitespace~ check word by word

r=(df['reviews_without_stopwords'])
print(r)

0      I love cookies! Not healthy taste great soft! ...
1      Quaker Soft Baked Oatmeal Cookies raisins deli...
2      I usually huge fan oatmeal cookies, literally ...
3      I participated product review included sample ...
4      My kids loved these. I pleased give kids quick...
                             ...                        
908    I loved cookies kids. You read full review [.....
909    This great tasting cookie. It soft texture smo...
910    These great quick snack! They satisfying even ...
911    I love Quaker soft baked cookies. The really s...
912    This cookie really good works really well fami...
Name: reviews_without_stopwords, Length: 913, dtype: object


1. df['reviews'] refers to the 'reviews' column in your DataFrame df
2. .apply(lambda x: ...) is used to apply a function (defined by the lambda function) along the axis of the DataFrame.
3. lambda x: ' '.join([word for word in x.split() if word not in (stop)]) is a lambda function that:
   <br>a. Splits each review x into a list of words (x.split()).
   <br>b. Iterates through each word in this list (for word in x.split()).
   <br>c. Checks if each word is not in the stop list (i.e., if it's not a stopword).
   <br>d. If the word is not a stopword, it includes it in the list comprehension ([word for word in x.split() if word not in (stop)]).
   <br>e. Joins these words back into a single string with spaces separating them (' '.join(...)).

(b) Change to lower case

In [26]:
l_case=r.str.lower()
l_case

Unnamed: 0,reviews_without_stopwords
0,i love cookies! not healthy taste great soft! ...
1,quaker soft baked oatmeal cookies raisins deli...
2,"i usually huge fan oatmeal cookies, literally ..."
3,i participated product review included sample ...
4,my kids loved these. i pleased give kids quick...
...,...
908,i loved cookies kids. you read full review [.....
909,this great tasting cookie. it soft texture smo...
910,these great quick snack! they satisfying even ...
911,i love quaker soft baked cookies. the really s...


(b) Perform stemming

In [27]:
sno = nltk.stem.SnowballStemmer('english')

documents = [[sno.stem(word) for word in x.split(" ")] for x in l_case]
print(documents)

[['i', 'love', 'cookies!', 'not', 'healthi', 'tast', 'great', 'soft!', 'i', 'definit', 'add', 'groceri', 'list!'], ['quaker', 'soft', 'bake', 'oatmeal', 'cooki', 'raisin', 'delici', 'treat,', 'great', 'anytim', 'day.', 'for', 'example:<br', '/><br', '/>--at', 'breakfast,', 'i', 'one', 'larg', 'banana', 'cup', 'coffee,', 'felt', "i'd", 'relat', '"healthy"', 'start', 'day.<br', '/><br', '/>--the', 'next', 'day', 'lunch,', 'follow', 'tuna', 'sandwich,', 'i', 'one', 'glass', 'milk,', 'satisfi', 'enough', 'need', 'snack', 'dinner', '6:30.<br', '/><br', '/>--the', 'follow', 'night,', 'dinner,', 'i', 'one', 'remaind', 'glass', 'wine.', '(delicious!)', 'and', 'again,', 'feel', 'need', 'snack', 'later', 'evening.<br', '/><br', '/>each', 'cooki', 'individu', 'packaged,', 'textur', 'soft', 'moist,', 'right', 'amount', 'sweetness.', 'natur', 'flavor', 'use', 'make', 'cinnamon', 'all', 'spice.', 'these', 'flavor', 'give', 'cooki', 'real', 'old-fashioned,', 'homemad', 'taste.<br', '/><br', '/>nutrit

1. Constructs a new list (documents) by iterating over each element (x) in the list l_case.
2. For each document i in l_case, the inner list comprehension splits i into words using i.split(" ").
3. It then applies stemming to each word using sno.stem(word), where sno is an object or function that performs stemming.
4. The outer comprehension gathers these lists of stemmed words (one list per document) and constructs a new list (documents) where each element corresponds to a document from l_case, but with each word stemmed.

# TextBlob

### Another toolkit other than NLTK

- Wraps around NLTK and makes it easier to use

# TextBlob Demo: Tokenization

In [28]:
pip install textblob  #Install the library before importing



In [29]:
from textblob import TextBlob                #complete this

my_text = TextBlob("We're moving from NLTK to TextBlob. How fun!")
my_text.words

WordList(['We', "'re", 'moving', 'from', 'NLTK', 'to', 'TextBlob', 'How', 'fun'])

# TextBlob Demo: Spell Check

In [30]:
blob = TextBlob("I'm graat at speling.")
print(blob.correct())                           #complete this

I'm great at spelling.


<font color="blue">
## How does the correct function work?  <br>
    
- Calculates the Levenshtein distance between the word ‘graat’ and all words in its word list </br>
- Of the words with the smallest Levenshtein distance, it outputs the most popular word </br></font>

# TextBlob Demo: Tagging

In [31]:
blob = TextBlob("John hits the ball.")
for words, tag in blob.tags:                     #complete this
    print (words, tag)

John NNP
hits VBZ
the DT
ball NN


# TextBlob Demo: Language Detection and Translation

In [32]:
from textblob import TextBlob

text = "This is a sample text in English."
blob = TextBlob(text)

In [33]:
!pip install langdetect             #complete this

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993223 sha256=4176ac59f410d4f220c22493b1ea14085caa3dd8512ec03693e04b5fc107b66a
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd6272c167640072e
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.9


In [34]:
from langdetect import detect           #complete this

text = "没关系"
language = detect(text)

print("Detected Language:", language)

Detected Language: zh-cn


In [35]:
!pip install googletrans==4.0.0-rc1         #complete this

Collecting googletrans==4.0.0-rc1
  Downloading googletrans-4.0.0rc1.tar.gz (20 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting httpx==0.13.3 (from googletrans==4.0.0-rc1)
  Downloading httpx-0.13.3-py3-none-any.whl.metadata (25 kB)
Collecting hstspreload (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading hstspreload-2025.1.1-py3-none-any.whl.metadata (2.1 kB)
Collecting chardet==3.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading chardet-3.0.4-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting idna==2.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading idna-2.10-py2.py3-none-any.whl.metadata (9.1 kB)
Collecting rfc3986<2,>=1.3 (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting httpcore==0.9.* (from httpx==0.13.3->googletrans==4.0.0-rc1)
  Downloading httpcore-0.9.1-py3-none-any.whl.metadata (4.6 kB)
Collecting h11<0.10,>=0.8 (from httpcore==0.9.*->httpx==0.13.3->googl

In [36]:
from langdetect import detect
from googletrans import Translator

text = "Arch Chinese is a premier Chinese learning system crafted by Chinese teachers in the United States for Mandarin Chinese language learners at K-12 schools and universities."

# Detect the language
detected_lang = detect(text)

# Translate to French
translator = Translator()
translated_text = translator.translate(text, src=detected_lang, dest='zh-tw').text

print("Detected Language:", detected_lang)
print("Translated Text (to French):", translated_text)

Detected Language: en
Translated Text (to French): Arch Chinese是美國中國教師在美國學校和大學中漢語中漢語學習者制定的中國首要學習系統。


# Exercise

In [37]:
# Write a Python function using TextBlob to tokenize a given sentence and count the number of tokens.

from textblob import TextBlob

def tokenize_and_count(sentence):
    blob = TextBlob(sentence)
    tokens = blob.words

    print("Tokens:", tokens)
    num_tokens = len(tokens)
    return num_tokens

# Example sentence, you may change to user input as well
sentence = "TextBlob is a simple and powerful library for text processing."

# Call the function
num_tokens = tokenize_and_count(sentence)
print("Number of Tokens:", num_tokens)

Tokens: ['TextBlob', 'is', 'a', 'simple', 'and', 'powerful', 'library', 'for', 'text', 'processing']
Number of Tokens: 10


In [38]:
# Write a Python function using TextBlob to perform Parts of Speech (POS) tagging on a given sentence.

from textblob import TextBlob

def pos_tagging(sentence):

    blob = TextBlob(sentence)
    pos_tags = blob.tags
    print("POS Tags:")
    for words, tag in blob.tags:
        print (words, tag)

# Example sentence
sentence = "TextBlob is a simple and powerful library for text processing."

# Call the function
pos_tagging(sentence)

POS Tags:
TextBlob NNP
is VBZ
a DT
simple JJ
and CC
powerful JJ
library NN
for IN
text NN
processing NN


In [39]:
# Write a Python function using TextBlob to perform spell checking on a given text and suggest corrections.
from textblob import TextBlob

def spell_check(text):
    blob = TextBlob(text)
    corrected_text = blob.correct()
    print("Original Text:")
    print(text)
    print("\nCorrected Text:")
    print(corrected_text)

# Example text with intentional errors
text = "Thes arre somme speling errors in thiss sentance."

# Call the function
spell_check(text)

Original Text:
Thes arre somme speling errors in thiss sentance.

Corrected Text:
The are some spelling errors in this sentence.


In [None]:
# Write a Python function using langdetect and googletrans to perform trasnlation on a given text from english to chiense

In [40]:
from langdetect import detect
from googletrans import Translator

def translate_to_chinese(text):
    # Detect the language of the text
    detected_lang = detect(text)

    # Translate to Chinese ('zh-CN')
    translator = Translator()
    translated_text = translator.translate(text, src=detected_lang, dest='zh-CN').text

    return translated_text

# Example text in English
text = "Hello, how are you?"

# Translate to Chinese
translated_text = translate_to_chinese(text)

print(f"Original Text: {text}")
print(f"Translated Text (to Chinese): {translated_text}")

Original Text: Hello, how are you?
Translated Text (to Chinese): 你好吗？


# <font color="maroon"> Some other functions in NLP: Text Similarity Measures </font>

- To measure distance between 2 string

Applications
- Information retrieval
- Text classification
- Document clustering
- Topic Modeling
- Matric decomposition

To measure the word similarity, we use **<font color="blue"><a href="https://pypi.org/project/python-Levenshtein/" target="_blank">Levenshtein distance</a></font>**.
- Minimum number of operations to get from one word to another.

![](https://i.imgur.com/FkdJmPi.png)

In [41]:
pip install python-Levenshtein  #complete this

Collecting python-Levenshtein
  Downloading python_levenshtein-0.27.1-py3-none-any.whl.metadata (3.7 kB)
Collecting Levenshtein==0.27.1 (from python-Levenshtein)
  Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting rapidfuzz<4.0.0,>=3.9.0 (from Levenshtein==0.27.1->python-Levenshtein)
  Downloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Downloading python_levenshtein-0.27.1-py3-none-any.whl (9.4 kB)
Downloading levenshtein-0.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (161 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rapidfuzz-3.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages:

In [42]:
from Levenshtein import distance as lev
lev('party', 'park')

2

In [43]:
#concept behind lev('party', 'park')
def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i

    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s1[i - 1] == s2[j - 1] else 1
            dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + cost)

    return dp[m][n]

# Example usage
string1 = "party"
string2 = "park"
distance = levenshtein_distance(string1, string2)                #complete this
print("Levenshtein distance:", distance)

Levenshtein distance: 2


## Let's use the Levenshtein to measure the similarity between 2 sentences:
<br>sentence1 = "The quick brown fox jumps over the lazy dog."
<br>sentence2 = "A quick brown fox jumps over a lazy dog."

In [46]:
from Levenshtein import distance as lev

sentence1 = "The quick brown fox jumps over the lazy dog."
sentence2 = "The quick brown fox jumps over lazy dog."

words1 = sentence1.lower().split()
words2 = sentence2.lower().split()

distance = lev(words1, words2)

# Calculate similarity (adjust based on your specific needs)
max_length = max(len(words1), len(words2))
# print (max_length)
similarity = 1 - (distance / max_length)

print("Levenshtein distance between sentence 1 and sentence 2:", distance)
print("Similarity between sentence 1 and sentence 2:", similarity)

# However, it's important to note that Levenshtein distance is typically used for comparing sequences of characters, not entire sentences or phrases.
# To measure similarity between sentences where the words are not necessarily in the same sequence,
# you need to consider methods that can account for semantic similarity rather than just sequence-based similarity like Levenshtein distance.
# Here are a few approaches you can explore: TF-IDF/Word Embeddings (pretrained model like Word2Vec, GloVe, or FastText) and Similarity Metrics (Cosine Similarity)

Levenshtein distance between sentence 1 and sentence 2: 1
Similarity between sentence 1 and sentence 2: 0.8888888888888888


# Text Format for Analysis: Count Vectorizer

In [47]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus =['This is the first document.', 'This is the second document.', 'And the third one. One is fun.'] #corpus=collection of teks
cv = CountVectorizer()
X = cv.fit_transform(corpus)
pd.DataFrame(X.toarray(),columns=cv.get_feature_names_out())

Unnamed: 0,and,document,first,fun,is,one,second,the,third,this
0,0,1,1,0,1,0,0,1,0,1
1,0,1,0,0,1,0,1,1,0,1
2,1,0,0,1,1,2,0,1,1,0


![](https://i.imgur.com/OQDeQlb.png)

# Document Similarity: Example

![](https://i.imgur.com/PyirXsy.png)

In [49]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['The weather is hot under the sun',
'I make my hot chocolate with milk',
'One hot encoding',
'I will have a chai latte with milk',
'There is a hot sale today']
# create the document-term matrix with count vectorizer
cv = CountVectorizer(stop_words="english")
X = cv.fit_transform(corpus).toarray()
dt = pd.DataFrame(X, columns=cv.get_feature_names_out())
dt

Unnamed: 0,chai,chocolate,encoding,hot,latte,make,milk,sale,sun,today,weather
0,0,0,0,1,0,0,0,0,1,0,1
1,0,1,0,1,0,1,1,0,0,0,0
2,0,0,1,1,0,0,0,0,0,0,0
3,1,0,0,0,1,0,1,0,0,0,0
4,0,0,0,1,0,0,0,1,0,1,0


# Document Similarity: Example

In [50]:
# calculate the cosine similarity between all combinations of documents
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

# list all of the combinations of 5 take 2 as well as the pairs of phrases
pairs = list(combinations(range(len(corpus)),2)) #sentence (0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), .., (3,4))
print(pairs)
combos = [(corpus[a_index], corpus[b_index]) for (a_index, b_index) in pairs]
print (combos)

# calculate the cosine similarity for all pairs of phrases and sort by most similar
results = [cosine_similarity([X[a_index]], [X[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results, combos), reverse=True)

[(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
[('The weather is hot under the sun', 'I make my hot chocolate with milk'), ('The weather is hot under the sun', 'One hot encoding'), ('The weather is hot under the sun', 'I will have a chai latte with milk'), ('The weather is hot under the sun', 'There is a hot sale today'), ('I make my hot chocolate with milk', 'One hot encoding'), ('I make my hot chocolate with milk', 'I will have a chai latte with milk'), ('I make my hot chocolate with milk', 'There is a hot sale today'), ('One hot encoding', 'I will have a chai latte with milk'), ('One hot encoding', 'There is a hot sale today'), ('I will have a chai latte with milk', 'There is a hot sale today')]


[(array([[0.40824829]]),
  ('The weather is hot under the sun', 'One hot encoding')),
 (array([[0.40824829]]), ('One hot encoding', 'There is a hot sale today')),
 (array([[0.35355339]]),
  ('I make my hot chocolate with milk', 'One hot encoding')),
 (array([[0.33333333]]),
  ('The weather is hot under the sun', 'There is a hot sale today')),
 (array([[0.28867513]]),
  ('The weather is hot under the sun', 'I make my hot chocolate with milk')),
 (array([[0.28867513]]),
  ('I make my hot chocolate with milk', 'There is a hot sale today')),
 (array([[0.28867513]]),
  ('I make my hot chocolate with milk', 'I will have a chai latte with milk')),
 (array([[0.]]),
  ('The weather is hot under the sun', 'I will have a chai latte with milk')),
 (array([[0.]]), ('One hot encoding', 'I will have a chai latte with milk')),
 (array([[0.]]),
  ('I will have a chai latte with milk', 'There is a hot sale today'))]

In [51]:
pairs = list(combinations(range(5),2))
pairs

[(0, 1),
 (0, 2),
 (0, 3),
 (0, 4),
 (1, 2),
 (1, 3),
 (1, 4),
 (2, 3),
 (2, 4),
 (3, 4)]

![](https://i.imgur.com/jrfN6Jj.png)

![](https://i.imgur.com/BI8XP92.png)

![](https://i.imgur.com/3IbfQXT.png)

![](https://i.imgur.com/pnNqzql.png)

In [53]:
import pandas as pd
corpus = ['This is the first document.',
         'This is the second document.',
         'And the third one. One is fun.']
# original Count Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=cv.get_feature_names_out())

Unnamed: 0,and,document,first,fun,is,one,second,the,third,this
0,0,1,1,0,1,0,0,1,0,1
1,0,1,0,0,1,0,1,1,0,1
2,1,0,0,1,1,2,0,1,1,0


In [55]:
# new TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
cv_tfidf = TfidfVectorizer()
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
pd.DataFrame(X_tfidf, columns=cv_tfidf.get_feature_names_out())

Unnamed: 0,and,document,first,fun,is,one,second,the,third,this
0,0.0,0.450145,0.591887,0.0,0.349578,0.0,0.0,0.349578,0.0,0.450145
1,0.0,0.450145,0.0,0.0,0.349578,0.0,0.591887,0.349578,0.0,0.450145
2,0.36043,0.0,0.0,0.36043,0.212876,0.72086,0.0,0.212876,0.36043,0.0


![](https://i.imgur.com/xlJibKw.png)

## Document Similarity: Example with TF-IDF

In [57]:
corpus = ['The weather is hot under the sun',
'I make my hot chocolate with milk',
'One hot encoding',
'I will have a chai latte with milk',
'There is a hot sale today']

from sklearn.feature_extraction.text import TfidfVectorizer
# create the document-term matrix with TF-IDF vectorizer
cv_tfidf = TfidfVectorizer(stop_words="english")
X_tfidf = cv_tfidf.fit_transform(corpus).toarray()
dt_tfidf = pd.DataFrame(X_tfidf,columns=cv_tfidf.get_feature_names_out())
dt_tfidf

Unnamed: 0,chai,chocolate,encoding,hot,latte,make,milk,sale,sun,today,weather
0,0.0,0.0,0.0,0.370086,0.0,0.0,0.0,0.0,0.6569,0.0,0.6569
1,0.0,0.580423,0.0,0.327,0.0,0.580423,0.468282,0.0,0.0,0.0,0.0
2,0.0,0.0,0.871247,0.490845,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.614189,0.0,0.0,0.0,0.614189,0.0,0.495524,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.370086,0.0,0.0,0.0,0.6569,0.0,0.6569,0.0


In [58]:
# calculate the cosine similarity for all pairs of phrases and sort by most similar
results_tfidf = [cosine_similarity([X_tfidf[a_index]], [X_tfidf[b_index]]) for (a_index, b_index) in pairs]
sorted(zip(results_tfidf, combos), reverse=True)

[(array([[0.23204486]]),
  ('I make my hot chocolate with milk', 'I will have a chai latte with milk')),
 (array([[0.18165505]]),
  ('The weather is hot under the sun', 'One hot encoding')),
 (array([[0.18165505]]), ('One hot encoding', 'There is a hot sale today')),
 (array([[0.16050661]]),
  ('I make my hot chocolate with milk', 'One hot encoding')),
 (array([[0.1369638]]),
  ('The weather is hot under the sun', 'There is a hot sale today')),
 (array([[0.12101835]]),
  ('The weather is hot under the sun', 'I make my hot chocolate with milk')),
 (array([[0.12101835]]),
  ('I make my hot chocolate with milk', 'There is a hot sale today')),
 (array([[0.]]),
  ('The weather is hot under the sun', 'I will have a chai latte with milk')),
 (array([[0.]]), ('One hot encoding', 'I will have a chai latte with milk')),
 (array([[0.]]),
  ('I will have a chai latte with milk', 'There is a hot sale today'))]

![](https://i.imgur.com/mj4J60v.png)

# <font color=red>Sentiment Analysis Exercise</font>

## Introduction

We will be using a song lyric dataset from Kaggle to identify songs with similar lyrics. The data set contains artists, songs and lyrics for 55K+ songs, but today we will be focusing on songs by one group in particular - The Beatles.

The following code will help you load in the data and get set up for this exercise.


In [59]:
import nltk
import pandas as pd

In [60]:
data = pd.read_csv('songdata.csv')
data.head(10)

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...
5,ABBA,Burning My Bridges,/a/abba/burning+my+bridges_20003011.html,"Well, you hoot and you holler and you make me ..."
6,ABBA,Cassandra,/a/abba/cassandra_20002811.html,Down in the street they're all singing and sho...
7,ABBA,Chiquitita,/a/abba/chiquitita_20002978.html,"Chiquitita, tell me what's wrong \nYou're enc..."
8,ABBA,Crazy World,/a/abba/crazy+world_20003013.html,I was out with the morning sun \nCouldn't sle...
9,ABBA,Crying Over You,/a/abba/crying+over+you_20177611.html,I'm waitin' for you baby \nI'm sitting all al...


# Question 1

Apply the following preprocessing steps:

- Note the '\n' (new line) characters in the lyrics. Remove them using regular expressions.


In [61]:
# first bullet

data = pd.read_csv('songdata.csv')

txt_col = data.text

txt_col=txt_col.str.replace("\n","")
print(txt_col)

0        Look at her face, it's a wonderful face  And i...
1        Take it easy with me, please  Touch me gently ...
2        I'll never know why I had to go  Why I had to ...
3        Making somebody happy is a question of give an...
4        Making somebody happy is a question of give an...
                               ...                        
57645    Irie days come on play  Let the angels fly let...
57646    Power to the workers  More power  Power to the...
57647    all you need  is something i'll believe  flash...
57648    northern star  am i frightened  where can i go...
57649    come in  make yourself at home  i'm a bit late...
Name: text, Length: 57650, dtype: object


## Question 2

(a) List all the rows with "Imagine" in the title


In [62]:
data[data['song'].str.contains('Imagine')]

Unnamed: 0,artist,song,link,text
1769,Bon Jovi,Imagine,/b/bon+jovi/imagine_20525130.html,"Imagine there's no heaven, \nIt's easy if you..."
4215,Diana Ross,Imagine,/d/diana+ross/imagine_20040404.html,Imagine there's no heaven \nIt's easy if you ...
6885,Glee,Imagine,/g/glee/imagine_20854234.html,"Imagine there's no countries, \nIt isn't hard..."
7340,Guns N' Roses,Imagine,/g/guns+n+roses/imagine_20254363.html,Imagine there's no heaven \nIt's easy if you ...
15678,Pearl Jam,Hard To Imagine,/p/pearl+jam/hard+to+imagine_20106382.html,"Paint a picture, using only gray \nLight your..."
19748,Train,Imagine,/t/train/imagine_21054702.html,Finally met Virginia on a slow summer night \...
24406,Avril Lavigne,Imagine,/a/avril+lavigne/imagine_20785697.html,Imagine there's no Heaven \nIt's easy if you ...
24783,The Beatles,Imagine,/b/beatles/imagine_20254326.html,Imagine there's no heaven \nIt's easy if you ...
29441,Demi Lovato,I Can Only Imagine,/d/demi+lovato/i+can+only+imagine_20868017.html,I can only imagine \nSurrounded by your glory...
40519,Kirk Franklin,Imagine Me,/k/kirk+franklin/imagine+me_20370453.html,Imagine me \nLoving what I see when the mirro...


## Question 3

(a) Extract the first line of lyric out from the first song.


In [63]:
#print(txt_col)
first_sentence =txt_col.str.split("\r").str.get(0)

print(first_sentence[0])

Look at her face, it's a wonderful face  And it means something special to me  Look at the way that she smiles when she sees me  How lucky can one fellow be?    She's just my kind of girl, she makes me feel fine  Who could ever believe that she could be mine?  She's just my kind of girl, without her I'm blue  And if she ever leaves me what could I do, what could I do?    And when we go for a walk in the park  And she holds me and squeezes my hand  We'll go on walking for hours and talking  About all the things that we plan    She's just my kind of girl, she makes me feel fine  Who could ever believe that she could be mine?  She's just my kind of girl, without her I'm blue  And if she ever leaves me what could I do, what could I do?


(b) Find out the sentiment of the extracted lyric.

In [64]:
from textblob import TextBlob

t=TextBlob(first_sentence[0])
t.sentiment

Sentiment(polarity=0.4476190476190476, subjectivity=0.654978354978355)