Lowercasing, Punctuation removal, Stop words removal
--------------------------------------------------------------------------------

A> Converting Text Data to Lowercase
-------------------------------------------------------
In this code, we are going to discuss how to lowercase the text data in
order to have all the data in a uniform format and to make sure “NLP” and
“nlp” are treated as the same.

Problem
------------
How to lowercase the text data?

Solution
------------
The simplest way to do this is by using the default lower() function in
Python.

The lower() method converts all uppercase characters in a string into
lowercase characters and returns them.

In [10]:
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity','There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']

#convert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                                               tweet
0                        This is introduction to NLP
1              It is likely to be useful, to people 
2             Machine learning is the new electrcity
3  There would be less hype around AI and more ac...
4                           python is the best tool!
5                                R is good langauage
6                                   I like this book
7                        I want more books like this


In [11]:
# Execute lower() function on the text data
# When there is just the string, apply the lower() function directly as shown below:
x = 'Testing'
x2 = x.lower()
print(x2)

# but we cannot directly apply the lower() on a dataframe !!!

testing


In [12]:
# When you want to perform lowercasing on a data frame, 
# use the apply function as shown below:
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df['tweet']

0                          this is introduction to nlp
1                 it is likely to be useful, to people
2               machine learning is the new electrcity
3    there would be less hype around ai and more ac...
4                             python is the best tool!
5                                  r is good langauage
6                                     i like this book
7                          i want more books like this
Name: tweet, dtype: object

B> Removing Punctuation
---------------------------------
In this code, we are going to discuss how to remove punctuation from the
text data. This step is very important as punctuation doesn’t add any extra
information or value. Hence removal of all such instances will help reduce
the size of the data and increase computational efficiency.

Problem
------------
You want to remove punctuation from the text data.

Solution
------------
The simplest way to do this is by using the regex and replace() function in
Python.

In [13]:
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity', 'There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage', 'I like this book','I want more books like this']

#convert list to dataframe
## type your code here


import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)


                                               tweet
0                        This is introduction to NLP
1              It is likely to be useful, to people 
2             Machine learning is the new electrcity
3  There would be less hype around AI and more ac...
4                           python is the best tool!
5                                R is good langauage
6                                   I like this book
7                        I want more books like this


In [14]:
# using regex and replace() function, we can remove the punctuation

## type your code here

import re

s="I. like. This book!"
s1=re.sub(r'[^\w\s]','',s)
s1



'I like This book'

In [15]:
# Or:

## type your code here

df['tweet']=df['tweet'].str.replace('[^\w\s]','')
df['tweet']


0                          This is introduction to NLP
1                 It is likely to be useful to people 
2               Machine learning is the new electrcity
3    There would be less hype around AI and more ac...
4                              python is the best tool
5                                  R is good langauage
6                                     I like this book
7                          I want more books like this
Name: tweet, dtype: object

In [17]:
# Or:

## type your code here
import string
s='I, like. This book!'

for c in string.punctuation:
    s=s.replace(c,'')

s


'I like This book'

C> Removing Stop Words
---------------------------------

In this coding example, we are going to discuss how to remove stop words. 

Stop words are very common words that carry no meaning or less meaning compared
to other keywords. If we remove the words that are less commonly used,
we can focus on the important keywords instead. Say, for example, in the
context of a search engine, if your search query is “How to develop chatbot
using python,” if the search engine tries to find web pages that contained the
terms “how,” “to,” “develop,” “chatbot,” “using,” “python,” the search engine
is going to find a lot more pages that contain the terms “how” and “to” than
pages that contain information about developing chatbot because the terms
“how” and “to” are so commonly used in the English language. So, if we
remove such terms, the search engine can actually focus on retrieving pages
that contain the keywords: “develop,” “chatbot,” “python” – which would
more closely bring up pages that are of real interest. Similarly we can remove
more common words and rare words as well.

Problem
------------
You want to remove stop words.

Solution
------------
The simplest way to do this by using the NLTK library, or you can build
your own stop words file also.

In [29]:
text=['This is introduction to NLP','It is likely to be useful, to people ','Machine learning is the new electrcity', 'There would be less hype around AI and more action going forward','python is the best tool!','R is good langauage','I like this book','I want more books like this']

#convert list to data frame
import pandas as pd
df = pd.DataFrame({'tweet':text})
print(df)

                                               tweet
0                        This is introduction to NLP
1              It is likely to be useful, to people 
2             Machine learning is the new electrcity
3  There would be less hype around AI and more ac...
4                           python is the best tool!
5                                R is good langauage
6                                   I like this book
7                        I want more books like this


In [31]:
#install and import libraries
# !pip install nltk

import nltk
# nltk.download()
from nltk.corpus import stopwords

#remove stop words
stop = stopwords.words('english')
# print(stop)
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x.lower() not in stop))
print(df['tweet'])

0                                  introduction NLP
1                             likely useful, people
2                   Machine learning new electrcity
3    would less hype around AI action going forward
4                                 python best tool!
5                                  R good langauage
6                                         like book
7                                   want books like
Name: tweet, dtype: object


# Self to do excerises in class
"""
1> find the frequency or count of the non-stop words and rank them in descending order. Use the above df['tweet'] data.
"""


In [32]:

# version 1 : Explains how to solve the above problem.
from collections import Counter
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    cnt[word] += 1

print(cnt)



Counter({'blue': 3, 'red': 2, 'green': 1})


In [37]:
# be careful. We don't want alphabetical sorting !!!!! 
print(sorted(cnt.items()))



[('blue', 3), ('green', 1), ('red', 2)]


In [38]:

# For reverse alphabetical sorting !!!!! 
sorted(cnt.items(), reverse=True)


[('red', 2), ('green', 1), ('blue', 3)]

In [40]:

# Solution : to print the list of tuples in reverse order according to count
l = cnt.items()
print(sorted(l, key = lambda item: item[1], reverse=True))


[('blue', 3), ('red', 2), ('green', 1)]


In [41]:


# Counter class can directly read from iterable object. No need of FOR loop 
# Counter takes iterable object like a list or set, 
# hence recode above like this
mywords = ['red', 'blue', 'red', 'green', 'blue', 'blue']
cnt = Counter(mywords)

print(cnt)



Counter({'blue': 3, 'red': 2, 'green': 1})


In [42]:

#--------------------------------------------

# Version-2 : Not using Counter class

wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'

wordlist = wordstring.split()

wordfreq = []
for w in wordlist:
    wordfreq.append(wordlist.count(w))

print("String\n" + wordstring +"\n")
print("List\n" + str(wordlist) + "\n")
print("Frequencies\n" + str(wordfreq) + "\n")
print("Pairs\n" + str(zip(wordlist, wordfreq))) # it would only print the object


# Recall from Python Course : to print contents of zip object 
# convert to list or set
list(zip(wordlist, wordfreq))


String
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

List
['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times', 'it', 'was', 'the', 'age', 'of', 'wisdom', 'it', 'was', 'the', 'age', 'of', 'foolishness']

Frequencies
[4, 4, 4, 1, 4, 2, 4, 4, 4, 1, 4, 2, 4, 4, 4, 2, 4, 1, 4, 4, 4, 2, 4, 1]

Pairs
<zip object at 0x00000292A56743C8>


[('it', 4),
 ('was', 4),
 ('the', 4),
 ('best', 1),
 ('of', 4),
 ('times', 2),
 ('it', 4),
 ('was', 4),
 ('the', 4),
 ('worst', 1),
 ('of', 4),
 ('times', 2),
 ('it', 4),
 ('was', 4),
 ('the', 4),
 ('age', 2),
 ('of', 4),
 ('wisdom', 1),
 ('it', 4),
 ('was', 4),
 ('the', 4),
 ('age', 2),
 ('of', 4),
 ('foolishness', 1)]

In [50]:


#---------------------------------------

# all concepts clear.
# Now focusing on our problem df['tweet']

import nltk
from nltk.corpus import stopwords

# removing all punctuation marks 
df['tweet'] = df['tweet'].str.replace('[^\w\s]','')

# convert all words to the lower case
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))

# remove stop words
stop = stopwords.words('english')
df['tweet'] = df['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

# converting df to list of string
all_words =   [str(i) for i in df['tweet'].values]
print(all_words) # note its a list of string phrases, NOT list of all strings

all_split_words = []

print("----------------------------------------")

# extracting each word and making a single list of strings
for word in all_words :
    for x in word.split():
        all_split_words.append(x) 

# see the list of all string words
print(all_split_words)

# Create a frequency distribution. it needs a list of string words only.
freq = nltk.FreqDist(all_split_words)
# print(freq)
# Show the words in the list, with counts in desc order.
# print(freq.items())
sorted(list(freq.items()), key = lambda item: item[1], reverse=True )



['introduction nlp', 'likely useful people', 'machine learning new electrcity', 'would less hype around ai action going forward', 'python best tool', 'r good langauage', 'like book', 'want books like']
----------------------------------------
['introduction', 'nlp', 'likely', 'useful', 'people', 'machine', 'learning', 'new', 'electrcity', 'would', 'less', 'hype', 'around', 'ai', 'action', 'going', 'forward', 'python', 'best', 'tool', 'r', 'good', 'langauage', 'like', 'book', 'want', 'books', 'like']


[('like', 2),
 ('introduction', 1),
 ('nlp', 1),
 ('likely', 1),
 ('useful', 1),
 ('people', 1),
 ('machine', 1),
 ('learning', 1),
 ('new', 1),
 ('electrcity', 1),
 ('would', 1),
 ('less', 1),
 ('hype', 1),
 ('around', 1),
 ('ai', 1),
 ('action', 1),
 ('going', 1),
 ('forward', 1),
 ('python', 1),
 ('best', 1),
 ('tool', 1),
 ('r', 1),
 ('good', 1),
 ('langauage', 1),
 ('book', 1),
 ('want', 1),
 ('books', 1)]

#--------------------------------

"""
Extra Reading 
https://programminghistorian.org/en/lessons/counting-frequencies
"""
