<a href="https://colab.research.google.com/github/heizsen/Ai/blob/main/NLP_Demo_Basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Tutorial adopted from: https://medium.com/pythoneers/basics-of-natural-language-processing-in-10-minutes-2ed51e6d5d32

In [None]:
import nltk
nltk.download("popular")
from nltk.tokenize import sent_tokenize, word_tokenize

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

**Tokenization** is the process of dividing the whole text into tokens.

In [None]:
example_text = "Hello there, how are you doing today? The weather is great today. The sky is blue. Python is awesome"

# sent_tokenize (Separated by sentence)
sentences = sent_tokenize(example_text)
print(sentences)
##word_tokenize (Separated by words)
words = word_tokenize(example_text)
print(words)

['Hello there, how are you doing today?', 'The weather is great today.', 'The sky is blue.', 'Python is awesome']
['Hello', 'there', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', 'today', '.', 'The', 'sky', 'is', 'blue', '.', 'Python', 'is', 'awesome']


In general, **stopwords** are the words in any language which does not add much meaning to a sentence. In NLP stopwords are those words which are not important in analyzing the data.

In [None]:
# Stopwords
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
from nltk.corpus import stopwords
text = 'he is a good boy. he is very good in coding'
words = word_tokenize(text)
words_without_stopwords = [word for word in words if word not in stopwords.words('english')]
print(words_without_stopwords)

['good', 'boy', '.', 'good', 'coding']


**Stemming** is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma.

In [None]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()    ## Creating an object for porterstemmer
example_words = ['earn',"earning","earned","earns"]  ##Example words
for w in example_words:
    print(ps.stem(w))    ##Using ps object stemming the word

earn
earn
earn
earn


**Lemmatization** does the same work as stemming, the difference is that lemmatization returns a meaningful word.
(Commonly used in Chatbot)


In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() ## Create object for lemmatizer
example_words = ['history','formality','changes']
for w in example_words:
    print(lemmatizer.lemmatize(w))

history
formality
change


**WordNet** is the lexical database i.e. dictionary for the English language, specifically designed for natural language processing.
We can use wordnet for finding synonyms and antonyms.

In [None]:
from nltk.corpus import wordnet
synonyms = []   ## Creaing an empty list for all the synonyms
antonyms =[]    ## Creaing an empty list for all the antonyms
for syn in wordnet.synsets("happy"): ## Giving word
    for i in syn.lemmas():        ## Finding the lemma,matching
        synonyms.append(i.name())  ## appending all the synonyms
        if i.antonyms():
            antonyms.append(i.antonyms()[0].name()) ## antonyms
print(set(synonyms)) ## Converting them into set for unique values
print(set(antonyms))

{'happy', 'well-chosen', 'felicitous', 'glad'}
{'unhappy'}


**Part of Speech (PoS) Tagging** is a process of converting a sentence to forms — a list of words, a list of tuples (where each tuple is having a form (word, tag)). The tag in the case is a part-of-speech tag and signifies whether the word is a noun, adjective, verb, and so on.

```
 CC coordinating conjunction
 CD cardinal digit
 DT determiner
 EX existential there (like: “there is” … think of it like “there”)
 FW foreign word
 IN preposition/subordinating conjunction
 JJ adjective ‘big’
 JJR adjective, comparative ‘bigger’
 JJS adjective, superlative ‘biggest’
 LS list marker 1)
 MD modal could, will
 NN noun, singular ‘desk’
 NNS noun plural ‘desks’
 NNP proper noun, singular ‘Harrison’
 NNPS proper noun, plural ‘Americans’
 PDT predeterminer ‘all the kids’
 POS possessive ending parent’s
 PRP personal pronoun I, he, she
 PRP possessive pronoun my, his, hers
 RB adverb very, silently,
 RBR adverb, comparative better
 RBS adverb, superlative best
 RP particle give up
 TO to go ‘to’ the store.
 UH interjection errrrrrrrm
 VB verb, base form take
 VBD verb, past tense took
 VBG verb, gerund/present participle taking
 VBN verb, past participle taken
 VBP verb, sing. present, non-3d take
 VBZ verb, 3rd person sing. present takes
 WDT wh-determiner which
 WP wh-pronoun who, what
 WP possessive wh-pronoun whose
 WRB wh-abverb where, when
```



In [None]:
sample_text = '''
An sincerity so extremity he additions. Her yet there truth merit. Mrs all projecting favourable now unpleasing. Son law garden chatty temper. Oh children provided to mr elegance marriage strongly. Off can admiration prosperous now devonshire diminution law.
'''
words = word_tokenize(sample_text)
print(nltk.pos_tag(words))

[('An', 'DT'), ('sincerity', 'NN'), ('so', 'RB'), ('extremity', 'NN'), ('he', 'PRP'), ('additions', 'VBZ'), ('.', '.'), ('Her', 'PRP$'), ('yet', 'RB'), ('there', 'EX'), ('truth', 'NN'), ('merit', 'NN'), ('.', '.'), ('Mrs', 'NNP'), ('all', 'DT'), ('projecting', 'VBG'), ('favourable', 'JJ'), ('now', 'RB'), ('unpleasing', 'VBG'), ('.', '.'), ('Son', 'NNP'), ('law', 'NN'), ('garden', 'NN'), ('chatty', 'JJ'), ('temper', 'NN'), ('.', '.'), ('Oh', 'UH'), ('children', 'NNS'), ('provided', 'VBD'), ('to', 'TO'), ('mr', 'VB'), ('elegance', 'NN'), ('marriage', 'NN'), ('strongly', 'RB'), ('.', '.'), ('Off', 'CC'), ('can', 'MD'), ('admiration', 'VB'), ('prosperous', 'JJ'), ('now', 'RB'), ('devonshire', 'VBP'), ('diminution', 'NN'), ('law', 'NN'), ('.', '.')]


# **Bag of words**

After cleaning text, we need to convert the text into some kind of numerical representation called vectors (bag-of-words) so that we can feed the data to a machine learning model for further processing.



```
sent1 = he is a good boy
sent2 = she is a good girl
sent3 = boy and girl are good
        |
        |
  After removal of stopwords , lematization or stemming
sent1 = good boy
sent2 = good girl
sent3 = boy girl good  
        | ### Now we will calculate the frequency for each word by
        |     calculating the occurrence of each word
word  frequency
good     3
boy      2
girl     2
         | ## Then according to their occurrence we assign o or 1
         |    according to their occurrence in the sentence
         | ## 1 for present and 0 fot not present
         f1  f2   f3
        girl good boy   
sent1    0    1    1     
sent2    1    0    1
sent3    1    1    1
### After this we pass the vector form to machine learning model
```



In [None]:
sentences = ['he is a good boy', 'she is a good girl', 'boy and girl are good']
corpus = []
for sent in sentences:
    words  = word_tokenize(sent)
    texts = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    text = ' '.join(texts)
    corpus.append(text)
print(corpus)   #### Cleaned Data
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() ## Creating Object for CountVectorizer
word_counts = cv.fit_transform(corpus).toarray()
print(word_counts)

['good boy', 'good girl', 'boy girl good']
[[1 0 1]
 [0 1 1]
 [1 1 1]]


# Classification of Text/Article using Bag-of-Words

Given 3 articles with their respective topic label, we will classify the topic of a new article.

In [None]:
# Three texts with labeled topic
text_sport = "Liverpool held off a late charge from Tottenham after Mohamed Salah struck twice to win 2-1 in north London and lift themselves back into the Premier League top-four race. Jurgen Klopp side had suffered shock defeats to relegation-threatened pair Nottingham Forest and Leeds in their last two league outings but started fast against Spurs, with a sharp touch and finish into the bottom corner from in-form Salah giving them the lead on 11 minutes. Ivan Perisic had a header deflected onto the post by Liverpool goalkeeper Alisson and Ryan Sessegnon saw a penalty shout from a challenge by Trent Alexander-Arnold waved away as Spurs came to life - but the Reds struck again just before the break thanks to a gift from Eric Dier. The centre-back miscued a header towards his own goal and Salah (40) raced through to chip in his ninth goal in eight games. Tottenham were sent out early for the second half and - not for the first time this season - were better after the break, with Alisson again pushing a Perisic effort onto the woodwork before Harry Kane (70) fired home a brilliant strike when played in by sub Dejan Kulusevski. Rodrigo Bentancur went close with a couple of headers and Kane glanced wide as Spurs desperately sought another late goal but Liverpool clung on for their first away win in the Premier League this season to move up to eighth and within seven points of fourth-placed Spurs, with a game in hand."
text_medical = "There is growing popularity of the Hospital Incident Command System (HICS) as an organizational tool for hospital management in the COVID-19 pandemic. We specifically describe implementation of HICS at the Isfahan province reference hospital (Isabn-e-Maryam) during the COVID-19 pandemic and try to explore performance of it. Methods: To document the actions taken during the COVID-19 pandemic, standard, open-ended interviews were conducted with individuals occupying activated HICS leadership positions during the event. A checklist based on the job action sheets of the HICS was used for performance assessment. Results: With the onset of the pandemic, hospital director revised ICS structure that adheres to span of better control of COVID-19. Methods of expanding hospital inpatient capacity to enable surge capacity were considered. The highest performance score was in the field of planning. Performance was intermediate in Financial/Administration section and good in other fields. Discussion: In the current COVID-19 pandemic, establishing HICS with some consideration about long-standing events can help improve communication, resource use, staff and patient protection, and maintenance of roles."
text_finance = "According to Fidelity’s Financial Resolutions survey, saving more money was the number one-resolution for respondents. Close to half (43%) said this was a goal they wanted to work toward in the new year. Building a nest egg can help you pay for big ticket items — like a house, a vacation, a wedding or even just an expensive item you really want — without taking on additional debt. Having savings can also come in handy if an emergency expense comes up. If you’re looking to increase your savings in the new year, it helps to start small — even if you’re only transferring $10 a week into your savings account. Starting small helps you build a muscle for saving. This way, when you receive salary bumps, bonuses and gift money, you’ve already gotten into the habit of saving, and you’ll be more likely to transfer that money to your savings account. You may also want to consider automating that process instead of just manually moving money into your savings. Relying on manual transfers leaves a lot of room for procrastination — and before we know it, we’ve spent the money we intended to save. But when you set up automatic transfers into your savings account, you take away the need to make that decision altogether. You can usually schedule automatic transfers through your bank’s mobile app. Lastly, if you want to see your savings grow just a little faster, you can opt for a high-yield savings account instead of a traditional savings account. High-yield savings accounts — like the Marcus by Goldman Sachs Online Savings Account or the Ally Online Savings Account — pay you more in interest each month compared to traditional savings accounts."

texts = [text_sport, text_medical, text_finance]
bow_keys = []
corpus_texts = []
for text in texts:
    words  = word_tokenize(text)
    texts = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    bow_keys += texts
    text = ' '.join(texts)
    corpus_texts.append(text)
bow_keys = set(bow_keys)
print(bow_keys)   #### Cleaned Data
print(corpus_texts)   #### Cleaned Data

{'Financial', 'looking', 'salary', 'take', 'another', 'surge', 'role', 'manually', 'patient', 'receive', 'interview', 'enable', 'Trent', 'Kane', 'COVID-19', 'minute', 'Ivan', '43', '%', 'know', 'waved', 'span', 'sub', 'expanding', 'vacation', 'communication', 'set', 'fourth-placed', 'traditional', 'revised', 'came', 'really', 'altogether', 'side', 'centre-back', 'fast', 'Rodrigo', 'held', 'establishing', 'played', 'If', 'need', 'expensive', 'opt', 'miscued', 'maintenance', '10', '’', 'checklist', 'touch', 'may', 'want', 'Savings', 'considered', 'spent', 'second', 'Nottingham', 'taking', 'tool', 'muscle', 'hospital', 'Leeds', 'adheres', 'week', 'London', 'finish', 'compared', 'Liverpool', 'charge', 'fired', 'schedule', 'activated', 'long-standing', 'Perisic', 'pay', 'brilliant', 'couple', 'ticket', 'staff', '(', ')', 'Alisson', 'goal', 'shock', 'improve', 'save', '.', 'sharp', 'gift', 'make', 'top-four', 'conducted', 'help', 'consideration', 'implementation', 'Building', 'first', 'time'

In [None]:
# A new text to be classified based on topic
query_text = "Federal revenue for the period from January to September 2022 totalled approximately €256.7bn, up by 10.1% (about €23.6bn) on the year. Tax receipts (including EU own resources that are subtracted from the total) increased by 10.1% (about €22.0bn) on the year. Revenue from value added taxes rose by 22.2% (about €18.5bn), while receipts from income tax and corporation tax grew by 7.9% (about €8.9bn). Federal revenue fell as a result of a year-on-year increase of approximately €4.1bn in public transport subsidies to the Länder. These additional subsidies were used to offset revenue losses in the public transport sector and to finance the 9-euro ticket scheme (a temporary reduced-rate public transport ticket costing €9 per month in the months of June, July and August 2022)."
query_words = word_tokenize(query_text)
query_words_clean = [lemmatizer.lemmatize(word) for word in query_words if word not in set(stopwords.words('english'))]
query_words_corpus = [word for word in query_words_clean if word in set(bow_keys)]
query_text_corpus = ' '.join(query_words_corpus)
corpus_texts.append(query_text_corpus)

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer() ## Creating Object for CountVectorizer
bow_vectors = cv.fit_transform(corpus_texts).toarray()
print(bow_vectors)
print(len(bow_vectors[0]))

[[0 1 0 ... 0 0 0]
 [0 0 5 ... 0 0 0]
 [1 0 0 ... 2 2 2]
 [0 0 0 ... 2 0 0]]
330


Classification based on maximum similarity

In [None]:
# Normalize the BoW vectors
bow_texts_norm = []
for bow in bow_vectors:
  length = (sum(i*i for i in bow)) ** 0.5
  bow_norm = bow / length
  bow_texts_norm.append(bow_norm)

# Compute similarity using dot product
similarity_vector = []
bow_norm_query = bow_texts_norm[3]
for bow in bow_texts_norm[:3]:
  similarity_vector.append(sum(i*j for i,j in zip(bow,bow_norm_query)))
print(similarity_vector)

# Find the highest similarity
id_max_sim = similarity_vector.index(max(similarity_vector))
if (id_max_sim == 0):
  print ("The query text is classified as: Sport")
elif (id_max_sim == 1):
  print ("The query text is classified as: Medical")
elif (id_max_sim == 2):
  print ("The query text is classified as: Finance")

[0.0, 0.03214121732666125, 0.11158046066936295]
The query text is classified as: Finance
