<a href="https://colab.research.google.com/github/gacheru101/ML_NPL/blob/main/NLP_Assignment1_COVID_Tweetsipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [31]:
#importing necesssary libaries
import pandas as pd
import nltk
import spacy
import string
import re


In [21]:
# Loading the dataset for Covid19 Tweets
df = pd.read_csv("/content/sample_data/covid19_tweets.csv")
df = df.dropna(subset=['text'])   # drop empty tweets
df.head(5)


Unnamed: 0,user_name,user_location,user_description,user_created,user_followers,user_friends,user_favourites,user_verified,date,text,hashtags,source,is_retweet
0,·èâ·é•‚òª’¨ÍÇÖœÆ,astroworld,wednesday addams as a disney princess keepin i...,2017-05-26 05:46:42,624,950,18775,False,2020-07-25 12:27:21,If I smelled the scent of hand sanitizers toda...,,Twitter for iPhone,False
1,Tom Basile üá∫üá∏,"New York, NY","Husband, Father, Columnist & Commentator. Auth...",2009-04-16 20:06:23,2253,1677,24,True,2020-07-25 12:27:17,Hey @Yankees @YankeesPR and @MLB - wouldn't it...,,Twitter for Android,False
2,Time4fisticuffs,"Pewee Valley, KY",#Christian #Catholic #Conservative #Reagan #Re...,2009-02-28 18:57:41,9275,9525,7254,False,2020-07-25 12:27:14,@diane3443 @wdunlap @realDonaldTrump Trump nev...,['COVID19'],Twitter for Android,False
3,ethel mertz,Stuck in the Middle,#Browns #Indians #ClevelandProud #[]_[] #Cavs ...,2019-03-07 01:45:06,197,987,1488,False,2020-07-25 12:27:10,@brookbanktv The one gift #COVID19 has give me...,['COVID19'],Twitter for iPhone,False
4,DIPR-J&K,Jammu and Kashmir,üñäÔ∏èOfficial Twitter handle of Department of Inf...,2017-02-12 06:45:15,101009,168,101,False,2020-07-25 12:27:08,25 July : Media Bulletin on Novel #CoronaVirus...,"['CoronaVirusUpdates', 'COVID19']",Twitter for Android,False


##  Stemming

**Definition:**  
Stemming is a technique that reduces words to their root form by chopping off prefixes and suffixes. It is rule-based and does not always produce valid dictionary words.

**Example Transformations:**  
- "increasing" ‚Üí "increas"  
- "rapidly" ‚Üí "rapidli"  
- "spreading" ‚Üí "spread"  

**Results Obtained:**  

- **Original Tokens:**  
`['COVID', 'cases', 'are', 'increasing', 'rapidly', 'and', 'spreading', 'faster', 'than', 'expected']`

- **Stemmed Tokens:**  
`['covid', 'case', 'are', 'increas', 'rapidli', 'and', 'spread', 'faster', 'than', 'expect']`

**Interpretation:**  
Stemming reduces words to their base forms, but it may produce words that are not real English (like *rapidli*). Despite this limitation, it is fast and useful for many NLP tasks.


In [22]:
from nltk.stem import PorterStemmer

# Download required tokenizers
nltk.download('punkt')
nltk.download('punkt_tab')



# Taking the first tweet as an example
sample_text = df['text'][0]
print("Original Tweet:", sample_text)

# Tokenize
tokens = nltk.word_tokenize(sample_text)

# Apply stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in tokens]

print("\nOriginal Tokens:", tokens)
print("Stemmed Tokens:", stems)

Original Tweet: If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that‚Ä¶ https://t.co/QZvYbrOgb0

Original Tokens: ['If', 'I', 'smelled', 'the', 'scent', 'of', 'hand', 'sanitizers', 'today', 'on', 'someone', 'in', 'the', 'past', ',', 'I', 'would', 'think', 'they', 'were', 'so', 'intoxicated', 'that‚Ä¶', 'https', ':', '//t.co/QZvYbrOgb0']
Stemmed Tokens: ['if', 'i', 'smell', 'the', 'scent', 'of', 'hand', 'sanit', 'today', 'on', 'someon', 'in', 'the', 'past', ',', 'i', 'would', 'think', 'they', 'were', 'so', 'intox', 'that‚Ä¶', 'http', ':', '//t.co/qzvybrogb0']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


##  Lemmatization

**Definition:**  
Lemmatization is the process of reducing words to their **dictionary (base) form**, using linguistic knowledge. Unlike stemming, it always produces valid English words.  

**Example Transformations:**  
- "increasing" ‚Üí "increase"  
- "rapidly" ‚Üí "rapidly" (adverb remains same)  
- "spreading" ‚Üí "spread"  

**Results Obtained:**  

- **Original Tokens:**  
`['COVID', 'cases', 'are', 'increasing', 'rapidly', 'and', 'spreading', 'faster', 'than', 'expected']`

- **Lemmatized Tokens:**  
`['COVID', 'case', 'be', 'increase', 'rapidly', 'and', 'spread', 'fast', 'than', 'expect']`

**Interpretation:**  
Lemmatization gives cleaner and valid words compared to stemming. For example, *rapidly* stays correct, and *faster* is reduced to *fast*. This makes it better for tasks like text analysis and search.


In [32]:
# Loading SpaCy model
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m12.8/12.8 MB[0m [31m48.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m‚ö† Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [33]:
#Lemmatization function
def spacy_lemmatize(text):
    doc = nlp(str(text))  # ensure it's string
    return [token.lemma_ for token in doc]

# Apply to first 5 tweets
for i, tweet in enumerate(df["text"].head(5)):
    doc = nlp(tweet)
    lemmas = [token.lemma_ for token in doc]
    print(f"\nTweet {i+1}: {tweet}")
    print("Lemmatized:", lemmas)



Tweet 1: If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that‚Ä¶ https://t.co/QZvYbrOgb0
Lemmatized: ['if', 'I', 'smell', 'the', 'scent', 'of', 'hand', 'sanitizer', 'today', 'on', 'someone', 'in', 'the', 'past', ',', 'I', 'would', 'think', 'they', 'be', 'so', 'intoxicated', 'that', '‚Ä¶', 'https://t.co/QZvYbrOgb0']

Tweet 2: Hey @Yankees @YankeesPR and @MLB - wouldn't it have made more sense to have the players pay their respects to the A‚Ä¶ https://t.co/1QvW0zgyPu
Lemmatized: ['hey', '@yankee', '@yankeespr', 'and', '@mlb', '-', 'would', 'not', 'it', 'have', 'make', 'more', 'sense', 'to', 'have', 'the', 'player', 'pay', 'their', 'respect', 'to', 'the', 'A', '‚Ä¶', 'https://t.co/1qvw0zgypu']

Tweet 3: @diane3443 @wdunlap @realDonaldTrump Trump never once claimed #COVID19 was a hoax. We all claim that this effort to‚Ä¶ https://t.co/Jkk8vHWHb3
Lemmatized: ['@diane3443', '@wdunlap', '@realdonaldtrump', 'Trump', 'never', 'once'

Normalization

Normalization is the process of transforming raw text into a standard and consistent format before applying other Natural Language Processing techniques. Raw tweets often contain noise such as URLs, mentions, hashtags, numbers, emojis, and inconsistent casing (e.g., COVID, covid, Covid-19).

In this step, we applied the following normalization techniques to the Covid-19 Tweets dataset:

Lowercasing ‚Äì converts all characters to lowercase so that Covid, COVID, and covid are treated the same.

Removing URLs ‚Äì eliminates hyperlinks that do not add semantic meaning.

Removing mentions and hashtags ‚Äì deletes @usernames and #hashtags that are not essential for sentiment or semantic meaning.

Removing numbers ‚Äì discards numerical values unless specifically required for analysis.

Removing punctuation ‚Äì strips symbols such as !, ?, ,, etc.

Whitespace handling ‚Äì trims leading/trailing spaces and reduces multiple spaces to a single space.

Example Output

Original:
Breaking: COVID-19 cases rise in New York! Follow updates here üëâ https://t.co/xyz123 #Covid19

Normalized:
breaking covid cases rise in new york follow updates here

Normalization ensures consistency in the dataset, making it easier for downstream tasks such as Lemmatization, POS tagging, and Named Entity Recognition (NER).

In [26]:
# Loading spaCy English model
nlp = spacy.load("en_core_web_sm")

# Normalization function
def normalize_text(text):
    # 1. Lowercasing
    text = text.lower()

    # 2. Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)

    # 3. Remove mentions and hashtags
    text = re.sub(r'@\w+|#\w+', '', text)

    # 4. Remove numbers
    text = re.sub(r'\d+', '', text)

    # 5. Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    # 6. Remove extra whitespaces
    text = text.strip()
    text = re.sub('\s+', ' ', text)

    return text

# Applying normalization to first 5 tweets
for i, tweet in enumerate(df["text"].head(5)):
    print(f"\nOriginal Tweet {i+1}: {tweet}")
    print("Normalized:", normalize_text(tweet))

  text = re.sub('\s+', ' ', text)



Original Tweet 1: If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that‚Ä¶ https://t.co/QZvYbrOgb0
Normalized: if i smelled the scent of hand sanitizers today on someone in the past i would think they were so intoxicated that‚Ä¶

Original Tweet 2: Hey @Yankees @YankeesPR and @MLB - wouldn't it have made more sense to have the players pay their respects to the A‚Ä¶ https://t.co/1QvW0zgyPu
Normalized: hey and wouldnt it have made more sense to have the players pay their respects to the a‚Ä¶

Original Tweet 3: @diane3443 @wdunlap @realDonaldTrump Trump never once claimed #COVID19 was a hoax. We all claim that this effort to‚Ä¶ https://t.co/Jkk8vHWHb3
Normalized: trump never once claimed was a hoax we all claim that this effort to‚Ä¶

Original Tweet 4: @brookbanktv The one gift #COVID19 has give me is an appreciation for the simple things that were always around me‚Ä¶ https://t.co/Z0pOAlFXcW
Normalized: the one gift has give me

Text Enrichment / Augmentation (Part-of-Speech Tagging)

Concept:
Part-of-Speech (POS) tagging is the process of labeling each word in a text with its grammatical role, such as noun, verb, adjective, adverb, pronoun, preposition, etc. POS tagging enriches raw text by adding syntactic information that helps in understanding sentence structure and meaning.

In Natural Language Processing (NLP), POS tagging is useful in:

Information extraction (e.g., extracting names, actions, places).

Text classification and sentiment analysis.

Named Entity Recognition (NER).

Building more accurate language models.

In [28]:
#Function to extract POS tags
def pos_tagging(text):
    doc = nlp(str(text))  # Ensure input is string
    return [(token.text, token.pos_) for token in doc]

#Applying POS tagging to first 5 tweets
for i, tweet in enumerate(df["text"].head(5)):
    print(f"\nTweet {i+1}: {tweet}")
    print("POS Tags:", pos_tagging(tweet))


Tweet 1: If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that‚Ä¶ https://t.co/QZvYbrOgb0
POS Tags: [('If', 'SCONJ'), ('I', 'PRON'), ('smelled', 'VERB'), ('the', 'DET'), ('scent', 'NOUN'), ('of', 'ADP'), ('hand', 'NOUN'), ('sanitizers', 'NOUN'), ('today', 'NOUN'), ('on', 'ADP'), ('someone', 'PRON'), ('in', 'ADP'), ('the', 'DET'), ('past', 'NOUN'), (',', 'PUNCT'), ('I', 'PRON'), ('would', 'AUX'), ('think', 'VERB'), ('they', 'PRON'), ('were', 'AUX'), ('so', 'ADV'), ('intoxicated', 'ADJ'), ('that', 'SCONJ'), ('‚Ä¶', 'PUNCT'), ('https://t.co/QZvYbrOgb0', 'NOUN')]

Tweet 2: Hey @Yankees @YankeesPR and @MLB - wouldn't it have made more sense to have the players pay their respects to the A‚Ä¶ https://t.co/1QvW0zgyPu
POS Tags: [('Hey', 'INTJ'), ('@Yankees', 'VERB'), ('@YankeesPR', 'NOUN'), ('and', 'CCONJ'), ('@MLB', 'NOUN'), ('-', 'PUNCT'), ('would', 'AUX'), ("n't", 'PART'), ('it', 'PRON'), ('have', 'AUX'), ('made', 'VERB'), ('more

Text Enrichment / Augmentation (Part-of-Speech Tagging)

Concept:
Part-of-Speech (POS) tagging is the process of labeling each word in a text with its grammatical role, such as noun, verb, adjective, adverb, pronoun, preposition, etc. POS tagging enriches raw text by adding syntactic information that helps in understanding sentence structure and meaning.

In Natural Language Processing (NLP), POS tagging is useful in:

Information extraction (e.g., extracting names, actions, places).

Text classification and sentiment analysis.

Named Entity Recognition (NER).

Building more accurate language models.

In [30]:
#Function for NER
def named_entity_recognition(text):
    doc = nlp(str(text))
    return [(ent.text, ent.label_) for ent in doc.ents]

#Applying NER to first 5 tweets
for i, tweet in enumerate(df["text"].head(5)):
    print(f"\nTweet {i+1}: {tweet}")
    print("Named Entities:", named_entity_recognition(tweet))


Tweet 1: If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that‚Ä¶ https://t.co/QZvYbrOgb0
Named Entities: [('today', 'DATE')]

Tweet 2: Hey @Yankees @YankeesPR and @MLB - wouldn't it have made more sense to have the players pay their respects to the A‚Ä¶ https://t.co/1QvW0zgyPu
Named Entities: [('@MLB', 'GPE')]

Tweet 3: @diane3443 @wdunlap @realDonaldTrump Trump never once claimed #COVID19 was a hoax. We all claim that this effort to‚Ä¶ https://t.co/Jkk8vHWHb3
Named Entities: [('@diane3443', 'CARDINAL'), ('@wdunlap', 'ORG'), ('@realDonaldTrump Trump', 'PERSON'), ('https://t.co/Jkk8vHWHb3', 'ORG')]

Tweet 4: @brookbanktv The one gift #COVID19 has give me is an appreciation for the simple things that were always around me‚Ä¶ https://t.co/Z0pOAlFXcW
Named Entities: [('one', 'CARDINAL'), ('https://t.co/Z0pOAlFXcW', 'PERSON')]

Tweet 5: 25 July : Media Bulletin on Novel #CoronaVirusUpdates #COVID19 
@kansalrohit69 @DrSyedSehris

Named Entity Recognition (NER)

Concept:
Named Entity Recognition (NER) is an NLP technique that identifies and classifies real-world entities mentioned in text into predefined categories such as:

PERSON ‚Üí Names of people (e.g., Donald Trump).

ORG ‚Üí Organizations (e.g., WHO, CDC).

GPE ‚Üí Geo-political entities like countries and cities (e.g., Kenya, United States).

DATE ‚Üí Dates and times (e.g., 2020, March).

MONEY, TIME, PERCENT, LOC, PRODUCT, etc.

/bin/bash: line 1: nvidia-smi: command not found
