## Key Features of This Preprocessing Strategy (for Twitter-based Sentiment Classification)

1. **Encoding Robustness & HTML Entity Handling**
   The pipeline begins with **encoding normalization** (`latin1` → `utf-8`) to handle character encoding issues common in social media data. **HTML entities** (`&amp;`, `&lt;`, `&gt;`, `&quot;`, `&nbsp;`) are systematically converted to their appropriate representations, with `&amp;` specifically normalized to "and" to preserve semantic meaning.

2. **Contextual Placeholder Replacement**
   **User mentions** (`@username`) are replaced with the consistent token `"user"` rather than removal, preserving the **conversational context** and interpersonal dynamics that may influence sentiment. **URLs** are similarly replaced with `"http"` to maintain the presence of **link-sharing behavior** without introducing noise from diverse domain names.

3. **COVID-19 Domain-Specific Normalization**
   A **targeted COVID unification strategy** consolidates hashtag variations (`#covid19`, `#coronavirus`, `#corona`, `#corvid`, etc.) into the single normalized form `#covid`. This **domain-aware preprocessing** improves model consistency and generalization across pandemic-related expressions while maintaining topic coherence.

4. **Intelligent Hashtag Processing with Semantic Preservation**
   **Hashtags are retained with their `#` symbols** to preserve their semantic function as topic markers. **CamelCase hashtags** are intelligently split (`#StayHomeSaveLives` → `#Stay #Home #Save #Lives`) using regex patterns that handle mixed cases, numbers, and all-caps terms, allowing RoBERTa's subword tokenization to better recognize individual concepts while maintaining hashtag context.

5. **Gibberish Hashtag Filtering**
   A **two-stage cleaning process** first applies general normalization, then removes **meaningless or spam hashtags** from a curated gibberish list. This targeted filtering eliminates noise while preserving legitimate hashtags that carry sentiment-relevant information.

6. **Conservative Text Normalization**
   The approach applies **measured cleaning** including emoji removal, currency symbol replacement (`$` → `"money"`), and non-ASCII character filtering, while **avoiding over-aggressive punctuation normalization** that could remove emotionally significant patterns (e.g., preserving question marks and exclamation points that indicate sentiment intensity).

7. **Deduplication and Coherence Enhancement**
   **Duplicate word removal** and **duplicate hashtag consolidation** reduce redundancy without losing semantic content. **Whitespace normalization** and **repeated punctuation reduction** (`!!!` → `!`) clean the text while preserving emotional markers that are crucial for sentiment analysis.

8. **Transformer-Optimized Design**
   Unlike aggressive cleaning pipelines, this strategy **preserves linguistic structure** and **contextual cues** that modern transformer models (RoBERTa/BERT) can leverage. The approach maintains **enough textual complexity** for the model to learn nuanced sentiment patterns while removing genuine noise that could impair tokenization or training stability.

9. **Twitter-Specific Robustness**
   The preprocessing pipeline is specifically designed for **social media text characteristics**: informal language, hashtags, mentions, URLs, and encoding issues. This domain-specific approach ensures optimal performance on Twitter sentiment classification tasks by balancing noise reduction with information preservation.

### Imports¶

In [1]:
! pip install nlpaug

Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-1.1.11


In [2]:
! pip install torch
! pip install emoji

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import nltk
from nltk.corpus import stopwords
import string
import re
import html
import pandas as pd
import emoji
import matplotlib.ticker as ticker
from collections import defaultdict
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
import numpy as np
import random
from tqdm import tqdm
import nlpaug.augmenter.word as naw

import gc
# Try to import torch for back translation check
try:
    import torch
except ImportError:
    torch = None

# Set random seeds
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
#collab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#### Upload Data

In [4]:
train = pd.read_csv('/content/drive/MyDrive/deep_learning/Corona_NLP_train.csv', encoding='latin1')
test = pd.read_csv('/content/drive/MyDrive/deep_learning/Corona_NLP_test.csv', encoding='latin1')

Light data cleaning:

In [5]:
def clean_for_cardiffnlp(text):
    if pd.isnull(text):
        return ""

    tokens = []
    for t in text.split(" "):
        if t.startswith("@") and len(t) > 1:
            tokens.append("@user")
        elif t.startswith("http"):
            tokens.append("http")
        else:
            tokens.append(t)
    text = " ".join(tokens)

    # Normalize common COVID variants to "covid"
    text = re.sub(r"\b(coronaviruspandemic|covid[_\s-]*2019|covid[_\s-]*19|covid2019|coronavirus2019|coronavirus|corona)\b", "covid", text, flags=re.IGNORECASE)

    # Decode HTML entities
    text = html.unescape(text)

    # Normalize whitespace and repeated punctuation (optional)
    text = re.sub(r"\s+", " ", text).strip()
    text = re.sub(r"(\.\s*){2,}", ". ", text)
    text = re.sub(r"([!?]){2,}", r"\1", text)
    text = re.sub(r"(\?\s+){2,}", "?", text)
    text = re.sub(r"(\!\s+){2,}", "!", text)

    return text
# Apply to train and test
train['ProcessedTweet'] = train['OriginalTweet'].apply(clean_for_cardiffnlp)
test['ProcessedTweet'] = test['OriginalTweet'].apply(clean_for_cardiffnlp)

#### Cleaning data:

In [None]:
def normalize_text(text):
    if pd.isnull(text):
        return ""

    # Fix common encoding issues
    text = text.encode('latin1', 'ignore').decode('utf-8', 'ignore')

    # Step 1: Replace user mentions and URLs
    # Replace mentions with @user (handles names with underscores or punctuation)
    text = re.sub(r"@\w+", "user", text)
    # Replace URLs with placeholder
    text = re.sub(r"http\S+|www\S+|https\S+", "http", text)
    text = text.replace("&lt;", " ")
    text = text.replace("&gt;", " ")
    text = text.replace("&quot;", '"')
    text = text.replace("&nbsp;", " ")

    # Replace HTML entities and ampersands with "and"
    text = re.sub(r"&amp;|&", " and ", text)

    # Step 3: Extract and normalize COVID-related hashtags
    hashtags = re.findall(r"#(\w+)", text)
    for tag in hashtags:
        tag_lower = tag.lower()
        if 'covid' in tag_lower or 'corona' in tag_lower:
            text = text.replace(f"#{tag}", "#covid")

    # FIX common typos in COVID-related hashtags before splitting
    covid_variants = [
        r"#corvid[\w_]*", r"#convid[\w_]*", r"#covd[\w_]*",
        r"#covid[\w_]*", r"#corona[\w_]*", r"#coronavirus[\w_]*"
    ]
    for pattern in covid_variants:
        text = re.sub(pattern, "#covid", text)

    # Step 4: Split CamelCase and attached hashtags
    def split_hashtag_words(m):
        words = re.findall(r"[A-Z]+(?=[A-Z][a-z]|\b)|[A-Z][a-z]+|\d+|[a-z]+", m.group(1))
        if words:
            return "#" + " #".join(words)
        else:
            return "#" + m.group(1)  # Fallback

    text = re.sub(r"#([A-Za-z0-9]{3,})", split_hashtag_words, text)

    # Step 5: Replace underscores and dashes with spaces
    text = re.sub(r"[_\-]", " ", text)

    # Step 6: Remove emojis
    text = emoji.replace_emoji(text, replace='')
    #text = re.sub(r'\s*\?\s*', ' ', text)  # Clean up emoji remnants

    # Step 7: Lowercase everything
    text = text.lower()

    # Replace currency symbols with word
    text = re.sub(r"[$£€]", " money ", text)

    # Remove weird non-ASCII characters (like Â)
    text = re.sub(r"[^\x00-\x7F]+", "", text)

    # Step 8: Normalize whitespace and punctuation spacing
    text = re.sub(r"\s+", " ", text).strip()
    text = re.sub(r"(\s*\.\s*){2,}", ". ", text)
    #text = re.sub(r"\s*([.,!?])\s*", r"\1 ", text)
    text = text.replace("...", ".")
    text = text.replace("–", "-").replace("—", "-")

    # Step 10: Normalize repeated punctuation
    text = re.sub(r"([!?]){2,}", r"\1", text)
    text = re.sub(r"(\?\s+){2,}", "?", text)
    text = re.sub(r"(\!\s+){2,}", "!", text)

    # Remove duplicate words
    text = re.sub(r'\b(\w+)( \1\b)+', r'\1', text)

    # Step 2: Remove empty hashtags (just "#")
    text = re.sub(r"#(?!\w)", "", text)

    # Remove duplicate hashtags
    text = re.sub(r'(#\w+)(\s+\1)+', r'\1', text)

    return text


In [None]:
# Apply to train and test
train['ProcessedTweet'] = train['OriginalTweet'].apply(normalize_text)
test['ProcessedTweet'] = test['OriginalTweet'].apply(normalize_text)

In [6]:
pd.set_option('display.max_colwidth', None)  # so full text is shown

In [None]:
train_sample = train[['OriginalTweet', 'ProcessedTweet','Sentiment']].sample(10, random_state=1)
display(train_sample)
test_sample = test[['OriginalTweet', 'ProcessedTweet','Sentiment']].sample(10, random_state=1)
display(test_sample)

Unnamed: 0,OriginalTweet,ProcessedTweet,Sentiment
32828,"My mum used to say ""come with us to the supermarket"" and i would say nopee, I am regretting tht shit ):\r\r\n#Covid_19","My mum used to say ""come with us to the supermarket"" and i would say nopee, I am regretting tht shit ): #covid",Extremely Negative
766,The breakfast program at my kids school is suspended from tomorrow cos the volunteers can t source any bread For some kids this is a real life saver the only food they might get until lunchtime,The breakfast program at my kids school is suspended from tomorrow cos the volunteers can t source any bread For some kids this is a real life saver the only food they might get until lunchtime,Negative
35742,Russia to develop additional business support program amid COVID-19 pandemic to preserve employment &amp; salaries at maximum rate possible. Program includes tax payment deferment as well as repayment extensions on consumer &amp; mortgage loans.\r\r\nhttps://t.co/c3AWzpDR5v...,Russia to develop additional business support program amid covid pandemic to preserve employment & salaries at maximum rate possible. Program includes tax payment deferment as well as repayment extensions on consumer & mortgage loans. https://t.co/c3AWzpDR5v.,Positive
39955,"Top oil-producing countries agreed on ""historic"" output cuts to prop up prices hammered by #coronavirus crisis \r\r\n#OPEC #OilPriceWar #CoronavirusOutbreak \r\r\n\r\r\nhttps://t.co/XAudAqKBr9","Top oil-producing countries agreed on ""historic"" output cuts to prop up prices hammered by #covid crisis #OPEC #OilPriceWar #CoronavirusOutbreak https://t.co/XAudAqKBr9",Negative
36059,"Busy #legostreet at first glance, but actually it's kinda' quiet due to #covid_19 measures. The iShop #supermarket is open, but the #coffeeshop is closed!? Don't know if the owner will survive the #coronacrisis businesswise.. #legocityscene #legomodulars #legocreatorexpert #Â https://t.co/dKMAy0VWLS","Busy #legostreet at first glance, but actually it's kinda' quiet due to #covid measures. The iShop #supermarket is open, but the #coffeeshop is closed? Don't know if the owner will survive the #coronacrisis businesswise. #legocityscene #legomodulars #legocreatorexpert #Â http",Neutral
21681,Delivery drivers face pandemic with no resources #usbiz #cdnbiz #coronavirus https://t.co/uduVwjRmSJ,Delivery drivers face pandemic with no resources #usbiz #cdnbiz #covid http,Negative
15770,@GovMikeDeWine #coronavirus #Construction workers.. this is what we deal with! Per osha ONE PORTALET PER FORTY WORKERS. rarely do we have access to running water and almost never have hand sanitizer how are we to #FlattenTheCuve !? All non essential const,@user #covid #Construction workers. this is what we deal with! Per osha ONE PORTALET PER FORTY WORKERS. rarely do we have access to running water and almost never have hand sanitizer how are we to #FlattenTheCuve ? All non essential const,Negative
40509,How Luxury Consumer Preferences in China Are Shifting After COVID-19 | Jing Daily https://t.co/IN7NWEvt2V,How Luxury Consumer Preferences in China Are Shifting After covid | Jing Daily http,Neutral
38299,Thank you #Indianapolis for shopping ? ? #SmallBiz \r\r\n.\r\r\nhttps://t.co/0BKtOdif4A\r\r\n.\r\r\n#CustomOrders #Masks #Kenya #Naptown #LinkinBio #Nairobi\r\r\n.\r\r\nAfrican print. Spider-Man. Wakanda Manenoz. Sewing. \r\r\n.\r\r\nMade in @NoblesvilleIN @KetepaLtd @NoblesvilleCOC #Makers #TembeaKenya https://t.co/amR4Ss8PS2,Thank you #Indianapolis for shopping ?#SmallBiz . https://t.co/0BKtOdif4A . #CustomOrders #Masks #Kenya #Naptown #LinkinBio #Nairobi . African print. Spider-Man. Wakanda Manenoz. Sewing. Made in @user @user @user #Makers #TembeaKenya http,Positive
11188,@AOC i live in NY14 i went a supermarket on ditmars blvd this am\r\r\n\r\r\nI chatted w the cashier for 1 min\r\r\n\r\r\nShe said in nyc the poor are stealing soap and toilet paper \r\r\n\r\r\n#coronavirus #COVID19,@user i live in NY14 i went a supermarket on ditmars blvd this am I chatted w the cashier for 1 min She said in nyc the poor are stealing soap and toilet paper #covid #covid,Extremely Negative


Unnamed: 0,OriginalTweet,ProcessedTweet,Sentiment
427,ThereÂs def gatherings of more than 100 people in every supermarket in the country rn?? close all schools and colleges but letÂs all go shop at the same time and stand right beside each other in queues ????? #Covid_19,ThereÂs def gatherings of more than 100 people in every supermarket in the country rn? close all schools and colleges but letÂs all go shop at the same time and stand right beside each other in queues ? #covid,Neutral
3526,"Since its outbreak, COVID-19 has disrupted commerce and trade, as well as giving rise to commercial and consumer disputes. Here is what companies can do to prepare for potential litigations claims. https://t.co/tGMLesGTMn https://t.co/hXKwD87Ac6","Since its outbreak, covid has disrupted commerce and trade, as well as giving rise to commercial and consumer disputes. Here is what companies can do to prepare for potential litigations claims. http http",Positive
725,"ÂMake sure you stock up on non-perishables!Â \r\r\nÂ\r\r\n&lt;Leaves grocery store with milk, cheese, yogurt, fruit, and vegetables&gt; #Covid_19","ÂMake sure you stock up on non-perishables!Â Â <Leaves grocery store with milk, cheese, yogurt, fruit, and vegetables> #covid",Positive
2346,"You know who are also heroes? Those working the checkout counters and stocking shelves at supermarkets and pharmacies. Their work, at some risk to their own health, is vital to the health and safety of our country.","You know who are also heroes? Those working the checkout counters and stocking shelves at supermarkets and pharmacies. Their work, at some risk to their own health, is vital to the health and safety of our country.",Extremely Positive
664,".@LME_news #metals , #CrudeOil prices down 5% as #USA puts #europetravelban . @WHO labelled #Covid_19 a pandemic. Fiscal stimulus by several central banks has failed to soothe fears . #global #manufacturing , #industrial , #aviation hit. #commodity",".@LME_news #metals , #CrudeOil prices down 5% as #USA puts #europetravelban . @user labelled #covid a pandemic. Fiscal stimulus by several central banks has failed to soothe fears . #global #manufacturing , #industrial , #aviation hit. #commodity",Negative
108,"Reminder: a State of Emergency is in effect statewide due to #Coronavirus.\r\r\n\r\r\nCalifornians are protected from illegal price gouging on housing, gas, food, and other essential supplies. https://t.co/nw5FHsodTj https://t.co/r8wH8MSped","Reminder: a State of Emergency is in effect statewide due to #covid. Californians are protected from illegal price gouging on housing, gas, food, and other essential supplies. http http",Negative
857,NEW VLOG!\r\r\n\r\r\nMany people who feared for a possible lockdown in Metro Manila are reportedly panic-buying basic food and household items. \r\r\n\r\r\nBut will it help or will only aggravate the COVID-19 situation we are facing now?\r\r\n\r\r\nKnow what Kuya Daniel says about it:\r\r\nhttps://t.co/4Fd0ZiN8Vz,NEW VLOG! Many people who feared for a possible lockdown in Metro Manila are reportedly panic-buying basic food and household items. But will it help or will only aggravate the covid situation we are facing now? Know what Kuya Daniel says about it: https://t.co/4Fd0ZiN8Vz,Extremely Negative
2384,ÂÂThe toilet paper virusÂ.##Coronavirus preparation: What to stock-up on https://t.co/mpYL7CH1aI #FoxNews,ÂÂThe toilet paper virusÂ.##covid preparation: What to stock-up on http #FoxNews,Neutral
3477,"3. Open Q after COVID-19 of what happens to blockbusters (e.g, MCU, Mulan) in terms of audience demand for theatergoing experience vs. preference to consume at home. Closing of theatrical windows solves for consumer fears &amp; allows studios to address those fears.","3. Open Q after covid of what happens to blockbusters (e.g, MCU, Mulan) in terms of audience demand for theatergoing experience vs. preference to consume at home. Closing of theatrical windows solves for consumer fears & allows studios to address those fears.",Extremely Negative
481,Donated to @McrFoodbank as donations have slowed down due to #Covid_19. If you can make a Â£ donation to help people whoÂll be especially vulnerable and unable to stock up their cupboards in the next few weeks please donate to your local food bank. It will be a lifeline for many. https://t.co/JVKYv2CWxT,Donated to @user as donations have slowed down due to #covid. If you can make a Â£ donation to help people whoÂll be especially vulnerable and unable to stock up their cupboards in the next few weeks please donate to your local food bank. It will be a lifeline for many. http,Positive


#### Check for gibberish hashtags:

In [None]:
combined_df = pd.concat([train, test], ignore_index=True)

# Download English words from NLTK
nltk.download('words')
from nltk.corpus import words
english_vocab = set(w.lower() for w in words.words())

# Function to extract hashtags
def extract_hashtags(text):
    if pd.isnull(text):
        return []
    return re.findall(r"#(\w+)", text)

# Collect all hashtags
combined_df['hashtags'] = combined_df['ProcessedTweet'].astype(str).apply(extract_hashtags)
all_hashtags = [tag.lower() for tags in combined_df['hashtags'] for tag in tags]

# Count frequency of each hashtag
hashtag_counts = Counter(all_hashtags)

# Filter rare hashtags (appearing less than 5 times)
rare_hashtags = [tag for tag, count in hashtag_counts.items() if count < 5]

# Function to detect gibberish
def is_gibberish(hashtag):
    tag = hashtag.lower()
    if re.search(r'(.)\1{2,}', tag):  # repeated characters
        return True
    if re.search(r'[^a-z0-9]', tag):  # non-alphanumeric
        return True
    if not any(word in tag for word in english_vocab):  # not containing known English word
        return True
    return False

# Extract gibberish tags and their counts
gibberish_tags = [(tag, hashtag_counts[tag]) for tag in rare_hashtags if is_gibberish(tag)]

# Create DataFrame for display
df_gibberish = pd.DataFrame(gibberish_tags, columns=["Hashtag", "Count"])
df_gibberish

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Unnamed: 0,Hashtag,Count
0,320,1
1,965,1
2,316,1
3,01,2
4,850,1
...,...,...
134,901,1
135,100029,1
136,07,1
137,mmmbread,1


In [None]:
def remove_gibberish_hashtags(text, gibberish_set):
    if pd.isnull(text):
        return ""

    # Only remove actual hashtags (starting with #)
    for hashtag in gibberish_set:
        # Remove the # if it's there, then add it back for matching
        clean_hashtag = hashtag.lstrip('#')
        # Only match when it's actually a hashtag (preceded by #)
        text = re.sub(r'#' + re.escape(clean_hashtag) + r'\b', '', text, flags=re.IGNORECASE)

    # Clean up extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Create a set of gibberish hashtags for faster lookup
gibberish_hashtags = set(df_gibberish['Hashtag'].str.lower())

# Apply
train['ProcessedTweet'] = train['ProcessedTweet'].apply(lambda x: remove_gibberish_hashtags(x, gibberish_hashtags))
test['ProcessedTweet'] = test['ProcessedTweet'].apply(lambda x: remove_gibberish_hashtags(x, gibberish_hashtags))

In [None]:
train_sample = train[['OriginalTweet', 'ProcessedTweet','Sentiment']].sample(10, random_state=1)
display(train_sample)
test_sample = test[['OriginalTweet', 'ProcessedTweet','Sentiment']].sample(10, random_state=1)
display(test_sample)

Unnamed: 0,OriginalTweet,ProcessedTweet,Sentiment
32828,"My mum used to say ""come with us to the supermarket"" and i would say nopee, I am regretting tht shit ):\r\r\n#Covid_19","my mum used to say ""come with us to the supermarket"" and i would say nopee, i am regretting tht shit ): #covid",Extremely Negative
766,The breakfast program at my kids school is suspended from tomorrow cos the volunteers can t source any bread For some kids this is a real life saver the only food they might get until lunchtime,the breakfast program at my kids school is suspended from tomorrow cos the volunteers can t source any bread for some kids this is a real life saver the only food they might get until lunchtime,Negative
35742,Russia to develop additional business support program amid COVID-19 pandemic to preserve employment &amp; salaries at maximum rate possible. Program includes tax payment deferment as well as repayment extensions on consumer &amp; mortgage loans.\r\r\nhttps://t.co/c3AWzpDR5v...,russia to develop additional business support program amid covid 19 pandemic to preserve employment and salaries at maximum rate possible. program includes tax payment deferment as well as repayment extensions on consumer and mortgage loans. http,Positive
39955,"Top oil-producing countries agreed on ""historic"" output cuts to prop up prices hammered by #coronavirus crisis \r\r\n#OPEC #OilPriceWar #CoronavirusOutbreak \r\r\n\r\r\nhttps://t.co/XAudAqKBr9","top oil producing countries agreed on ""historic"" output cuts to prop up prices hammered by #covid crisis #opec #oil #price #war #covid http",Negative
36059,"Busy #legostreet at first glance, but actually it's kinda' quiet due to #covid_19 measures. The iShop #supermarket is open, but the #coffeeshop is closed!? Don't know if the owner will survive the #coronacrisis businesswise.. #legocityscene #legomodulars #legocreatorexpert #Â https://t.co/dKMAy0VWLS","busy #legostreet at first glance, but actually it's kinda' quiet due to #covid measures. the ishop #supermarket is open, but the #coffeeshop is closed? don't know if the owner will survive the #covid businesswise. #legocityscene #legomodulars #legocreatorexpert http",Neutral
21681,Delivery drivers face pandemic with no resources #usbiz #cdnbiz #coronavirus https://t.co/uduVwjRmSJ,delivery drivers face pandemic with no resources #usbiz #cdnbiz #covid http,Negative
15770,@GovMikeDeWine #coronavirus #Construction workers.. this is what we deal with! Per osha ONE PORTALET PER FORTY WORKERS. rarely do we have access to running water and almost never have hand sanitizer how are we to #FlattenTheCuve !? All non essential const,user #covid #construction workers. this is what we deal with! per osha one portalet per forty workers. rarely do we have access to running water and almost never have hand sanitizer how are we to #flatten #the #cuve ? all non essential const,Negative
40509,How Luxury Consumer Preferences in China Are Shifting After COVID-19 | Jing Daily https://t.co/IN7NWEvt2V,how luxury consumer preferences in china are shifting after covid 19 | jing daily http,Neutral
38299,Thank you #Indianapolis for shopping ? ? #SmallBiz \r\r\n.\r\r\nhttps://t.co/0BKtOdif4A\r\r\n.\r\r\n#CustomOrders #Masks #Kenya #Naptown #LinkinBio #Nairobi\r\r\n.\r\r\nAfrican print. Spider-Man. Wakanda Manenoz. Sewing. \r\r\n.\r\r\nMade in @NoblesvilleIN @KetepaLtd @NoblesvilleCOC #Makers #TembeaKenya https://t.co/amR4Ss8PS2,thank you #indianapolis for shopping ?#small #biz . http . #custom #orders #masks #kenya #naptown #linkin #bio #nairobi . african print. spider man. wakanda manenoz. sewing. made in user #makers #tembea #kenya http,Positive
11188,@AOC i live in NY14 i went a supermarket on ditmars blvd this am\r\r\n\r\r\nI chatted w the cashier for 1 min\r\r\n\r\r\nShe said in nyc the poor are stealing soap and toilet paper \r\r\n\r\r\n#coronavirus #COVID19,user i live in ny14 i went a supermarket on ditmars blvd this am i chatted w the cashier for 1 min she said in nyc the poor are stealing soap and toilet paper #covid,Extremely Negative


Unnamed: 0,OriginalTweet,ProcessedTweet,Sentiment
427,ThereÂs def gatherings of more than 100 people in every supermarket in the country rn?? close all schools and colleges but letÂs all go shop at the same time and stand right beside each other in queues ????? #Covid_19,theres def gatherings of more than 100 people in every supermarket in the country rn? close all schools and colleges but lets all go shop at the same time and stand right beside each other in queues ? #covid,Neutral
3526,"Since its outbreak, COVID-19 has disrupted commerce and trade, as well as giving rise to commercial and consumer disputes. Here is what companies can do to prepare for potential litigations claims. https://t.co/tGMLesGTMn https://t.co/hXKwD87Ac6","since its outbreak, covid 19 has disrupted commerce and trade, as well as giving rise to commercial and consumer disputes. here is what companies can do to prepare for potential litigations claims. http",Positive
725,"ÂMake sure you stock up on non-perishables!Â \r\r\nÂ\r\r\n&lt;Leaves grocery store with milk, cheese, yogurt, fruit, and vegetables&gt; #Covid_19","make sure you stock up on non perishables! leaves grocery store with milk, cheese, yogurt, fruit, and vegetables #covid",Positive
2346,"You know who are also heroes? Those working the checkout counters and stocking shelves at supermarkets and pharmacies. Their work, at some risk to their own health, is vital to the health and safety of our country.","you know who are also heroes? those working the checkout counters and stocking shelves at supermarkets and pharmacies. their work, at some risk to their own health, is vital to the health and safety of our country.",Extremely Positive
664,".@LME_news #metals , #CrudeOil prices down 5% as #USA puts #europetravelban . @WHO labelled #Covid_19 a pandemic. Fiscal stimulus by several central banks has failed to soothe fears . #global #manufacturing , #industrial , #aviation hit. #commodity",".user #metals , #crude #oil prices down 5% as #usa puts #europetravelban . user labelled #covid a pandemic. fiscal stimulus by several central banks has failed to soothe fears . #global #manufacturing , #industrial , #aviation hit. #commodity",Negative
108,"Reminder: a State of Emergency is in effect statewide due to #Coronavirus.\r\r\n\r\r\nCalifornians are protected from illegal price gouging on housing, gas, food, and other essential supplies. https://t.co/nw5FHsodTj https://t.co/r8wH8MSped","reminder: a state of emergency is in effect statewide due to #covid. californians are protected from illegal price gouging on housing, gas, food, and other essential supplies. http",Negative
857,NEW VLOG!\r\r\n\r\r\nMany people who feared for a possible lockdown in Metro Manila are reportedly panic-buying basic food and household items. \r\r\n\r\r\nBut will it help or will only aggravate the COVID-19 situation we are facing now?\r\r\n\r\r\nKnow what Kuya Daniel says about it:\r\r\nhttps://t.co/4Fd0ZiN8Vz,new vlog! many people who feared for a possible lockdown in metro manila are reportedly panic buying basic food and household items. but will it help or will only aggravate the covid 19 situation we are facing now? know what kuya daniel says about it: http,Extremely Negative
2384,ÂÂThe toilet paper virusÂ.##Coronavirus preparation: What to stock-up on https://t.co/mpYL7CH1aI #FoxNews,the toilet paper virus.#covid preparation: what to stock up on http #fox #news,Neutral
3477,"3. Open Q after COVID-19 of what happens to blockbusters (e.g, MCU, Mulan) in terms of audience demand for theatergoing experience vs. preference to consume at home. Closing of theatrical windows solves for consumer fears &amp; allows studios to address those fears.","3. open q after covid 19 of what happens to blockbusters (e.g, mcu, mulan) in terms of audience demand for theatergoing experience vs. preference to consume at home. closing of theatrical windows solves for consumer fears and allows studios to address those fears.",Extremely Negative
481,Donated to @McrFoodbank as donations have slowed down due to #Covid_19. If you can make a Â£ donation to help people whoÂll be especially vulnerable and unable to stock up their cupboards in the next few weeks please donate to your local food bank. It will be a lifeline for many. https://t.co/JVKYv2CWxT,donated to user as donations have slowed down due to #covid. if you can make a money donation to help people wholl be especially vulnerable and unable to stock up their cupboards in the next few weeks please donate to your local food bank. it will be a lifeline for many. http,Positive


#### Make sure that there is no gibberish hashtags:

In [None]:
combined_df = pd.concat([train, test], ignore_index=True)

# Download English words from NLTK
nltk.download('words')
from nltk.corpus import words
english_vocab = set(w.lower() for w in words.words())

# Function to extract hashtags
def extract_hashtags(text):
    if pd.isnull(text):
        return []
    return re.findall(r"#(\w+)", text)

# Collect all hashtags
combined_df['hashtags'] = combined_df['ProcessedTweet'].astype(str).apply(extract_hashtags)
all_hashtags = [tag.lower() for tags in combined_df['hashtags'] for tag in tags]

# Count frequency of each hashtag
hashtag_counts = Counter(all_hashtags)

# Filter rare hashtags (appearing less than 5 times)
rare_hashtags = [tag for tag, count in hashtag_counts.items() if count < 5]

# Function to detect gibberish
def is_gibberish(hashtag):
    tag = hashtag.lower()
    if re.search(r'(.)\1{2,}', tag):  # repeated characters
        return True
    if re.search(r'[^a-z0-9]', tag):  # non-alphanumeric
        return True
    if not any(word in tag for word in english_vocab):  # not containing known English word
        return True
    return False

# Extract gibberish tags and their counts
gibberish_tags = [(tag, hashtag_counts[tag]) for tag in rare_hashtags if is_gibberish(tag)]

# Create DataFrame for display
df_gibberish = pd.DataFrame(gibberish_tags, columns=["Hashtag", "Count"])
display(df_gibberish.sort_values(by='Count', ascending=False))

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Unnamed: 0,Hashtag,Count


#### **Split to Tarin and Val:**

In [None]:
# Split training data into train and validation sets
# Using 85/20 split with stratification to maintain sentiment distribution
train_data, val_data = train_test_split(
    train,
    test_size=0.2,           # 20% for validation
    random_state=42,         # For reproducibility
    stratify=train['Sentiment']  # Maintain sentiment distribution
)

# Display the splits
print("Original train size:", len(train))
print("New train size:", len(train_data))
print("Validation size:", len(val_data))
print()

# Check sentiment distribution
print("Original sentiment distribution:")
print(train['Sentiment'].value_counts(normalize=True).round(3))
print()

print("Train split sentiment distribution:")
print(train_data['Sentiment'].value_counts(normalize=True).round(3))
print()

print("Validation split sentiment distribution:")
print(val_data['Sentiment'].value_counts(normalize=True).round(3))

# Reset indices
train_data = train_data.reset_index(drop=True)
val_data = val_data.reset_index(drop=True)

Original train size: 41157
New train size: 32925
Validation size: 8232

Original sentiment distribution:
Sentiment
Positive              0.278
Negative              0.241
Neutral               0.187
Extremely Positive    0.161
Extremely Negative    0.133
Name: proportion, dtype: float64

Train split sentiment distribution:
Sentiment
Positive              0.278
Negative              0.241
Neutral               0.187
Extremely Positive    0.161
Extremely Negative    0.133
Name: proportion, dtype: float64

Validation split sentiment distribution:
Sentiment
Positive              0.278
Negative              0.241
Neutral               0.187
Extremely Positive    0.161
Extremely Negative    0.133
Name: proportion, dtype: float64


#### Save this virsion:

In [None]:
# Save the processed datasets as CSV files
train_data.to_csv('train_processed.csv', index=False, encoding='utf-8')
test.to_csv('test_processed.csv', index=False, encoding='utf-8')
val_data.to_csv('val_processed.csv', index=False, encoding='utf-8')

print(f"Train dataset saved: {len(train_data)} rows")
print(f"Test dataset saved: {len(test)} rows")
print(f"Val dataset saved: {len(val_data)} rows")

Train dataset saved: 32925 rows
Test dataset saved: 3798 rows
Val dataset saved: 8232 rows


light clening save:

In [None]:
# Save the processed datasets as CSV files
train_data.to_csv('train_light.csv', index=False, encoding='utf-8')
test.to_csv('test_light.csv', index=False, encoding='utf-8')
val_data.to_csv('val_light.csv', index=False, encoding='utf-8')

print(f"Train dataset saved: {len(train_data)} rows")
print(f"Test dataset saved: {len(test)} rows")
print(f"Val dataset saved: {len(val_data)} rows")

#### Check the data:

In [None]:
sentiment_counts_train = train_data['Sentiment'].value_counts()

print("Sentiment label distribution in train:")
print(sentiment_counts_train)

sentiment_counts_test = test['Sentiment'].value_counts()

print("Sentiment label distribution in test:")
print(sentiment_counts_test)

sentiment_counts_val = val_data['Sentiment'].value_counts()

print("Sentiment label distribution in val:")
print(sentiment_counts_val)


Sentiment label distribution in train:
Sentiment
Positive              9137
Negative              7934
Neutral               6170
Extremely Positive    5299
Extremely Negative    4385
Name: count, dtype: int64
Sentiment label distribution in test:
Sentiment
Negative              1041
Positive               947
Neutral                619
Extremely Positive     599
Extremely Negative     592
Name: count, dtype: int64
Sentiment label distribution in val:
Sentiment
Positive              2285
Negative              1983
Neutral               1543
Extremely Positive    1325
Extremely Negative    1096
Name: count, dtype: int64


#### **Augmentations:**

### Data Augmentation Strategy for Sentiment Balance

### Current Class Distribution Analysis

| Sentiment | Count | Percentage | Imbalance Ratio |
|-----------|-------|------------|-----------------|
| Positive | 9,137 | 27.8% | 1.0x (baseline) |
| Negative | 7,934 | 24.1% | 0.87x |
| Neutral | 6,170 | 18.7% | 0.68x |
| Extremely Positive | 5,299 | 16.1% | 0.58x |
| Extremely Negative | 4,385 | 13.3% | 0.48x |

**Problem**: The majority class (Positive) has **2.08x more samples** than the minority class (Extremely Negative), creating significant bias.

### Why Augmentation is Necessary

1. **Model Bias Prevention**: Without balancing, the model will be biased toward predicting "Positive" and "Negative" sentiments, potentially misclassifying extreme sentiments as moderate ones.

2. **Improved Extreme Sentiment Detection**: The "Extremely Positive" and "Extremely Negative" classes are most underrepresented, but these are often the most valuable for sentiment analysis applications.

3. **Better Generalization**: Balanced training data helps the model learn more robust decision boundaries between sentiment classes.

4. **Reduced Overfitting**: More diverse examples for minority classes prevent the model from memorizing limited patterns.

### Recommended Augmentation Strategy

**Target Distribution**: Upsampling minority classes only (keeping all original data + class weights)

| Sentiment | Current | Target | Augmentation Needed | Augmentation Factor |
|-----------|---------|--------|-------------------|-------------------|
| Positive | 9,137 | 9,137 | 0 (keep all) | 1.0x |
| Negative | 7,934 | 8,000-8,500 | +500-1,000 | 1.1-1.2x |
| Neutral | 6,170 | 7,500-8,000 | +1,330-1,830 | 1.2-1.3x |
| Extremely Positive | 5,299 | 7,000-7,500 | +1,700-2,200 | 1.3-1.4x |
| Extremely Negative | 4,385 | 6,500-7,000 | +2,100-2,600 | 1.5-1.6x |

### Augmentation Techniques for Each Class

#### 1. **Extremely Negative** (+3,615 samples, ~82% increase)
- **Back-translation** (English → Spanish → English): Preserves strong negative sentiment
- **Synonym replacement**: Replace negative words with stronger negatives
- **Paraphrasing**: Rephrase complaints, fears, and frustrations
- **Contextual word insertion**: Add intensifiers ("extremely", "absolutely", "completely")

#### 2. **Extremely Positive** (+2,701 samples, ~51% increase)  
- **Back-translation**: Maintains enthusiastic tone
- **Synonym replacement**: Replace positive words with more intense variants
- **Emoji-to-text conversion**: Convert positive emojis to emotional words
- **Exclamation enhancement**: Add emphasis to existing positive statements

#### 3. **Neutral** (+1,830 samples, ~30% increase)
- **Sentence reordering**: Maintain factual tone
- **Back-translation**: Preserves neutral, informational content
- **Paraphrasing**: Rephrase news-like or factual statements
- **Random word substitution**: Small changes that maintain neutrality

#### 4. **Negative** (+66 samples, ~1% increase)
- **Minimal augmentation needed**: Just light back-translation or paraphrasing

#### 5. **Positive** (No changes needed)
- **Keep all original data**: More data = better model performance
- **Handle imbalance with class weights**: Let the loss function manage the imbalance


In [7]:
def check_gpu_setup():
    """Check GPU availability and memory"""
    if torch.cuda.is_available():
        gpu_count = torch.cuda.device_count()
        gpu_name = torch.cuda.get_device_name(0)
        gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9

        print(f"GPU Available: {gpu_name}")
        print(f"GPU Memory: {gpu_memory:.1f} GB")
        print(f"GPU Count: {gpu_count}")
        return True
    else:
        print("No GPU available, falling back to CPU")
        return False

def clear_gpu_memory():
    """Clear GPU memory cache"""
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        gc.collect()

def gpu_augment_sentiment_class(train_data, target_sentiment, target_count, batch_size=200):
    """
    GPU-optimized augmentation with larger batches and memory management
    """

    sentiment_df = train_data[train_data['Sentiment'] == target_sentiment].copy()
    current_count = len(sentiment_df)
    needed_augmentations = target_count - current_count

    print(f"\\n=== GPU AUGMENTING {target_sentiment.upper()} ===")
    print(f"Current: {current_count}, Target: {target_count}, Need: {needed_augmentations}")

    if needed_augmentations <= 0:
        return pd.DataFrame()

    # Clear GPU memory before starting
    clear_gpu_memory()

    # Protected words (from your frequency analysis)
    cross_class_protected = [
        'store', 'prices', 'food', 'supermarket', 'grocery', 'people', 'amp',
        'consumer', 'out', 'about', 'how', 'now', 'during', 'get', 'online',
        'shopping', 'hand', 'need', 'like'
    ]

    sentiment_specific = {
        'Extremely Negative': ['panic', 'crisis', 'buying', 'no', 'who', 'just', 'there'],
        'Extremely Positive': ['help', 'sanitizer', 'please', 'workers', 'us', 'who'],
        'Negative': ['demand', 'panic'],
        'Positive': [],
        'Neutral': ['coronavirus', 'pandemic', 'stock', 'go', 'what', 'just']
    }

    base_protected = ['covid', 'coronavirus', 'pandemic', 'virus', 'user', 'http']
    protected_words = base_protected + cross_class_protected + sentiment_specific.get(target_sentiment, [])

    print(f"Protected words: {len(protected_words)}")

    # Initialize GPU-optimized augmenters
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"Using device: {device}")

    try:
        # BERT contextual augmentation on GPU
        aug_context = naw.ContextualWordEmbsAug(
            model_path='bert-base-uncased',
            action="substitute",
            aug_p=0.12,  # Slightly higher since GPU can handle it
            device=device,  # This is key!
            stopwords=protected_words
        )

        # Synonym augmentation (CPU is fine for this)
        aug_synonym = naw.SynonymAug(
            aug_src='wordnet',
            aug_p=0.08,
            stopwords=protected_words
        )

        augmenters = [
            ('context', aug_context, 0.6),
            ('synonym', aug_synonym, 0.4)
        ]

        print("GPU augmenters initialized successfully")

    except Exception as e:
        print(f"GPU augmentation failed: {e}")
        return pd.DataFrame()

    augmented_data = []

    # Process in larger batches since GPU is faster
    num_batches = (needed_augmentations + batch_size - 1) // batch_size

    print(f"Processing {num_batches} batches of ~{batch_size} samples each")

    for batch_idx in tqdm(range(num_batches), desc="Augmenting batches"):
        batch_start = batch_idx * batch_size
        batch_end = min((batch_idx + 1) * batch_size, needed_augmentations)
        batch_size_actual = batch_end - batch_start

        # Sample tweets for this batch
        batch_tweets = sentiment_df.sample(
            n=batch_size_actual,
            replace=True,
            random_state=42 + batch_idx
        )

        batch_augmented = []

        for _, row in batch_tweets.iterrows():
            original_tweet = row['ProcessedTweet']

            # Skip very short tweets
            if len(original_tweet.split()) < 4:
                continue

            try:
                # Choose augmentation method based on weights
                method_name, augmenter, weight = random.choices(
                    augmenters,
                    weights=[w for _, _, w in augmenters]
                )[0]

                # Perform augmentation
                augmented_tweet = augmenter.augment(original_tweet)[0]

                # Quality check
                if (augmented_tweet != original_tweet and
                    len(augmented_tweet.strip()) > 10 and
                    len(augmented_tweet.split()) >= 3):

                    new_row = row.copy()
                    new_row['ProcessedTweet'] = augmented_tweet
                    new_row['OriginalTweet'] = f"[GPU_{method_name.upper()}] {row['OriginalTweet'][:80]}..."
                    batch_augmented.append(new_row)

            except Exception as e:
                continue

        augmented_data.extend(batch_augmented)

        # Clear GPU cache every few batches
        if batch_idx % 3 == 0:
            clear_gpu_memory()

        # Stop if we have enough
        if len(augmented_data) >= needed_augmentations:
            break

    # Final cleanup
    clear_gpu_memory()

    # Trim to exact number needed
    augmented_data = augmented_data[:needed_augmentations]

    print(f"Generated {len(augmented_data)} augmented samples using GPU")

    return pd.DataFrame(augmented_data).reset_index(drop=True) if augmented_data else pd.DataFrame()


def gpu_augment_multiple_classes(train_data, augmentation_targets):
    """
    GPU-optimized multi-class augmentation
    """

    # Check GPU setup
    gpu_available = check_gpu_setup()
    if not gpu_available:
        print(" Falling back to CPU - consider using lightweight version")
        return pd.DataFrame()

    print("\\n=== GPU-ACCELERATED AUGMENTATION ===")
    print(f"Original training size: {len(train_data)}")
    print("\\nCurrent distribution:")
    print(train_data['Sentiment'].value_counts())

    all_augmented = []

    for sentiment, target_count in augmentation_targets.items():
        print(f"\\n{'='*50}")

        # Clear memory before each class
        clear_gpu_memory()

        try:
            augmented_class = gpu_augment_sentiment_class(
                train_data, sentiment, target_count, batch_size=150
            )

            if not augmented_class.empty:
                all_augmented.append(augmented_class)
                print(f" {sentiment}: {len(augmented_class)} samples")
            else:
                print(f"{sentiment}: No samples generated")

        except Exception as e:
            print(f" {sentiment} failed: {e}")
            # Try to continue with other classes
            continue

    # Final cleanup
    clear_gpu_memory()

    if all_augmented:
        final_augmented = pd.concat(all_augmented, ignore_index=True)
        print(f"\\n TOTAL AUGMENTED: {len(final_augmented)} samples")
        print("\\nAugmented distribution:")
        print(final_augmented['Sentiment'].value_counts())
        return final_augmented
    else:
        print(" No augmented samples generated")
        return pd.DataFrame()


# GPU Memory monitoring
def monitor_gpu_memory():
    """Monitor GPU memory usage"""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / 1e9
        cached = torch.cuda.memory_reserved(0) / 1e9
        print(f"GPU Memory - Allocated: {allocated:.2f}GB, Cached: {cached:.2f}GB")

In [None]:
# Your original targets - GPU can handle these better
#first augmantation:
augmentation_targets = {
        'Extremely Negative': 6500,  # +2,115 samples
        'Extremely Positive': 7000,  # +1,701 samples
        'Neutral': 7500,             # +1,330 samples
        'Negative': 8200             # +266 samples
    }

print("Starting GPU-accelerated augmentation...")
monitor_gpu_memory()

# Run GPU augmentation
augmented_samples = gpu_augment_multiple_classes(train_data, augmentation_targets)

# Memory summary
monitor_gpu_memory()

Starting GPU-accelerated augmentation...
GPU Memory - Allocated: 0.00GB, Cached: 0.00GB
GPU Available: NVIDIA A100-SXM4-40GB
GPU Memory: 42.5 GB
GPU Count: 1
\n=== GPU-ACCELERATED AUGMENTATION ===
Original training size: 32925
\nCurrent distribution:
Sentiment
Positive              9137
Negative              7934
Neutral               6170
Extremely Positive    5299
Extremely Negative    4385
Name: count, dtype: int64
\n=== GPU AUGMENTING EXTREMELY NEGATIVE ===
Current: 4385, Target: 6500, Need: 2115
Protected words: 32
Using device: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


GPU augmenters initialized successfully
Processing 15 batches of ~150 samples each


Augmenting batches:   0%|          | 0/15 [00:00<?, ?it/s][nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
  return forward_call(*args, **kwargs)
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]   

Generated 1259 augmented samples using GPU
 Extremely Negative: 1259 samples
\n=== GPU AUGMENTING EXTREMELY POSITIVE ===
Current: 5299, Target: 7500, Need: 2201
Protected words: 31
Using device: cuda
GPU augmenters initialized successfully
Processing 15 batches of ~150 samples each


Augmenting batches:   0%|          | 0/15 [00:00<?, ?it/s][nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[n

Generated 1339 augmented samples using GPU
 Extremely Positive: 1339 samples
\n=== GPU AUGMENTING NEUTRAL ===
Current: 6170, Target: 8200, Need: 2030
Protected words: 31
Using device: cuda
GPU augmenters initialized successfully
Processing 14 batches of ~150 samples each


Augmenting batches:   0%|          | 0/14 [00:00<?, ?it/s][nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[n

Generated 1217 augmented samples using GPU
 Neutral: 1217 samples
\n TOTAL AUGMENTED: 3815 samples
\nAugmented distribution:
Sentiment
Extremely Positive    1339
Extremely Negative    1259
Neutral               1217
Name: count, dtype: int64
GPU Memory - Allocated: 0.45GB, Cached: 0.49GB


In [None]:
sentiment_counts_augment= augmented_samples['Sentiment'].value_counts()
print("Sentiment label distribution in augmented samples:")
print(sentiment_counts_augment)

augmented_samples_1 = augmented_samples[['ProcessedTweet','Sentiment']].sample(15, random_state=1)
display(augmented_samples_1)

Sentiment label distribution in augmented samples:
Sentiment
Extremely Positive    1339
Extremely Negative    1259
Neutral               1217
Name: count, dtype: int64


Unnamed: 0,ProcessedTweet,Sentiment
1140,come on user your supermarket shelves have been empty for hours. no hero grams for real. this is a small crisis and lives risk falling. # [UNK] # panicbuying,Extremely Negative
2329,lysol lot 2 and 2 [UNK] clean and fresh 48oz x 20 gal spray http # [UNK] http,Extremely Positive
1399,gp friend told by hospital that fuel oxide problems are reportedly a significant problem for [UNK] 19 spreading. wear gloves or use paper towel when filling up and discard immediately. rt please http,Extremely Positive
17,businesses inflating prices amid this global [UNK] people who are panick buying as making other people suffer especially the elderly days on end of [UNK],Extremely Negative
2731,food produce sales spike as consumers react to [UNK] at http,Neutral
725,chris tiernay vice of chief operating officer at coatings manufacturer talks amongst us about industry leadership during the [UNK] 2008 crisis and why oil prices are a big factor us watch during any potential recovery,Extremely Negative
2801,going over # supermarket after 28 days # lockdown? # award # win # citizen # goes # to? ( me )? # nomask # time # mask # hope # smile # [UNK] # stay # work # stay # live # restez # chez # s?? http,Neutral
1905,printed store flyers may not come back as [UNK] it changes retail habits http would love it if these just popped out. http,Extremely Positive
3681,ge sells n95 and now link nspx into decn nbdr spy qqq,Neutral
2742,jerk your hand away from your face and lick those hands regularly or use sanitizer? # [UNK] # amr # usha # [UNK] # amr # 2 # eradicating # hunger http,Neutral


In [8]:
# Check if [UNK] tokens are real or just display masking

import pandas as pd
import re

def analyze_unk_tokens(augmented_df):
    """
    Analyze whether [UNK] tokens are real or display artifacts
    """
    print("=== ANALYZING [UNK] TOKENS ===")

    # Sample a few texts with apparent [UNK] tokens
    sample_texts = augmented_df['ProcessedTweet'].head(10).tolist()

    print("\\n1. RAW STRING INSPECTION:")
    print("-" * 50)

    for i, text in enumerate(sample_texts):
        print(f"\\nSample {i+1}:")
        print(f"Display: {text}")
        print(f"Raw repr: {repr(text)}")
        print(f"Contains '[UNK]': {'[UNK]' in text}")
        print(f"Contains 'covid': {'covid' in text.lower()}")
        print(f"Contains 'coronavirus': {'coronavirus' in text.lower()}")
        print("-" * 30)

    print("\\n2. STATISTICAL ANALYSIS:")
    print("-" * 50)

    # Count [UNK] occurrences
    total_samples = len(augmented_df)
    unk_samples = augmented_df['ProcessedTweet'].str.contains(r'\\[UNK\\]', regex=True).sum()

    print(f"Total augmented samples: {total_samples}")
    print(f"Samples with [UNK]: {unk_samples}")
    print(f"Percentage with [UNK]: {unk_samples/total_samples*100:.1f}%")

    # Find all [UNK] patterns
    all_text = ' '.join(augmented_df['ProcessedTweet'].tolist())
    unk_matches = re.findall(r'\\[UNK\\]\\s*\\w*', all_text)

    print(f"\\nTotal [UNK] tokens found: {len(unk_matches)}")
    print("Most common [UNK] patterns:")
    from collections import Counter
    unk_counter = Counter(unk_matches)
    for pattern, count in unk_counter.most_common(10):
        print(f"  '{pattern}': {count} times")

    print("\\n3. CONTEXT ANALYSIS:")
    print("-" * 50)

    # Look at context around [UNK] tokens
    unk_contexts = []
    for text in augmented_df['ProcessedTweet']:
        if '[UNK]' in text:
            # Extract 10 words before and after each [UNK]
            words = text.split()
            for i, word in enumerate(words):
                if '[UNK]' in word:
                    start = max(0, i-5)
                    end = min(len(words), i+6)
                    context = ' '.join(words[start:end])
                    unk_contexts.append(context)

    print("Sample contexts with [UNK]:")
    for i, context in enumerate(unk_contexts[:5]):
        print(f"  {i+1}. ...{context}...")

    print("\\n4. TOKENIZER TEST:")
    print("-" * 50)

    # Test with a tokenizer to see what happens
    try:
        from transformers import AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

        # Test a sample with [UNK]
        sample_with_unk = None
        for text in sample_texts:
            if '[UNK]' in text:
                sample_with_unk = text
                break

        if sample_with_unk:
            print(f"Testing tokenizer on: {sample_with_unk[:100]}...")
            tokens = tokenizer.tokenize(sample_with_unk)
            print(f"Tokenized: {tokens[:20]}...")  # First 20 tokens
            print(f"Contains [UNK] tokens: {'[UNK]' in tokens}")

            # Test if replacing [UNK] with 'covid' makes sense
            replaced_text = sample_with_unk.replace('[UNK]', 'covid')
            print(f"\\nIf we replace [UNK] with 'covid':")
            print(f"Result: {replaced_text[:100]}...")
            replaced_tokens = tokenizer.tokenize(replaced_text)
            print(f"Tokenized: {replaced_tokens[:20]}...")

    except ImportError:
        print("Tokenizer test skipped (transformers not available)")

    print("\\n5. RECOMMENDATION:")
    print("-" * 50)

    if unk_samples > total_samples * 0.3:  # More than 30% have [UNK]
        print("⚠️  HIGH [UNK] RATE - Likely real unknown tokens")
        print("   → Augmentation may be problematic")
        print("   → Consider using class weights instead")
    elif unk_samples > 0:
        print("🤔 SOME [UNK] TOKENS FOUND")
        print("   → Check if they're replacing specific words (like 'covid')")
        print("   → If it's just masking, data might be usable")
        print("   → Consider post-processing to replace [UNK] with original words")
    else:
        print("✅ NO [UNK] TOKENS FOUND")
        print("   → Display masking was the issue, not real tokens")
        print("   → Augmented data is likely good to use!")


def fix_unk_tokens(augmented_df, replacement_dict=None):
    """
    Attempt to fix [UNK] tokens by replacing with likely words
    """
    if replacement_dict is None:
        replacement_dict = {
            '[UNK]': 'covid',
            '[UNK] 19': 'covid 19',
            '[UNK] [UNK]': 'covid 19',
            '# [UNK]': '#covid',
            '# [UNK] [UNK]': '#covid 19'
        }

    print("\\n=== ATTEMPTING TO FIX [UNK] TOKENS ===")

    fixed_df = augmented_df.copy()
    total_replacements = 0

    for old_token, new_token in replacement_dict.items():
        count = fixed_df['ProcessedTweet'].str.contains(re.escape(old_token)).sum()
        if count > 0:
            fixed_df['ProcessedTweet'] = fixed_df['ProcessedTweet'].str.replace(old_token, new_token)
            total_replacements += count
            print(f"Replaced '{old_token}' → '{new_token}': {count} instances")

    print(f"\\nTotal replacements made: {total_replacements}")

    # Check remaining [UNK] tokens
    remaining_unk = fixed_df['ProcessedTweet'].str.contains(r'\\[UNK\\]').sum()
    print(f"Remaining [UNK] tokens: {remaining_unk}")

    return fixed_df


# Analyze your augmented data
print("Analyzing augmented samples for [UNK] tokens...")
analyze_unk_tokens(augmented_samples)


Analyzing augmented samples for [UNK] tokens...


NameError: name 'augmented_samples' is not defined

In [None]:
from google.colab import files

# Save the CSV file in Colab
augmented_samples.to_csv('augmented_1.csv', index=False, encoding='utf-8')
print(f"Train augmented dataset saved: {len(augmented_samples)} rows")

# Download the file to your computer
#files.download('augmented.csv')
files.download('augmented_1.csv')

Train augmented dataset saved: 3815 rows


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>



---



---

seconed augmantation:

In [9]:
def gpu_augment_sentiment_class(train_data, target_sentiment, target_count, batch_size=200):
    """
    GPU-optimized augmentation with retry logic until enough valid samples are generated
    """
    import random
    from tqdm import tqdm
    import torch
    import gc
    import nlpaug.augmenter.word as naw

    sentiment_df = train_data[train_data['Sentiment'] == target_sentiment].copy()
    current_count = len(sentiment_df)
    needed_augmentations = target_count - current_count

    print(f"\n=== GPU AUGMENTING {target_sentiment.upper()} ===")
    print(f"Current: {current_count}, Target: {target_count}, Need: {needed_augmentations}")

    if needed_augmentations <= 0:
        return pd.DataFrame()

    # Clear GPU memory before starting
    clear_gpu_memory()

    # Protected words
    cross_class_protected = [
        'store', 'prices', 'food', 'supermarket', 'grocery', 'people', 'amp',
        'consumer', 'out', 'about', 'how', 'now', 'during', 'get', 'online',
        'shopping', 'hand', 'need', 'like'
    ]

    sentiment_specific = {
        'Extremely Negative': ['panic', 'crisis', 'buying', 'no', 'who', 'just', 'there'],
        'Extremely Positive': ['help', 'sanitizer', 'please', 'workers', 'us', 'who'],
        'Negative': ['demand', 'panic'],
        'Positive': [],
        'Neutral': ['coronavirus', 'pandemic', 'stock', 'go', 'what', 'just']
    }

    base_protected = ['covid', 'coronavirus', 'pandemic', 'virus', 'user', 'http']
    protected_words = base_protected + cross_class_protected + sentiment_specific.get(target_sentiment, [])

    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    aug_context = naw.ContextualWordEmbsAug(
        model_path='bert-base-uncased',
        action="substitute",
        aug_p=0.12,
        device=device,
        stopwords=protected_words
    )

    aug_synonym = naw.SynonymAug(
        aug_src='wordnet',
        aug_p=0.08,
        stopwords=protected_words
    )

    augmenters = [
        ('context', aug_context, 0.6),
        ('synonym', aug_synonym, 0.4)
    ]

    augmented_data = []
    attempts = 0
    max_attempts = needed_augmentations * 30  # safety limit to avoid infinite loops

    while len(augmented_data) < needed_augmentations and attempts < max_attempts:
        attempts += 1

        row = sentiment_df.sample(n=1, replace=True).iloc[0]
        original_tweet = row['ProcessedTweet']

        if len(original_tweet.split()) < 2:
            continue

        try:
            method_name, augmenter, weight = random.choices(augmenters, weights=[w for _, _, w in augmenters])[0]
            augmented_tweet = augmenter.augment(original_tweet)[0]

            if (augmented_tweet != original_tweet and
                len(augmented_tweet.split()) >= 2):

                new_row = row.copy()
                new_row['ProcessedTweet'] = augmented_tweet
                new_row['OriginalTweet'] = f"[GPU_{method_name.upper()}] {row['OriginalTweet'][:80]}..."
                augmented_data.append(new_row)

        except Exception as e:
            continue

        if attempts % 500 == 0:
            print(f"Generated {len(augmented_data)} so far (after {attempts} attempts)...")
            clear_gpu_memory()

    clear_gpu_memory()

    if len(augmented_data) < needed_augmentations:
        print(f"⚠️ Only generated {len(augmented_data)} out of {needed_augmentations} requested for {target_sentiment}")

    print(f"✅ Final count: {len(augmented_data)} augmented samples for {target_sentiment}")

    return pd.DataFrame(augmented_data).reset_index(drop=True) if augmented_data else pd.DataFrame()


In [None]:
# Final targets = original count + augmentation
augmentation_targets = {
    'Extremely Negative': 5885,  # 4385 + 1500
    'Neutral': 8170,             # 6170 + 2000
    'Extremely Positive': 7499   # 5299 + 2200
}


In [None]:
augmented_df = gpu_augment_multiple_classes(train_data, augmentation_targets)

GPU Available: NVIDIA A100-SXM4-40GB
GPU Memory: 42.5 GB
GPU Count: 1
\n=== GPU-ACCELERATED AUGMENTATION ===
Original training size: 32925
\nCurrent distribution:
Sentiment
Positive              9137
Negative              7934
Neutral               6170
Extremely Positive    5299
Extremely Negative    4385
Name: count, dtype: int64

=== GPU AUGMENTING EXTREMELY NEGATIVE ===
Current: 4385, Target: 5885, Need: 1500


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
  return forward_call(*args, **kwargs)
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_dat

Generated 586 so far (after 1000 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 879 so far (after 1500 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 1179 so far (after 2000 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 1454 so far (after 2500 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

✅ Final count: 1500 augmented samples for Extremely Negative
 Extremely Negative: 1500 samples

=== GPU AUGMENTING NEUTRAL ===
Current: 6170, Target: 8170, Need: 2000


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 610 so far (after 1000 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 1209 so far (after 2000 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 1792 so far (after 3000 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

✅ Final count: 2000 augmented samples for Neutral
 Neutral: 2000 samples

=== GPU AUGMENTING EXTREMELY POSITIVE ===
Current: 5299, Target: 7499, Need: 2200


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 320 so far (after 500 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 638 so far (after 1000 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 940 so far (after 1500 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 1239 so far (after 2000 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 1849 so far (after 3000 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

✅ Final count: 2200 augmented samples for Extremely Positive
 Extremely Positive: 2200 samples
\n TOTAL AUGMENTED: 5700 samples
\nAugmented distribution:
Sentiment
Extremely Positive    2200
Neutral               2000
Extremely Negative    1500
Name: count, dtype: int64


In [None]:
sentiment_counts_augment= augmented_df['Sentiment'].value_counts()
print("Sentiment label distribution in augmented samples:")
print(sentiment_counts_augment)

augmented_samples_1 = augmented_df[['ProcessedTweet','Sentiment']].sample(15, random_state=1)
display(augmented_samples_1)

Sentiment label distribution in augmented samples:
Sentiment
Extremely Positive    2200
Neutral               2000
Extremely Negative    1500
Name: count, dtype: int64


Unnamed: 0,ProcessedTweet,Sentiment
5308,"our corporate director of health and adult services user, on how its is best to exercise at get some fresh air, but please adhere to # social # distancing advice during a # [UNK] outbreak. http",Extremely Positive
2896,waiting for long fucking line to get into a bloody supermarket! # kroger money hungry # supermarket # foodsupply # [UNK],Neutral
2900,if these ' re out getting stock of food then you should try some soup!. # [UNK],Neutral
1457,il b shocked if in kanye west we r told there will be a water crisis n prices for potable water do stay high n the reason they will give us wud when we wasted too much water during the # e # outbreak. so be sorry. # [UNK] # watercrisis soon,Extremely Negative
21,some idiot stole money 20 out of my play console sunday night and didnt take my hand sanitizer? # [UNK] http,Extremely Negative
4721,"the aim of joint procurement is to advance admin, get better prices through bulk purchasing programmes and take advantage key medical manufacturing skills that may not be equally shared ( especially available for smaller countries ).",Extremely Positive
2632,msd manuals [UNK] 19 information and electronic board manual consumer version http,Neutral
2946,china consumer impact from [UNK] business? money baba money jd inc pdd system tcehy money tcom money wynn money lvs money mlco friend iq money bili money yumc dm or email money. look for more data http,Neutral
2097,about [UNK] came to this : http ( user ),Neutral
5540,"thank you to the amazing professionals, scientists, all honest, dedicated government leaders, the grocery store and other essential business / service workers. you are just a great here! i love this. # [UNK]",Extremely Positive


light augmentations:

In [10]:
# Final targets = original count + augmentation
augmentation_targets = {
    'Extremely Negative': 5885,   # 4385 + 1500 (to restore lost recall)
    'Neutral': 8370,              # 6170 + 2200 (to keep focus on weakest class)
    'Extremely Positive': 6799    # 5299 + 1500 (but slightly less than before)
}


In [12]:
augmented_df = gpu_augment_multiple_classes(train, augmentation_targets)

GPU Available: NVIDIA A100-SXM4-40GB
GPU Memory: 42.5 GB
GPU Count: 1
\n=== GPU-ACCELERATED AUGMENTATION ===
Original training size: 41157
\nCurrent distribution:
Sentiment
Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: count, dtype: int64

=== GPU AUGMENTING EXTREMELY NEGATIVE ===
Current: 5481, Target: 5885, Need: 404


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
  return forward_call(*args, **kwargs)
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   P

✅ Final count: 404 augmented samples for Extremely Negative
 Extremely Negative: 404 samples

=== GPU AUGMENTING NEUTRAL ===
Current: 7713, Target: 8370, Need: 657


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 300 so far (after 500 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

Generated 593 so far (after 1000 attempts)...


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

✅ Final count: 657 augmented samples for Neutral
 Neutral: 657 samples

=== GPU AUGMENTING EXTREMELY POSITIVE ===
Current: 6624, Target: 6799, Need: 175


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger t

✅ Final count: 175 augmented samples for Extremely Positive
 Extremely Positive: 175 samples
\n TOTAL AUGMENTED: 1236 samples
\nAugmented distribution:
Sentiment
Neutral               657
Extremely Negative    404
Extremely Positive    175
Name: count, dtype: int64


In [13]:
from google.colab import files

# Save the CSV file in Colab
augmented_df.to_csv('augmented_light.csv', index=False, encoding='utf-8')
print(f"Train augmented dataset saved: {len(augmented_df)} rows")

# Download the file to your computer
#files.download('augmented.csv')
files.download('augmented_light.csv')

Train augmented dataset saved: 1236 rows


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [16]:
import random

def random_deletion(text, deletion_prob=0.15, protected_words=None):
    """
    Randomly deletes words from the text with a given probability.
    Ensures protected words are not removed.
    """
    if protected_words is None:
        protected_words = []

    words = text.split()
    if len(words) <= 3:
        return text  # Skip very short tweets

    new_words = []
    for word in words:
        if word.lower() in protected_words:
            new_words.append(word)
        elif random.random() > deletion_prob:
            new_words.append(word)

    return ' '.join(new_words) if new_words else text


In [17]:
import pandas as pd

def augment_with_random_deletion(train_df, target_sentiment, num_augmented=600, protected_words=None):
    """
    Generate num_augmented new samples from a given sentiment class using random deletion.
    """
    source_df = train_df[train_df["Sentiment"] == target_sentiment]

    augmented_rows = []
    for i in range(num_augmented):
        row = source_df.sample(n=1, replace=True).iloc[0]
        original_text = row["ProcessedTweet"]
        augmented_text = random_deletion(original_text, protected_words=protected_words)

        if augmented_text != original_text and len(augmented_text.split()) >= 3:
            new_row = row.copy()
            new_row["ProcessedTweet"] = augmented_text
            new_row["OriginalTweet"] = f"[RD] {row['OriginalTweet'][:80]}..."
            augmented_rows.append(new_row)

    return pd.DataFrame(augmented_rows)


In [18]:
protected_words = ['store', 'prices', 'food', 'supermarket', 'grocery', 'people', 'amp',
        'consumer', 'out', 'about', 'how', 'now', 'during', 'get', 'online',
        'shopping', 'hand', 'need', 'like', 'panic', 'crisis', 'buying', 'no', 'who', 'just', 'there','help', 'sanitizer', 'please', 'workers', 'us', 'who',
        'demand', 'panic','coronavirus', 'pandemic', 'stock', 'go', 'what', 'just', 'covid', 'coronavirus', 'pandemic', 'virus', 'user', 'http']


In [19]:
aug_neutral = augment_with_random_deletion(train, "Neutral", 600, protected_words)
aug_ex_pos = augment_with_random_deletion(train, "Extremely Positive", 600, protected_words)
aug_ex_neg = augment_with_random_deletion(train, "Extremely Negative", 600, protected_words)

# Combine all
aug_combined = pd.concat([aug_neutral, aug_ex_pos, aug_ex_neg], ignore_index=True)



In [20]:
from google.colab import files

# Save the CSV file in Colab
aug_combined.to_csv('augmented_light333.csv', index=False, encoding='utf-8')
print(f"Train augmented dataset saved: {len(aug_combined)} rows")

# Download the file to your computer
#files.download('augmented.csv')
files.download('augmented_light333.csv')

Train augmented dataset saved: 1700 rows


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>