<a href="https://colab.research.google.com/github/adimehta97/DSA-PROJECT/blob/master/Copy_of_HF_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Hugging Face Datasets & Pipelines in Colab

In [1]:
# Install necessary libraries
!pip install datasets transformers

Collecting datasets
  Downloading datasets-3.3.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.1-py3-none-any.whl (484 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

In [10]:
# Import required libraries
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer
import pandas as pd

## 1. Overview of Popular Hugging Face Datasets
Below is a table listing some popular datasets, categorized by size and task:

| **Dataset Name** | **Size** | **Loading Approach** | **Task** |
|-----------------|----------|----------------|------------|
| `imdb` | Small (~80MB) | Sampling | Sentiment Analysis |
| `ag_news` | Small (~30MB) | Sampling | Text Classification |
| `yelp_review_full` | Medium (~200MB) | Sampling | Sentiment Analysis |
| `squad` | Medium (~150MB) | Sampling | Question Answering |
| `cnn_dailymail` | Large (~1GB) | Streaming | Summarization |
| `wikipedia` | Large (GBs) | Streaming | Knowledge Extraction |
| `common_voice` | Large (Multiple GBs) | Streaming | Speech Recognition |
| `laion400m` | Huge (TBs) | Streaming | Vision-Language Tasks |
| `financial_phrasebank` | Small (~1MB) | Sampling | Financial Sentiment |
| `amazon_reviews_multi` | Medium (~300MB) | Sampling | Sentiment Analysis |

For this notebook, we will use **IMDB** for sentiment analysis with a pipeline.
"""


# Watchouts When Using Hugging Face Datasets
Using Hugging Face datasets is convenient, but there are some important considerations to keep in mind.

1. Dataset Size & Memory Issues
✔ Some datasets (e.g., wikipedia, laion400m) are too large to fit in memory.
✔ Solution: Use streaming for large datasets:
dataset = load_dataset("wikipedia", split="train", streaming=True)
 If working in Colab, avoid datasets exceeding 1-2GB.

2.  Model Compatibility with Dataset
✔ Not all datasets are directly compatible with every NLP task.
✔ Example: Sentiment analysis models work with positive/negative labels, but a dataset may have numerical ratings instead.
✔ Solution: Preprocess labels before using in Hugging Face pipelines:

df["label"] = df["rating"].apply(lambda x: "positive" if x > 3 else "negative")

3. Multilingual & Hinglish Issues
✔ If working with Hinglish, Arabic, or other mixed-language text, standard models may fail.
✔ Solution: Use language-specific models:

classifier = pipeline("sentiment-analysis", model="mrm8488/distilbert-multi-uncased-finetuned-sentiment")

4. Dataset Versioning & Updates
✔ Some datasets get deprecated (e.g., amazon_reviews_multi was removed).
✔ Solution: Always check for an updated version or alternatives:

from datasets import list_datasets
print(list_datasets())  # Check available datasets

5.  Missing or Incomplete Data
✔ Some datasets have missing values, which can affect performance.
✔ Solution: Check for NaNs and handle them appropriately:
df.dropna(inplace=True)  # Remove missing values

6. ✔ Some datasets may not have clear splits (train/test). Always check:
print(dataset.keys())  # Ensure train/test splits exist

7. ext Encoding Issues
✔ Some datasets (especially multilingual ones) may contain special characters or non-UTF-8 encodings.
✔ Solution: Always specify encoding when reading CSVs or handling text:

df = pd.read_csv("dataset.csv", encoding="utf-8")

8. Dataset Format & Structure
✔ Some datasets don’t follow a standard tabular structure—they may be JSON or contain nested fields.
✔ Solution: Convert to Pandas and inspect structure:

df = pd.DataFrame(dataset["train"])  # Convert to Pandas
print(df.head())  # Check structure



# Demo Exercise: IMDB Sentiment Analysis

# Loading and exploring the dataset

In [3]:
## 2. Load the IMDB Dataset

dataset = load_dataset("imdb")
print(dataset.keys())  # Shows available splits (train/test)
print(dataset["train"][0])  # Example data point


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

dict_keys(['train', 'test', 'unsupervised'])
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really,

In [4]:
# Increase the timeout duration
# %%
## 2. Load the IMDB Dataset
import socket
# Set the default timeout to 60 seconds
socket.setdefaulttimeout(60)
dataset = load_dataset("imdb")
print(dataset.keys())  # Shows available splits (train/test)
print(dataset["train"][0])  # Example data point


dict_keys(['train', 'test', 'unsupervised'])
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really,

# Convert to Pandas

In [5]:
# Convert to Pandas for easy manipulation
df = pd.DataFrame(dataset["train"])
print(df.head())

                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0


# Sampling a subset

In [6]:
## 3. Sampling a Subset for Quick Processing

# Taking a random sample of 1000 rows
df_sample = df.sample(n=1000, random_state=42)
print(df_sample.head())


                                                    text  label
6868   Dumb is as dumb does, in this thoroughly unint...      0
24016  I dug out from my garage some old musicals and...      1
9668   After watching this movie I was honestly disap...      0
13640  This movie was nominated for best picture but ...      1
14018  Just like Al Gore shook us up with his painful...      1


Using Sentiment Analysis Pipleine

In [7]:
## 4. Using a Hugging Face Pipeline for Sentiment Analysis

# Load the sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


## 5. Applying Pipeline to a Dataset Row-by-Row
Since Hugging Face Pipelines work on single text inputs, we need to **apply them row-by-row** on our dataset.
"""


There is an error which is troubleshot with help of Gemini

In [11]:
from transformers import pipeline, AutoTokenizer

# Load the sentiment analysis pipeline and the corresponding tokenizer
classifier = pipeline("sentiment-analysis")
tokenizer = AutoTokenizer.from_pretrained(classifier.model.config._name_or_path)

# Define a function to truncate text to the maximum sequence length
# accounting for special tokens
def truncate_text(text):
  tokens = tokenizer(text, truncation=True,
                     max_length=512, # Maximum length for the model
                     padding="max_length", # Pad to max_length
                     add_special_tokens = True)  # Add special tokens
  # Truncate to 512 tokens, including special tokens
  truncated_text = tokenizer.decode(tokens["input_ids"], skip_special_tokens=True)
  return truncated_text

# Apply the truncation function to the 'text' column before sentiment analysis
df_sample["predicted_sentiment"] = df_sample["text"].apply(lambda x: classifier(truncate_text(x))[0]['label'])
print(df_sample.head())

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cuda:0


                                                    text  label  \
6868   Dumb is as dumb does, in this thoroughly unint...      0   
24016  I dug out from my garage some old musicals and...      1   
9668   After watching this movie I was honestly disap...      0   
13640  This movie was nominated for best picture but ...      1   
14018  Just like Al Gore shook us up with his painful...      1   

      predicted_sentiment  
6868             NEGATIVE  
24016            POSITIVE  
9668             NEGATIVE  
13640            NEGATIVE  
14018            POSITIVE  


In [12]:
#Save Processed Data

df_sample.to_csv("imdb_sentiment_predictions.csv", index=False)

from google.colab import files

files.download('imdb_sentiment_predictions.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Class Exercise

Use Sentiment140 dataset -

The Sentiment140 dataset contains 1.6 million tweets labeled for sentiment analysis, making it suitable for training and evaluating sentiment models.

This is a huge data set and it needs ot be be sampled

Check the structure of the data set

In [13]:


# Load the Sentiment140 dataset
dataset = load_dataset("sentiment140")
print(dataset.keys())  # Shows available splits (train/test)
print(dataset["train"][0])  # Example data point


README.md:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

sentiment140.py:   0%|          | 0.00/4.03k [00:00<?, ?B/s]

The repository for sentiment140 contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/sentiment140.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0.00/81.4M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1600000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/498 [00:00<?, ? examples/s]

dict_keys(['train', 'test'])
{'text': "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D", 'date': 'Mon Apr 06 22:19:45 PDT 2009', 'user': '_TheSpecialOne_', 'sentiment': 0, 'query': 'NO_QUERY'}


In [15]:
# Convert to a Pandas DataFrame
df = pd.DataFrame(dataset["train"])

# Sample 1000 rows for efficient processing
df_sample = df.sample(n=1000, random_state=42)

# Display the first few rows




In [18]:
df_sample

Unnamed: 0,text,date,user,sentiment,query
541200,@chrishasboobs AHHH I HOPE YOUR OK!!!,Tue Jun 16 18:18:12 PDT 2009,LaLaLindsey0609,0,NO_QUERY
750,"@misstoriblack cool , i have no tweet apps fo...",Mon Apr 06 23:11:14 PDT 2009,sexygrneyes,0,NO_QUERY
766711,@TiannaChaos i know just family drama. its la...,Tue Jun 23 13:40:11 PDT 2009,sammydearr,0,NO_QUERY
285055,School email won't open and I have geography ...,Mon Jun 01 10:26:07 PDT 2009,Lamb_Leanne,0,NO_QUERY
705995,upper airways problem,Sat Jun 20 12:56:51 PDT 2009,yogicerdito,0,NO_QUERY
...,...,...,...,...,...
338333,"@girrlonthewing Ha, well you'd be surprised at...",Wed Jun 03 01:24:58 PDT 2009,sydeshow,0,NO_QUERY
109574,Some dark clouds in #indiavotes #indiavotes09 ...,Sun May 17 02:04:56 PDT 2009,dineshah,0,NO_QUERY
1349309,@wolfgnards awesome. Thanks for letting me kno...,Fri Jun 05 10:22:33 PDT 2009,MDobson84,4,NO_QUERY
671510,I left my heart @holdenbeach Hoping to go bac...,Fri Jun 19 18:16:30 PDT 2009,AnNa_HaLe,0,NO_QUERY


In [19]:
# Load a pre-trained sentiment analysis pipeline model="distilbert-base-uncased-finetuned-sst-2-english")
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")


# Apply sentiment analysis on the text column
df_sample["predicted_sentiment"] = df_sample["text"].apply(lambda x: classifier(x)[0]['label'])


# Display the DataFrame with predictions
display(df_sample)


# Save processed data




config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


Unnamed: 0,text,date,user,sentiment,query,predicted_sentiment
541200,@chrishasboobs AHHH I HOPE YOUR OK!!!,Tue Jun 16 18:18:12 PDT 2009,LaLaLindsey0609,0,NO_QUERY,POSITIVE
750,"@misstoriblack cool , i have no tweet apps fo...",Mon Apr 06 23:11:14 PDT 2009,sexygrneyes,0,NO_QUERY,POSITIVE
766711,@TiannaChaos i know just family drama. its la...,Tue Jun 23 13:40:11 PDT 2009,sammydearr,0,NO_QUERY,NEGATIVE
285055,School email won't open and I have geography ...,Mon Jun 01 10:26:07 PDT 2009,Lamb_Leanne,0,NO_QUERY,NEGATIVE
705995,upper airways problem,Sat Jun 20 12:56:51 PDT 2009,yogicerdito,0,NO_QUERY,NEGATIVE
...,...,...,...,...,...,...
338333,"@girrlonthewing Ha, well you'd be surprised at...",Wed Jun 03 01:24:58 PDT 2009,sydeshow,0,NO_QUERY,POSITIVE
109574,Some dark clouds in #indiavotes #indiavotes09 ...,Sun May 17 02:04:56 PDT 2009,dineshah,0,NO_QUERY,NEGATIVE
1349309,@wolfgnards awesome. Thanks for letting me kno...,Fri Jun 05 10:22:33 PDT 2009,MDobson84,4,NO_QUERY,POSITIVE
671510,I left my heart @holdenbeach Hoping to go bac...,Fri Jun 19 18:16:30 PDT 2009,AnNa_HaLe,0,NO_QUERY,NEGATIVE


In [20]:
df_sample.to_csv("sentiment140_sentiment_predictions.csv", index=False)

In [21]:
from google.colab import files

files.download('sentiment140_sentiment_predictions.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Exercise 2

Load the "takala/financial_phrasebank" dataset and use only the train set

Sample 1000 rows

Set up classifier using model="distilbert-base-uncased-finetuned-sst-2-english"




In [None]:
# Install necessary libraries
!pip install datasets transformers

# Import required libraries
from datasets import load_dataset
from transformers import pipeline
import pandas as pd



# Load the Financial PhraseBank dataset, specifying the configuration
#dataset = load_dataset("takala/financial_phrasebank", name="sentences_allagree", split="train") # Choose one of the available configs
#print(dataset.keys())  # Shows available splits (train/test)
#print(dataset["train"][0])  # Example data point

# Convert the dataset to a Pandas DataFrame


# Sample 1000 rows for efficient processing


# Display the first few rows


# Load a pre-trained sentiment analysis pipeline model="distilbert-base-uncased-finetuned-sst-2-english")



# Apply sentiment analysis on the text column


# Display the DataFrame with predictions


# Save processed data






# Save processed data

# Call files.download to download the saved file.







README.md:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

financial_phrasebank.py:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

# Zero Shot Classification

The AG News dataset contains news articles labeled into four categories:

 World, Sports, Business, and Sci/Tech.

it is available on Huggingface

Exercise Objectives:

Load the AG News Dataset:

Implement Zero-Shot Classification: Using the facebook/bart-large-mnli model,

classify news articles into the predefined categories without explicit training.

Evaluate Model Performance: Assess the model's predictions and discuss its effectiveness in a zero-shot setting.



In [None]:



# Load the AG News dataset
#dataset = load_dataset("ag_news", split="test")

#print(dataset.keys())  # Shows available splits (train/test)
#print(dataset["train"][0])  # Example data point


# Convert the dataset to a Pandas DataFrame

# sample 100 records
#df_sample = df.sample(n=100, random_state=42)

# Display the first few rows


# Initialize the zero-shot classification pipeline, model="facebook/bart-large-mnli"


# Define candidate labels

# Apply zero-shot classification to a sample of 100 articles for efficiency

#df_sample['predicted_category'] = df_sample['text'].apply(
    #lambda x: classifier(x, candidate_labels)['labels'][0]
)

# Display the DataFrame with predictions


# Save the results to a CSV file


                                                text  label
0  Fears for T N pension after talks Unions repre...      2
1  The Race is On: Second Private Team Sets Launc...      3
2  Ky. Company Wins Grant to Study Peptides (AP) ...      3
3  Prediction Unit Helps Forecast Wildfires (AP) ...      3
4  Calif. Aims to Limit Farm-Related Smog (AP) AP...      3


Device set to use cuda:0


                                                   text  label  \
7094  Fan v Fan: Manchester City-Tottenham Hotspur T...      1   
1017  Paris Tourists Search for Key to 'Da Vinci Cod...      0   
2850  Net firms: Don't tax VoIP The Spanish-American...      3   
1452  Dependent species risk extinction The global e...      3   
457   EDS Is Charter Member of Siebel BPO Alliance (...      3   

     predicted_category  
7094             Sports  
1017              World  
2850              World  
1452              World  
457            Business  


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>