## Introduction
Sentiment analysis is the process of analyzing digital texts to determine the emotional tone. The emotional tone of a text could be positive, negative, or neutral.

In this lab, we will learn how to analyze text data according to sentiment. Then, we will integrate our model to *Hugging face* platform which is a popular MLOPs platform that helps users build, deploy and train machine learning models.

*Hugging face* provides developers with lighweight tools to smoothely track their machine learning training experiments, evaluate the model performance, reproduce models, and visualize results. The platform supports team work and shared projects as well. 


## Objectives
The main objectives of this Jupyter notebook are:

* To learn how to build a sentiment analysis classification application using pytorch library.
* To observe and analyze the visualizations and results of a machine learning experiment.
* To learn how to integrate the sentiment analysis application with Hugging face platform.


## Tools and Libraries
For this Jupyter notebook, we will need the following tools and libraries:

1. Python 3.x
2. Pytorch deep learning library
3. transformers

## What is?
* Transformers. Transformers are a kind of neural network architecture that is commonly used have in the field of natural language processing (NLP). Key features of transformers include attention Mechanism, parallel processing, scalability. Popular examples of transformers are BERT and GPT.
* Pytorch. PyTorch is an open source ML library developed by the AI Research lab at Facebook. the library is used mostly for computer vision and natural language processing applications.

## Step 1. Importing Libraries

First, you need to install the necessary libraries to run the Lab activity.
1) A transformer is a deep learning architecture, initially proposed in 2017. It has been significantly adopted for training large language models on large (language) datasets.
2) Torch PyTorch is a Python-based scientific computing package used to implement neural networks.

In [1]:
#!pip install transformers
#!pip install tf-keras

In [2]:
#! pip3 install torch torchvision

In [3]:
import torch
print(torch.__version__)

2.7.1+cu118


In [4]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

2025-12-22 21:05:03.295677: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2025-12-22 21:05:05.800504: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.


## Step 2. Load the Model

In [5]:
model_name= "distilbert-base-uncased-finetuned-sst-2-english" #even if you don't define a model name, by deafult this model is used.
classifier = pipeline ("sentiment-analysis", model=model_name) #Pipelines in the Hugging Face Transformers library provide a user-friendly and effecient way to deploy ML models for various tasks.
                                                               #Pipeline(..,..) takes the 'task' the user wants to perform for example "sentiment-analysis" and the 'model' to be used.

Device set to use cuda:0


In [6]:
text_1 = classifier("The weather today is nice. It is dark, cloudy and expected to snow in the evening.")
text_2 = classifier("It is dark, cloudy and expected to snow in the evening.")
text_3 = classifier("We are not sure wether we like the AI Engineering course or not, but it is an important course.")
print(f'The predicted sentiment of text_1 is {text_1}')
print(f'The predicted sentiment of text_2 is {text_2}')
print(f'The predicted sentiment of text_3 is {text_3}')

The predicted sentiment of text_1 is [{'label': 'POSITIVE', 'score': 0.9986727237701416}]
The predicted sentiment of text_2 is [{'label': 'NEGATIVE', 'score': 0.994065523147583}]
The predicted sentiment of text_3 is [{'label': 'POSITIVE', 'score': 0.9982800483703613}]


In [7]:
#Try another way to read the text

results = classifier(["The weather today is nice. It is dark, cloudy and expected to snow in the evening.", "I hope you don't hate this weather."])
for r in results:
    print (r)

{'label': 'POSITIVE', 'score': 0.9986727237701416}
{'label': 'NEGATIVE', 'score': 0.8883466720581055}


## Step 3. Using a Tokinizer

A tokenizer is used in natural language processing (NLP) to breakdown the text into smaller units (tokens) which could be words or characters so they are processed by a machine learning model. Multiple steps are included in tokenization, we mention:<br>
* Tokenization: Splitting text into tokens.
* Indexing: Assigning unique numerical IDs to tokens. The main reason behind assigning IDs to tokens is that neural networks work with numerical respresentations only. Another advantage for IDs assignment is the consistancy of mapping tokens into IDs which ensures that the same text is always represented in the same way, resulting in consistent learning of models. 

In [8]:
model= AutoModelForSequenceClassification.from_pretrained(model_name) #used to load a pre-trained model
tokenizer =AutoTokenizer.from_pretrained(model_name)

In [9]:
#Pass the text to the tokenizer and observe the results

tokens=tokenizer.tokenize("I hope you don't hate the cold dark weather.")
print (f' Tokens: {tokens}\n')
token_ids=tokenizer.convert_tokens_to_ids(tokens)
print (f' Token IDs: {token_ids}. Each token is assigned a unique ID, a numerical representation of which the ML model understands\n')
input_ids=tokenizer("I hope you don't hate this weather.")
print (f' Input IDs passed to the ML model: {input_ids}. \n The IDs 101 and 102 represent the beginning and the ending of a string')

 Tokens: ['i', 'hope', 'you', 'don', "'", 't', 'hate', 'the', 'cold', 'dark', 'weather', '.']

 Token IDs: [1045, 3246, 2017, 2123, 1005, 1056, 5223, 1996, 3147, 2601, 4633, 1012]. Each token is assigned a unique ID, a numerical representation of which the ML model understands

 Input IDs passed to the ML model: {'input_ids': [101, 1045, 3246, 2017, 2123, 1005, 1056, 5223, 2023, 4633, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}. 
 The IDs 101 and 102 represent the beginning and the ending of a string


In [10]:
#Observe the output after passing a tokenizer to the pipeline.

classifier = pipeline ("sentiment-analysis", model=model, tokenizer=tokenizer)
results = classifier(["The weather today is nice. It is dark, cloudy and expected to snow in the evening.", "I hope you don't hate the cold dark weather."])
for r in results:
    print (r)

Device set to use cuda:0


{'label': 'POSITIVE', 'score': 0.9986727237701416}
{'label': 'NEGATIVE', 'score': 0.7226830720901489}


In [11]:
X_train = ["The weather today is nice. It is dark, cloudy and expected to snow in the evening.", "I hope you don't hate the cold dark weather."]
batch = tokenizer(X_train, padding =True, truncation=True, max_length=512, return_tensors="pt") #pt as pytorch

In [12]:
print(batch)

{'input_ids': tensor([[  101,  1996,  4633,  2651,  2003,  3835,  1012,  2009,  2003,  2601,
          1010, 24706,  1998,  3517,  2000,  4586,  1999,  1996,  3944,  1012,
           102],
        [  101,  1045,  3246,  2017,  2123,  1005,  1056,  5223,  1996,  3147,
          2601,  4633,  1012,   102,     0,     0,     0,     0,     0,     0,
             0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}


## Step 4.1 Wrapping in Well Structured Functions

In [13]:
import torch

device = next(model.parameters()).device  
#def get_sentiment(text):
#    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
#    outputs = model(**inputs)
#    return "positive" if torch.argmax(outputs.logits) == 1 else "negative"

def get_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}  # <-- critical fix
    with torch.no_grad():
        outputs = model(**inputs)
    pred = outputs.logits.argmax(dim=-1).item()
    return "positive" if pred == 1 else "negative"

In [14]:
print(get_sentiment("I love learning new things!"))

positive


## Step 4.2 Write `analyze_file()` function

Analyze_file() takes file_path as input, tokenizes and predicts the sentiment of the input text using the trained model.


In [15]:
def analyze_file(file_path):
    lines = []
    with open(file_path, "r") as f:
        lines = f.readlines()
    pos_to_neg = 0
    for l in lines:
        if get_sentiment(l) == "positive":
            pos_to_neg += 1
        else:
            pos_to_neg -= 1
    if pos_to_neg >= 0:
        return "Overall Sentiment: Positive"
    else:
        return "Overall Sentiment: Negative"

In [16]:
file_path = 'test_sentiment.txt'
analyze_file(file_path)

FileNotFoundError: [Errno 2] No such file or directory: 'test_sentiment.txt'

## Step 5: Save the Model Locally (This is important for versioning and reusability)

In [None]:
model_save_path = "sentiment_model_testing"
model.save_pretrained(model_save_path) #Specifies the path to save all files.
tokenizer.save_pretrained(model_save_path)

## Step 6: Setup Account on Huggingface

In [None]:
#!pip install huggingface_hub
from huggingface_hub import notebook_login

notebook_login()
#checkout git
#login to huggingface, setting, access token, generate a new token

## Step 7: Upload the Model to Hugging Face

Uploading the model from your local directory to Hugging Face is important to make the model publicly available for others to download and use.


In [None]:
from huggingface_hub import Repository
from huggingface_hub import HfApi
api = HfApi()

repo_id = "Vagabond98/dvae26-lab5-test"

api.create_repo(repo_id=repo_id, exist_ok=True, private=False)
model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)