# **Final Lab ANLP**

**Wajd Alrabiah**

**443007641**

**72U**

In [None]:
# Install necessary libraries
!pip install transformers -q # For using pre-trained models
!pip install tabulate -q # For formating the output

In [None]:
# Importing the necessary classes and functions
import transformers  # General library import
from transformers import pipeline  # Used to create NLP pipelines
from sklearn.metrics.pairwise import cosine_similarity  # Used for similarity calculation
from transformers import AutoTokenizer, AutoModel
import torch  # PyTorch library
import numpy as np
from tabulate import tabulate # For formating the output

## **Question 1: Sentiment Analysis**

In [None]:
# Load a pre-trained sentiment analysis model using the Transformers pipeline
# This automatically downloads and sets up a model optimized for sentiment analysis
sentiment_analyzer = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
# Define a dataset of text reviews to analyze
reviews = [
    "I absolutely love my stay at this hotel! The staff was incredibly friendly and went above and beyond ensure I was comfortable",  # Example of a Positive sentiment
    "The hotel was okay for the price. The room was clean, but the decor felt outdate.", # Example of a Neutral sentiment
    "This was the worst hotel experience I've ever had. The room was dirty, with stains on the sheets and a terrible odor." # Example of a Negative sentiment
]

In [None]:
# Perform sentiment analysis on the dataset (reviews)
# The analyzer will return a list of dictionaries, each containing:
# - 'label': Predicted sentiment (e.g., "Positive","Neutral", "Negative")
# - 'score': Confidence score for the prediction
results = sentiment_analyzer(reviews)

In [None]:
# Print the results
# Prepare data for the table
table_data = []
# Iterate through reviews and corresponding results using enumerate to get index (i)
for i, review in enumerate(reviews):
    table_data.append([review, results[i]['label'], results[i]['score']])

# Print the table using tabulate
print(tabulate(table_data, headers=["Review", "Sentiment", "Score"], tablefmt="grid"))

+-------------------------------------------------------------------------------------------------------------------------------+-------------+----------+
| Review                                                                                                                        | Sentiment   |    Score |
| I absolutely love my stay at this hotel! The staff was incredibly friendly and went above and beyond ensure I was comfortable | POSITIVE    | 0.999868 |
+-------------------------------------------------------------------------------------------------------------------------------+-------------+----------+
| The hotel was okay for the price. The room was clean, but the decor felt outdate.                                             | NEGATIVE    | 0.956286 |
+-------------------------------------------------------------------------------------------------------------------------------+-------------+----------+
| This was the worst hotel experience I've ever had. The room was dirt

## **Question 2: Retrieval-Augmented Generation (RAG)**


In [None]:
# Sample documents for similarity search
documents = [
    "Princess Nourah bint Abdulrahman University (PNU) aspires to be a beacon of knowledge and values for women.",
    "In 2006, a royal decree was issued that established the first university for girls in Riyadh.",
    "PNU is the largest women's university in the world.",
    "PNU is one fo the successes of the care and attention that has been directed toward women's higher education.",
]

In [None]:
# Step 1: Load a pre-trained model and tokenizer for embedding generation
model_name = "sentence-transformers/all-MiniLM-L6-v2"  # Lightweight and efficient embedding model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

In [None]:
# Step 2: Function to generate embeddings from a list of texts
def generate_embeddings(texts):
    # Tokenize and encode the input texts
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    # Generate embeddings using the model
    with torch.no_grad():
        embeddings = model(**inputs).last_hidden_state[:, 0, :]  # Use CLS token embeddings
    return embeddings.numpy()

In [None]:
# Generate embeddings for the documents
document_embeddings = generate_embeddings(documents)

In [None]:
# Step 3: Create a simple in-memory vector database
vector_db = {
    "documents": documents,
    "vectors": document_embeddings
}

In [None]:
# Step 4: Function to query the vector database
def query_vector_db(query_text):
    # Generate embedding for the query text
    query_embedding = generate_embeddings([query_text])
    # Calculate cosine similarities with the document embeddings
    similarities = cosine_similarity(query_embedding, document_embeddings).flatten()
    # Sort indices of documents by similarity in descending order
    top_indices = np.argsort(similarities)[::-1]
    # Return the matched documents and their similarity scores
    return [(documents[i], similarities[i]) for i in top_indices]

In [None]:
# Example query
query = "Which city is PNU in?"
results = query_vector_db(query)

# Display query results
# Prepare data for the table
table_data = []
for doc, score in results:
    table_data.append([query, doc, score])

# Display query results in a table
print("Query:", query)
print(tabulate(table_data, headers=["Query", "Document", "Similarity Score"], tablefmt="grid"))

Query: Which city is PNU in?
+-----------------------+---------------------------------------------------------------------------------------------------------------+--------------------+
| Query                 | Document                                                                                                      |   Similarity Score |
| Which city is PNU in? | PNU is one fo the successes of the care and attention that has been directed toward women's higher education. |           0.836401 |
+-----------------------+---------------------------------------------------------------------------------------------------------------+--------------------+
| Which city is PNU in? | PNU is the largest women's university in the world.                                                           |           0.823708 |
+-----------------------+---------------------------------------------------------------------------------------------------------------+--------------------+
| Which city is P

## **Question 3: Text Generation**

In [None]:
# Step 1: Text Generation using GPT-2
# Define the task and specify the model
task = "text-generation"
model_name = "gpt2"  # Pre-trained GPT-2 model for text generation

In [None]:
# Setting parameters for text generation
max_output_length = 50  # Maximum length of the generated text
num_of_return_sequences = 2  # Number of outputs to return
input_text = "Princess Noura University is"  # Input prompt for text generation

In [None]:
# Create a text generation pipeline with the specified model
text_generator = pipeline(task, model=model_name)

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
# Perform text generation
generated_text = text_generator(
    input_text,
    max_length=max_output_length,
    num_return_sequences=num_of_return_sequences
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [None]:
# Display the generated text results
print("\nGenerated Text Results:")
for i, result in enumerate(generated_text):
    print(f"Sentence {i + 1}: {result['generated_text']}")


Generated Text Results:
Sentence 1: Princess Noura University is an official UNESCO World Heritage Site.


As UNESCO, we aim to bring about the highest standards of cultural preservation and cultural development in the world, through the right to preserve the natural heritage, to meet the needs of
Sentence 2: Princess Noura University is the only institution to have accredited a French course in linguistics.

"The students were able to apply to English-language programs of their choice. This made it more accessible, and it is currently the only Latin
