<a href="https://colab.research.google.com/github/abishaanadar/TextAnalytics_AN/blob/main/TA_AN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Step 1: Setting Up the Environment**

First, we need to set up our environment by importing necessary libraries. We will use Hugging Face's transformers for the LLM and sentence-transformers for embedding, and a vector database like FAISS for storage and retrieval.

In [15]:
# Import necessary libraries
import pandas as pd
#!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
from transformers import pipeline, AutoModelForSeq2SeqLM, AutoTokenizer
#!pip install faiss-gpu
import faiss
import numpy as np

# **Step 2: Loading and Preprocessing the Dataset**

Next, we'll load the Disneyland Reviews dataset and preprocess it. This involves reading the data, cleaning it, and preparing it for embedding. We will chunk the reviews into manageable pieces.

In [25]:
# Load the dataset with specified encoding and error handling
encodings = ['utf-8', 'latin1', 'iso-8859-1', 'cp1252']
for enc in encodings:
    try:
        df = pd.read_csv('DisneylandReviews.csv', encoding=enc)
        print(f"Successfully loaded with encoding: {enc}")
        break
    except UnicodeDecodeError:
        print(f"Failed to load with encoding: {enc}")
else:
    raise ValueError("All attempted encodings failed. Please check the file encoding.")

# Print the columns to inspect
print("Columns in the dataset:", df.columns)

# Preprocess the dataset: chunk the reviews into manageable pieces
def chunk_text(text, max_length=512):
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_length):
        chunks.append(' '.join(words[i:i + max_length]))
    return chunks

df['chunks'] = df['Review_Text'].apply(chunk_text)
all_chunks = [chunk for sublist in df['chunks'] for chunk in sublist]

Failed to load with encoding: utf-8
Successfully loaded with encoding: latin1
Columns in the dataset: Index(['Review_ID', 'Rating', 'Year_Month', 'Reviewer_Location', 'Review_Text',
       'Branch'],
      dtype='object')


# **Step 3: Embed the Knowledge Base**

Use a pretrained embedding model to embed the text chunks and store them in a vector database (FAISS).

In [26]:
# Load the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed the text chunks
embeddings = embedding_model.encode(all_chunks, convert_to_tensor=True)

# Create a FAISS index and add the embeddings
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings.cpu().numpy())



KeyboardInterrupt: 

# **Step 4: Develop an Appropriate Prompt Template**

Create a template that will be used to query the LLM.

In [27]:
# Define a prompt template for querying the LLM
def create_prompt(query, retrieved_text):
    prompt = f"Here is some information retrieved from the knowledge base:\n{retrieved_text}\n\nBased on the above information, answer the following query:\n{query}"
    return prompt

# **Step 5: Create the Query Pipeline**

Develop the pipeline to process the query, retrieve relevant information, and generate the response using the LLM.

In [28]:
# Load the LLM for text generation
llm_tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
llm_model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-large-cnn')
llm_pipeline = pipeline('text2text-generation', model=llm_model, tokenizer=llm_tokenizer)

def query_pipeline(query, top_k=5):
    # Embed the query
    query_embedding = embedding_model.encode(query, convert_to_tensor=True)

    # Ensure the query embedding is 2-dimensional
    query_embedding = query_embedding.unsqueeze(0) if len(query_embedding.shape) == 1 else query_embedding

    # Retrieve relevant information from the vector database
    _, indices = index.search(query_embedding.cpu().numpy(), top_k)
    retrieved_texts = [all_chunks[idx] for idx in indices[0]]
    retrieved_text = ' '.join(retrieved_texts)

    # Create the prompt
    prompt = create_prompt(query, retrieved_text)

    # Generate the response using the LLM
    response = llm_pipeline(prompt)[0]['generated_text']
    return response

# **Step 6: Define a Set of Tests**

Determine a set of queries to test the RAG system.

In [35]:
# Define test queries
test_queries = [
    "What are the most common complaints about Disneyland?",
    "Can you summarize the positive aspects of Disneyland based on recent reviews?",
    "What do visitors say about the food at Disneyland?",
    "Which Disneyland ride is the best?",
    "How are the ride experiences described in the reviews?",
    "Are there any tips from visitors on how to avoid long lines?"
]

# **Step 7: Test the System**

Run the tests and evaluate the performance of the RAG system.

In [36]:
# Test the RAG system
for query in test_queries:
    print(f"Query: {query}")
    print(f"Response: {query_pipeline(query)}\n")

Query: What are the most common complaints about Disneyland?
Response: The Disneyland Park in Paris is now a disgrace to the Disney brand. Poor customer service and rude staff on the rides, in the restaurants and in the shops. Huge crowds, long lines, and horribly expensive food combine to make things less enjoyable. Several key attractions like Indiana Jones, flying dumbo and Aladin's flying carpet were shut.

Query: Can you summarize the positive aspects of Disneyland based on recent reviews?
Response: The only thing I can criticize is the fact that the rates keep flying upward but they keep closing the park earlier and earlier. Crowds also keep getting bigger even in the off season. If you have never been to Disneyland I recommend it no other theme park tops Disneyland!! Very clean park try visiting on a weekday to avoid less crowds.

Query: What do visitors say about the food at Disneyland?
Response: The food is good a little pricey but hey, you're in Disneyland! The kids' meals ar