<a href="https://colab.research.google.com/github/bilguunmyagmar-commits/data4600/blob/main/A3_DATA5000_Bilguun.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assessment 3: AI-Driven Embeddings for Enhanced Retail Customer Insights

**Subject:** DATA5000 - Artificial Intelligence Programming in Business Analytics  
**Assessment Type:** Report Writing, Questionnaire Completion and Practical Python Programming  
**Weighting:** 40% Individual

**Student Name:** Bilguun Myagmar  
**Student Number:** 1830798

---
### Assessment Overview
This assessment evaluates your ability to implement a Retrieval-Augmented Generation (RAG) system for real-world business intelligence applications. You will work with multi-modal business data from MobTel, a telecommunications company, to demonstrate your proficiency in integrating heterogeneous data sources, building vector databases, and generating actionable business insights using modern AI tools.The assessment simulates a realistic business analytics scenario where you are hired as a data scientist consultant to help MobTel understand their current business position and identify strategic opportunities for growth. You will need to synthesize information from sales records, customer feedback, and financial reports to provide comprehensive business intelligence.

### Company Overiview
MobTel Communications is a mid-sized mobile phone retailer established in 2018, headquartered in Austin, Texas. The company initially entered the market focusing on affordable, reliable smartphones for budget-conscious consumers. Over the past six years, MobTel has expanded its product portfolio to compete across multiple market segments, from entry-level devices to flagship smartphones.
### Data Sources
- **sales_data.xlsx**: MobTel sales records
- **customer_reviews.csv**: Customer feedback on phone models
- **financial_statement.pdf**: Company financial report
## Your Tasks
1. Go through each of the dataset, play with it to understand the business context of this company
2. Get your own problem statement understanding
3. Implement document storage and retrieval using ChromaDB
4. Perform cosine similarity analysis to compare customer reviews and product descriptions
5. Utilize Gemini LLM to enhance search results using RAG and twist the prompt to get the good report for business intelligence and **propose the data-driven visualizations**
6. In Part 6, you are required to finish the prompt and the business questions to retrieve insightful recommendations for the business
7. Part 4 is the basic search in Vector database while Part 6 is LLM-driven search with Gemini. Your job is to see how the results from the two are different and which one is more helpful
8. **Please write the report following the instructions in Part 7**

## Setup and Imports

Install and import the required libraries for AI-powered document analysis.

In [None]:
# Install required packages
!pip install chromadb sentence-transformers openpyxl PyPDF2 google-genai

Collecting chromadb
  Downloading chromadb-1.1.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.37.0-py3-none-any.whl.metadata (2.4 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import chromadb
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import PyPDF2
import json
import pathlib
from google import genai
from google.genai import types
import warnings
import os
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## Gemini API Configuration

Set up the Gemini API for LLM-powered business insights generation. You must get your API in here [link](https://aistudio.google.com/apikey). Copy the API and paste it in the variable GEMINI_API_KEY below

In [None]:
# Gemini API Setup
# Enter your Gemini API key here
GEMINI_API_KEY = "AIzaSyDojNhOSwV-ctWyDvrcBJd85gPiWs6sszo"

# Initialize Gemini client
os.environ['GOOGLE_API_KEY'] = GEMINI_API_KEY #Fill in your API here

client = genai.Client()
print("Gemini API client initialized successfully!")

# Test the API connection
try:
    test_response = client.models.generate_content(
        model="gemini-2.0-flash-lite",
        contents=["Test connection. Respond with 'API Working'"]
    )
    print(f"API Test: {test_response.text}")
except Exception as e:
    print(f"API connection error: {e}")
    print("Please check your API key and try again.")

Gemini API client initialized successfully!
API Test: API Working



## Part 1: Data Loading and Preprocessing

Load the provided datasets: customer_reviews.csv, sales_data.xlsx, and finance.pdf

In [None]:
# Load customer reviews CSV
reviews_df = pd.read_csv('customer_reviews.csv') #FILL IN THE DATASET
print(f"Customer Reviews Dataset Shape: {reviews_df.shape}")
print("\nDataset Columns:")
print(reviews_df.columns.tolist())
print("\nFirst 3 rows:")
print(reviews_df.head(3))

Customer Reviews Dataset Shape: (620, 12)

Dataset Columns:
['review_id', 'customer_id', 'product_model', 'rating', 'review_date', 'review_title', 'review_text', 'verified_purchase', 'helpful_votes', 'total_votes', 'purchase_date', 'aspects_mentioned']

First 3 rows:
       review_id customer_id             product_model  rating review_date  \
0  REV2024000217  CUST504271  Xiaomi Redmi Note 12 Pro     3.0  2024-12-03   
1  REV2024000334  CUST461273        Huawei Mate 50 Pro     4.1  2024-11-12   
2  REV2024000110  CUST958754           BlackBerry Key2     4.1  2024-04-23   

             review_title                                        review_text  \
0      Average experience  my experience has been mixed. While it offers ...   
1  Impressive performance  My upgrade to Huawei Mate 50 Pro has been, so ...   
2    Fantastic all around  so far it has impressed me with smooth gaming ...   

   verified_purchase  helpful_votes  total_votes purchase_date  \
0               True            

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Load sales data Excel file
sales_df = pd.read_excel('sales_data.xlsx') #FILL IN THE DATASET
descriptions = {
    "Samsung Galaxy A14": "The Samsung Galaxy A14 delivers everyday reliability with a sleek modern design. Its long-lasting battery ensures uninterrupted performance throughout the day. Perfect for those who value affordability without compromising quality.",
    "iPhone 14 Pro": "Apple's iPhone 14 Pro redefines flagship excellence with its cutting-edge A16 Bionic chip. The Dynamic Island interface introduces a seamless new way to interact with your device. Designed for professionals, it balances performance, elegance, and innovation.",
    "Xiaomi Redmi 12": "The Xiaomi Redmi 12 is built for users who demand value and performance in one package. It features a crisp display and a versatile camera system that captures moments beautifully. This phone makes premium features accessible to everyone.",
    "Xiaomi Redmi Note 12 Pro": "Packed with advanced imaging technology, the Redmi Note 12 Pro delivers vibrant photos and videos. Its 5G support ensures blazing-fast connectivity for modern lifestyles. A balanced choice for users seeking performance and style in the mid-range segment.",
    "Samsung Galaxy A54": "The Galaxy A54 brings Samsung’s signature design language into the mid-tier market. It offers a smooth display experience and dependable performance for gaming and streaming. A versatile companion for both work and entertainment needs.",
    "BlackBerry Key2": "The BlackBerry Key2 revives the iconic keyboard experience for productivity enthusiasts. With its enhanced security features, it appeals to professionals who prioritize privacy. A rare blend of nostalgia and modern utility in the smartphone world.",
    "Oppo Reno 10": "Oppo Reno 10 is crafted with a stylish design that captures attention. Its fast charging technology keeps you moving without long waits. Tailored for photography lovers, it produces sharp and vivid shots in any environment.",
    "Huawei P50 Pro": "The Huawei P50 Pro emphasizes photography excellence with its Leica-engineered camera setup. Its sleek curves and vibrant display make it a pleasure to use daily. Built with innovation in mind, it pushes the boundaries of mobile creativity.",
    "Huawei Mate 50 Pro": "Huawei Mate 50 Pro combines power and elegance, boasting a futuristic design. Its high-end performance makes multitasking effortless and smooth. The camera system introduces precision and detail that rival professional gear.",
    "Oppo A57": "The Oppo A57 is an entry-level smartphone with a polished look and practical performance. Its lightweight build makes it comfortable to carry and use all day. Ideal for budget-conscious users, it covers all essentials without compromise.",
    "Samsung Galaxy S23 Ultra": "The Galaxy S23 Ultra is Samsung’s ultimate powerhouse, featuring a stunning AMOLED display. Its S Pen integration adds creativity and productivity in one device. Designed for enthusiasts, it brings desktop-level performance into your pocket.",
    "Samsung Galaxy Note 20": "Samsung Galaxy Note 20 blends productivity with entertainment through its iconic S Pen. Its large screen and powerful internals make multitasking effortless. A device that bridges the gap between a phone and a productivity tool."
}

# Map the product_model column to the new descriptions
sales_df['product_description'] = sales_df['product_model'].map(descriptions)
print(f"Sales Data Shape: {sales_df.shape}")
print("\nSales Data Columns:")
print(sales_df.columns.tolist())
print("\nFirst 3 rows:")
print(sales_df.head(3))

Sales Data Shape: (1300, 13)

Sales Data Columns:
['transaction_id', 'sale_date', 'product_model', 'product_category', 'quantity', 'unit_price', 'revenue', 'customer_id', 'store_location', 'payment_method', 'discount_applied', 'sales_rep_id', 'product_description']

First 3 rows:
  transaction_id  sale_date       product_model product_category  quantity  \
0  TXN2023000001 2023-11-01  Samsung Galaxy A14           Budget         2   
1  TXN2024000002 2024-12-07       iPhone 14 Pro         Flagship         1   
2  TXN2024000003 2024-07-31     Xiaomi Redmi 12           Budget         4   

   unit_price  revenue customer_id store_location payment_method  \
0      306.76   490.14  CUST989340    East Region    Credit Card   
1      916.30   838.05  CUST535436         Online    Credit Card   
2      336.41  1189.28  CUST249014    East Region    Installment   

   discount_applied sales_rep_id  \
0             20.11       REP018   
1              8.54       REP013   
2             11.62      

In [None]:
# Load and extract text from PDF

def extract_pdf_text(pdf_path):
    """Extract text from PDF file"""
    text = "" # Initialize text variable
    try:
        with open(pdf_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            for page in pdf_reader.pages:
                text += page.extract_text() + "\n"
    except Exception as e:
        print(f"Error reading PDF: {e}")
        return ""
    return text

# Extract text from the finance PDF
finance_text = extract_pdf_text('finance.pdf') #FILL IN THE DATASET
print(f"Finance document length: {len(finance_text)} characters")
print(f"First 300 characters: {finance_text[:300]}...")

Finance document length: 14716 characters
First 300 characters: MobTel Corporation
Annual Financial Report
Fiscal Year 2024
mobtel_logo.png
Connecting Innovation to Excellence
Prepared by: Finance Department
Date: March 2025
Audited by: Gary Link
MobTel Corporation 2024 Annual Financial Report
Contents
1 Executive Summary 2
1.1 Company Overview and Mission . . ....


## Part 2: Document Processing and Embeddings

Process the documents and create embeddings using sentence transformers.

In [None]:
# Initialize the sentence transformer model

model = SentenceTransformer('all-MiniLM-L6-v2')
print("Sentence Transformer model loaded successfully!")
print(f"Model embedding dimension: {model.get_sentence_embedding_dimension()}")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sentence Transformer model loaded successfully!
Model embedding dimension: 384


In [None]:
# Prepare documents for vector storage
documents = []
metadatas = []
ids = []

# Add customer reviews to documents
for idx, row in reviews_df.iterrows():
    if pd.notna(row['review_text']) and len(str(row['review_text']).strip()) > 10:
        documents.append(str(row['review_text']))
        metadatas.append({
            'type': 'review',
            'product_model': str(row['product_model']),
            'rating': float(row['rating']),
            'customer_id': str(row['customer_id']),
            'verified_purchase': bool(row['verified_purchase'])
        })
        ids.append(f"review_{idx}")

# Create product descriptions from sales data (if available)
if 'product_description' in sales_df.columns:
    for idx, row in sales_df.iterrows():
        if pd.notna(row['product_description']):
            documents.append(str(row['product_description']))
            metadatas.append({
                'type': 'product_description',
                'product_name': str(row.get('product_name', 'Unknown')),
                'category': str(row.get('category', 'Electronics'))
            })
            ids.append(f"product_desc_{idx}")

# Split finance document into chunks and add to documents
chunk_size = 500
overlap = 50
finance_chunks = []

for i in range(0, len(finance_text), chunk_size - overlap):
    chunk = finance_text[i:i + chunk_size]
    finance_chunks.append(chunk)

for i, chunk in enumerate(finance_chunks):
    if len(chunk.strip()) > 50:  # Only add substantial chunks
        documents.append(chunk)
        metadatas.append({
            'type': 'finance_report',
            'chunk_id': i,
            'source': 'MobTel Annual Report 2024',
            'chunk_length': len(chunk)
        })
        ids.append(f"finance_chunk_{i}")

print(f"Total documents prepared: {len(documents)}")
print(f"Reviews: {len([d for d in metadatas if d['type'] == 'review'])}")
print(f"Product descriptions: {len([d for d in metadatas if d['type'] == 'product_description'])}")
print(f"Finance chunks: {len([d for d in metadatas if d['type'] == 'finance_report'])}")

Total documents prepared: 1953
Reviews: 620
Product descriptions: 1300
Finance chunks: 33


## Part 3: ChromaDB Setup and Document Storage

Create a ChromaDB collection and store the documents with their embeddings.

In [None]:
# Initialize ChromaDB client and create collection
client_chroma = chromadb.Client()

# Delete collection if it exists (for fresh start)
try:
    client_chroma.delete_collection(name="mobtel_documents")
except:
    pass

# Create new collection
collection = client_chroma.create_collection(
    name="mobtel_documents",
    metadata={"description": "MobTel customer reviews, product descriptions, and financial documents"}
)

print("ChromaDB collection created successfully!")
print(f"Collection name: {collection.name}")

ChromaDB collection created successfully!
Collection name: mobtel_documents


In [None]:
# Generate embeddings and add documents to ChromaDB
print("Generating embeddings and storing documents...")

# Generate embeddings for all documents
embeddings = model.encode(documents, show_progress_bar=True)
print(f"Generated embeddings shape: {embeddings.shape}")

# Add documents to the collection in batches (ChromaDB has limits)
batch_size = 100
for i in range(0, len(documents), batch_size):
    end_idx = min(i + batch_size, len(documents))

    collection.add(
        embeddings=embeddings[i:end_idx].tolist(),
        documents=documents[i:end_idx],
        metadatas=metadatas[i:end_idx],
        ids=ids[i:end_idx]
    )
    print(f"Added batch {i//batch_size + 1}: documents {i}-{end_idx-1}")

print(f"Successfully stored {len(documents)} documents in ChromaDB!")
print(f"Collection count: {collection.count()}")

Generating embeddings and storing documents...


Batches:   0%|          | 0/62 [00:00<?, ?it/s]

Generated embeddings shape: (1953, 384)
Added batch 1: documents 0-99
Added batch 2: documents 100-199
Added batch 3: documents 200-299
Added batch 4: documents 300-399
Added batch 5: documents 400-499
Added batch 6: documents 500-599
Added batch 7: documents 600-699
Added batch 8: documents 700-799
Added batch 9: documents 800-899
Added batch 10: documents 900-999
Added batch 11: documents 1000-1099
Added batch 12: documents 1100-1199
Added batch 13: documents 1200-1299
Added batch 14: documents 1300-1399
Added batch 15: documents 1400-1499
Added batch 16: documents 1500-1599
Added batch 17: documents 1600-1699
Added batch 18: documents 1700-1799
Added batch 19: documents 1800-1899
Added batch 20: documents 1900-1952
Successfully stored 1953 documents in ChromaDB!
Collection count: 1953


## Part 4: Document Retrieval and Vector Search - BASIC VECTOR SEARCH

Implement search functionality to retrieve relevant documents.

In [None]:
def search_documents(query, n_results=5, document_type=None):
    """Search for relevant documents using vector similarity"""

    # Create where clause for filtering by document type if specified
    where_clause = None
    if document_type:
        where_clause = {"type": document_type}

    # Query the ChromaDB collection
    results = collection.query(
        query_texts=[query],
        n_results=n_results,
        where=where_clause
    )

    return results

# Test the search function with different queries, YOU MIGHT NEED TO CHANGE THE QUERIES SO YOU CAN COMPARE WITH GEMINI OUTPUT LATER
test_queries = [
    "battery life and performance",
    "financial performance revenue",
    "customer satisfaction product quality"
]

for query in test_queries:
    print(f"\n{'='*60}")
    print(f"Search Query: '{query}'")
    print(f"{'='*60}")

    search_results = search_documents(query, n_results=3)

    print(f"Found {len(search_results['documents'][0])} relevant documents")

    # Display top 3 results
    for i in range(len(search_results['documents'][0])):
        print(f"\n--- Result {i+1} ---")
        print(f"Document: {search_results['documents'][0][i][:150]}...")
        print(f"Type: {search_results['metadatas'][0][i]['type']}")
        print(f"Distance Score: {search_results['distances'][0][i]:.4f}")


Search Query: 'battery life and performance'


/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:02<00:00, 35.3MiB/s]


Found 3 relevant documents

--- Result 1 ---
Document: my experience has been mixed. While it offers some lag during multitasking, it also suffers from not exciting but reliable. Regarding battery, perform...
Type: review
Distance Score: 0.9153

--- Result 2 ---
Document: so far it has impressed me with long battery life and smooth gaming performance. In terms of camera, durability, it delivers solid results. Apps open ...
Type: review
Distance Score: 0.9197

--- Result 3 ---
Document: This is my first Xiaomi Redmi Note 12 Pro, so far it has impressed me with long battery life and premium look and feel. In terms of battery, it delive...
Type: review
Distance Score: 0.9207

Search Query: 'financial performance revenue'
Found 3 relevant documents

--- Result 1 ---
Document: n represents 6.8% of total
revenue, demonstrating our commitment to innovation and product development.
2.3 Profitability Metrics
4
MobTel Corporation...
Type: finance_report
Distance Score: 0.8594

--- Result 2 ---
D

## Part 5: Cosine Similarity Analysis

Calculate cosine similarity between customer reviews and product descriptions.

In [None]:
# Create product descriptions based on available data
# Extract unique products from reviews
unique_products = reviews_df['product_model'].unique()

# Create product descriptions (based on MobTel's product line from the finance report)
product_descriptions = descriptions
# If we have product descriptions in sales data, use those instead
if 'product_description' in sales_df.columns and 'product_name' in sales_df.columns:
    for _, row in sales_df.iterrows():
        if pd.notna(row['product_description']) and pd.notna(row['product_name']):
            product_descriptions[row['product_name']] = row['product_description']

print(f"Product descriptions for similarity analysis: {len(product_descriptions)}")
for product, desc in product_descriptions.items():
    print(f"- {product}: {desc[:100]}...")

Product descriptions for similarity analysis: 12
- Samsung Galaxy A14: The Samsung Galaxy A14 delivers everyday reliability with a sleek modern design. Its long-lasting ba...
- iPhone 14 Pro: Apple's iPhone 14 Pro redefines flagship excellence with its cutting-edge A16 Bionic chip. The Dynam...
- Xiaomi Redmi 12: The Xiaomi Redmi 12 is built for users who demand value and performance in one package. It features ...
- Xiaomi Redmi Note 12 Pro: Packed with advanced imaging technology, the Redmi Note 12 Pro delivers vibrant photos and videos. I...
- Samsung Galaxy A54: The Galaxy A54 brings Samsung’s signature design language into the mid-tier market. It offers a smoo...
- BlackBerry Key2: The BlackBerry Key2 revives the iconic keyboard experience for productivity enthusiasts. With its en...
- Oppo Reno 10: Oppo Reno 10 is crafted with a stylish design that captures attention. Its fast charging technology ...
- Huawei P50 Pro: The Huawei P50 Pro emphasizes photography excellence with its 

In [None]:
# Calculate cosine similarity between reviews and product descriptions
def interpret_similarity(score):
    """Interpret cosine similarity score for business context"""
    if score >= 0.8:
        return "High alignment - Strong product-customer expectation match"
    elif score >= 0.6:
        return "Good alignment - Minor gaps in customer perception"
    elif score >= 0.4:
        return "Moderate alignment - Some disconnect in customer perception"
    else:
        return "Low alignment - Significant issues with product positioning"

similarity_results = []

for product, description in product_descriptions.items():
    # Get reviews for this product
    product_reviews = reviews_df[reviews_df['product_model'] == product]['review_text'].dropna()

    if len(product_reviews) > 0:
        # Calculate embeddings for product description and reviews
        desc_embedding = model.encode([description])
        review_embeddings = model.encode(product_reviews.tolist())

        # Calculate cosine similarity between description and each review
        similarities = cosine_similarity(desc_embedding, review_embeddings)[0]

        # Calculate statistics
        avg_similarity = np.mean(similarities)
        max_similarity = np.max(similarities)
        min_similarity = np.min(similarities)
        std_similarity = np.std(similarities)

        # Get average rating for this product
        avg_rating = reviews_df[reviews_df['product_model'] == product]['rating'].mean()

        similarity_results.append({
            'Product': product,
            'Avg_Cosine_Similarity': round(avg_similarity, 3),
            'Max_Similarity': round(max_similarity, 3),
            'Min_Similarity': round(min_similarity, 3),
            'Std_Similarity': round(std_similarity, 3),
            'Review_Count': len(product_reviews),
            'Avg_Rating': round(avg_rating, 2),
            'Business_Interpretation': interpret_similarity(avg_similarity)
        })
    else:
        similarity_results.append({
            'Product': product,
            'Avg_Cosine_Similarity': 0,
            'Max_Similarity': 0,
            'Min_Similarity': 0,
            'Std_Similarity': 0,
            'Review_Count': 0,
            'Avg_Rating': 0,
            'Business_Interpretation': 'No reviews available'
        })

# Create results DataFrame
similarity_df = pd.DataFrame(similarity_results)
similarity_df = similarity_df.sort_values('Avg_Cosine_Similarity', ascending=False)

print("\nCosine Similarity Analysis Results:")
print("="*80)
print(similarity_df.to_string(index=False))

# Summary statistics
print("\n\nSummary Statistics:")
print(f"Average similarity across all products: {similarity_df['Avg_Cosine_Similarity'].mean():.3f}")
print(f"Products with high alignment (>0.6): {len(similarity_df[similarity_df['Avg_Cosine_Similarity'] > 0.6])}")
print(f"Products needing attention (<0.4): {len(similarity_df[similarity_df['Avg_Cosine_Similarity'] < 0.4])}")


Cosine Similarity Analysis Results:
                 Product  Avg_Cosine_Similarity  Max_Similarity  Min_Similarity  Std_Similarity  Review_Count  Avg_Rating                                     Business_Interpretation
                Oppo A57                  0.570           0.836           0.299           0.180            71        3.54 Moderate alignment - Some disconnect in customer perception
      Samsung Galaxy A14                  0.527           0.784           0.347           0.138            95        3.73 Moderate alignment - Some disconnect in customer perception
      Samsung Galaxy A54                  0.507           0.731           0.275           0.121            77        3.74 Moderate alignment - Some disconnect in customer perception
Xiaomi Redmi Note 12 Pro                  0.478           0.761           0.283           0.157            70        3.77 Moderate alignment - Some disconnect in customer perception
Samsung Galaxy S23 Ultra                  0.473      

***Hint: Which product category has the highest rate of low alignment? How does it impact on sales performance for low alignment?***

In [15]:
# This block is optional for you to investigate further the analysis of low alignment with sales/revenue using correlation charts

##Code here (Optional)

## Part 6: RAG Implementation with Gemini LLM - Complete the prompt below

Implement Retrieval-Augmented Generation using ChromaDB and Gemini AI. **You are required to finish the below prompt in the variable "prompt" **that would enable insightful information from the contextual knowledge

In [16]:
def rag_query(question, n_retrieve=5, document_type=None):
    """Implement RAG: Retrieve relevant documents and generate insights using Gemini"""

    # Step 1: Retrieve relevant documents
    retrieved_docs = search_documents(question, n_retrieve, document_type)

    # Step 2: Prepare context from retrieved documents
    context_docs = retrieved_docs['documents'][0]
    context_metadata = retrieved_docs['metadatas'][0]

    # Create rich context with metadata
    context_with_metadata = []
    for i, doc in enumerate(context_docs):
        meta = context_metadata[i]
        context_with_metadata.append(f"Document Type: {meta['type']}\nContent: {doc}")

    context = "\n\n---\n\n".join(context_with_metadata)

    # Step 3: Generate business insights using Gemini AI
    insights = generate_business_insights_with_gemini(question, context, retrieved_docs)


    df = pd.DataFrame([{
        'question': question,
        **insights  # expands full_analysis, summary_points (list), generated_by, response_length
    }])
    return df

def generate_business_insights_with_gemini(question, context, retrieved_docs):
    """Generate business insights using Gemini AI based on retrieved documents"""

    # Create a comprehensive business analysis prompt by FINISHING THIS PROMPT (Add step 5 to make it more comprehensive in term of evaluating the recommendation/metrics). FEEL FREE TO CHANGE THE PROMPT TO SUPPORT YOUR ANALYSIS
    prompt = f"""
You are a senior business analyst at MobTel Corporation, an innovative mobile device retailer competing with major brands like Walmart, K-mart amd JBHiFi.
Your task is to analyze retrieved company data and provide strategic business insights.

BUSINESS QUESTION: {question}

RETRIEVED COMPANY DATA:
{context}

Analysis framework:
PLease provide a comprehensive business analytics answer for business stakeholders that covers the following key points:
1. Understand the question and look into the connection with the sales, revenue and customer satisfaction
2. Understand the patterns in the context knowledge and what should be done to resolve such issues
3. Think of strategy recommendations
4. Think of if we have data-driven evidence to backup these recommendations?

    """


    # Generate content using Gemini AI
    response = client.models.generate_content(
        model="gemini-2.0-flash-lite",
        contents=[prompt]
    )

    # Extract and process the insights
    insights_text = response.text

    # Return structured insights
    return {
        'full_analysis': insights_text,
        'summary_points': [point.strip() for point in insights_text.split('\n') if point.strip() and len(point.strip()) > 20][:10],
        'generated_by': 'gemini-2.0-flash-lite',
        'response_length': len(insights_text)
    }



print("RAG system with Gemini AI implemented successfully!")

RAG system with Gemini AI implemented successfully!


In [17]:
# Test RAG system with comprehensive business questions
business_questions = [
    "What are the main customer concerns about our mobile device products and how do they impact our business performance?",

    "What product development priorities should MobTel focus on to improve market competitiveness and customer retention?",

    # ADD MORE BUSINESS QUESTIONS FOR YOUR OWN ANALYSIS
    # MY Suggestion is to delete these questions and replace with new ones every time you run this code block to avoid quotas exhaustion 429 Error
]

rag_results = []

for i, question in enumerate(business_questions):
    print(f"\n{'='*80}")
    print(f"BUSINESS ANALYSIS {i+1}: {question}")
    print(f"{'='*80}")

    # Perform RAG analysis
    result = rag_query(question, n_retrieve=5)
    rag_results.append(result)

    print(f"\nGEMINI AI BUSINESS ANALYSIS:")

    if 'full_analysis' in result:
        print(f"\nFULL ANALYSIS:\n{result['full_analysis'][0]}")

    print(f"\nKEY SUMMARY POINTS:")
    for point in result['summary_points'][0]:
        print(f"• {point}")

    print("\n" + "-"*80)


BUSINESS ANALYSIS 1: What are the main customer concerns about our mobile device products and how do they impact our business performance?

GEMINI AI BUSINESS ANALYSIS:

FULL ANALYSIS:
Okay, here's a comprehensive business analysis based on the provided data, addressing the key points for MobTel Corporation:

**1. Understanding the Question & Connection to Sales, Revenue, and Customer Satisfaction**

The core question is: **What are the main customer concerns about our mobile device products and how do they impact our business performance?**

This question directly relates to:

*   **Sales and Revenue:** Customer concerns, if unresolved, can lead to decreased sales (fewer purchases, lower average selling price – e.g., customers choosing cheaper models) and ultimately impact revenue. Negative experiences can lead to returns, warranty claims, and a reduction in repeat purchases.
*   **Customer Satisfaction:** This is the heart of the matter. Dissatisfied customers are less likely to rec

## Part 7: Report Writing - below are some guidance questions to assist you in writing your report, you do not need to answer every question

### Report Structure: 1200 Words | Assessment Weight: 25 Marks

---

## 1. Executive Summary (100 words, 2 marks)

**Key Questions to Answer:**
- What AI-driven analysis did you conduct for MobTel's retail operations?
- What are the 2-3 most critical findings from your document retrieval and similarity analysis?
- How will implementing these AI insights impact MobTel's retail business performance?
- What is the primary business value proposition of your analysis?

---

## 2. Business Context and Problem Statement (100 words, 2 marks)

**Key Questions to Answer:**
- What specific retail challenges does MobTel face as an electronics retailer?
- Why is understanding customer sentiment and product alignment critical for retail success?
- How do current product search and recommendation systems limit MobTel's performance?
- What business problems will AI-powered customer insights solve?

---

## 3. Data Analytics Process (300 words total, 6 marks)

### 3.1 Document Retrieval and Vector Search Analysis (150 words, 3 marks)

**Key Questions to Answer:**
- How did you store and organize MobTel's customer reviews, product data, and financial documents?
- What vector search methodology did you implement using ChromaDB?
- Which documents were most relevant when querying for specific retail insights?
- How effective was your retrieval system at finding relevant customer feedback?

### 3.2 Cosine Similarity Analysis for Customer Sentiment (150 words, 3 marks)

**Key Questions to Answer:**
- How well do product descriptions align with actual customer experiences?
- Which products show strong alignment vs. significant gaps between marketing and reality?
- What do similarity scores reveal about customer expectation management?
- How can similarity analysis improve product positioning and marketing?

**Required Elements:**
- Present your cosine similarity results table or heatmap visualization
- Interpret similarity scores in business context:
  - High (0.8-1.0): Well-aligned product marketing
  - Medium (0.5-0.8): Some customer perception gaps
  - Low (0.0-0.5): Marketing-reality misalignment issues
- Identify specific products requiring immediate attention

---

## 4. AI-Generated Business Insights (RAG) (150 words, 3 marks)

**Key Questions to Answer:**
- How did the Gemini AI enhance your analysis beyond basic vector search?
- What unique insights did AI generate that wouldn't be apparent from manual analysis?
- How do AI-generated recommendations compare to traditional retail analytics?
- What specific business strategies did the AI suggest for MobTel?


---

## 5. Business Recommendations for Stakeholders (350 words, 8 marks)

### 5.1 Product Development Team (100 words)
**Key Questions to Answer:**
- Which product categories need improved descriptions based on similarity analysis?
- How should product information be restructured to match customer expectations?
- What AI tools should be integrated into product catalog management?


### 5.2 Marketing Strategy (125 words)
**Key Questions to Answer:**
- How should marketing messaging change based on customer sentiment analysis?
- What targeted campaigns would address customer perception gaps?
- How can AI insights improve advertising ROI and customer acquisition?


### 5.3 Customer Experience & Support (125 words)
**Key Questions to Answer:**
- How can AI-powered search improve customer product discovery?
- What customer support enhancements would AI-driven insights enable?
- How should customer service teams use sentiment analysis proactively?

---

## 6. Business Impact Analysis (100 words, 2 marks)

**Key Questions to Answer:**
- How will improved product-customer alignment affect conversion rates and sales?
- What operational efficiencies will AI-driven insights create?
- How does this give MobTel competitive advantages in the electronics retail market?
- What are the expected ROI and timeline for implementation?

---

## 7. Conclusion and Next Steps (100 words, 2 marks)

**Key Questions to Answer:**
- How do AI-powered analytics transform MobTel's retail decision-making capabilities?
- What are the immediate next steps for implementing these insights?
- What future AI integrations should MobTel consider for competitive advantage?
- How will success be measured and monitored?


---



**Remember: Your report should showcase how AI-driven customer insights can transform MobTel's retail operations, not just demonstrate technical capabilities.**