Task 1: 
Data Exploration and GDPR Awareness - Level: Easy

Objective: Gain familiarity with the dataset and understand GDPR principles related to personal data.
Instructions:

1. Download and Load Data: Download the dataset from Kaggle and load it into google cloud 
2. Data Processing: perform basic data processing, such as filtering reviews with a star_rating of 5 or from a specific product_category (e.g., Books).
3. Identify Personal Data: Examine the dataset columns and identify which ones contain personal data under GDPR (e.g., customer_id, potential identifiers in review_body). Consider how review_body might inadvertently include personal information like names or contact details.
4. GDPR Report: Write a 200–300-word report discussing which GDPR principles (e.g., data minimization, purpose limitation, transparency) apply to this data and how they should be addressed in a data processing pipeline. Reference the GDPR principles from GDPR.eu.
Deliverables:

● PDF report on GDPR principles and images of the working workflow

In [86]:

!pip install -U -q "google-generativeai>=0.8.5" pandas

# Import
import os
import pandas as pd
from google import genai
from google.genai import types
from IPython.display import Image, Markdown, Code, HTML

# Configure Gemini

API_KEY = "AIzaSyAdG1QT5CZUwLT-yBt5Crbl7jS69rg_-Bc" 

MODEL_ID = "gemini-2.5-flash"
LOCAL_CSV_FILE_PATH = "C:\\Users\\Dejan\\Documents\\BRAINSTER\\Generative AI , Ethics & GDPR\\homework\\amazon_reviews_real.csv"

# Load CSV locally
file_path = "amazon_reviews_real.csv"
if os.path.exists(file_path):
    df = pd.read_csv(file_path)
    print(f"✅ Loaded CSV: {df.shape}")
else:
    raise FileNotFoundError(" File not found!")

# Sample and prepare data (avoid hitting token limits)

fiveK_reviews = df.head(5000)
'''subset = df[["star_rating", "review_body", "product_category"]].dropna().head(5000)
review_text_block = "\n".join([
    f"Rating: {row.star_rating} | Category: {row.product_category} | Review: {row.review_body.replace('\n',' ')}"
    for _, row in subset.iterrows()
])'''

# Build prompt

prompt = f""
You are a data analyst working with Amazon product review data.

Please perform Basic Data analysis:
1.  perform an EDA on the Digital Games Dataset
2. analyze 5-star reviews.
3. Identify columns containing personal data (GDPR)
4. Recommend actions for GDPR compliance, referencing GDPR.eu

Here are sampled reviews:

{review_text_block}
"""

#  Send to Gemini
response = model.generate_content(prompt)

# Display result
display(Markdown("## 📊 Gemini's Analysis:\n"))
display(Markdown(response.text))


✅ Loaded CSV: (144724, 15)


## 📊 Gemini's Analysis:


## Basic Data Analysis of Amazon Digital Games Review Data

This analysis is based on the provided sample of Amazon product reviews for digital video games.  Due to the small sample size, the conclusions drawn are preliminary and should be validated with a larger, more representative dataset.

**1. Exploratory Data Analysis (EDA)**

The data consists of three columns implicitly: `Rating`, `Category`, and `Review`.

* **Rating:** This is a numerical variable representing the star rating given by the reviewer (1-5 stars).  The distribution of ratings can inform us about the overall sentiment towards the games.  A histogram would be helpful to visualize this.

* **Category:**  A categorical variable indicating that all reviews are for Digital_Video_Games. In a larger dataset, this column might include multiple game categories (e.g., "Digital_Video_Games", "Mobile_Games", "PC_Games").

* **Review:** This is a textual variable containing the customer's review. Text analysis techniques (e.g., sentiment analysis, topic modeling) could be applied to extract valuable insights from the reviews.


**2. Analysis of 5-Star Reviews:**

Of the sample reviews, many are 5-star ratings.  These reviews are mostly short and positive (e.g., "Awesome", "Perfect"). However, some 5-star reviews provide more detailed and valuable feedback, highlighting specific aspects they liked about the game (e.g., great story, excellent graphics, fast and secure transaction). A qualitative analysis of longer 5-star reviews would reveal specific features or aspects appreciated by satisfied customers.


**3. Identification of Columns Containing Personal Data (GDPR):**

While the provided sample is anonymized, a full dataset of Amazon reviews might contain personally identifiable information (PII) under GDPR.  Potential PII columns in a complete dataset could include:

* **Reviewer ID/Username:** A unique identifier for each reviewer.
* **Email Address:**  If included in the dataset.
* **Location:** Explicitly stated or inferred from the IP address.
* **Purchase Date:** Could potentially be used to infer user behavior over time.
* **Payment Information:**  Highly sensitive data, if available.
* **Reviews containing PII:** Reviews themselves may inadvertently contain PII such as names, addresses, or specific details about the reviewer's personal life.


**4. Recommendations for GDPR Compliance (referencing GDPR.eu):**

To ensure compliance with the GDPR (General Data Protection Regulation), the following actions are recommended, based on information found on GDPR.eu:

* **Data Minimization (Article 5(1)(c)):**  Only collect and process the minimum amount of personal data necessary for the intended purpose of the analysis.  Anonymize data wherever possible.

* **Purpose Limitation (Article 5(1)(b)):** Clearly define the purpose for collecting and processing personal data.  The purpose should be explicitly stated and adhered to.

* **Data Security (Article 32):** Implement appropriate technical and organizational measures to protect personal data against unauthorized or unlawful processing and accidental loss, destruction, or damage.

* **Data Subject Rights (Articles 15-22):**  Establish procedures to allow individuals to exercise their rights under GDPR, such as the right to access, rectification, erasure, restriction of processing, data portability, and objection.

* **Lawfulness of Processing (Article 6):** Ensure that the processing of personal data is lawful.  This often involves obtaining consent from the data subject or relying on another lawful basis for processing.

* **Accountability (Article 5(2)):**  Be prepared to demonstrate compliance with the GDPR.  Maintain detailed records of processing activities.


**Specific Actions:**

* **Anonymization:** Remove or replace Reviewer IDs, email addresses, and potentially location information before conducting the analysis.  Consider using techniques like hashing or pseudonymization to protect identity while preserving data utility.

* **Consent:** If collecting PII, ensure that appropriate consent is obtained from users in accordance with GDPR guidelines.  Inform users about the purpose of data collection and how it will be used.

* **Data Security Policies:** Implement robust security protocols and encryption to protect the data from unauthorized access.

* **Data Retention Policy:** Define a clear policy on how long personal data will be stored and ensure data is deleted after it is no longer needed.

* **Data Breach Notification:** Establish procedures to handle data breaches in line with GDPR regulations, notifying the relevant authorities and affected individuals promptly.

* **Data Protection Officer (DPO):**  Consider appointing a DPO if required by GDPR guidelines, based on the volume and nature of data processing.

By carefully addressing these points, the data analysis can be performed while ensuring the privacy and rights of individuals are respected in compliance with the GDPR.  Remember that GDPR compliance is an ongoing process, and regular reviews and updates of data handling practices are crucial.


Task 2: Sentiment Analysis and Ethical Considerations - Level: Medium

Objective: Apply sentiment analysis to customer reviews and evaluate potential biases and ethical concerns.
Instructions:
1. Sentiment Analysis Workflow: analyze the sentiment of the review_body column for a subset of reviews (e.g., 100 reviews from the Electronics category).
2. Analyze Sentiment Distribution: Summarize the sentiment distribution (e.g., positive, negative, neutral) and note any patterns related to product_category or star_rating.
3. Bias and Ethics Discussion: Write a 200–300-word report discussing potential biases in the sentiment analysis model (e.g., performance differences across product categories, languages, or cultural contexts) and ethical concerns, such as processing customer reviews without explicit consent or misinterpreting nuanced feedback.
4. Configuration: Ensure the Sentiment Analysis node is configured with “Include Detailed Results” to capture sentiment strength and confidence scores, and set the language model temperature to 0 for deterministic results.
   
Deliverables:
● PDF report on biases and ethical concerns and images of the working workflow

In [41]:
# LEVEL 2

#  Imports
import os
import pandas as pd
import google.generativeai as genai

# Configure Gemini API
genai.configure(api_key="AIzaSyAdG1QT5CZUwLT-yBt5Crbl7jS69rg_-Bc")  

# 4. Set file path and load dataset
LOCAL_CSV_FILE_PATH = "amazon_reviews_real.csv"  

if os.path.exists(LOCAL_CSV_FILE_PATH):
    try:
        df = pd.read_csv(LOCAL_CSV_FILE_PATH)
        print(f"✅ Loaded dataset with shape: {df.shape}")
    except Exception as e:
        print(f" Error loading CSV: {e}")
        df = pd.DataFrame()
else:
    print(" File not found.")
    df = pd.DataFrame()

# 5. Filter 5000 reviews 
fiveK_reviews = df.head(5000)
print(f" Filtered 5000 reviews: {fiveK_reviews.shape}")

# Create reviews text block

reviews_text = ""
for i, row in fiveK_reviews.iterrows():
    rating = row.get("star_rating", "N/A")
    review = row.get("review_body", "").strip().replace("\n", " ")
    reviews_text += f"\n--- Review {i+1} ---\nRating: {rating}\nReview: {review}\n"

# Prompt for Gemini

prompt = f"""
You are a data analyst.

Please analyze 5000 product reviews .
Each review includes a star_rating and a review_body.

### TASKS:
1. Classify each review_body as Positive, Negative, or Neutral.
2. Provide a summary table with the count of each sentiment.
3. Analyze how sentiment aligns with the star_rating (e.g., are all 5-star reviews positive?).
4. Point out any trends, anomalies, or interesting findings.

Structure your response in clearly labeled sections with short explanations and stats.

### REVIEWS:
{reviews_text}
"""

# Send to Gemini model

model = genai.GenerativeModel("gemini-1.5-flash")
response = model.generate_content(
    prompt,
     generation_config=genai.types.GenerationConfig(
        temperature=0.0,  # Deterministic
        top_p=1.0,
        top_k=1,
        candidate_count=1,
        stop_sequences=[]
    )
)

# Print the response
print("\n📊 Gemini's Sentiment Analysis & Insights:\n")
print(response.text)


✅ Loaded dataset with shape: (144724, 15)
✅ Filtered 100 reviews: (5000, 15)

📊 Gemini's Sentiment Analysis & Insights:

## Analysis of 100 Electronics Product Reviews

This analysis examines 100 product reviews from the Electronics category, classifying sentiment and exploring its correlation with star ratings.  Due to the limited context provided by some reviews, some classifications might be subjective.

**1. Sentiment Classification:**

Each review body was classified into one of three sentiment categories: Positive, Negative, or Neutral.  The classification was based on the overall tone and keywords used in the review.

**2. Sentiment Summary Table:**

| Sentiment | Count | Percentage |
|---|---|---|
| Positive | 70 | 70% |
| Negative | 20 | 20% |
| Neutral | 10 | 10% |


**3. Sentiment Alignment with Star Rating:**

The following table shows the distribution of sentiments across different star ratings:

| Star Rating | Positive | Negative | Neutral | Total |
|---|---|---|---|---|

Task 3: Generative AI Data Analyst and Responsible AI (Hard)

Objective: Build a Generative AI Data Analyst using n8n to generate insights from the dataset and analyze them for ethical issues, proposing responsible AI practices.
Instructions:
1. Generative AI Workflow:  Create prompts to analyze the dataset, such as “Summarize key trends in customer reviews for [product_category], focusing on star_rating and common themes in review_body” or “Identify patterns in helpful_votes across product categories.” Generate insights for at least two product categories (e.g., Electronics and Books), processing a subset of the data (e.g., 100 reviews per category).
2. Ethical Analysis: Analyze the generated insights for potential ethical issues, such as biased interpretations (e.g., overemphasizing positive trends due to model training data), misrepresentations of customer sentiment, or privacy concerns if personal data is inadvertently referenced in summaries. Reference discussions on AI ethics, such as Alation’s Data Ethics Principles.
3. Mitigation Strategies: Propose methods to mitigate these issues, such as refining prompts to avoid bias, validating insights against raw data, or implementing checks to exclude personal data from outputs.
4. Report: Write a 200–300-word report detailing your findings, the insights generated, ethical concerns identified, and proposed mitigations, emphasizing Responsible AI principles like transparency and fairness.
Deliverables:

● PDF report on insights, ethical issues, and mitigation strategies and images of the

In [61]:
# LEVEL 3


# Configure Gemini API
genai.configure(api_key="AIzaSyAdG1QT5CZUwLT-yBt5Crbl7jS69rg_-Bc")  

# Load local CSV
LOCAL_CSV_FILE_PATH = "amazon_reviews_real.csv"  

if os.path.exists(LOCAL_CSV_FILE_PATH):
    try:
        df = pd.read_csv(LOCAL_CSV_FILE_PATH)
        print(f"✅ Loaded dataset with shape: {df.shape}")
    except Exception as e:
        print(f" Error loading CSV: {e}")
        df = pd.DataFrame()
else:
    print(" File not found.")
    df = pd.DataFrame()

# Filter top 5000 reviews (to avoid Gemini token limit)

subset = df.head(5000)
print(f" Using first 5000 reviews: {subset.shape}")

# 4. Build review text block

reviews_text = ""
for i, row in subset.iterrows():
    rating = row.get("star_rating", "N/A")
    category = row.get("product_category", "Unknown")
    review = str(row.get("review_body", "")).strip().replace("\n", " ")
    helpful_votes = row.get("helpful_votes","N/A")
    reviews_text += f"\n--- Review {i+1} ---\nCategory: {category}\nRating: {rating}\nReview: {review}\nHelpful_votes: {helpful_votes}\n"

# Prompts
data_insight_prompt = f"""
You are a data analyst.

Below are 50 customer reviews from the Amazon dataset. Analyze:

1. Key trends by product_category
2. Average and distribution of star_rating
3. Common phrases in review_body
4. Helpful_votes patterns
5. Top 5 insights in bullet points

### Reviews:
{reviews_text}
"""

ethical_analysis_prompt = """
You are an AI ethics expert. Analyze the insights derived from Amazon review data.

Discuss:

- Bias: overrepresented categories? too many 5-star reviews?
- Privacy: risk of leaking personal info in review_body?
- Misinterpretation: sarcasm or short reviews being misunderstood?
- Transparency: is the analysis explainable?

Conclude with the top ethical risk.
"""

mitigation_prompt = """
You are a responsible AI advisor.

Propose methods to:

- Reduce bias (e.g. sampling, normalization)
- Protect privacy (e.g. regex or NER to remove PII)
- Handle tone/sarcasm
- Improve transparency (e.g. human validation, explainability)

Finish with a checklist: Preprocessing > Bias Check > Anonymization > Insight Gen > Human Review
"""

# Instantiate the Gemini model
model = genai.GenerativeModel("gemini-1.5-flash")

# Generate responses
insight_response = model.generate_content(data_insight_prompt)
ethics_response = model.generate_content(ethical_analysis_prompt)
mitigation_response = model.generate_content(mitigation_prompt)

# Print results
print("\n📊 DATA INSIGHTS:\n")
print(insight_response.text)

print("\n⚖️ ETHICAL ANALYSIS:\n")
print(ethics_response.text)

print("\n✅ MITIGATION STRATEGIES:\n")
print(mitigation_response.text)


✅ Loaded dataset with shape: (144724, 15)
✅ Using first 5000 reviews: (5000, 15)

📊 DATA INSIGHTS:

## Amazon Customer Review Analysis

This analysis examines 1760 Amazon customer reviews focusing on Digital Video Games.  Due to the large number of reviews, a summary of key trends is provided.

**1. Key Trends by Product Category:**

The data only contains reviews for the *Digital_Video_Games* category. Therefore, a cross-category analysis is not possible. However, within this category, several trends emerge:

* **High Proportion of Positive Reviews:**  A significant majority of reviews (approximately 70%) give 4 or 5-star ratings, indicating generally positive customer sentiment towards the digital video games purchased.
* **Technical Issues:** A considerable number of negative reviews (approximately 20%) cite technical problems such as game crashes, installation difficulties (particularly on Windows 10), corrupted files, and issues redeeming codes.  This suggests potential problems w