# **Objective:**  
In this lab, you will learn how to generate synthetic text data (e.g., product reviews) using GPT‑2 from Hugging Face, perform basic data analysis, and anonymize personal data in text using regular expressions.

### **Tasks Covered:**
1. Loading a pre-trained GPT‑2 model and tokenizer.
2. Creating a text generation function that produces synthetic reviews.
3. Generating synthetic reviews from a list of prompts.
4. Storing the generated reviews in a CSV file.
5. Computing and displaying basic statistics on the synthetic reviews.
6. Writing and testing simple regex patterns to detect personal data (e.g., email addresses, phone numbers, and full names).
7. Anonymizing detected personal data by replacing or masking it with placeholders (e.g., `[EMAIL]`, `[PHONE]`, or masked names).
8. Integrating the anonymization process into the synthetic review analysis pipeline.

### **Prerequisites:**  
- Python 3.7+  
- PyTorch and Transformers libraries (install via `pip install torch transformers`)  
- Pandas library (install via `pip install pandas`)  

### **Instructions:**
- Run each cell sequentially.
- Read the markdown instructions before each code cell.
- Ensure your implementations for both text generation and data anonymization work as expected.


### 1. Load the GPT‑2 Model and Tokenizer [DO NOT EDIT]

In [1]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Specify the model name
model_name = "gpt2"

# Load the tokenizer and model from Hugging Face
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()  # Set the model to evaluation mode

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

###2. Define a Text Generation Function [DO NOT EDIT]

In this step, we define a function `generate_text` that takes a prompt and generates text using GPT‑2.  

In [2]:
def generate_text(prompt, max_length=50, temperature=0.7, top_k=50, top_p=0.95):
    """
    Generate text using GPT-2 given an input prompt.
    """
    # Encode the input prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    # Generate text using the model
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_k=top_k,
            top_p=top_p,
            pad_token_id=tokenizer.eos_token_id
        )
    # Decode the generated tokens to a string
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

###3. Test the Text Generation Function [DO NOT EDIT]

Here, we test the `generate_text` function with a sample prompt for a product review.  
Review the output in the console to ensure that the model generates a detailed review.

In [3]:
# Example prompt
prompt = "Write a detailed review for a smartphone:"
generated_review = generate_text(prompt, max_length=100)
print("Generated Review:\n", generated_review)


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Review:
 Write a detailed review for a smartphone:

Why this review is important

The HTC One M9 is the most powerful smartphone we've ever tested. It's a smart, low-cost smartphone that is easy to use, and it's also easily the most customizable. It's also a very good phone for anyone who likes to be a smartphone user. That said, it's not the most powerful phone that we've tested, so it's not a great phone for most people.



### 4. Generate a Synthetic Dataset [DO NOT EDIT]

In this step, we will generate multiple synthetic reviews using a variety of prompts.  
We define a list of sample prompts and generate a specified number of reviews.  
The generated reviews are stored in a Pandas DataFrame.

### **Instructions:**  
- You can adjust the `num_samples` variable to generate more or fewer reviews.
- Run the cell and review the first few generated reviews printed from the DataFrame.

In [4]:
import pandas as pd
import random

# List of sample prompts
prompts = [
    "Write a detailed review for a new smartphone:",
    "Describe your experience with a recently launched laptop:",
    "Give a review of a new pair of wireless headphones:",
    "Share your thoughts on a cutting-edge smartwatch:"
]

# Number of synthetic examples to generate
num_samples = 5
synthetic_reviews = []

for _ in range(num_samples):
    # Randomly choose a prompt for variety
    prompt = random.choice(prompts)
    review = generate_text(prompt, max_length=100)
    synthetic_reviews.append(review)

# Create a DataFrame
df = pd.DataFrame(synthetic_reviews, columns=["Review"])
print("First 5 synthetic reviews:")
print(df.head())
print(df['Review'])

First 5 synthetic reviews:
                                              Review
0  Share your thoughts on a cutting-edge smartwat...
1  Write a detailed review for a new smartphone:\...
2  Describe your experience with a recently launc...
3  Describe your experience with a recently launc...
4  Describe your experience with a recently launc...
0    Share your thoughts on a cutting-edge smartwat...
1    Write a detailed review for a new smartphone:\...
2    Describe your experience with a recently launc...
3    Describe your experience with a recently launc...
4    Describe your experience with a recently launc...
Name: Review, dtype: object


###5. Save the Dataset & Compute Statistics [to be solved]
### **Instructions:**
- Save the DataFrame:
- Save your synthetic reviews to synthetic_reviews.csv (without the index).

### **Compute Statistics:**
- Calculate the total number of reviews.
- Compute the average review length in words.

In [5]:
# Store the DataFrame in a CSV file
df.to_csv("synthetic_reviews.csv", index=False)

# Compute basic statistics on synthetic data
num_reviews = len(df)
df['Review_Length'] = df['Review'].apply(lambda x : len(x.split()))
avg_length = df['Review_Length'].mean()
# Calculate total number of reviews


# Calculate average review length (in words)
# Ensure that each review is treated as a string
df['Review'] = df['Review'].astype(str)


print(f"\nTotal number of reviews: {num_reviews}")
print(f"Average review length: {avg_length:.2f} words")



Total number of reviews: 5
Average review length: 72.20 words


###Exploring Regular Expressions[to be solved]

**Objective**: Write and test simple regex patterns using Python.

**Instructions**:


Write code that uses re.search() or re.findall() to locate:

 - An email address

 - A phone number

 - A full name pattern (e.g., two words starting with capital letters)


In [6]:
import re

# Sample text strings
email_text = "Please contact ali@gmail.com for further details."
phone_text = "My phone number is +923034567890."
name_text  = "Hello, my name is Abdullah Asghar."

# Define regex patterns
email_pattern = r"[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,5}"  # Matches standard emails\
phone_pattern = r"\+92\d{10}"  # Matches phone numbers in the format +923034567890

#Add regex for phone number and Name
#phone_pattern = r""  # Matches phone numbers like +923034567890
name_pattern  = r"\b[A-Z][a-z]+\s[A-Z][a-z]+\b"  # Matches two words starting with uppercase letters

# Perform searches
email_match = re.search(email_pattern, email_text)
phone_match = re.search(phone_pattern, phone_text)
name_match = re.search(name_pattern, name_text)


print("Email Found:", email_match.group() if email_match else "None")
print("Phone Found:", phone_match.group() if phone_match else "None")
print("Name Found: ", name_match.group() if name_match else "None")

Email Found: ali@gmail.com
Phone Found: +923034567890
Name Found:  Abdullah Asghar


### Anonymizing Personal Data with Regex[To be solved]

**Objective:**  
Anonymize emails, phone numbers, and full names in a text using Python regex.

**Instructions:**

1. Email Anonymization:
   - Use a regex to find email
   - Replace matches with `[EMAIL]` via `re.sub()`.

2. Phone Number Anonymization:
   - Use a regex to find number.
   - Replace matches with `[PHONE]` via `re.sub()`.

3. Name Anonymization:
   - Use a regex to find name.
   - Replace matches with a masked version (e.g., "John Doe" → "J**** D****") via `re.sub()`.



In [7]:
import re

# Example containing personal data
text = """
Abdullah's email is abde@example.com and his phone number is +923034567890.
He lives at 123 Raya St, Springfield and his colleague Ali Smith has the email Ali.smith@company.org.
"""

# Anonymize email addresses
email_pattern = r"[a-zA-Z0-9._-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,5}"
text = re.sub(email_pattern, '[EMAIL]', text)
#email_pattern = r""


# Anonymize phone numbers (a simple version that matches numbers with optional + and separators)
phone_pattern = r"\+?\d[\d\s-]{7,}\d"
text = re.sub(phone_pattern, '[PHONE]', text)



# Anonymize names:
# This example assumes a name is two words that start with capital letters.
# It replaces each name with a masked version showing only the first letter of each part.
def mask_name(match):
    #Code here
    name = match.group()
    part = name.split()
    first_initial = part[0][0]
    second_initial = part[1][0]
    return f"{first_initial}**** {second_initial}****"

name_pattern = r"\b[A-Z][a-z]+\s[A-Z][a-z]+\b"
text = re.sub(name_pattern, mask_name, text)


print(text)


Abdullah's email is [EMAIL] and his phone number is [PHONE].
He lives at 123 R**** S****, Springfield and his colleague A**** S**** has the email [EMAIL].



###Extracting Course Codes Using spaCy [To be solved]

**Objective:**

Extract course codes from text that contains course descriptions using spaCy tokenization instead of regex.

**Instructions:**

 - Load spaCy Model:
	 Use spacy.load('en_core_web_sm') to load the English NLP model.
 - Tokenize Course Descriptions:
	 Process text with nlp(), and extract individual tokens.
 - Identify Course Codes:
	 Find patterns where:
	 The first token consists of uppercase letters (Department Code).
	 The second token is a three-digit number (Course Number).
 - Extract and Return Course Code:
	  Concatenate the department and course number (e.g., "CS" + "101" → "CS101").

In [8]:
import spacy

# Sample data
course_descriptions = [
    "CS 101 Introduction to Computer Science",
    "MATH 202 Calculus II",
    "ENG 150 English Literature",
    "BIO 303 Genetics and Evolution",
    "HIST 210 World History"
]

# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')

def extract_course_code(text):
    doc = nlp(text)
    print(doc)
    tokens = [token.text for token in doc]


    if len(tokens) > 1 and tokens[0].isalpha() and tokens[0].isupper() and tokens[1].isdigit() and len(tokens[1]) == 3:
        return tokens[0] + tokens[1]
    return None

# Iterate through the course descriptions
for description in course_descriptions:
    course_code = extract_course_code(description)
    if course_code:
        print(f"Extracted Course Code: {course_code}")
    else:
        print("No course code found.")


CS 101 Introduction to Computer Science
Extracted Course Code: CS101
MATH 202 Calculus II
Extracted Course Code: MATH202
ENG 150 English Literature
Extracted Course Code: ENG150
BIO 303 Genetics and Evolution
Extracted Course Code: BIO303
HIST 210 World History
Extracted Course Code: HIST210
