# AI-Powered ETL Pipeline for Customer Data Enrichment and Marketing Personalization

## üìå Project Overview

This notebook presents an end-to-end ETL (Extract, Transform, Load) pipeline designed to simulate a real-world data workflow in a marketing analytics context.

The pipeline processes customer data, applies data privacy techniques, enriches records with AI-generated insights, and outputs a structured dataset ready for analytical or business use.

The objective of this project is to demonstrate practical skills in data engineering, data quality, and AI integration within a reproducible and scalable workflow.

## üîÑ Pipeline Stages

The ETL process is structured into three main stages:

### 1) Extract
- Load customer data from a CSV file.
- Convert raw data into structured Python objects for processing.

### 2) Transform
- Apply data masking to sensitive fields (e.g., account and card numbers).
- Perform data validation and basic quality checks.
- Generate personalized marketing messages using an AI model.
- Handle exceptions and ensure pipeline robustness.

### 3) Load
- Build a clean and enriched dataset.
- Export the final output to a structured CSV file for downstream use.

In [29]:
import pandas as pd
from openai import OpenAI

## üì• Extract Phase

Load the CSV file and convert it into a dictionary structure for easier processing.

In [None]:
df = pd.read_csv("data/clients.csv")
data = df.to_dict(orient="records")

df.head()

Unnamed: 0,ID,Name,Account,Card
0,1,Ana Silva,12345,5555-4444-3333-2222
1,2,Jo√£o Pereira,67890,1111-2222-3333-4444
2,3,Marcos Lima,54321,9999-8888-7777-6666
3,4,Carla Mendes,11223,4444-3333-2222-1111
4,5,Ricardo Souza,99887,2222-1111-4444-3333


## üîê Data Privacy and Masking

To simulate real-world data governance practices, sensitive customer information is masked before further processing.

This step reflects common industry practices related to:
- Data privacy
- Compliance requirements
- Secure data handling in analytics pipelines

In [31]:
def mask_card(card):
    return "****-****-****-" + card[-4:]

def mask_account(account):
    return "*" * (len(str(account)) - 1) + str(account)[-1:]

for user in data:
    user["Card"] = mask_card(user["Card"])
    user["Account"] = mask_account(str(user["Account"]))

data[:3]

[{'ID': 1,
  'Name': 'Ana Silva',
  'Account': '****5',
  'Card': '****-****-****-2222'},
 {'ID': 2,
  'Name': 'Jo√£o Pereira',
  'Account': '****0',
  'Card': '****-****-****-4444'},
 {'ID': 3,
  'Name': 'Marcos Lima',
  'Account': '****1',
  'Card': '****-****-****-6666'}]

## ü§ñ AI-Driven Data Enrichment

An AI model is used to enrich customer records with personalized marketing messages.

This step demonstrates how AI can be integrated into data pipelines to:
- Enhance customer segmentation
- Support personalized communication strategies
- Add business value to raw datasets

In [32]:
import os
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [33]:
def generate_message(user):
    try:
        completion = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a marketing and banking account manager specialist."},
                {"role": "user", "content": f"Generate a message for {user['Name']} about the importance of investments and savings (200 characters max)."}
            ]
        )
        return completion.choices[0].message.content.strip('"')
    except Exception as e:
        print(f"Error generating message for {user['Name']}: {e}")
        return False

In [34]:
for user in data:
    news = generate_message(user)

    if not news:
        print(f"Skipping user {user['Name']} due to AI failure")
        continue

    user["news"] = [{"description": news}]

data[:3]

[{'ID': 1,
  'Name': 'Ana Silva',
  'Account': '****5',
  'Card': '****-****-****-2222',
  'news': [{'description': 'Hi Ana! Investing and saving are key to financial security. They help grow your wealth over time, provide for future needs, and offer peace of mind. Start today for a brighter tomorrow!'}]},
 {'ID': 2,
  'Name': 'Jo√£o Pereira',
  'Account': '****0',
  'Card': '****-****-****-4444',
  'news': [{'description': 'Hi Jo√£o, investing and saving are key to financial security. They help grow wealth and prepare for future needs. Start small, stay consistent, and watch your financial goals come to life!'}]},
 {'ID': 3,
  'Name': 'Marcos Lima',
  'Account': '****1',
  'Card': '****-****-****-6666',
  'news': [{'description': 'Hi Marcos, investing and saving are key to financial security and growth. They can help you achieve goals and prepare for the future. Start small and build your wealth over time!'}]}]

## üì§ Output Generation

After validation, the pipeline generates a final dataset containing:
- Masked customer identifiers
- Cleaned and structured attributes
- AI-generated personalized messages

The resulting dataset is ready for analytical, reporting, or business applications.

In [None]:
valid_rows = [u for u in data if u.get("news")]

if not valid_rows:
    print("No valid messages generated. Aborting load phase.")
else:
    output_rows = []

    for user in valid_rows:
        output_rows.append({
            "ID": user["ID"],
            "Name": user["Name"],
            "Account": user["Account"],
            "Card": user["Card"],
            "Message": user["news"][0]["description"]
        })

    df_out = pd.DataFrame(output_rows)
    df_out.to_csv("output/marketing_messages.csv", index=False, encoding="utf-8")
    df_out.head()

## üéØ Key Takeaways

This project demonstrates practical skills in:
- Designing ETL workflows
- Data cleaning and validation
- Data privacy and masking techniques
- AI integration in data pipelines
- Structuring data projects for real-world use

This notebook is part of my professional data portfolio and reflects my approach to building scalable and business-oriented data solutions.