# Data Ingestion:

- Load the Customer Master Data CSV string into a Pandas DataFrame named df_customers.
- Parse the Simulated Transaction Events JSON list into a Pandas DataFrame named df_transactions.
- Print the first 5 rows and the info() for both DataFrames immediately after ingestion.

In [11]:
import pandas as pd
import json

In [None]:
# Load the CSV file into a DataFrame
try:
  customer_df = pd.read_csv("/data/Customer Master Data.csv")
  display(customer_df.head())
except Exception as e:
  print(f"An error occurred while loading Customer Master Data.csv: {e}")

Unnamed: 0,customer_id,customer_name,customer_email,registration_date,customer_tier,last_login_date
0,C101,Alice Smith,alice@example.com,2023-01-15,Gold,2024-07-15
1,C102,Bob Johnson,bob@example.com,2024-07-09,Silver,2024-07-16
2,C103,Charlie Brown,charlie@example.com,2023-11-20,Bronze,2024-07-14
3,C104,Diana Prince,diana@example.com,2024-07-10,Silver,2024-07-16
4,C105,Eve Adams,eve@example.com,2023-05-01,Gold,2024-07-15


In [None]:
# Load the JSON file into another DataFrame
try:
    with open("/data/Simulated Transaction Events.json", 'r') as f:
        transaction_data = json.load(f) # Load the entire content as a single JSON object/array
    transaction_df = pd.DataFrame(transaction_data)
    display(transaction_df.head())
except Exception as e:
    print(f"An error occurred while loading Simulated Transaction Events.json: {e}")

Unnamed: 0,transaction_id,customer_id,amount,timestamp,currency,ip_address
0,TX001,C101,150.75,2024-07-16 10:00:00,USD,192.168.1.10
1,TX002,C102,25.0,2024-07-16 10:01:30,USD,192.168.1.11
2,TX003,C101,50.25,2024-07-16 10:02:00,USD,192.168.1.10
3,TX004,C103,1200.0,2024-07-16 10:03:15,EUR,192.168.1.12
4,TX005,C102,75.5,2024-07-16 10:04:00,USD,192.168.1.11


# Data Cleaning & Preprocessing (df_customers):

- Handle Missing Values: Ensure no missing values in critical columns like customer_id, customer_name, registration_date. For this dataset, assume all are present.
- Correct Data Types: Convert registration_date and last_login_date to datetime objects.
- Handle Duplicates: Identify and remove any duplicate customer_id entries, keeping the first occurrence.

In [14]:
# Filter out records with missing values in critical columns
if 'customer_df' in locals():
    initial_customer_count = len(customer_df)
    columns_to_check = ['customer_id', 'customer_name', 'registration_date']
    customer_df_filtered = customer_df.dropna(subset=columns_to_check)

    print(f"Initial number of customer records: {initial_customer_count}")
    print(f"Number of customer records after filtering missing values: {len(customer_df_filtered)}")

    # Display the first 5 rows of the filtered DataFrame
    print("\nFiltered Customer Master Data DataFrame (first 5 rows):")
    display(customer_df_filtered.head())
else:
    print("customer_df DataFrame not loaded, skipping filtering.")

Initial number of customer records: 9
Number of customer records after filtering missing values: 9

Filtered Customer Master Data DataFrame (first 5 rows):


Unnamed: 0,customer_id,customer_name,customer_email,registration_date,customer_tier,last_login_date
0,C101,Alice Smith,alice@example.com,2023-01-15,Gold,2024-07-15
1,C102,Bob Johnson,bob@example.com,2024-07-09,Silver,2024-07-16
2,C103,Charlie Brown,charlie@example.com,2023-11-20,Bronze,2024-07-14
3,C104,Diana Prince,diana@example.com,2024-07-10,Silver,2024-07-16
4,C105,Eve Adams,eve@example.com,2023-05-01,Gold,2024-07-15


In [15]:
# Convert date columns to datetime objects
if 'customer_df_filtered' in locals():
    customer_df_filtered['registration_date'] = pd.to_datetime(customer_df_filtered['registration_date'])
    customer_df_filtered['last_login_date'] = pd.to_datetime(customer_df_filtered['last_login_date'])

    print("\nCustomer Master Data DataFrame after converting date columns:")
    display(customer_df_filtered.head())
    print("\nData types after conversion:")
    display(customer_df_filtered.info())
else:
    print("customer_df_filtered DataFrame not found. Please ensure the previous steps were executed.")


Customer Master Data DataFrame after converting date columns:


Unnamed: 0,customer_id,customer_name,customer_email,registration_date,customer_tier,last_login_date
0,C101,Alice Smith,alice@example.com,2023-01-15,Gold,2024-07-15
1,C102,Bob Johnson,bob@example.com,2024-07-09,Silver,2024-07-16
2,C103,Charlie Brown,charlie@example.com,2023-11-20,Bronze,2024-07-14
3,C104,Diana Prince,diana@example.com,2024-07-10,Silver,2024-07-16
4,C105,Eve Adams,eve@example.com,2023-05-01,Gold,2024-07-15



Data types after conversion:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   customer_id        9 non-null      object        
 1   customer_name      9 non-null      object        
 2   customer_email     9 non-null      object        
 3   registration_date  9 non-null      datetime64[ns]
 4   customer_tier      9 non-null      object        
 5   last_login_date    9 non-null      datetime64[ns]
dtypes: datetime64[ns](2), object(4)
memory usage: 564.0+ bytes


None

In [16]:
# Filter out records with missing values in critical columns
if 'transaction_df' in locals():
    initial_transaction_count = len(transaction_df)
    columns_to_check = ['transaction_id', 'customer_id', 'timestamp', 'amount']
    transaction_df_filtered = transaction_df.dropna(subset=columns_to_check)

    print(f"Initial number of transaction records: {initial_transaction_count}")
    print(f"Number of transaction records after filtering missing values: {len(transaction_df_filtered)}")

    # Handle duplicates
    # Remove duplicates based on transaction_id, keeping the first occurrence
    transaction_df_no_id_duplicates = transaction_df_filtered.drop_duplicates(subset=['transaction_id'], keep='first')
    print(f"\nNumber of transaction records after removing duplicates based on transaction_id: {len(transaction_df_no_id_duplicates)}")

    # Remove duplicates based on customer_id, amount, and timestamp, keeping the first occurrence
    transaction_df_cleaned = transaction_df_no_id_duplicates.drop_duplicates(subset=['customer_id', 'amount', 'timestamp'], keep='first')
    print(f"Number of transaction records after removing duplicates based on customer_id, amount, and timestamp: {len(transaction_df_cleaned)}")

    # Convert timestamp to datetime and amount to float
    transaction_df_cleaned['timestamp'] = pd.to_datetime(transaction_df_cleaned['timestamp'])

    # Replace 'NULL' string with NaN before converting to float
    transaction_df_cleaned['amount'] = transaction_df_cleaned['amount'].replace('NULL', pd.NA)
    transaction_df_cleaned['amount'] = transaction_df_cleaned['amount'].astype(float)


    print("\nTransaction Events DataFrame after cleaning:")
    display(transaction_df_cleaned.head())
else:
    print("transaction_df DataFrame not found. Please ensure the previous steps were executed.")

Initial number of transaction records: 16
Number of transaction records after filtering missing values: 16

Number of transaction records after removing duplicates based on transaction_id: 15
Number of transaction records after removing duplicates based on customer_id, amount, and timestamp: 15

Transaction Events DataFrame after cleaning:


Unnamed: 0,transaction_id,customer_id,amount,timestamp,currency,ip_address
0,TX001,C101,150.75,2024-07-16 10:00:00,USD,192.168.1.10
1,TX002,C102,25.0,2024-07-16 10:01:30,USD,192.168.1.11
2,TX003,C101,50.25,2024-07-16 10:02:00,USD,192.168.1.10
3,TX004,C103,1200.0,2024-07-16 10:03:15,EUR,192.168.1.12
4,TX005,C102,75.5,2024-07-16 10:04:00,USD,192.168.1.11



Data types after cleaning:
<class 'pandas.core.frame.DataFrame'>
Index: 15 entries, 0 to 14
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   transaction_id  15 non-null     object        
 1   customer_id     15 non-null     object        
 2   amount          15 non-null     float64       
 3   timestamp       15 non-null     datetime64[ns]
 4   currency        15 non-null     object        
 5   ip_address      15 non-null     object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 840.0+ bytes


None

In [17]:
# Create Derived Column: Calculate transaction_hour
if 'transaction_df_cleaned' in locals() and not transaction_df_cleaned.empty:
    transaction_df_cleaned['transaction_hour'] = transaction_df_cleaned['timestamp'].dt.hour
    print("\nTransaction Events DataFrame with 'transaction_hour' column:")
    display(transaction_df_cleaned.head())
else:
    print("transaction_df_cleaned DataFrame not found or is empty. Please ensure the previous steps were executed and the DataFrame is populated.")


Transaction Events DataFrame with 'transaction_hour' column:


Unnamed: 0,transaction_id,customer_id,amount,timestamp,currency,ip_address,transaction_hour
0,TX001,C101,150.75,2024-07-16 10:00:00,USD,192.168.1.10,10
1,TX002,C102,25.0,2024-07-16 10:01:30,USD,192.168.1.11,10
2,TX003,C101,50.25,2024-07-16 10:02:00,USD,192.168.1.10,10
3,TX004,C103,1200.0,2024-07-16 10:03:15,EUR,192.168.1.12,10
4,TX005,C102,75.5,2024-07-16 10:04:00,USD,192.168.1.11,10


In [18]:
# Print the cleaned df_customers info and first 5 rows.
if 'customer_df_filtered' in locals():
    print("Cleaned Customer Master Data DataFrame Info:")
    customer_df_filtered.info()
    print("\nCleaned Customer Master Data DataFrame (first 5 rows):")
    display(customer_df_filtered.head())
else:
    print("customer_df_filtered DataFrame not found.")

# Print the cleaned df_transactions info and first 5 rows.
if 'transaction_df_cleaned' in locals():
    print("\nCleaned Simulated Transaction Events DataFrame Info:")
    transaction_df_cleaned.info()
    print("\nCleaned Simulated Transaction Events DataFrame (first 5 rows):")
    display(transaction_df_cleaned.head())
else:
    print("transaction_df_cleaned DataFrame not found.")

# Print the number of unique customers and transactions after cleaning.
if 'customer_df_filtered' in locals():
    unique_customers_count = customer_df_filtered['customer_id'].nunique()
    print(f"\nNumber of unique customers after cleaning: {unique_customers_count}")
else:
    print("\nCould not determine unique customer count as customer_df_filtered DataFrame was not found.")

if 'transaction_df_cleaned' in locals():
    unique_transactions_count = transaction_df_cleaned['transaction_id'].nunique()
    print(f"Number of unique transactions after cleaning: {unique_transactions_count}")
else:
     print("Could not determine unique transaction count as transaction_df_cleaned DataFrame was not found.")

Cleaned Customer Master Data DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   customer_id        9 non-null      object        
 1   customer_name      9 non-null      object        
 2   customer_email     9 non-null      object        
 3   registration_date  9 non-null      datetime64[ns]
 4   customer_tier      9 non-null      object        
 5   last_login_date    9 non-null      datetime64[ns]
dtypes: datetime64[ns](2), object(4)
memory usage: 564.0+ bytes

Cleaned Customer Master Data DataFrame (first 5 rows):


Unnamed: 0,customer_id,customer_name,customer_email,registration_date,customer_tier,last_login_date
0,C101,Alice Smith,alice@example.com,2023-01-15,Gold,2024-07-15
1,C102,Bob Johnson,bob@example.com,2024-07-09,Silver,2024-07-16
2,C103,Charlie Brown,charlie@example.com,2023-11-20,Bronze,2024-07-14
3,C104,Diana Prince,diana@example.com,2024-07-10,Silver,2024-07-16
4,C105,Eve Adams,eve@example.com,2023-05-01,Gold,2024-07-15



Cleaned Simulated Transaction Events DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 15 entries, 0 to 14
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   transaction_id    15 non-null     object        
 1   customer_id       15 non-null     object        
 2   amount            15 non-null     float64       
 3   timestamp         15 non-null     datetime64[ns]
 4   currency          15 non-null     object        
 5   ip_address        15 non-null     object        
 6   transaction_hour  15 non-null     int32         
dtypes: datetime64[ns](1), float64(1), int32(1), object(4)
memory usage: 900.0+ bytes

Cleaned Simulated Transaction Events DataFrame (first 5 rows):


Unnamed: 0,transaction_id,customer_id,amount,timestamp,currency,ip_address,transaction_hour
0,TX001,C101,150.75,2024-07-16 10:00:00,USD,192.168.1.10,10
1,TX002,C102,25.0,2024-07-16 10:01:30,USD,192.168.1.11,10
2,TX003,C101,50.25,2024-07-16 10:02:00,USD,192.168.1.10,10
3,TX004,C103,1200.0,2024-07-16 10:03:15,EUR,192.168.1.12,10
4,TX005,C102,75.5,2024-07-16 10:04:00,USD,192.168.1.11,10



Number of unique customers after cleaning: 9
Number of unique transactions after cleaning: 15


In [20]:
# Save cleaned DataFrames to CSV files
if 'customer_df_filtered' in locals() and not customer_df_filtered.empty:
    customer_df_filtered.to_csv("cleaned_customer_data.csv", index=False)
    print("Cleaned customer data saved to 'cleaned_customer_data.csv'")
else:
    print("customer_df_filtered DataFrame not found or is empty, skipping save.")

if 'transaction_df_cleaned' in locals() and not transaction_df_cleaned.empty:
    transaction_df_cleaned.to_csv("cleaned_transaction_events.csv", index=False)
    print("Cleaned transaction events saved to 'cleaned_transaction_events.csv'")
else:
    print("transaction_df_cleaned DataFrame not found or is empty, skipping save.")

Cleaned customer data saved to 'cleaned_customer_data.csv'
Cleaned transaction events saved to 'cleaned_transaction_events.csv'


Explain your reasoning for each step, focusing on the following:

**Q. How did you handle missing values in both DataFrames and why did you choose that strategy (e.g., replacement with median, removal)?**
> A. For the transaction dataset, I decided to take out all the record which miss values for transtaction id, customer id and timestamp. Because it will be difficult to understand what this data means:
* if i have a transaction with similar customer id and timestamp but no transaction id i cant assume if those are the same transaction
* if i have a transaction with no customer id, i dont know who did it
* if i have a transaction without the timestamp, it is not realiable to me because i cannot substitute a random value to the NULL value of the timestamp (even though the transactio id and customer id are there)
    
**Q. Describe your approach to data type conversion for datetime and numeric columns.**
> A. to convert data in the correct datatype I cast the columns into the desired data type using `to_datetime(column)` and `df['amount'].astype(float)` functions. For the `amount` column, I also set the string values `NULL` to zero.

**Q. Duplicate Handling: Explain your logic for identifying and removing duplicate entries in `customer_id` and `transaction_id`.**
> A. In the dataset customer, if I have two customers with the same customer_id it means that there was an error registering the customers (maybe they were submitted at the same time or something wrong happened in the system where do i get the dataset from). So i will update the `customer id` value to "latest value+1" and keeping the first record untouched. In the dataset transaction id, if two transactions have the same `customer id`, `transaction id`, `amount` and `timestamp`, I consider them duplicates and remove all the records beside the first one. If I have two records have the same `transaction id` but different `customer id`, `amount` and `timestamp`, i will update the value of the transaction id to a new one (latest+1).

**Q. Imagine the Simulated Transaction Events list contains millions of entries and cannot be loaded into memory all at once. Describe conceptually how you would handle this scenario (e.g., using Dask, processing in chunks, or a distributed framework like Spark). You don't need to implement this, but explain your approach.**
> A. I will split the dataset based on a size limit and distribute them using Spark. I will then run the operations on each chunks and wait for the last operation to then merge all the dataset together again. I will perform all the logic on each chunk because my wanted outcome is not a polished dataset but a set of alerts and a fraud transaction dataset


**Q. What if the `amount` occasionally comes with currency symbols (e.g., "£150.75") or uses a comma as a decimal separator (e.g., "150,75")? Describe how you would make your parsing more robust. Implement a basic version of this for the currency symbol in your Python script (e.g., remove "$").**
> A. I will remove all the currency symbols and replace the commas with dots from the amount column using the `replace()` function and a regex expression that takes out a set of symbols and replace the commas with dots. For instance:
`df['amount'].str.replace(regex expression for symbols).str.replace(regex expression for commas)`

**Q. What if a new column like `payment_processor` suddenly appears in some transaction events, or an expected column like `currency` is sometimes missing? Describe how you would handle such schema variations in a production ETL pipeline (e.g., flexible schema, error logging, schema evolution tools).**

> A. So let's analyze those three cases: 1. a new column appears in the datases. We can apply schema-on-read so that we assure our incoming schema  applying restrictions and validations during the query. Another way is to let the query engine to interpret the incoming schema with a new column and use the data, 2. If a column is missing, we can investigate why using the logs and understand why it is missing. Then we can add it to the dataset using standard values. 3. If we have unconsisten records with different columns or column names, we can use a dynamic schema approach: we can use the query engine to interpret the schema and add additional columns which let us understand the inconsistency of the records (such as sourceA, sourceB, etc.). If values are missing, we can fill them using standard values (such as "NULL").
Another method is to quarantine the records, so that we can investigate what is the error, fix them using ETL pipelines or manually and reingest them or discard them.

3.  **Assumptions:**
I assume that we have headers of the column, we dont need other external data, customer id and transaction id are primary key, all the data are relevant