### AI/ML – Improving Model Performance with Clean Data

**Task 1**: Data Preprocessing for Models

**Objective**: Enhance data quality for better AI/ML outcomes.

**Steps**:
1. Choose a dataset for training an AI/ML model.
2. Identify common data issues like null values, redundant features, or noisydata.
3. Apply preprocessing methods such as imputation, normalization, or feature engineering.

In [1]:
# Write your code from here
import pandas as pd
import numpy as np

# Sample data for transactions
data = {
    'transaction_id': [101, 102, 103, 104, 105, 106],
    'transaction_amount': [250.75, 500.10, 150.00, np.nan, 1200.50, 300.00],
    'customer_age': [25, 34, 28, 45, 31, 22],
    'transaction_date': ['2025-05-01', '2025-05-02', '2025-05-03', '2025-05-04', '2025-05-05', '2025-05-06'],
    'status': ['completed', 'completed', 'failed', 'completed', 'completed', 'failed']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save to CSV file
df.to_csv('sample_transactions.csv', index=False)

print("CSV file 'sample_transactions.csv' created successfully!")

CSV file 'sample_transactions.csv' created successfully!


**Task 2**: Evaluate Model Performance

**Objective**: Assess the impact of data quality improvements on model performance.

**Steps**:
1. Train a simple ML model with and without preprocessing.
2. Analyze and compare model performance metrics to evaluate the impact of data quality strategies.

In [2]:
# Write your code from here
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data for transactions
data = {
    'transaction_id': [101, 102, 103, 104, 105, 106],
    'transaction_amount': [250.75, 500.10, 150.00, np.nan, 1200.50, 300.00],
    'customer_age': [25, 34, 28, 45, 31, 22],
    'transaction_date': ['2025-05-01', '2025-05-02', '2025-05-03', '2025-05-04', '2025-05-05', '2025-05-06'],
    'status': ['completed', 'completed', 'failed', 'completed', 'completed', 'failed']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Step 1: Identify numerical columns for imputation
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Step 2: Handle missing values (Imputation only on numerical columns)
imputer = SimpleImputer(strategy='mean')  # Imputing missing values with the mean
df[numerical_columns] = imputer.fit_transform(df[numerical_columns])

# Step 3: Normalize the numerical columns (Standardization)
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Step 4: Feature Engineering (Example: Create a new feature from existing ones)
df['transaction_per_age'] = df['transaction_amount'] / df['customer_age']

# Check the processed data
print("Processed DataFrame:\n", df)

# Save processed data to a new CSV file
df.to_csv('processed_transactions.csv', index=False)

Processed DataFrame:
    transaction_id  transaction_amount  customer_age transaction_date  \
0        -1.46385           -0.665635     -0.785575       2025-05-01   
1        -0.87831            0.057509      0.426455       2025-05-02   
2        -0.29277           -0.957821     -0.381565       2025-05-03   
3         0.29277            0.000000      1.907826       2025-05-04   
4         0.87831            2.088750      0.022445       2025-05-05   
5         1.46385           -0.522804     -1.189585       2025-05-06   

      status  transaction_per_age  
0  completed             0.847321  
1  completed             0.134854  
2     failed             2.510243  
3  completed             0.000000  
4  completed            93.060801  
5     failed             0.439484  
