### AI/ML – Improving Model Performance with Clean Data

**Task 1**: Data Preprocessing for Models

**Objective**: Enhance data quality for better AI/ML outcomes.

**Steps**:
1. Choose a dataset for training an AI/ML model.
2. Identify common data issues like null values, redundant features, or noisydata.
3. Apply preprocessing methods such as imputation, normalization, or feature engineering.

In [1]:
# Write your code from here
import pandas as pd
import numpy as np

# Sample data for transactions
data = {
    'transaction_id': [101, 102, 103, 104, 105, 106],
    'transaction_amount': [250.75, 500.10, 150.00, np.nan, 1200.50, 300.00],
    'customer_age': [25, 34, 28, 45, 31, 22],
    'transaction_date': ['2025-05-01', '2025-05-02', '2025-05-03', '2025-05-04', '2025-05-05', '2025-05-06'],
    'status': ['completed', 'completed', 'failed', 'completed', 'completed', 'failed']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save to CSV file
df.to_csv('sample_transactions.csv', index=False)

print("CSV file 'sample_transactions.csv' created successfully!")

CSV file 'sample_transactions.csv' created successfully!


**Task 2**: Evaluate Model Performance

**Objective**: Assess the impact of data quality improvements on model performance.

**Steps**:
1. Train a simple ML model with and without preprocessing.
2. Analyze and compare model performance metrics to evaluate the impact of data quality strategies.

In [2]:
# Write your code from here
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import numpy as np

# Sample data for transactions
data = {
    'transaction_id': [101, 102, 103, 104, 105, 106],
    'transaction_amount': [250.75, 500.10, 150.00, np.nan, 1200.50, 300.00],
    'customer_age': [25, 34, 28, 45, 31, 22],
    'transaction_date': ['2025-05-01', '2025-05-02', '2025-05-03', '2025-05-04', '2025-05-05', '2025-05-06'],
    'status': ['completed', 'completed', 'failed', 'completed', 'completed', 'failed']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Step 1: Identify numerical columns for imputation
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Step 2: Handle missing values (Imputation only on numerical columns)
imputer = SimpleImputer(strategy='mean')  # Imputing missing values with the mean
df[numerical_columns] = imputer.fit_transform(df[numerical_columns])

# Step 3: Normalize the numerical columns (Standardization)
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Step 4: Feature Engineering (Example: Create a new feature from existing ones)
df['transaction_per_age'] = df['transaction_amount'] / df['customer_age']

# Check the processed data
print("Processed DataFrame:\n", df)

# Save processed data to a new CSV file
df.to_csv('processed_transactions.csv', index=False)

Processed DataFrame:
    transaction_id  transaction_amount  customer_age transaction_date  \
0        -1.46385           -0.665635     -0.785575       2025-05-01   
1        -0.87831            0.057509      0.426455       2025-05-02   
2        -0.29277           -0.957821     -0.381565       2025-05-03   
3         0.29277            0.000000      1.907826       2025-05-04   
4         0.87831            2.088750      0.022445       2025-05-05   
5         1.46385           -0.522804     -1.189585       2025-05-06   

      status  transaction_per_age  
0  completed             0.847321  
1  completed             0.134854  
2     failed             2.510243  
3  completed             0.000000  
4  completed            93.060801  
5     failed             0.439484  


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error

# Sample data for transactions
data = {
    'transaction_id': [101, 102, 103, 104, 105, 106],
    'transaction_amount': [250.75, 500.10, 150.00, None, 1200.50, 300.00],
    'customer_age': [25, 34, 28, 45, 31, 22],
    'transaction_date': ['2025-05-01', '2025-05-02', '2025-05-03', '2025-05-04', '2025-05-05', '2025-05-06'],
    'status': ['completed', 'completed', 'failed', 'completed', 'completed', 'failed']
}

# Create DataFrame
df = pd.DataFrame(data)

# Step 1: Handle missing values (Imputation on numerical columns only)
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
imputer = SimpleImputer(strategy='mean')  # Impute missing values with the mean
df[numerical_columns] = imputer.fit_transform(df[numerical_columns])

# Step 2: Normalize the numerical columns (Standardization)
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Step 3: Feature Engineering (Example: Create a new feature from existing ones)
df['transaction_per_age'] = df['transaction_amount'] / df['customer_age']

# Assume 'transaction_amount' is the target column you want to predict
# Replace 'target_column' with the actual name of your target column in your data
target_column = 'transaction_amount'

# Step 4: Split data into training and testing sets
X = df.drop(columns=[target_column, 'transaction_id', 'transaction_date', 'status'])  # Features (drop non-predictive columns)
y = df[target_column]  # Target variable (transaction amount)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Train a model (e.g., Linear Regression)
model = LinearRegression()
model.fit(X_train, y_train)

# Step 6: Predict and evaluate the model
y_pred = model.predict(X_test)

# Step 7: Calculate Mean Squared Error for model evaluation
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

Mean Squared Error (MSE): 0.12499952036502948


In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error

# Sample data for transactions
data = {
    'transaction_id': [101, 102, 103, 104, 105, 106],
    'transaction_amount': [250.75, 500.10, 150.00, None, 1200.50, 300.00],
    'customer_age': [25, 34, 28, 45, 31, 22],
    'transaction_date': ['2025-05-01', '2025-05-02', '2025-05-03', '2025-05-04', '2025-05-05', '2025-05-06'],
    'status': ['completed', 'completed', 'failed', 'completed', 'completed', 'failed']
}

# Create DataFrame
df = pd.DataFrame(data)

# Handle missing values in the target column ('transaction_amount')
df['transaction_amount'].fillna(df['transaction_amount'].mean(), inplace=True)

# Step 1: Split data into training and testing sets
X = df.drop(columns=['transaction_amount', 'transaction_id', 'transaction_date', 'status'])  # Features
y = df['transaction_amount']  # Target variable (transaction amount)

# Check if there are any NaN values in the target column (y)
if y.isnull().any():
    print("Warning: Missing values found in target column. Filling with mean.")
    y.fillna(y.mean(), inplace=True)

# Ensure the features (X) have no missing values
X.fillna(X.mean(), inplace=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Function to train model without preprocessing
def train_model_without_preprocessing(X_train, X_test, y_train, y_test):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    return mse

# Step 2: Train model without preprocessing
mse_without_preprocessing = train_model_without_preprocessing(X_train, X_test, y_train, y_test)

# Step 3: Preprocess the data (Imputation and Normalization)
# Handle missing values (Imputation on numerical columns only)
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
imputer = SimpleImputer(strategy='mean')  # Impute missing values with the mean
df[numerical_columns] = imputer.fit_transform(df[numerical_columns])

# Normalize the numerical columns (Standardization)
scaler = StandardScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Recreate the train-test split after preprocessing
X = df.drop(columns=['transaction_amount', 'transaction_id', 'transaction_date', 'status'])  # Features
y = df['transaction_amount']  # Target variable (transaction amount)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Function to train model with preprocessing
def train_model_with_preprocessing(X_train, X_test, y_train, y_test):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    return mse

# Step 4: Train model with preprocessing
mse_with_preprocessing = train_model_with_preprocessing(X_train, X_test, y_train, y_test)

# Step 5: Compare performance
print("MSE without preprocessing:", mse_without_preprocessing)
print("MSE with preprocessing:", mse_with_preprocessing)

# Step 6: Conclusion based on comparison
if mse_with_preprocessing < mse_without_preprocessing:
    print("\nData preprocessing improved model performance!")
else:
    print("\nData preprocessing did not improve model performance.")

MSE without preprocessing: 26742.930732130044
MSE with preprocessing: 0.22492599498537885

Data preprocessing improved model performance!
