### AI/ML – Improving Model Performance with Clean Data

**Task 1**: Data Preprocessing for Models

**Objective**: Enhance data quality for better AI/ML outcomes.

**Steps**:
1. Choose a dataset for training an AI/ML model.
2. Identify common data issues like null values, redundant features, or noisydata.
3. Apply preprocessing methods such as imputation, normalization, or feature engineering.

In [1]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Sample dataset with common issues: nulls, redundant features, noisy data
data = {
    'Age': [25, 30, None, 45, 35],
    'Salary': [50000, 60000, 55000, None, 58000],
    'YearsExperience': [1, 5, 3, 10, 7],
    'RedundantFeature': [100, 100, 100, 100, 100],  # No variance, redundant
    'Department': ['Sales', 'Engineering', 'Engineering', 'HR', 'Sales']
}

df = pd.DataFrame(data)

print("Original Data:")
print(df)

# Step 1: Handle missing values (imputation)
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])

# Step 2: Drop redundant feature (zero variance)
df = df.drop(columns=['RedundantFeature'])

# Step 3: Normalize numerical features (Age, Salary, YearsExperience)
scaler = StandardScaler()
df[['Age', 'Salary', 'YearsExperience']] = scaler.fit_transform(df[['Age', 'Salary', 'YearsExperience']])

# Step 4: Encode categorical feature (Department) using one-hot encoding
df = pd.get_dummies(df, columns=['Department'])

print("\nPreprocessed Data:")
print(df)


Original Data:
    Age   Salary  YearsExperience  RedundantFeature   Department
0  25.0  50000.0                1               100        Sales
1  30.0  60000.0                5               100  Engineering
2   NaN  55000.0                3               100  Engineering
3  45.0      NaN               10               100           HR
4  35.0  58000.0                7               100        Sales

Preprocessed Data:
        Age    Salary  YearsExperience  Department_Engineering  Department_HR  \
0 -1.322876 -1.706750        -1.344387                   False          False   
1 -0.566947  1.261511        -0.064018                    True          False   
2  0.000000 -0.222620        -0.704203                    True          False   
3  1.700840  0.000000         1.536443                   False           True   
4  0.188982  0.667859         0.576166                   False          False   

   Department_Sales  
0              True  
1             False  
2             False  


**Task 2**: Evaluate Model Performance

**Objective**: Assess the impact of data quality improvements on model performance.

**Steps**:
1. Train a simple ML model with and without preprocessing.
2. Analyze and compare model performance metrics to evaluate the impact of data quality strategies.

In [2]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Sample dataset with issues
data = {
    'Age': [25, 30, None, 45, 35, 40, None, 29, 50, 38],
    'Salary': [50000, 60000, 55000, None, 58000, 62000, 61000, None, 70000, 67000],
    'YearsExperience': [1, 5, 3, 10, 7, 8, 6, 4, 15, 9],
    'Department': ['Sales', 'Engineering', 'Engineering', 'HR', 'Sales', 'Engineering', 'HR', 'Sales', 'HR', 'Engineering'],
    'LeftCompany': [0, 1, 0, 0, 1, 0, 1, 0, 1, 0]  # Target variable (binary classification)
}

df = pd.DataFrame(data)

# Split features and target
X = df.drop('LeftCompany', axis=1)
y = df['LeftCompany']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# ----- Model WITHOUT preprocessing -----
# For baseline: drop rows with missing values (simple approach)
X_train_simple = X_train.dropna()
y_train_simple = y_train.loc[X_train_simple.index]

X_test_simple = X_test.dropna()
y_test_simple = y_test.loc[X_test_simple.index]

model_simple = LogisticRegression(max_iter=1000)
model_simple.fit(X_train_simple.select_dtypes(include=['number']), y_train_simple)

y_pred_simple = model_simple.predict(X_test_simple.select_dtypes(include=['number']))
print("Performance WITHOUT preprocessing:")
print(f"Accuracy: {accuracy_score(y_test_simple, y_pred_simple):.4f}")
print(classification_report(y_test_simple, y_pred_simple))


# ----- Model WITH preprocessing pipeline -----
numeric_features = ['Age', 'Salary', 'YearsExperience']
categorical_features = ['Department']

# Preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

# Full pipeline with classifier
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

model_pipeline.fit(X_train, y_train)
y_pred_pipeline = model_pipeline.predict(X_test)

print("\nPerformance WITH preprocessing:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_pipeline):.4f}")
print(classification_report(y_test, y_pred_pipeline))


Performance WITHOUT preprocessing:
Accuracy: 0.3333
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      0.50      0.50         2

    accuracy                           0.33         3
   macro avg       0.25      0.25      0.25         3
weighted avg       0.33      0.33      0.33         3


Performance WITH preprocessing:
Accuracy: 0.6667
              precision    recall  f1-score   support

           0       0.50      1.00      0.67         1
           1       1.00      0.50      0.67         2

    accuracy                           0.67         3
   macro avg       0.75      0.75      0.67         3
weighted avg       0.83      0.67      0.67         3

