### AI/ML – Improving Model Performance with Clean Data

**Task 1**: Data Preprocessing for Models

**Objective**: Enhance data quality for better AI/ML outcomes.

**Steps**:
1. Choose a dataset for training an AI/ML model.
2. Identify common data issues like null values, redundant features, or noisydata.
3. Apply preprocessing methods such as imputation, normalization, or feature engineering.

In [None]:
# Write your code from here

In [1]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Sample dataset with common data issues
data = {
    'Feature1': [1.0, 2.0, None, 4.0, 5.0],
    'Feature2': [10, 20, 30, None, 50],
    'Feature3': ['A', 'B', 'A', 'B', None]
}
df = pd.DataFrame(data)

# Step 1: Handle missing values
imputer = SimpleImputer(strategy='mean')
df['Feature1'] = imputer.fit_transform(df[['Feature1']])
df['Feature2'] = imputer.fit_transform(df[['Feature2']])

# Step 2: Normalize numerical features
scaler = StandardScaler()
df[['Feature1', 'Feature2']] = scaler.fit_transform(df[['Feature1', 'Feature2']])

# Step 3: Encode categorical features
df = pd.get_dummies(df, columns=['Feature3'], drop_first=True)

# Display the preprocessed dataset
print(df)

   Feature1  Feature2  Feature3_B
0 -1.414214 -1.322876       False
1 -0.707107 -0.566947        True
2  0.000000  0.188982       False
3  0.707107  0.000000        True
4  1.414214  1.700840       False


**Task 2**: Evaluate Model Performance

**Objective**: Assess the impact of data quality improvements on model performance.

**Steps**:
1. Train a simple ML model with and without preprocessing.
2. Analyze and compare model performance metrics to evaluate the impact of data quality strategies.

In [None]:
# Write your code from here


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Define features (X) and target (y)
X = df[['Feature1', 'Feature2']]
y = df['Feature3_B']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a simple Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

Model Accuracy: 0.00
