## Exercise 1: Experiment with Different Models

Objective: To understand how different models perform on the same dataset and to practice replacing models in a pipeline.

Instructions:

- Replace the `Logistic Regression` model with another classifier, such as `KNeighborsClassifier` or `DecisionTreeClassifier`.
- Train and evaluate the new model on the same training and test sets.
- Compare the accuracy with those from the `Logistic Regression` model.

Questions:

- How did the performance of the new model compare to Logistic Regression?

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier

In [34]:
# Load the breast cancer dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)

df['target'] = data.target

# Separate features and target variable
X = df.drop(columns='target')
y = df['target']

# Standardize the features to similar ranges
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Train a model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.98


In [6]:
# Load the breast cancer dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


In [7]:
df.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')

In [8]:
data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [35]:
# Train a model
model2 = KNeighborsClassifier(n_neighbors=2)
model2.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.98


How did the performance of the new model compare to Logistic Regression?
it's the same

# Exercise 2: Experiment with different datasets

Choose 1 or these datasets and compare the accuracies.

- Wine Dataset

```python
from sklearn.datasets import load_wine
data = load_wine()
```
- Diabetes Dataset
```python
from sklearn.datasets import load_diabetes
data = load_diabetes()
```

In [13]:
# Load the Wine Dataset
from sklearn.datasets import load_wine
data1 = load_wine()
df1 = pd.DataFrame(data1.data, columns=data1.feature_names)

df1['target'] = data1.target

# Separate features and target variable
X = df1.drop(columns='target')
y = df1['target']

# Standardize the features to similar ranges
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Train a model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate accuracy
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error is {mae} and R2 score is {r2}.")


Mean Absolute Error is 0.018518518518518517 and R2 score is 0.9692657939669892.


There is the same accuracy with vine dataset as with given first.

In [14]:
# Load the Wine Dataset
from sklearn.datasets import load_diabetes
data2 = load_diabetes()
df2 = pd.DataFrame(data2.data, columns=data2.feature_names)

df2['target'] = data2.target

# Separate features and target variable
X = df2.drop(columns='target')
y = df2['target']

# Standardize the features to similar ranges
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Train a model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate accuracy
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error is {mae} and R2 score is {r2}.")

Mean Absolute Error is 47.669172932330824 and R2 score is 0.3261491756691697.


In [10]:
y_pred[0]

230.0

In [11]:
y_test[0]

151.0