<a href="https://colab.research.google.com/github/dahaamfirst/Desktop/blob/main/IDM_S25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd

train_df = pd.read_csv('/content/sample_data/california_housing_train.csv')
test_df = pd.read_csv('/content/sample_data/california_housing_test.csv')

In [4]:
print("Training data head:")
display(train_df.head())
print("\nTesting data head:")
display(test_df.head())

Training data head:


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0



Testing data head:


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.05,37.37,27.0,3885.0,661.0,1537.0,606.0,6.6085,344700.0
1,-118.3,34.26,43.0,1510.0,310.0,809.0,277.0,3.599,176500.0
2,-117.81,33.78,27.0,3589.0,507.0,1484.0,495.0,5.7934,270500.0
3,-118.36,33.82,28.0,67.0,15.0,49.0,11.0,6.1359,330000.0
4,-119.67,36.33,19.0,1241.0,244.0,850.0,237.0,2.9375,81700.0


In [5]:
# Dataset description
print("Number of rows (training set):", train_df.shape[0])
print("Number of features:", train_df.shape[1] - 1) # Assuming one column is the target
print("Target variable:", "median_house_value") # Based on the loaded data
print("\nFeature names:")
print(train_df.drop(columns=['median_house_value']).columns.tolist())

Number of rows (training set): 17000
Number of features: 8
Target variable: median_house_value

Feature names:
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']


In [7]:
from sklearn.model_selection import train_test_split

X = train_df.drop("median_house_value", axis=1)
y = train_df["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

Training set size: (13600, 8)
Test set size: (3400, 8)


In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Train Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Predictions
y_pred_lr = lin_reg.predict(X_test)

# Evaluation
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
mae_lr = mean_absolute_error(y_test, y_pred_lr)

print("Linear Regression RMSE:", rmse_lr)
print("Linear Regression MAE:", mae_lr)


Linear Regression RMSE: 68078.32552452553
Linear Regression MAE: 49983.47465122931


In [9]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Columns to apply polynomial features
poly_features = ["median_income", "housing_median_age"]

# Column Transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("poly", PolynomialFeatures(degree=2, include_bias=False), poly_features)
    ],
    remainder="passthrough"
)

# Pipeline
poly_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("regressor", LinearRegression())
])

# Train
poly_model.fit(X_train, y_train)

# Predict
y_pred_poly = poly_model.predict(X_test)

# Evaluation
rmse_poly = np.sqrt(mean_squared_error(y_test, y_pred_poly))
mae_poly = mean_absolute_error(y_test, y_pred_poly)

print("Polynomial Regression RMSE:", rmse_poly)
print("Polynomial Regression MAE:", mae_poly)

Polynomial Regression RMSE: 67620.01465480258
Polynomial Regression MAE: 49770.00050346282


In [10]:
print("Baseline Linear RMSE:", rmse_lr)
print("Polynomial RMSE:", rmse_poly)

print("Baseline Linear MAE:", mae_lr)
print("Polynomial MAE:", mae_poly)


Baseline Linear RMSE: 68078.32552452553
Polynomial RMSE: 67620.01465480258
Baseline Linear MAE: 49983.47465122931
Polynomial MAE: 49770.00050346282


In [12]:
# Create binary target
# median_value is already correctly calculated from y (which is train_df["median_house_value"])
median_value = y.median()
y_cls = (y > median_value).astype(int)

# Features for classification will be the same as regression features, which are stored in X
X_cls = X

# Train/Test split
from sklearn.model_selection import train_test_split

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=42
)

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Train Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_c, y_train_c)

# Predict
y_pred_log = log_reg.predict(X_test_c)

# Metrics
print("Logistic Regression Metrics:")
print("Accuracy:", accuracy_score(y_test_c, y_pred_log))
print("Precision:", precision_score(y_test_c, y_pred_log))
print("Recall:", recall_score(y_test_c, y_pred_log))
print("F1:", f1_score(y_test_c, y_pred_log))

print("\nConfusion Matrix:")
confusion_matrix(y_test_c, y_pred_log)


Logistic Regression Metrics:
Accuracy: 0.8376470588235294
Precision: 0.8493552168815943
Recall: 0.830848623853211
F1: 0.84

Confusion Matrix:


array([[1399,  257],
       [ 295, 1449]])

After mounting, you can navigate to your files in `/content/drive/My Drive/` and load your data using appropriate libraries like pandas.

In [15]:
from sklearn.tree import DecisionTreeClassifier

# Train Decision Tree
tree = DecisionTreeClassifier(
    max_depth=5,
    random_state=42
)
tree.fit(X_train_c, y_train_c)

# Predict
y_pred_tree = tree.predict(X_test_c)

# Metrics
print("Decision Tree Metrics:")
print("Accuracy:", accuracy_score(y_test_c, y_pred_tree))
print("Precision:", precision_score(y_test_c, y_pred_tree))
print("Recall:", recall_score(y_test_c, y_pred_tree))
print("F1:", f1_score(y_test_c, y_pred_tree))

print("\nConfusion Matrix:")
confusion_matrix(y_test_c, y_pred_tree)


Decision Tree Metrics:
Accuracy: 0.7864705882352941
Precision: 0.8775964391691394
Recall: 0.6783256880733946
F1: 0.7652005174644243

Confusion Matrix:


array([[1491,  165],
       [ 561, 1183]])

# Task
Standardize the features in `X_cls` using `StandardScaler`. Then, apply KMeans clustering for `k` values of 3, 4, and 5, calculating and comparing the silhouette score for each to determine the optimal number of clusters. Summarize the findings.

## Standardize Features

### Subtask:
Apply StandardScaler to the feature set (X_cls) to ensure all features contribute equally to the clustering process.


**Reasoning**:
I need to standardize the features in `X_cls` using `StandardScaler` as per the instructions. This involves importing the scaler, initializing it, and then fitting and transforming the data.



In [16]:
from sklearn.preprocessing import StandardScaler

# Instantiate StandardScaler
scaler = StandardScaler()

# Fit and transform X_cls
X_scaled = scaler.fit_transform(X_cls)

print("Shape of standardized features (X_scaled):", X_scaled.shape)
print("First 5 rows of standardized features:\n", X_scaled[:5])

Shape of standardized features (X_scaled): (17000, 8)
First 5 rows of standardized features:
 [[ 2.619365   -0.67152023 -1.07967114  1.36169494  1.76420407 -0.36118401
  -0.07599796 -1.25254316]
 [ 2.53956878 -0.57326437 -0.76187201  2.29660752  3.23044127 -0.26186523
  -0.09940441 -1.08148298]
 [ 2.4946834  -0.90546278 -0.92077158 -0.88246225 -0.86695622 -0.95535424
  -0.99925206 -1.17010515]
 [ 2.48969614 -0.92885703 -1.15912093 -0.52418613 -0.48023023 -0.79679269
  -0.71577404 -0.36260047]
 [ 2.48969614 -0.96160899 -0.68242223 -0.54574692 -0.5063283  -0.70182999
  -0.62214827 -1.02645443]]


## Apply KMeans and Evaluate Silhouette Score

### Subtask:
Apply KMeans clustering for k values of 3, 4, and 5. For each k, calculate the silhouette score to evaluate the quality of the clustering. Store these scores for comparison.


**Reasoning**:
The subtask requires applying KMeans clustering for k=3, 4, and 5, and calculating the silhouette score for each. This code block will perform these steps by importing necessary libraries, iterating through the specified k values, fitting KMeans, predicting clusters, and calculating silhouette scores, storing them in a dictionary.



In [17]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Dictionary to store silhouette scores
silhouette_scores = {}

k_values = [3, 4, 5]

for k in k_values:
    # Initialize KMeans model
    kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')

    # Fit the model to the standardized data
    kmeans.fit(X_scaled)

    # Predict cluster labels
    cluster_labels = kmeans.predict(X_scaled)

    # Calculate silhouette score
    score = silhouette_score(X_scaled, cluster_labels)

    # Store the score
    silhouette_scores[k] = score

    print(f"Silhouette Score for k={k}: {score}")

print("\nAll Silhouette Scores:", silhouette_scores)

Silhouette Score for k=3: 0.3206753067847732
Silhouette Score for k=4: 0.27151152757225405
Silhouette Score for k=5: 0.27694747654098495

All Silhouette Scores: {3: np.float64(0.3206753067847732), 4: np.float64(0.27151152757225405), 5: np.float64(0.27694747654098495)}


## Summarize Findings

Based on the calculated silhouette scores, we can determine the optimal number of clusters among k=3, 4, and 5.

- **k=3**: Silhouette Score: 0.3206753067847732
- **k=4**: Silhouette Score: 0.27151152757225405
- **k=5**: Silhouette Score: 0.27694747654098495

Comparing these scores, **k=3** yields the highest silhouette score (approximately 0.321). A higher silhouette score indicates that the objects are well matched to their own cluster and poorly matched to neighboring clusters. Therefore, based on this evaluation metric, **3 appears to be the optimal number of clusters** for this dataset among the tested values.

## Summarize Findings

Based on the calculated silhouette scores, we can determine the optimal number of clusters among k=3, 4, and 5.

- **k=3**: Silhouette Score: 0.3206753067847732
- **k=4**: Silhouette Score: 0.27151152757225405
- **k=5**: Silhouette Score: 0.27694747654098495

Comparing these scores, **k=3** yields the highest silhouette score (approximately 0.321). A higher silhouette score indicates that the objects are well matched to their own cluster and poorly matched to neighboring clusters. Therefore, based on this evaluation metric, **3 appears to be the optimal number of clusters** for this dataset among the tested values.

## Final Task

### Subtask:
Summarize the entire clustering task, including the feature standardization, the KMeans application for different k values, and the comparison of silhouette scores, highlighting the optimal k found.


## Summary:

### Q&A
*   **What is the optimal number of clusters for the dataset among k=3, 4, and 5?**
    Based on the silhouette scores, k=3 is the optimal number of clusters, yielding the highest score of approximately 0.321.

### Data Analysis Key Findings
*   The features in `X_cls` were successfully standardized using `StandardScaler`, resulting in an `X_scaled` array with a shape of (17000, 8).
*   KMeans clustering was applied to the standardized data for three different cluster counts: k=3, k=4, and k=5.
*   The silhouette scores for each KMeans model were calculated:
    *   For k=3, the silhouette score was approximately 0.321.
    *   For k=4, the silhouette score was approximately 0.272.
    *   For k=5, the silhouette score was approximately 0.277.
*   Comparing these scores, k=3 produced the highest silhouette score (0.321), indicating a better-defined clustering structure compared to k=4 and k=5 for this dataset.

### Insights or Next Steps
*   The identified optimal number of clusters (k=3) suggests that the underlying data might naturally form three distinct groups, which warrants further investigation into the characteristics of these clusters.
*   While k=3 is optimal among the tested values, it would be beneficial to visualize the data and cluster results (e.g., using PCA or t-SNE) and potentially explore a broader range of 'k' values or different clustering algorithms to confirm this finding.
