# Exercise 4.16 (logistic regression only)
---

I am running a logistic regression model using sklearn. First I will split the data into training and test sets for model evaluation. I will also be using StandardScaler to standardize the features before applying the logisctic regression model.

I am going to use the entire set of features at first and then I will train and test the model with three different subsets.

You can skip to the end of this page to see the summary and interpertation of my results.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
df = pd.read_csv('Boston.csv')
crim_median = df['crim'].median();
df['high_crime'] = (df['crim'] > crim_median).astype(int)
print(df.head())

   Unnamed: 0     crim    zn  indus  chas    nox     rm   age     dis  rad  \
0           1  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1   
1           2  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2   
2           3  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2   
3           4  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3   
4           5  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3   

   tax  ptratio  lstat  medv  high_crime  
0  296     15.3   4.98  24.0           0  
1  242     17.8   9.14  21.6           0  
2  242     17.8   4.03  34.7           0  
3  222     18.7   2.94  33.4           0  
4  222     18.7   5.33  36.2           0  


In [None]:
X = df.drop(columns=['Unnamed: 0', 'crim', 'high_crime'])
y = df['high_crime']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=9)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.875
[[74  6]
 [13 59]]
              precision    recall  f1-score   support

           0       0.85      0.93      0.89        80
           1       0.91      0.82      0.86        72

    accuracy                           0.88       152
   macro avg       0.88      0.87      0.87       152
weighted avg       0.88      0.88      0.87       152



In [None]:
subset_1 = ['rm', 'age', 'tax']
X_train_subset1 = X_train[subset_1]
X_test_subset1 = X_test[subset_1]

# Scale subset features
X_train_scaled_1 = scaler.fit_transform(X_train_subset1)
X_test_scaled_1 = scaler.transform(X_test_subset1)

# Fit logistic regression
model.fit(X_train_scaled_1, y_train)
y_pred_1 = model.predict(X_test_scaled_1)

# Evaluate performance
print("Subset 1 Accuracy:", accuracy_score(y_test, y_pred_1))
print(confusion_matrix(y_test, y_pred_1))
print(classification_report(y_test, y_pred_1))

Subset 1 Accuracy: 0.8421052631578947
[[71  9]
 [15 57]]
              precision    recall  f1-score   support

           0       0.83      0.89      0.86        80
           1       0.86      0.79      0.83        72

    accuracy                           0.84       152
   macro avg       0.84      0.84      0.84       152
weighted avg       0.84      0.84      0.84       152



In [None]:
subset_2 = ['indus', 'nox', 'ptratio', 'tax']
X_train_subset2 = X_train[subset_2]
X_test_subset2 = X_test[subset_2]

# Scale subset features
X_train_scaled_2 = scaler.fit_transform(X_train_subset2)
X_test_scaled_2 = scaler.transform(X_test_subset2)

# Fit logistic regression
model.fit(X_train_scaled_2, y_train)
y_pred_2 = model.predict(X_test_scaled_2)

# Evaluate performance
print("Subset 2 Accuracy:", accuracy_score(y_test, y_pred_2))
print(confusion_matrix(y_test, y_pred_2))
print(classification_report(y_test, y_pred_2))

Subset 2 Accuracy: 0.8947368421052632
[[75  5]
 [11 61]]
              precision    recall  f1-score   support

           0       0.87      0.94      0.90        80
           1       0.92      0.85      0.88        72

    accuracy                           0.89       152
   macro avg       0.90      0.89      0.89       152
weighted avg       0.90      0.89      0.89       152



In [None]:
subset_3 = ['dis', 'rad', 'medv', 'zn']
X_train_subset3 = X_train[subset_3]
X_test_subset3 = X_test[subset_3]

# Scale subset features
X_train_scaled_3 = scaler.fit_transform(X_train_subset3)
X_test_scaled_3 = scaler.transform(X_test_subset3)

# Fit logistic regression
model.fit(X_train_scaled_3, y_train)
y_pred_3 = model.predict(X_test_scaled_3)

# Evaluate performance
print("Subset 3 Accuracy:", accuracy_score(y_test, y_pred_3))
print(confusion_matrix(y_test, y_pred_3))
print(classification_report(y_test, y_pred_3))

Subset 3 Accuracy: 0.8223684210526315
[[67 13]
 [14 58]]
              precision    recall  f1-score   support

           0       0.83      0.84      0.83        80
           1       0.82      0.81      0.81        72

    accuracy                           0.82       152
   macro avg       0.82      0.82      0.82       152
weighted avg       0.82      0.82      0.82       152



###Results
---
### Original set (all features)
Confusion Matrix

|            | Predicted 0 | Predicted 1 |
|------------|-------------|-------------|
| Actual 0   | 74          | 6           |
| Actual 1   | 13          | 59          |

Classification Report

| Class      | Precision | Recall | F1-Score | Support |
|------------|-----------|--------|----------|---------|
| 0          | 0.85      | 0.93   | 0.89     | 80      |
| 1          | 0.91      | 0.82   | 0.86     | 72      |
| **Accuracy**   |        |        | 0.88     | 152     |
| **Macro Avg**  | 0.88   | 0.87   | 0.87     | 152     |
| **Weighted Avg** | 0.88   | 0.88   | 0.87   | 152     |


### Subset 1 (features: rm, age, tax)

Confusion Matrix

|            | Predicted 0 | Predicted 1 |
|------------|-------------|-------------|
| Actual 0   | 71          | 9           |
| Actual 1   | 15          | 57          |

Classification Report

| Class      | Precision | Recall | F1-Score | Support |
|------------|-----------|--------|----------|---------|
| 0          | 0.83      | 0.89   | 0.86     | 80      |
| 1          | 0.86      | 0.79   | 0.83     | 72      |
| **Accuracy**   |        |        | 0.84     | 152     |
| **Macro Avg**  | 0.84   | 0.84   | 0.84     | 152     |
| **Weighted Avg** | 0.84   | 0.84   | 0.84   | 152     |

### Subset 2 (features: indus, nox, ptratio, tax)

Confusion Matrix

|            | Predicted 0 | Predicted 1 |
|------------|-------------|-------------|
| Actual 0   | 75          | 5           |
| Actual 1   | 11          | 61          |

Classification Report

| Class      | Precision | Recall | F1-Score | Support |
|------------|-----------|--------|----------|---------|
| 0          | 0.87      | 0.94   | 0.90     | 80      |
| 1          | 0.92      | 0.85   | 0.88     | 72      |
| **Accuracy**   |        |        | 0.89     | 152     |
| **Macro Avg**  | 0.90   | 0.89   | 0.89     | 152     |
| **Weighted Avg** | 0.90 | 0.89   | 0.89     | 152     |


### Subset 3 (features: dis, rad, medv, zn)
Confusion Matrix

|            | Predicted 0 | Predicted 1 |
|------------|-------------|-------------|
| Actual 0   | 67          | 13          |
| Actual 1   | 14          | 58          |

Classification Report

| Class      | Precision | Recall | F1-Score | Support |
|------------|-----------|--------|----------|---------|
| 0          | 0.83      | 0.84   | 0.83     | 80      |
| 1          | 0.82      | 0.81   | 0.81     | 72      |
| **Accuracy**   |        |        | 0.82     | 152     |
| **Macro Avg**  | 0.82   | 0.82   | 0.82     | 152     |
| **Weighted Avg** | 0.82   | 0.82   | 0.82   | 152     |



### Results summary
| Model Set   | Accuracy | Precision (0) | Recall (0) | F1-Score (0) | Precision (1) | Recall (1) | F1-Score (1) |
|-------------|----------|---------------|------------|--------------|---------------|------------|--------------|
| Original    | 0.875    | 0.85          | 0.93       | 0.89         | 0.91          | 0.82       | 0.86         |
| Subset 1    | 0.8421   | 0.83          | 0.89       | 0.86         | 0.86          | 0.79       | 0.83         |
| Subset 2    | **0.8947**| 0.87          | **0.94**   | **0.90**     | **0.92**      | 0.85       | **0.88**     |
| Subset 3    | 0.8224   | 0.83          | 0.84       | 0.83         | 0.82          | 0.81       | 0.81         |



*   Subset 2 has the highest accuracy (0.8947), with excellent precision, recall, and F1-scores, particularly for class 1 (above-median crime rate).
*   Original Set performs quite well with a solid accuracy of 0.875 and balanced F1-scores for both classes.
*   Subset 1 has slightly lower performance compared to the original set, particularly in recall for class 1, which dropped to 0.79.
*   Subset 3 shows the lowest accuracy (0.8224) and slightly weaker performance in both precision and recall for both classes.





Subset 2's strong performance (accuracy of 0.8947) suggests that  environmental (nox) and socio-economic factors (indus, ptratio, tax) provide valuable information for classifying crime rates

 New Section

# Exercise 6.11 (for LASSO and ridge only)
---

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler


In [None]:
# Load the Boston dataset
df = pd.read_csv('Boston.csv')
X = df.drop(columns=['Unnamed: 0', 'crim'])
y = df['crim']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Standardize the data (important for Lasso)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


**LASSO**



In [None]:
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
y_pred = lasso.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print("Lasso Coefficients:", lasso.coef_)


Mean Squared Error: 47.31442490271197
Lasso Coefficients: [ 0.68552554 -0.         -0.22260961 -0.62485611  0.14489728  0.10394735
 -1.41375269  4.50887328  0.         -0.36020619  0.07751059 -1.59727234]


**Ridge**

In [None]:
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
y_pred = ridge.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print("Ridge Coefficients:", ridge.coef_)

Mean Squared Error: 46.72695922739183
Ridge Coefficients: [ 0.98599621 -0.00601307 -0.28694681 -1.47310971  0.40794745  0.35019343
 -2.07855074  5.13433673 -0.39439014 -0.7171805   0.1435563  -2.124527  ]


The results for both Ridge and Lasso regression show similar performance in terms of Mean Squared Error:

Lasso Regression MSE: 47.31
Ridge Regression MSE: 46.73
The MSE values are quite close, indicating that both models are performing similarly for predicting the per capita crime rate in the dataset.

Lasso Regression: Some coefficients are set to zero, which means certain features are effectively removed from the model. From the coefficients output, features 'zn' and 'rad' have been set to zero, indicating that Lasso did some feature selection. The slightly higher MSE may be due to the removal of some useful predictors, but it also simplifies the model by reducing the number of
  active variables.

Ridge Regression: This model retains all features and shrinks the coefficients to reduce overfitting, which can explain its slightly lower MSE compared to Lasso. Ridge doesn't perform feature selection but can handle correlated predictors better.

The MSE in both regression models is pretty high indicating that some optimization and tuning would be required in order to obtain more effective regression models.
