# **Data Science Bootcamp, Spring 2025**
---
**Take-Home Assignment #7** \
Cherron Griffith \
Due Date: April 20, 2025

---
---

### **Logistic Regression**
---
1. Try different thresholds for computing predictions. By default it is 0.5. Use `predit_proba` function to compute probabilities and then try custom thresholds and see their impact on accuracy, precision, and recall.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set up data for model
glass = pd.read_csv('glass.csv')
glass.Type.value_counts().sort_index()
glass['household'] = glass.Type.map({1:0, 2:0, 3:0, 5:1, 6:1, 7:1})
glass.sort_values(by = 'Al', inplace = True)
X = np.array(glass.Al).reshape(-1, 1)
y = glass.household

In [None]:
from sklearn.linear_model import LogisticRegression

# Create logistic regression model
logreg = LogisticRegression()
logreg.fit(X,y)
pred = logreg.predict(X)
logreg.coef_, logreg.intercept_
glass_probs = logreg.predict_proba(X)[:, 1]

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Apply different thresholds to make predictions 
thresholds = [0.3, 0.5, 0.7]
for t in thresholds:
    glass_pred = (glass_probs >= t).astype(int)
    print(f"Threshold: {t}")
    print("Accuracy:", accuracy_score(y_true = y, y_pred = glass_pred))
    print("Precision:", precision_score(y_true = y, y_pred = glass_pred))
    print("Recall:", recall_score(y_true = y, y_pred = glass_pred))
    print("---")

Threshold: 0.3
Accuracy: 0.8644859813084113
Precision: 0.72
Recall: 0.7058823529411765
---
Threshold: 0.5
Accuracy: 0.8691588785046729
Precision: 0.896551724137931
Recall: 0.5098039215686274
---
Threshold: 0.7
Accuracy: 0.8364485981308412
Precision: 1.0
Recall: 0.3137254901960784
---


**Conclusion:** Lower thresholds have a decrease in precision and accuracy (slightly), and an increase in recall, while higher thresholds have an increase in precision, and a decrease accuracy and recall.

2. Do the same analysis for other columns.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set up data for model
glass = pd.read_csv('glass.csv')
glass.head()
glass.Type.value_counts().sort_index()
glass['window'] = glass.Type.map({1:1, 2:1, 3:1, 5:0, 6:0, 7:0})
glass.sort_values( by = 'Al', inplace=True)
X = np.array(glass.Al).reshape(-1,1)
y = glass.window

In [8]:
from sklearn.linear_model import LogisticRegression

# Create logistic regression model
logreg = LogisticRegression()
logreg.fit(X,y)
pred = logreg.predict(X)
logreg.coef_, logreg.intercept_
glass_probs = logreg.predict_proba(X)[:, 1]

In [9]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Apply different thresholds to make predictions 
thresholds = [0.3, 0.5, 0.7]
for t in thresholds:
    glass_pred = (glass_probs >= t).astype(int)
    print(f"Threshold: {t}")
    print("Accuracy:", accuracy_score(y_true = y, y_pred = glass_pred))
    print("Precision:", precision_score(y_true = y, y_pred = glass_pred))
    print("Recall:", recall_score(y_true = y, y_pred = glass_pred))
    print("---")

Threshold: 0.3
Accuracy: 0.8364485981308412
Precision: 0.8232323232323232
Recall: 1.0
---
Threshold: 0.5
Accuracy: 0.8691588785046729
Precision: 0.8648648648648649
Recall: 0.9815950920245399
---
Threshold: 0.7
Accuracy: 0.8644859813084113
Precision: 0.9085365853658537
Recall: 0.9141104294478528
---


**Conclusion:** Lower thresholds also have a decrease in precision and accuracy, and an increase in recall, while higher thresholds have an increase in precision, and a slight decrease accuracy and recall.

3.  Fit a Logistic Regression Model on all features. Remember to preprocess data (ex: normalization and one hot encoding).

In [28]:
from sklearn.model_selection import train_test_split

X = glass.drop(columns=['window'])
y = glass.window
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [29]:
# Create model
window_model = LogisticRegression()
window_model.fit(X_train, y_train)

# Get probabilities for class 1
y_pred_probs = window_model.predict_proba(X_test)
y_scores = y_pred_probs[:, 1]
y_true = y_test

4. Plot ROC Curves for each model.

In [30]:
from sklearn.metrics import roc_curve, auc

# Calculate FPR, TPR, and thresholds
fpr, tpr, thresholds = roc_curve(y_true, y_scores)

# Calculate AUC
roc_auc = auc(fpr, tpr)

In [None]:
# Plot ROC curve 
plt.figure()
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')  # Diagonal line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

### **Clustering**
---
1. Repeat the above exercise for different values of k.

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
from sklearn import cluster, datasets, preprocessing, metrics
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
df = pd.read_csv("iris.csv")

In [14]:
cols = df.columns[:-1]
X_scaled = preprocessing.MinMaxScaler().fit_transform(df[cols])

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

inertias = []
silhouettes = []
k_values = range(2,5)

for k in k_values:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X_scaled)
    
    labels = kmeans.labels_
    inertia = kmeans.inertia_
    silhouette = silhouette_score(X_scaled, labels)
    
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels))

    print(f"k = {k}")
    print(f"  Inertia: {inertia:.2f}")
    print(f"  Silhouette Score: {silhouette:.4f}")
    print("---")

k = 2
  Inertia: 12.14
  Silhouette Score: 0.6295
---
k = 3
  Inertia: 7.14
  Silhouette Score: 0.4825
---
k = 4
  Inertia: 6.18
  Silhouette Score: 0.3792
---


- **How do the inertia and silhouette scores change?**

    Inertia and silhouette scores decreased as the k values increased.

- **What if you don't scale your features?**

    If you do not scale your features, features with larger numeric ranges would greatly affect the distance calculations and, therefore, also the clustering.

- **Is there a "right" k? Why or why not?**

    The "right" value for k depends on the data you are doing the clustering analysis on. Finding the "right" k for your data involves using the Elbow Method or the peak silhouette score.