---
## Part 1: Setup and Imports

First, let's import all the necessary libraries.

In [1]:
# Standard imports
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Check Python version
print(f"Python version: {sys.version}")
assert sys.version_info >= (3, 7), "Python 3.7 or above is required"

# Scikit-learn imports
import sklearn
print(f"Scikit-learn version: {sklearn.__version__}")

# Set plot defaults
plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

Python version: 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:49:36) [Clang 16.0.6 ]
Scikit-learn version: 1.2.2
Scikit-learn version: 1.2.2


---
## Part 2: Load and Explore the Data

### 2.1 Load the Dataset

üìù **Your Task:** Load the CSV file containing the transcription data.

üí° **Hint:** Use `pd.read_csv()` to load the data from `Data_AUG_13.11.2024_output.csv`

In [None]:
# TODO: Load the dataset
# data = pd.read_csv(___)

# Your code here:

### 2.2 Take a Quick Look at the Data Structure

üìù **Your Task:** Explore the basic structure of the data:
1. Display the first few rows
2. Check the data types and non-null counts
3. Get statistical summary

üí° **Hint:** Use `.head()`, `.info()`, and `.describe()` methods

In [None]:
# TODO: Display the first 5 rows
# Your code here:

In [None]:
# TODO: Check data types and missing values
# Your code here:

In [None]:
# TODO: Get statistical summary of numerical columns
# Your code here:

### 2.3 Understand the Target Variable

üìù **Your Task:** Explore the target variable `Class_label`:
1. Check the unique values
2. Check the class distribution (value counts)
3. Visualize the class distribution

üí° **Hint:** Use `.value_counts()` and `.plot.bar()`

ü§î **Think About:** Is this a balanced or imbalanced dataset? How might this affect your model?

In [None]:
# TODO: Check the distribution of the target variable (Class_label)
# Your code here:

In [None]:
# TODO: Visualize the class distribution with a bar plot
# Your code here:

### 2.4 Explore the Features

üìù **Your Task:** Identify which columns are:
- **Numerical features** (can be used directly)
- **Categorical features** (may need encoding)
- **Text columns** (transcriptions - may need special processing)
- **ID/metadata columns** (should be excluded from training)

üí° **Hint:** Look at columns like `token_count`, `type_token_ratio`, `filler_count`, etc.

In [None]:
# TODO: List all column names
# Your code here:

In [None]:
# TODO: Identify numerical columns that could be good features
# Hint: Columns like 'filler_count', 'token_count', 'type_count', 
# 'type_token_ratio', 'content_density', 'sentence_count', etc.

# Your code here:

### 2.5 Visualize Feature Distributions

üìù **Your Task:** Create histograms for the numerical features.

üí° **Hint:** Use `df[numerical_columns].hist()` like we did in Chapter 2

In [None]:
# TODO: Create histograms for numerical features
# Your code here:

---
## Part 3: Data Preparation

### 3.1 Select Features for Classification

üìù **Your Task:** Select the features you want to use for classification.

Consider using these numerical features:
- `filler_count` - Number of filler words (um, uh, etc.)
- `token_count` - Total number of words
- `type_count` - Number of unique words
- `type_token_ratio` - Vocabulary diversity measure
- `content_density` - Content word ratio
- `sentence_count` - Number of sentences
- `average_words_per_sentence` - Sentence complexity
- `Age` - Patient age
- `Converted-MMSE` - Cognitive assessment score

üí° **Hint:** Create a list of feature column names and extract them into X

In [None]:
# TODO: Define your feature columns
# feature_columns = [___]

# Your code here:

In [None]:
# TODO: Create X (features) and y (target)
# X = data[feature_columns]
# y = data['Class_label']

# Your code here:

In [None]:
# TODO: Check the shapes of X and y
# Your code here:

### 3.2 Handle Missing Values

üìù **Your Task:** Check for and handle missing values in your features.

üí° **Hint:** Use `.isnull().sum()` to check, and consider using `.fillna()` or `SimpleImputer` from sklearn

In [None]:
# TODO: Check for missing values in X
# Your code here:

In [None]:
# TODO: Handle missing values if any exist
# Option 1: Fill with median
# X = X.fillna(X.median())

# Option 2: Use SimpleImputer
# from sklearn.impute import SimpleImputer
# imputer = SimpleImputer(strategy='median')
# X = imputer.fit_transform(X)

# Your code here:

### 3.3 Create Train/Test Split

üìù **Your Task:** Split the data into training and test sets.

‚ö†Ô∏è **Important:** Use **stratified sampling** to maintain class proportions!

üí° **Hint:** Use `train_test_split()` with `stratify=y` parameter (like in Chapter 2)

In [None]:
from sklearn.model_selection import train_test_split

# TODO: Split the data with stratification
# X_train, X_test, y_train, y_test = train_test_split(
#     X, y, test_size=0.2, random_state=42, stratify=y
# )

# Your code here:

In [None]:
# TODO: Verify the split - check shapes and class distribution in both sets
# Your code here:

### 3.4 Feature Scaling

üìù **Your Task:** Scale the features using StandardScaler.

‚ö†Ô∏è **Important:** 
- Fit the scaler on training data only!
- Transform both training and test data with the same scaler

üí° **Hint:** Use `StandardScaler` from sklearn.preprocessing

In [None]:
from sklearn.preprocessing import StandardScaler

# TODO: Create scaler, fit on training data, transform both sets
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_test_scaled = scaler.transform(X_test)

# Your code here:

---
## Part 4: Training Binary Classifiers

Now let's train some classifiers! Remember from Chapter 3, we can use various algorithms.

### 4.1 Train an SGDClassifier

üìù **Your Task:** Train an SGDClassifier on the training data.

üí° **Hint:** Like in Chapter 3, use `SGDClassifier(random_state=42)`

In [None]:
from sklearn.linear_model import SGDClassifier

# TODO: Create and train the SGDClassifier
# sgd_clf = SGDClassifier(random_state=42)
# sgd_clf.fit(X_train_scaled, y_train)

# Your code here:

In [None]:
# TODO: Make predictions on a few samples from the test set
# Your code here:

### 4.2 Train a Random Forest Classifier

üìù **Your Task:** Train a RandomForestClassifier.

üí° **Hint:** Random Forests often perform well and can work without scaling

In [None]:
from sklearn.ensemble import RandomForestClassifier

# TODO: Create and train the RandomForestClassifier
# forest_clf = RandomForestClassifier(random_state=42)
# forest_clf.fit(X_train, y_train)  # Note: RF doesn't need scaled features

# Your code here:

### 4.3 (Optional) Try Other Classifiers

üìù **Your Task:** Try at least one more classifier.

Options to try:
- `LogisticRegression`
- `SVC` (Support Vector Classifier)
- `KNeighborsClassifier`
- you can be creative and try others too! or maybe somthing from hugging face like bert or distilbert for text classification

In [None]:
# TODO: Try another classifier of your choice
# from sklearn.linear_model import LogisticRegression
# log_clf = LogisticRegression(random_state=42, max_iter=1000)
# log_clf.fit(X_train_scaled, y_train)

# Your code here:

---
## Part 5: Performance Evaluation

### 5.1 Cross-Validation Accuracy

üìù **Your Task:** Evaluate your classifiers using cross-validation.

üí° **Hint:** Use `cross_val_score()` with `cv=5` (5-fold cross-validation)

In [None]:
from sklearn.model_selection import cross_val_score

# TODO: Evaluate SGDClassifier with cross-validation
# scores = cross_val_score(sgd_clf, X_train_scaled, y_train, cv=5, scoring='accuracy')
# print(f"SGD Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Your code here:

In [None]:
# TODO: Evaluate RandomForestClassifier with cross-validation
# Your code here:

### 5.2 Confusion Matrix

üìù **Your Task:** Generate and visualize the confusion matrix.

üí° **Hint:** Use `cross_val_predict()` to get predictions, then `confusion_matrix()` or `ConfusionMatrixDisplay`

ü§î **Think About:** What do the different quadrants mean?
- True Positives (TP): Correctly predicted positive class
- True Negatives (TN): Correctly predicted negative class
- False Positives (FP): Incorrectly predicted as positive
- False Negatives (FN): Incorrectly predicted as negative

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# TODO: Get predictions using cross-validation
# y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=5)

# Your code here:

In [None]:
# TODO: Display the confusion matrix
# Option 1: Simple matrix
# cm = confusion_matrix(y_train, y_train_pred)
# print(cm)

# Option 2: Visual display
# ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred)
# plt.show()

# Your code here:

### 5.3 Precision, Recall, and F1-Score

üìù **Your Task:** Calculate precision, recall, and F1-score for your classifiers.

üí° **Hint:** Use `precision_score()`, `recall_score()`, and `f1_score()` from sklearn.metrics

ü§î **Think About:**
- **Precision** = TP / (TP + FP) - When we predict positive, how often are we correct?
- **Recall** = TP / (TP + FN) - Of all actual positives, how many did we catch?
- **F1** = Harmonic mean of precision and recall

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

# TODO: Calculate and print precision, recall, and F1 score
# precision = precision_score(y_train, y_train_pred)
# recall = recall_score(y_train, y_train_pred)
# f1 = f1_score(y_train, y_train_pred)

# Your code here:

### 5.4 Precision-Recall Curve

üìù **Your Task:** Plot the precision-recall curve.

üí° **Hint:** 
1. Get decision scores using `cross_val_predict()` with `method='decision_function'` or `method='predict_proba'`
2. Use `precision_recall_curve()` to compute the curve
3. Plot precisions vs recalls

In [None]:
from sklearn.metrics import precision_recall_curve

# TODO: Get decision scores (for classifiers that support it)
# For SGDClassifier:
# y_scores = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=5, method='decision_function')

# For RandomForestClassifier (uses predict_proba):
# y_probas = cross_val_predict(forest_clf, X_train, y_train, cv=5, method='predict_proba')
# y_scores = y_probas[:, 1]  # Get probability of positive class

# Your code here:

In [None]:
# TODO: Compute and plot the precision-recall curve
# precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores)
# plt.plot(recalls, precisions)
# plt.xlabel('Recall')
# plt.ylabel('Precision')
# plt.title('Precision-Recall Curve')
# plt.grid(True)
# plt.show()

# Your code here:

### 5.5 ROC Curve and AUC

üìù **Your Task:** Plot the ROC curve and calculate the AUC score.

üí° **Hint:** Use `roc_curve()` and `roc_auc_score()` from sklearn.metrics

ü§î **Think About:** 
- ROC shows True Positive Rate vs False Positive Rate
- AUC of 0.5 = random classifier, AUC of 1.0 = perfect classifier

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

# TODO: Compute and plot the ROC curve
# fpr, tpr, thresholds = roc_curve(y_train, y_scores)
# 
# plt.figure(figsize=(6, 5))
# plt.plot(fpr, tpr, linewidth=2, label='ROC Curve')
# plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate (Recall)')
# plt.title('ROC Curve')
# plt.legend()
# plt.grid(True)
# plt.show()

# Your code here:

In [None]:
# TODO: Calculate the AUC score
# auc = roc_auc_score(y_train, y_scores)
# print(f"AUC Score: {auc:.3f}")

# Your code here:

---
## Part 6: Final Evaluation on Test Set

üìù **Your Task:** Choose your best model and evaluate it on the **held-out test set**.

‚ö†Ô∏è **Important:** Only do this ONCE at the very end! The test set should remain untouched until final evaluation.

üí° **Hint:** Use the same metrics you used before on the test set predictions

In [None]:
# TODO: Make predictions on the test set with your best model
# y_test_pred = best_clf.predict(X_test_scaled)  # or X_test for RF

# Your code here:

In [None]:
# TODO: Calculate all metrics on the test set
# - Accuracy
# - Precision
# - Recall
# - F1 Score
# - AUC Score

from sklearn.metrics import accuracy_score, classification_report

# Your code here:

In [None]:
# TODO: Print a complete classification report
# print(classification_report(y_test, y_test_pred))

# Your code here:

---
## Part 7: Error Analysis (Optional but Recommended)

üìù **Your Task:** Analyze the errors your model makes.

ü§î **Think About:**
- Which samples are being misclassified?
- Is there a pattern in the errors?
- Which features might need more engineering?

In [None]:
# TODO: Analyze misclassified samples
# Hint: Find samples where y_train != y_train_pred and examine their features

# Your code here:

---
## Part 8: Feature Importance (Optional)

üìù **Your Task:** If using Random Forest, examine feature importances.

üí° **Hint:** Use `forest_clf.feature_importances_` to see which features matter most

In [None]:
# TODO: Plot feature importances from Random Forest
# importances = forest_clf.feature_importances_
# indices = np.argsort(importances)[::-1]
# 
# plt.figure(figsize=(10, 6))
# plt.title("Feature Importances")
# plt.bar(range(len(importances)), importances[indices])
# plt.xticks(range(len(importances)), [feature_columns[i] for i in indices], rotation=45)
# plt.tight_layout()
# plt.show()

# Your code here:

---
## Summary and Conclusions

üìù **Your Task:** Write a brief summary of your findings:

1. **Best Model:** Which classifier performed best?
2. **Key Metrics:** What were the final precision, recall, and F1 scores?
3. **Important Features:** Which features were most important for classification?
4. **Challenges:** What challenges did you encounter?
5. **Future Improvements:** What could be done to improve the model?

### Your Summary:

*Write your conclusions here...*

1. **Best Model:** 

2. **Key Metrics:**

3. **Important Features:**

4. **Challenges:**

5. **Future Improvements:**