## Lab 3 â€“ Predicting a Categorical Target and Evaluating Performance

In this lab, we are going to use a model to predict the gender (male = 0 or male = 1) of people from the Howell dataset. We will train multiple models, evaluate performance using key metrics, and create visualizations to interpret the results.

Start with your work on Lab 2. We trim that notebook down and do our training and analysis.

We will:
1. Prepare the data
2. Train 3 models: Decision Tree, Support Vector Machine (SVM), and a Neural Net (NN)
3. Get model performance on train and test sets
4. Create appropriate graphs

NOTE: We will just work with the adults since there is a nice break in the data that is not there for the children.

You should see train/test counts of approximately:
Train    size:        276    Test    size:        70

In [52]:
# imports

from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.neural_network import MLPClassifier

### Section 1. Load and Inspect the Data
1.1 Load the dataset

In [53]:
# Load Howell.csv from the same folder as this file

howell_full = pd.read_csv("Howell.csv", sep=";")

### Handling Missing Data
In our case we have no missing data, so the code here is just for reference.

### Section 2. Data Exploration and Preparation

2.1 Create new features

    Compute BMI from height and weight
    Create BMI category


### Plot with Masking

### Section 3. Feature Selection and Justification
3.1 Choose features and target

First:

    input features: Height,
    target: Gender

Second:

    input features:  Weight,
    target: Gender

Third:

    input features: Height, Weight
    target: Gender

 

Justify your selections

    Height and weight are likely to show patterns based on gender.
    Age could contribute to secondary patterns. By restricting our data to adults, we help mitigate some of this. 


### 3.2 Define X (features) and y (target)

Comment out or uncomment the appropriate feature set before splitting the data. This code is set to run Case 1 - the inputs are just height.

### Reflection 3:

     Why did you choose these features?
     How might they impact predictions or accuracy?


### Section 4. Train a Classification Model (Decision Tree)
 
4.1 Split the Data

Split the data into training and test sets using train_test_split (or StratifiedShuffleSplit if class imbalance is an issue).

Use StratifiedShuffleSplit to ensure even distribution of the target variable.

### 4.2 Train Model (Decision Tree)

    Create and train a decision tree model with no random initializer argument.


### 4.3 Evaluate Model Performance

Evaluate model performance on training data

Evaluate model performance on test data:

### 4.4 Report Confusion Matrix (as a heatmap)

Plot a confusion matrix:

### 4.5 Report Decision Tree Plot

Plot the DT model. We give the plotter the names of the features and the names of the categories for the target. Save the image so we can use it in other places.

### Repeat for All 3 Cases

Try this for the 3 different cases: 1) using height as the only input  2) using weight as the only input and 3) using height and weight together as inputs. 

For each different case, redefine the input features in Section 3 (comment out the old case inputs X and target y and uncomment the new case inputs X and target y), then re-run Sections 4 and 5 for each case. Record your results in a Markdown table.

### Reflection 4:

    How well did the models perform?
    Are there any surprising results?
    Which worked better: just height, just weight, or using both together? 


### Section 5. Compare Alternative Models (SVC, NN)

 
5.1 Train Support Vector Classifier (SVC) Model

Train an SVM model using height and weight. Even though we suspect that it is better to just use height as the input, we will use both height and weight for the SVC since that will give a better visualization for the support vectors. 

Predict and evaluate SVC model:

Graph the support vectors.
We will add a third scatter plot showing the support vectors. In order to do that, we need to reach into the trained model and get the vectors. Then we can plot the points with a black cross.
This is a special shortcut for constructing a list in python called a list comprehension.
The square brackets let us know we are building up a list. We iterate over the source which is an internal parameter of the model and is built up during training. The values in the vector are pairs which we pull out and then deconstruct into their component pieces. (x,y). List comprehensions are one of the nice features of python. Suppose I wanted a list of cubes... The code [x*x for x in range(1,6)] will build the list [1, 4, 9, 25]

We now have two lists of data that we can hand over to matplotlib.
We need to plot the new data and we will adjust the color of the points. Changes and the new line are marked in red.

1. We are using yellow squares for males
2. We are using cyan for females.
3. We are using black pluses for the support vectors. Since we are plotting the support vectors last, they should not be obscured by the data points. Plus will let the male/female instances show through

 - NOTE:  The support_vectors_ attribute might give an error if the model didn't converge or if the problem is not linearly separable. To try to get it to converge, try adjusting the kernel (more on kernels in the Lab 3 Project) or tuning hyperparameters (more on this below). 

### 5.2 Train a Neural Network (NN) Model

Now we'll use the NN (Multi Level Perceptron ) model. Again, we will give the neural net as much information as possible and understand that it could overfit on the extra data.

We have some hyper parameters that we can adjust. For the other models we just let them run with their defaults. Here we are going to use 3 hidden layers and change up the solver to one that is more likely to give good results for a small data set.

Train a neural network model:

Predict and evaluate Neural Network model:

Plot confusion matrix:

### Reflection 5:

    How well did each model perform?
    Are there any surprising results?
    Why might one model outperform the others?


### Section 6. Final Thoughts & Insights
6.1 Summarize Findings

**What indicators are strong predictors of gender?**

Based on the comprehensive analysis of the Howell dataset, here are the key findings:

**Strong Predictors (in order of importance):**

1. **Height**: The strongest single predictor of gender
   - Achieved 71% accuracy using height alone
   - Clear biological dimorphism between male and female heights
   - Simple, interpretable relationship

2. **Height + Weight Combined**: Most robust predictor set
   - SVM achieved 77% accuracy using both features
   - Decision Tree maintained 71% accuracy with both features
   - Captures complementary aspects of physical dimorphism

3. **Weight**: Moderate predictor on its own
   - Less reliable than height as a standalone predictor
   - More variability due to lifestyle factors

**Model Performance Analysis:**

1. **Decision Tree**: 
   - Training accuracy: 100% (clear overfitting)
   - Test accuracy: 71% 
   - Excellent interpretability but prone to overfitting

2. **Support Vector Machine (SVM)**:
   - Best overall performance: 77% test accuracy
   - More robust to overfitting
   - Good balance of precision and recall

3. **Neural Network**:
   - Poor performance: 53% accuracy (worse than random guessing)
   - Failed to converge properly on this small dataset
   - Overly complex for the available data size

**Key Insights:**
- **Height emerges as the most reliable single predictor** due to consistent sexual dimorphism
- **SVM performs best overall**, handling the feature combination more effectively
- **Simple models work better** than complex ones for this small dataset (346 adults)
- **Overfitting is a major concern** - all models show significant training vs. test performance gaps

6.2 Discuss Challenges Faced

    Small sample size could limit generalizability.
    Missing values (if any) could bias the model.

6.3 Next Steps

    Test more features (e.g., BMI class).
    Try hyperparameter tuning for better results.
