## Course Assignment Instructions
You should have Python (version 3.8 or later) and Jupyter Notebook installed to complete this assignment. You will write code in the empty cell/cells below the problem. While most of this will be a programming assignment, some questions will ask you to "write a few sentences" in markdown cells. 

Submission Instructions:

Create a labs directory in your personal class repository (e.g., located in your home directory)
Clone the class repository
Copy this Jupyter notebook file (.ipynb) into your repo/labs directory
Make your edits, commit changes, and push to your repository
All submissions must be pushed before the due date to avoid late penalties. 

Labs are graded out of a 100 pts. Each day late is -5. For a max penalty of -50 after 10 days. From there you may submit the lab anytime before the semester ends for a max score of 50.  

Lab 2 is due on 2/18/25

## Basic Modeling
In the 342 class an example was given that considered a variable `x_3` which measured "criminality". In this example there are L = 4 levels "none", "infraction", "misdemeanor" and "felony". Create a variable `x_3` here with 100 random elements (equally probable). Create it as a nominal (i.e. unordered) factor. Hint: use random.choice from NumPy and Categorical from Pandas.

In [5]:
import numpy as np
import pandas as pd

categories = ["none", "infraction", "misdemeanor", "felony"]

# Generate 100 random elements with equal probability
x_3 = np.random.choice(categories, size = 100, replace = True)

# Convert to a categorical (nominal) variable in pandas
pd.Categorical(x_3, categories = categories, ordered = False)
print(x_3)

['felony' 'none' 'misdemeanor' 'misdemeanor' 'misdemeanor' 'felony'
 'felony' 'infraction' 'felony' 'infraction' 'none' 'misdemeanor' 'none'
 'none' 'infraction' 'misdemeanor' 'felony' 'felony' 'infraction' 'none'
 'infraction' 'felony' 'infraction' 'none' 'infraction' 'felony'
 'misdemeanor' 'infraction' 'none' 'misdemeanor' 'felony' 'infraction'
 'none' 'felony' 'felony' 'none' 'infraction' 'misdemeanor' 'infraction'
 'none' 'felony' 'felony' 'infraction' 'infraction' 'infraction'
 'misdemeanor' 'misdemeanor' 'infraction' 'felony' 'misdemeanor'
 'infraction' 'none' 'felony' 'infraction' 'misdemeanor' 'none' 'none'
 'felony' 'misdemeanor' 'misdemeanor' 'infraction' 'none' 'infraction'
 'none' 'infraction' 'felony' 'infraction' 'infraction' 'none' 'felony'
 'none' 'felony' 'none' 'infraction' 'none' 'none' 'felony' 'felony'
 'felony' 'none' 'felony' 'infraction' 'none' 'infraction' 'felony'
 'misdemeanor' 'felony' 'none' 'felony' 'none' 'misdemeanor' 'infraction'
 'none' 'infraction' '

Use x_3 to create x_3_bin, a binary feature where 0 is no crime and 1 is any crime.

In [7]:
# creates a boolean array (True for crime, False for no crime)
x_3_bin = (x_3 != "none").astype(int)
print(x_3_bin)

[1 0 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 1 1 0 1
 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 1 0 1 1 1 1 0 1 0 1 0 1
 0 0 1 1 1 0 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 0]


Use `x_3` to create `x_3_ord`, an ordered factor variable. Ensure the proper ordinal ordering.

In [9]:
x_3_ord = pd.Categorical(x_3, categories = categories, ordered = True)
x_3_ord

['felony', 'none', 'misdemeanor', 'misdemeanor', 'misdemeanor', ..., 'misdemeanor', 'misdemeanor', 'felony', 'felony', 'none']
Length: 100
Categories (4, object): ['none' < 'infraction' < 'misdemeanor' < 'felony']

Convert this variable into three binary variables without any information loss and put them into a data matrix. Hint: use column_stack from Numpy.

In [11]:
x_3_matrix = np.column_stack ([
    (x_3 == "infraction").astype(int),
    (x_3 == "misdemeanor").astype(int),
    (x_3 == "felony").astype(int)
])

x_3_matrix = pd.DataFrame(x_3_matrix, columns = ["infraction", "misdemeanor", "felony"])
print(x_3_matrix)

    infraction  misdemeanor  felony
0            0            0       1
1            0            0       0
2            0            1       0
3            0            1       0
4            0            1       0
..         ...          ...     ...
95           0            1       0
96           0            1       0
97           0            0       1
98           0            0       1
99           0            0       0

[100 rows x 3 columns]


What should the sum of each row be (in English)? Write your answer in the markdown cell below

The sum of each row should be 1, since each person is to have only one '1' from the four catagorizes. 

Verify that in the code cell below

In [13]:
row_sum = x_3_matrix.sum(axis = 1)

print(row_sum.value_counts())

1    74
0    26
Name: count, dtype: int64


 How should the column sum look (in English)? Write your answer in the markdown cell below

The column sum should be 25, since we sampled 100 people each with 1/4 of being in each catagory. 

Verify that in the code cell below

In [15]:
col_sum = x_3_matrix.sum(axis = 0)
print(col_sum)

infraction     27
misdemeanor    18
felony         29
dtype: int64


Generate a matrix with 100 rows where the first column is realization from a normal with mean 17 and variance 38, the second column is uniform between -10 and 10, the third column is poisson with mean 6, the fourth column is exponential with lambda of 9, the fifth column is binomial with n = 20 and p = 0.12 and the sixth column is a binary variable with exactly 24% 1's dispersed randomly. Name the rows the entries of the `fake_first_names` vector. You will need to use Numpy

In [17]:
# Number of rows
num_rows = 100

# Assign row names (index) from fake_first_names
fake_first_names = [
    "Sophia", "Emma", "Olivia", "Ava", "Mia", "Isabella", "Riley", 
    "Aria", "Zoe", "Charlotte", "Lily", "Layla", "Amelia", "Emily", 
    "Madelyn", "Aubrey", "Adalyn", "Madison", "Chloe", "Harper", 
    "Abigail", "Aaliyah", "Avery", "Evelyn", "Kaylee", "Ella", "Ellie", 
    "Scarlett", "Arianna", "Hailey", "Nora", "Addison", "Brooklyn", 
    "Hannah", "Mila", "Leah", "Elizabeth", "Sarah", "Eliana", "Mackenzie", 
    "Peyton", "Maria", "Grace", "Adeline", "Elena", "Anna", "Victoria", 
    "Camilla", "Lillian", "Natalie", "Jackson", "Aiden", "Lucas", 
    "Liam", "Noah", "Ethan", "Mason", "Caden", "Oliver", "Elijah", 
    "Grayson", "Jacob", "Michael", "Benjamin", "Carter", "James", 
    "Jayden", "Logan", "Alexander", "Caleb", "Ryan", "Luke", "Daniel", 
    "Jack", "William", "Owen", "Gabriel", "Matthew", "Connor", "Jayce", 
    "Isaac", "Sebastian", "Henry", "Muhammad", "Cameron", "Wyatt", 
    "Dylan", "Nathan", "Nicholas", "Julian", "Eli", "Levi", "Isaiah", 
    "Landon", "David", "Christian", "Andrew", "Brayden", "John", 
    "Lincoln"
]

# Create a DataFrame with the specified distributions
x = pd.DataFrame({
    "Normal": np.random.normal(loc = 17, scale = np.sqrt(38), size = num_rows),  # Normal(17, variance 38)
    "Uniform": np.random.uniform(low = -10, high = 10, size = num_rows),         # Uniform(-10, 10)
    "Poisson": np.random.poisson(6, size = num_rows),                            # Poisson(6)
    "Exponential": np.random.exponential(1/9, size = num_rows),                  # Exponential(λ=9)
    "Binomial": np.random.binomial(n = 20, p = 0.12, size = num_rows),           # Binomial(n=20, p=0.12)
    "Binary": np.random.permutation([1] * int(num_rows*0.24) + [0] * int(num_rows*0.76))  # 24% 1s, shuffled   
})

x.index = fake_first_names[:num_rows]
print(x)


              Normal   Uniform  Poisson  Exponential  Binomial  Binary
Sophia     19.989274  8.538661        8     0.381689         3       0
Emma       23.465743 -1.383005        6     0.005752         3       0
Olivia     31.225979  8.835610        3     0.129792         0       0
Ava        13.119809 -0.215711        9     0.061004         4       0
Mia         7.066318  1.422628        4     0.204565         3       0
...              ...       ...      ...          ...       ...     ...
Christian  15.513906 -0.268287        5     0.263889         3       1
Andrew      5.929438 -4.810215        5     0.296789         2       0
Brayden     9.932049  6.229181        5     0.046820         1       1
John        7.725920  3.250591        7     0.038331         4       0
Lincoln    22.991791  0.872125        7     0.163250         2       0

[100 rows x 6 columns]


Create a data frame of the same data as above except make the binary variable a factor "DOMESTIC" vs "FOREIGN" for 0 and 1 respectively. In Rstudio you used the `View` function to ensure this worked as desired. In python use .head() on the DataFrame. I recommend creating a copy of the DataFrame and then using the .replace in conjunction with .astype("category") to make the binary variable a factor. 

In [19]:
# Convert matrix DataFrame to categorical for the binary variable
# Make a copy to keep X unchanged
x_copy = x.copy()

# Convert binary column (6th column) to categorical labels
x_copy["Binary"] = x_copy["Binary"].replace({0: "Domestic", 1: "Foreign"}).astype("category")

# Display first few rows
x_copy.head()

Unnamed: 0,Normal,Uniform,Poisson,Exponential,Binomial,Binary
Sophia,19.989274,8.538661,8,0.381689,3,Domestic
Emma,23.465743,-1.383005,6,0.005752,3,Domestic
Olivia,31.225979,8.83561,3,0.129792,0,Domestic
Ava,13.119809,-0.215711,9,0.061004,4,Domestic
Mia,7.066318,1.422628,4,0.204565,3,Domestic


Print out a table of the binary variable. Then print out the proportions of "DOMESTIC" vs "FOREIGN". Pandas DataFrames has a .value_count() feature. 

In [21]:
x_copy["Binary"].value_counts(normalize = True)

Binary
Domestic    0.76
Foreign     0.24
Name: proportion, dtype: float64

Print out a summary of the whole dataframe.

In [23]:
x_copy.describe()

Unnamed: 0,Normal,Uniform,Poisson,Exponential,Binomial
count,100.0,100.0,100.0,100.0,100.0
mean,16.273458,0.709589,6.03,0.119086,2.48
std,6.630918,5.723737,2.36325,0.099141,1.527393
min,1.811336,-9.990842,1.0,0.001592,0.0
25%,11.355703,-3.32851,4.75,0.044687,1.0
50%,16.695459,0.942184,6.0,0.088486,2.0
75%,20.866179,5.231241,7.0,0.172863,3.0
max,31.225979,9.953382,14.0,0.535741,8.0


## Dataframe creation
Imagine you are running an experiment with many manipulations. You have 14 levels in the variable "treatment" with levels a, b, c, etc. For each of those manipulations you have 3 submanipulations in a variable named "variation" with levels A, B, C. Then you have "gender" with levels M / F. Then you have "generation" with levels Boomer, GenX, Millenial. Then you will have 6 runs per each of these groups. In each set of 6 you will need to select a name without duplication from the appropriate set of names (from the last question). Create a data frame with columns treatment, variation, gender, generation, name and y that will store all the unique unit information in this experiment. Leave y empty because it will be measured as the experiment is executed. In Rstudio you used `rep` function using the `times` argument. For python use np.tile, and np.repeat.

In [None]:
# Define categories
treatments =   # 14 levels
variations =   # 3 levels
genders =   # 2 levels
generations =   # 3 levels


# Define name sets
name_sets = {
    "M": {
        "Boomer": ["Theodore", "Bernard", "Gene", "Herbert", "Ray", "Tom", "Lee", "Alfred", "Leroy", "Eddie"],
        "GenX": ["Marc", "Jamie", "Greg", "Darryl", "Tim", "Dean", "Jon", "Chris", "Troy", "Jeff"],
        "Millenial": ["Zachary", "Dylan", "Christian", "Wesley", "Seth", "Austin", "Gabriel", "Evan", "Casey", "Luis"]
    },
    "F": {
        "Boomer": ["Gloria", "Joan", "Dorothy", "Shirley", "Betty", "Dianne", "Kay", "Marjorie", "Lorraine", "Mildred"],
        "GenX": ["Tracy", "Dawn", "Tina", "Tammy", "Melinda", "Tamara", "Tracey", "Colleen", "Sherri", "Heidi"],
        "Millenial": ["Samantha", "Alexis", "Brittany", "Lauren", "Taylor", "Bethany", "Latoya", "Candice", "Brittney", "Cheyenne"]
    }
}


# Create experiment dataframe
df = pd.DataFrame({
    "treatment": np.repeat(),
    "variation": np.tile(np.repeat(),
    "gender": np.tile(np.repeat(),
    "generation": np.tile(np.repeat(),
})


# Function to assign unique names per group
def assign_names(group):
    gender_val = group["gender"].iloc[0]  # Extract gender
    generation_val = group["generation"].iloc[0]  # Extract generation
    return np.random.choice(name_sets[gender_val][generation_val], 6, replace=False)

# Apply function to assign names
df["name"] = df.groupby(["treatment", "variation", "gender", "generation"], group_keys=False).apply(assign_names).explode().reset_index(drop=True)

# Add empty column y
df["y"] = np.nan

# Display first few rows
print(df.head())
df

Now that you've done it with the np.tile and np.repeat, Try doing this by importing product from the itertools module. This will be analogous to using `expand.grid` function from Rstudio. 

| **R Function** | **Python Equivalent** |
|--------------|-----------------|
| `rep(x, times=n)` | `np.repeat(x, n)` |
| `rep(x, each=n)` | `np.tile(np.repeat(x, n), times)` |
| `rep(x, length.out=n)` | `np.resize(x, n)` |
| `expand.grid()` | `itertools.product()` |

| **R Function** | **Python Equivalent** | **Use Case** |
|--------------|-----------------|-----------|
| `rep(x, times=n)` | `np.repeat(x, n)` | Repeat each element **`n` times** in order |
| `rep(x, each=n)` | `np.tile(x, n)` | Repeat the full sequence **`n` times** |
| `rep(x, length.out=n)` | `np.resize(x, n)` | Repeat `x` but **truncate** or **expand** to length `n` |

**`expand.grid()` → `itertools.product()`** for generating **all combinations**  
**`rep(..., each=n)` → `np.repeat()`** for **repeating values in order**  
**`rep(..., times=n)` → `np.tile()`** for **cycling through values**  
**`Combination of `np.repeat()` and `np.tile()`** replaces **nested `rep()`** in R

In [25]:
from itertools import product

# Define categories - we need to expand this out by multiplying 
treatments = list("abcdefghijklmn")  # 14 levels
variations = list("ABC")             # 3 levels
genders = ["M", "F"]                 # 2 levels
generations = ["Boomer", "GenX", "Millenial"]  # 3 levels

# Define name sets
name_sets = {
    "M": {
        "Boomer": ["Theodore", "Bernard", "Gene", "Herbert", "Ray", "Tom", "Lee", "Alfred", "Leroy", "Eddie"],
        "GenX": ["Marc", "Jamie", "Greg", "Darryl", "Tim", "Dean", "Jon", "Chris", "Troy", "Jeff"],
        "Millenial": ["Zachary", "Dylan", "Christian", "Wesley", "Seth", "Austin", "Gabriel", "Evan", "Casey", "Luis"]
    },
    "F": {
        "Boomer": ["Gloria", "Joan", "Dorothy", "Shirley", "Betty", "Dianne", "Kay", "Marjorie", "Lorraine", "Mildred"],
        "GenX": ["Tracy", "Dawn", "Tina", "Tammy", "Melinda", "Tamara", "Tracey", "Colleen", "Sherri", "Heidi"],
        "Millenial": ["Samantha", "Alexis", "Brittany", "Lauren", "Taylor", "Bethany", "Latoya", "Candice", "Brittney", "Cheyenne"]
    }
}

# Create experiment DataFrame
df = pd.DataFrame({
    "treatment": np.repeat(treatments, len(variations) * len(genders) * len(generations) * 6),
    "variation": np.tile(np.repeat(variations, len(genders) * len(generations) * 6), len(treatments)),
    "gender": np.tile(np.repeat(genders, len(generations) * 6), len(treatments) * len(variations)),
    "generation": np.tile(np.repeat(generations, 6), len(treatments) * len(variations) * len(genders)),
}) 

# Add a unique identifier to preserve the original order
df = df.reset_index().rename(columns={'index': 'orig_index'})

# Define a function that assigns 6 unique names per group and returns a DataFrame with the original index.
def assign_names_with_index(group):
    gender_val = group["gender"].iloc[0]       # Extract the group's gender
    generation_val = group["generation"].iloc[0]  # Extract the group's generation
    # Sample 6 unique names from the appropriate set (without replacement)
    names = np.random.choice(name_sets[gender_val][generation_val], 6, replace=False)
    # Return a DataFrame with the original indices and the assigned names
    return pd.DataFrame({
        "orig_index": group["orig_index"],
        "name": names
    })

# Group by the categorical variables and apply the function.
names_df = df.groupby(["treatment", "variation", "gender", "generation"], group_keys=False).apply(assign_names_with_index).reset_index(drop=True)

# Merge the assigned names back into the original DataFrame using the unique identifier.
df = df.merge(names_df, on="orig_index", how="left")

# Restore the original order and remove the temporary identifier.
df = df.sort_values("orig_index").reset_index(drop=True).drop(columns=["orig_index"])

# Add empty column y
df["y"] = np.nan

# Display DataFrame
df

  names_df = df.groupby(["treatment", "variation", "gender", "generation"], group_keys=False).apply(assign_names_with_index).reset_index(drop=True)


Unnamed: 0,treatment,variation,gender,generation,name,y
0,a,A,M,Boomer,Alfred,
1,a,A,M,Boomer,Herbert,
2,a,A,M,Boomer,Lee,
3,a,A,M,Boomer,Theodore,
4,a,A,M,Boomer,Bernard,
...,...,...,...,...,...,...
1507,n,C,F,Millenial,Alexis,
1508,n,C,F,Millenial,Bethany,
1509,n,C,F,Millenial,Lauren,
1510,n,C,F,Millenial,Latoya,


## Basic Binary Classification Modeling

Load the famous `iris` data frame into the namespace. In Rstudio you used the `skim` function from the package `skimr` to provide a summary of the columns. In python we will use df.describe() and the ProfileReport from the ydata-profiling package. The `iris` data set is not available in base python, but we can get this data from the sklearn package. Write a few descriptive sentences about the distributions using the code below in English.

In [47]:
#install scikit-learn by uncommenting the code below
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [51]:
#install ydata-profile by uncommenting the code below
%pip install -U ydata-profiling[notebook]

Note: you may need to restart the kernel to use updated packages.


### **Comparing the `iris` Dataset in R vs Python**
| Feature  | **R (`datasets::iris`)**  | **Python (`sklearn.datasets.load_iris()`)**  |
|----------|-------------------------|--------------------------------|
| **Total Rows**  | 150 | 150 |
| **Columns (Features)** | 5 (`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`) | 5 (`sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, `petal width (cm)`, `species`) |
| **Species Encoding**  | `"setosa"`, `"versicolor"`, `"virginica"` (Categorical Factor) | `0` (setosa), `1` (versicolor), `2` (virginica) (Numerical Encoding) |
| **Data Type for Species** | Factor (Categorical) | Integer (0,1,2) |
| **Data Loading Method** | `data(iris)` (built-in dataset) | `datasets.load_iris()` (from `sklearn`) |

### **Key Differences**
- **Species Encoding:**  
  - **R uses categorical factor labels (`setosa`, `versicolor`, `virginica`).**  
  - **Python (`sklearn`) encodes species numerically as `0`, `1`, and `2`.**
- **Column Names:**  
  - **R:** `Sepal.Length`, `Sepal.Width`, etc.  
  - **Python:** `sepal length (cm)`, `sepal width (cm)`, etc.  

In [58]:
from sklearn import datasets
import ydata_profiling  

# Load the famous Iris dataset
iris = datasets.load_iris()
df_iris = pd.DataFrame(iris.data, columns = iris.feature_names)
#df_iris

df_iris["Species"] = iris.target
print(df_iris.describe())

profile = ydata_profiling.ProfileReport(df_iris, title = "Iris Summary", explorative = True)
# Generate the profiling report (Uncomment to generate HTML file)
profile.to_file("iris_report.html")


       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.350000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   

       petal width (cm)     Species  
count        150.000000  150.000000  
mean           1.199333    1.000000  
std            0.762238    0.819232  
min            0.100000    0.000000  
25%            0.300000    0.000000  
50%            1.300000    1.000000  
75%            1.800000    2.000000  
max            2.500000    2.000000  


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 9 9 9 9 9 9 9 9 9 9 9
 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
 9 9]' has dtype incompatible with int32, please explicitly cast to a compatible dtype first.
  discretized_df.loc[:, column] = self._discretize_column(


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

TO-DO: describe this data

Sepal length and width seems to correlate with petal length and width. 

The outcome / label / response is `Species`. This is what we will be trying to predict. However, we only care about binary classification between "setosa" and "versicolor" for the purposes of this exercise. Thus the first order of business is to drop one class. Let's drop the data for the level "virginica" from the data frame.

In [60]:
# Filter out "virginica" from the dataset
df_iris_binary = df_iris[df_iris["Species"] != 2].copy()
df_iris_binary.head()
print(df_iris_binary["Species"].unique())
df_iris_binary

[0 1]


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
95,5.7,3.0,4.2,1.2,1
96,5.7,2.9,4.2,1.3,1
97,6.2,2.9,4.3,1.3,1
98,5.1,2.5,3.0,1.1,1


Now create a vector `y` that is length the number of remaining rows in the data frame whose entries are 0 if "setosa" and 1 if "versicolor".

In [62]:
# Create binary target vector `y` (0 for setosa, 1 for versicolor)
y = (df_iris_binary["Species"] == 1).astype(int)
print(y.unique())

[0 1]


Write a function `mode` returning the sample mode of a vector of numeric values. Use np.random.choice from NumPy and import Counter from the collections module.

In [64]:
from collections import Counter

# Define mode function
def mode(v):
    return Counter(v).most_common(1)[0][0]
    
# Test with a random sample (equivalent to "sample(letters, 1000, replace = True)")
sample_data = np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), 1000, replace = True)
print("Mode of Sample Letters:", mode(sample_data))

# Test with binary target vector 'y'
print("Mode of y:", mode(y))

Mode of Sample Letters: l
Mode of y: 0


Fit a threshold model to `y` using the feature `Sepal.Length`. Write your own code to do this. What is the estimated value of the threshold parameter? Save the threshold value as `threshold`. Hint: use np.zeros and np.sum from Numpy. You will need to use a for loop using the range() function.  

In [46]:
# Extract relevant data
sepal_length = df_iris_binary["sepal length (cm)"].values  # Feature
y_values = y.values  # Target labels (0 or 1)
n = len(sepal_length)  # Number of samples

# Initialize matrix to store threshold values and corresponding error counts
num_errors_by_parameter = np.zeros((n, 2))

# Loop over all possible threshold values
for i in range(n):
    threshold = sepal_length[i]  # Set current threshold
    num_errors = np.sum((sepal_length > threshold) != y_values)  # Count classification errors
    num_errors_by_parameter[i] = [threshold, num_errors]  # Store values

# Sort by number of errors
num_errors_by_parameter = num_errors_by_parameter[num_errors_by_parameter[:, 1].argsort()]

# Get the threshold with the least number of errors
best_threshold = num_errors_by_parameter[0, 0]

# Print results
print(f"Optimal threshold for classification: {best_threshold}")

Optimal threshold for classification: 5.4


What is the total number of errors this model makes? This requires a couple of minor modifications to the previous code.

In [48]:
import numpy as np

# Extract relevant data
sepal_length = df_iris_binary["sepal length (cm)"].values  # Feature
y_values = y.values  # Target labels (0 or 1)
n = len(sepal_length)  # Number of samples

# Initialize matrices for threshold values and classification errors
num_errors_by_parameter = np.zeros((n, 2))
total_errors = 0  # Initialize total error count

# Loop over all possible threshold values
for i in range(n):
    threshold = sepal_length[i]  # Set current threshold
    num_errors = np.sum((sepal_length > threshold) != y_values)  # Count classification errors
    
    # Store threshold and corresponding errors
    num_errors_by_parameter[i] = [threshold, num_errors]
    
    # Accumulate total errors across all thresholds
    total_errors += num_errors

# Sort by number of errors to find the best threshold
num_errors_by_parameter = num_errors_by_parameter[num_errors_by_parameter[:, 1].argsort()]
best_threshold = num_errors_by_parameter[0, 0]  # Best threshold with the least errors

# Print results
print(f"Optimal threshold for classification: {best_threshold}")
print(f"Total number of errors across all thresholds: {total_errors}")

Optimal threshold for classification: 5.4
Total number of errors across all thresholds: 2796


Does the threshold model's performance make sense given the following summaries:

In [68]:
# Print the best threshold found earlier
print(f"Optimal threshold for classification: {best_threshold}")

# Summary statistics for setosa and versicolor Sepal.Length
setosa_summary = df_iris_binary[df_iris_binary["Species"] == 0]["sepal length (cm)"].describe()
versicolor_summary = df_iris_binary[df_iris_binary["Species"] == 1]["sepal length (cm)"].describe()

# Print summaries
print("\nSummary statistics for Setosa Sepal Length:")
print(setosa_summary)

print("\nSummary statistics for Versicolor Sepal Length:")
print(versicolor_summary)

Optimal threshold for classification: 5.4

Summary statistics for Setosa Sepal Length:
count    50.00000
mean      5.00600
std       0.35249
min       4.30000
25%       4.80000
50%       5.00000
75%       5.20000
max       5.80000
Name: sepal length (cm), dtype: float64

Summary statistics for Versicolor Sepal Length:
count    50.000000
mean      5.936000
std       0.516171
min       4.900000
25%       5.600000
50%       5.900000
75%       6.300000
max       7.000000
Name: sepal length (cm), dtype: float64


TO-DO: Write your answer here in English

The threshold make sense since the mean of Setosa flowers are less than the threshold 5.4 and the mean of veriscolor is greater than 5.4, making it a decent way of seperating the two flowers.  The only issues may be that a few versicolor may be miscatagorized since we see that we have some versicolor flowers who have lengths less than 5.4. 

Create the function `g` explicitly that can predict `y` from `x` being a new `Sepal.Length`. Hint: use np.where from Numpy ... this can also be down using a lambda function. 

In [None]:
# Define function `g` for threshold-based prediction

In [70]:
def g(x):
    return np.where(x > best_threshold, 1, 0)