<a href="https://colab.research.google.com/github/benmanjackson/CS290/blob/main/Penguin_Manipulation(HW10_18).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 10/18/24 Homework

## Task



* Implement a k-nearest neighbors classifier and apply it to Palmer's
and assess the performance
* Use (a) logistic regression and (b) a support vector machine to create binary classifiers and assess performance
* Use (a) Softmax regression and (b) a stochastic gradient descent classifier to create multinomial classifiers and assess as well.

Of course, we need data to have any hope of accomplishing this goal, so we'll use the well-known [Palmer penguins dataset](https://allisonhorst.github.io/palmerpenguins/).


## Load and inspect the data

In [24]:
import pandas as pd

In [25]:
penguins = pd.read_csv("https://github.com/benmanjackson/CS290/raw/refs/heads/main/penguins.csv")

In [26]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [27]:
print(penguins.columns)

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
      dtype='object')


# Calculate the ***prior probabilities***

In [28]:
priors = penguins["species"].value_counts( normalize=True )
priors

Unnamed: 0_level_0,proportion
species,Unnamed: 1_level_1
Adelie,0.44186
Gentoo,0.360465
Chinstrap,0.197674


According to this dataset,
* 44.2% of penguins are Adélie penguins,
* 36% of penguins are Gentoo penguins, and
* 19.8% of penguins are Chinstrap penguins.

One of our key assumptions is that this dataset is ***representative***, i.e., that these proportions accurately reflect the percentages of these different species.


# Create our pipeline / import additional libraries

In [29]:
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
from sklearn.compose import ColumnTransformer

In [30]:
penguins = penguins.dropna(subset=['species'])

In [31]:
numerical_features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
categorical_features = ['island', 'sex']
target = 'species'

In [32]:
#numerical pipeline
num_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Fills missing values with mean
    ('scaler', StandardScaler())])                # Standardize features

In [33]:
#categorical pipeline
cat_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Fills missing cat values
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])   # Encodes catvariables

In [34]:
#Preprocessor that combines both pipelines
preprocessor = ColumnTransformer(transformers=[
    ('num', num_pipeline, numerical_features),
    ('cat', cat_pipeline, categorical_features)])

## Split into Training and Test sets



In [35]:
X = penguins[numerical_features + categorical_features]
y = penguins[target]
y = y.factorize()[0]  # Converts species into integer labels

In [36]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [37]:
#k-nearest neighbors:
knn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier(n_neighbors=5))
    ])
#Train model
knn_pipeline.fit(X_train, y_train)

In [38]:
#Assess Performance
y_pred_knn = knn_pipeline.predict(X_test)
print("k-NN Classifier Performance:")
print(classification_report(y_test, y_pred_knn))
print("Accuracy:", accuracy_score(y_test, y_pred_knn))

k-NN Classifier Performance:
              precision    recall  f1-score   support

           0       1.00      0.98      0.99        50
           1       1.00      1.00      1.00        36
           2       0.95      1.00      0.97        18

    accuracy                           0.99       104
   macro avg       0.98      0.99      0.99       104
weighted avg       0.99      0.99      0.99       104

Accuracy: 0.9903846153846154


In [42]:
#Softmax Regression:
softmax_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000))
    ])
#Train model:
softmax_pipeline.fit(X_train, y_train)



In [43]:
#Assess the performance:
y_pred_softmax = softmax_pipeline.predict(X_test)
print("Softmax Regression Performance:")
print(classification_report(y_test, y_pred_softmax))
print("Accuracy:", accuracy_score(y_test, y_pred_softmax))

Softmax Regression Performance:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00        36
           2       1.00      1.00      1.00        18

    accuracy                           1.00       104
   macro avg       1.00      1.00      1.00       104
weighted avg       1.00      1.00      1.00       104

Accuracy: 1.0


In [48]:
#Stochastic Gradient Descent:
sgd_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3))
    ])
#Train model:
sgd_pipeline.fit(X_train, y_train)

In [50]:
#Editing penguins for binary classification:
penguins['species_binary'] = (penguins['species'] == 'Adelie').astype(int)
y_binary = penguins['species_binary']
#Train Test:
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)


In [51]:
#Rebuild Pipelines for binary:
#numerical:
num_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
    ])
#Categorical:
cat_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ])

In [52]:
#New Preprocessor:
preprocessor = ColumnTransformer(transformers=[
    ('num', num_pipeline, numerical_features),
    ('cat', cat_pipeline, categorical_features)
    ])

In [53]:
#Logistic Regression pipeline:
log_reg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
    ])
#Train model:
log_reg_pipeline.fit(X_train, y_train)

In [54]:
#Assess Performance:
y_pred_log = log_reg_pipeline.predict(X_test)
print("Logistic Regression (Binary) Performance:")
print(classification_report(y_test, y_pred_log))
print("Accuracy:", accuracy_score(y_test, y_pred_log))

Logistic Regression (Binary) Performance:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        54
           1       1.00      0.98      0.99        50

    accuracy                           0.99       104
   macro avg       0.99      0.99      0.99       104
weighted avg       0.99      0.99      0.99       104

Accuracy: 0.9903846153846154


In [55]:
#Support Vector Machine:
svm_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC(kernel='linear'))
    ])
#Train model:
svm_pipeline.fit(X_train, y_train)

In [56]:
#Assessing Performance:
y_pred_svm = svm_pipeline.predict(X_test)
print("Support Vector Machine Performance:")
print(classification_report(y_test, y_pred_svm))
print("Accuracy:", accuracy_score(y_test, y_pred_svm))

Support Vector Machine Performance:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        54
           1       1.00      0.96      0.98        50

    accuracy                           0.98       104
   macro avg       0.98      0.98      0.98       104
weighted avg       0.98      0.98      0.98       104

Accuracy: 0.9807692307692307


## Have a great day!