# 🐧 Logistic Regression with the Penguins Dataset

Welcome to today's lab session! 🎉

In this notebook, we will:
- Perform **Exploratory Data Analysis (EDA)** on the `penguins` dataset 🏝️
- Train a **Logistic Regression Model** to predict penguin sex 🐧

Let's get started! 🚀

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
# Load the dataset
penguins = sns.load_dataset('penguins')
penguins.head()  # Display first few rows

In [None]:
# Check for missing values
penguins.isnull().sum()

In [None]:
# Drop rows with missing values
penguins = penguins.dropna()
penguins.isnull().sum()  # Verify missing values are gone

In [None]:
# Exploratory Data Analysis (EDA)
sns.pairplot(penguins, hue='sex', diag_kind='hist')
plt.show()

1️⃣ Modify and Expand EDA (15 min)
Task: Add at least two more visualizations to explore relationships.

Examples:

Does body mass vary significantly between species? (Hint: Use sns.boxplot) 📊

Do male and female penguins have different bill lengths? (Hint: Try sns.violinplot)

Goal: Think critically about which features might be useful for prediction! 🧠

In [None]:
# Encode categorical variables
le = LabelEncoder()
penguins['sex'] = le.fit_transform(penguins['sex'])  # Male=1, Female=0
penguins = pd.get_dummies(penguins, columns=['species', 'island'], drop_first=True)
penguins.head()

In [None]:
# Split data into training and testing sets
X = penguins.drop(columns=['sex'])  # Features
y = penguins['sex']  # Target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Model evaluation
y_pred = model.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))

 2️⃣ Feature Selection & Engineering (20 min)
Task: Try training the logistic regression model with different feature sets:

What happens if we remove body mass?

Does including island or species improve accuracy?

Goal: Understand how feature choices impact model performance! 🚀

3️⃣ Model Interpretation & Error Analysis (15 min)
Task: After training your model, interpret the results:

Which features had the most impact? 📊

Where did the model make the most mistakes? (Check confusion matrix) 🧐

Are male or female penguins harder to classify? Why? 🤔

Goal: Move beyond accuracy and think about real-world implications.

🔹 4️⃣ Challenge: Make a Small Change (Optional, 20 min)
Task: Modify the logistic regression model and justify the change:

Adjust solver or C parameter and observe the effect. 🛠️
(take time to google this to understand further)
Change the target variable (e.g., predict species instead of sex).

Goal: Get comfortable experimenting and defending your choices! 💡

✅ Final Deliverables
At the end of class or as homework, submit to google classroom:

A completed notebook with your modifications. 📂

A short written reflection answering:

What patterns did you find in the data? 🔍

What feature(s) were most useful for prediction? 🧩

If you could collect one extra feature, what would it be and why? 🤯