# Abstract

describe high-level aims of the analysis and an overview of the findin

First, we'll load the palmer penguins dataset:

In [64]:
import pandas as pd

url = "https://raw.githubusercontent.com/PhilChodrow/ml-notes/main/data/palmer-penguins/train.csv"
train = pd.read_csv(url)

We'll take a quick peek at how the data looks:

In [65]:
train.head(5)

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0809,31,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N63A1,Yes,11/24/08,40.9,16.6,187.0,3200.0,FEMALE,9.08458,-24.54903,
1,PAL0809,41,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N74A1,Yes,11/24/08,49.0,19.5,210.0,3950.0,MALE,9.53262,-24.66867,
2,PAL0708,4,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N32A2,Yes,11/27/07,50.0,15.2,218.0,5700.0,MALE,8.2554,-25.40075,
3,PAL0708,15,Gentoo penguin (Pygoscelis papua),Anvers,Biscoe,"Adult, 1 Egg Stage",N38A1,Yes,12/3/07,45.8,14.6,210.0,4200.0,FEMALE,7.79958,-25.62618,
4,PAL0809,34,Chinstrap penguin (Pygoscelis antarctica),Anvers,Dream,"Adult, 1 Egg Stage",N65A2,Yes,11/24/08,51.0,18.8,203.0,4100.0,MALE,9.23196,-24.17282,


# Data Preparation

We don't need some columns that aren't related to classifying different species, so we'll remove them from the dataset. We'll also print the shape of the dataframe before and after to make sure it removed the columns correctly.

In [66]:
print(train.shape)
train = train.drop(axis="columns", columns=["studyName", "Sample Number", "Individual ID", "Comments", "Date Egg", "Region"])
print(train.shape)

(275, 17)
(275, 11)


Since we'll have to make all categorical data numerical, let's see the data types of the variables:

In [67]:
train.dtypes

Species                 object
Island                  object
Stage                   object
Clutch Completion       object
Culmen Length (mm)     float64
Culmen Depth (mm)      float64
Flipper Length (mm)    float64
Body Mass (g)          float64
Sex                     object
Delta 15 N (o/oo)      float64
Delta 13 C (o/oo)      float64
dtype: object

## Data Preparation:

First, we'll take out all N/A values and make the 'Species' column more readable by taking out the latin name.

In [68]:
train = train.dropna()
train["Species"] = train["Species"].str.split().str.get(0)

We'll make the categorical data into either "one-hot-encoded" data using 'pd.get_dummies', or whole number data using LabelEncoder. We want "Species" to be whole number data because it's the column we're trying to predict. 

In [69]:
species = train[["Species"]]
train = train.drop(axis="columns", columns=["Species"])
#train.head()

train = 1*pd.get_dummies(train)
train = train.drop(axis="columns", columns=["Sex_."])

Unnamed: 0,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Delta 15 N (o/oo),Delta 13 C (o/oo),Island_Biscoe,Island_Dream,Island_Torgersen,"Stage_Adult, 1 Egg Stage",Clutch Completion_No,Clutch Completion_Yes,Sex_FEMALE,Sex_MALE
0,40.9,16.6,187.0,3200.0,9.08458,-24.54903,0,1,0,1,0,1,1,0
1,49.0,19.5,210.0,3950.0,9.53262,-24.66867,0,1,0,1,0,1,0,1
2,50.0,15.2,218.0,5700.0,8.2554,-25.40075,1,0,0,1,0,1,0,1
3,45.8,14.6,210.0,4200.0,7.79958,-25.62618,1,0,0,1,0,1,1,0
4,51.0,18.8,203.0,4100.0,9.23196,-24.17282,0,1,0,1,0,1,0,1


In [70]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() 
le.fit_transform(species)

  y = column_or_1d(y, warn=True)


Unnamed: 0,Species
0,Chinstrap
1,Chinstrap
2,Gentoo
3,Gentoo
4,Chinstrap


## Explore

To explore the data, 2 displayed figures and 1 displayed summary table: