# Palmers Penguins Analysis
An analysis of the Palmers Penguins data set for the module 23-24: 8634 -- PRINCIPLES OF DATA ANALYTICS.

## Import Modules

In [3]:
import pandas as pd
import seaborn as sns

## Load Dataset
We can use the pandas `read_csv()` function to read the dataset as a pandas dataframe from where it's hosted on GitHub as a CSV file. We'll asign it to the variable `penguins`. No other arguments are needed other than the file path. 

In [4]:
penguins = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")

## Overview of Dataset
There are a number of functions within the pandas library that we can use to explore the dataset. 

The `display()` funciton will display the first and last five rows of our dataset, giving us information about the types of variables and the size of our dataset. 

In [5]:
display(penguins)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE


We can see that our dataset has 344 rows, some of which contain missing datapoints as evidenced by the `NaN` values in rows 3 and 339. 

Our dataset has seven columns, four of which appear to contain floats and 3 that appear to contain strings. To verify the types of data in our dataset, we can use the `.info()` method, which returns some descriptive information about a DataFrame. 

In [9]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


This confirms our inference about our data types. The object data type can be used to store mixed data, but it is generally used to store strings.  

The `.describe()` method detects numerical columns and provides some summary statistics from our dataset. 

In [7]:
penguins.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


In [12]:
penguins["flipper_length_mm"].unique()

array([181., 186., 195.,  nan, 193., 190., 180., 182., 191., 198., 185.,
       197., 184., 194., 174., 189., 187., 183., 172., 178., 188., 196.,
       179., 200., 192., 202., 205., 208., 203., 199., 176., 210., 201.,
       212., 206., 207., 211., 230., 218., 215., 219., 209., 214., 216.,
       213., 217., 221., 222., 220., 225., 224., 231., 229., 223., 228.,
       226.])

## Types of Variables

We've determined that our dataset has four float columns, and three object columns (which likely contain strings). Let's examine them in more detail. 



### Species

The first column in our data is the species column. We can index our DataFrame using a string to return just this column as a pandas Series. 

In [20]:
penguins["species"]

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 344, dtype: object

Is appears to contain string data. From this view, we can see that it contains two unique elements, "Adelie" and "Gentoo". To view all of the unique elements in this Series, we can use the `.unique()` method. 

In [24]:
penguins["species"]

0      Adelie
1      Adelie
2      Adelie
3      Adelie
4      Adelie
        ...  
339    Gentoo
340    Gentoo
341    Gentoo
342    Gentoo
343    Gentoo
Name: species, Length: 344, dtype: object

We can see that there is actually a third unique element, "Chinstrap". Since the species column contains three unique elements, it makes sense to use strings to model this variable. 

We can use the same commands to explore the island variable. 

In [22]:
penguins["island"]

0      Torgersen
1      Torgersen
2      Torgersen
3      Torgersen
4      Torgersen
         ...    
339       Biscoe
340       Biscoe
341       Biscoe
342       Biscoe
343       Biscoe
Name: island, Length: 344, dtype: object

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

### Island

We can use the same commands to explore the island variable. 

In [25]:
penguins["island"]

0      Torgersen
1      Torgersen
2      Torgersen
3      Torgersen
4      Torgersen
         ...    
339       Biscoe
340       Biscoe
341       Biscoe
342       Biscoe
343       Biscoe
Name: island, Length: 344, dtype: object

In [26]:
penguins["island"].unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

From this, we can see that the island col

## References

* https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
* https://www.analyticsvidhya.com/blog/2022/04/data-exploration-and-visualisation-using-palmer-penguins-dataset/
* https://stackoverflow.com/questions/48503192/pandas-what-does-object-type-really-mean
* https://pandas.pydata.org/docs/user_guide/indexing.html