# Week 2 Notebook 3 Intro to Pandas with Solutions

## The Pandas Library

The `Pandas` library contains useful functions for reading and manipulating data. 

It organizes the CSV file into a data frame. In this notebook we will use `Pandas` to read in our data and explore it.

## Palmer Penguins Data

We will use the `penguins_size` dataset, compiled by [Alison Horst](https://allisonhorst.github.io/palmerpenguins/)

This is data about three types of penguins that were studied in Antarctica. 

The data we are using is available at [Parul Pandey's Kaggle page](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data?select=penguins_size.csv)

Select the penguins_size.csv file and click to download, and save it in the same folder as this notebook so that you can read it in easily.

![image.png](attachment:image.png)



## Using Pandas

To use the pandas library, we have to import it. We usually use `pd` as the alias to refer to it as a shortcut.

In [1]:
import pandas as pd


Now we can use the pandas function `read_csv()` to read the csv file we have saved. If it is in your current working directory, you don't have to specify the folder path.

In [2]:
# Read the data from the csv file into a pandas dataframe called df.
df = pd.read_csv('penguins_size.csv')

Let's have a look at the first five rows using the `head()` method on the dataframe.

In [3]:
df.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE


This also gives us a quick look at the data that we have, nicely formatted by `Pandas`.

We can see that in row 3 there is missing data for some of the columns. We can use the `n=` argument to specify how many rows we want to view. For example, using `n=10`, we can pass the value `10` as an argument.



In [4]:
# Show more rows by specifying the number required
df.head(10)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,FEMALE
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,MALE
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


The `tail()` method shows the last 5 rows of the data frame.

In [5]:
# tail() method shows us the last five rows
df.tail()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
339,Gentoo,Biscoe,,,,,
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE
343,Gentoo,Biscoe,49.9,16.1,213.0,5400.0,MALE


This shows that there are 344 rows of data (indexed from 0 to 343).
You can get information about the data frame using other `Pandas` methods:

In [6]:
# Get basic info about the data frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   culmen_length_mm   342 non-null    float64
 3   culmen_depth_mm    342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                334 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


You can see from the information shown that there are 344 rows and 7 columns in the data. However, for some of the columns, they are non-null, indicating that there are missing values. 

## Data Exploration

We can explore the data that we have. For example, obtaining basic descriptive statistics about the numerical values using the `describe` method on the data frame `df`:

In [7]:
# Basic descriptive statistics 
df.describe()

Unnamed: 0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


This gives us the frequency, mean, standard deviation minimum and maximum values, as well as the lower quartile (35%), median (50%) and upper quartile (75%).

### Exploring by columns 

You can get a list of the columns in the dataset using the `columns` attribute.

In [8]:
df.columns

Index(['species', 'island', 'culmen_length_mm', 'culmen_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

To refer to a specific column, you can use the column name as a string in the square brackets:

    df["species"]

You can use either single quotes or double quotes, so you could refer to the column like this:

    df['species']
    
If the column name matches Python's variable names rules, you can refer to the column using dot notation.

    df.species

For example, you can apply the `describe()` function on the `species` column to obtain basic information about the column:
- the number of values
- how many unique values
- the value with the highest frequency. 

In [9]:
# You can refer to the data in the columns using the name of the column
# These three ways are equivalent

#df["species"].describe()
#df['species'].describe()
df.species.describe()

count        344
unique         3
top       Adelie
freq         152
Name: species, dtype: object

Another quick way of checking how many observations of each species there are is to use the `value_counts()` method.

In [10]:
df["species"].value_counts()

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

This shows us that there are three possible values for the `species` column, with 152 observations with the value `Adelie`, 124 observations for the value `Gentoo` and 68 for the value `Chinstrap`

### Groupings using groupby 

When we have many categories of data, we would like to compare values between the different category groups.

The `groupby()` method can be used to form groups of data.

For example, if we want to find the means for the numeric values for each species type:

In [11]:
# find the mean for each group of species
df.groupby("species").mean()


Unnamed: 0_level_0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adelie,38.791391,18.346358,189.953642,3700.662252
Chinstrap,48.833824,18.420588,195.823529,3733.088235
Gentoo,47.504878,14.982114,217.186992,5076.01626


This calculates the mean of each of the numeric values, for each group of species. 

For example, in the results shown, the mean culmen length of penguins from the species `Adelie` is 38.781391.

You can also **count** the number of values recorded.

In [12]:
# Counting the number of data values for each column using count()
print("Count of data values for each species")
print(df.groupby(['species']).count())

# Count the number of unique values for each column ising nunique()
print("Count of unique data values for each species")
print(df.groupby(['species']).nunique())

Count of data values for each species
           island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \
species                                                                   
Adelie        152               151              151                151   
Chinstrap      68                68               68                 68   
Gentoo        124               123              123                123   

           body_mass_g  sex  
species                      
Adelie             151  146  
Chinstrap           68   68  
Gentoo             123  120  
Count of unique data values for each species
           island  culmen_length_mm  culmen_depth_mm  flipper_length_mm  \
species                                                                   
Adelie          3                78               49                 32   
Chinstrap       1                55               33                 25   
Gentoo          1                75               39                 25   

           body

You can add another level to group by, for example, if we want to calculate the mean by `species`, then `sex`, we add the column to the list.

In [13]:
#adding another column to group by
df.groupby(["species","sex"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
species,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adelie,FEMALE,37.257534,17.621918,187.794521,3368.835616
Adelie,MALE,40.390411,19.072603,192.410959,4043.493151
Chinstrap,FEMALE,46.573529,17.588235,191.735294,3527.205882
Chinstrap,MALE,51.094118,19.252941,199.911765,3938.970588
Gentoo,.,44.5,15.7,217.0,4875.0
Gentoo,FEMALE,45.563793,14.237931,212.706897,4679.741379
Gentoo,MALE,49.47377,15.718033,221.540984,5484.836066


This shows the mean values by each species, but they are further grouped into the different levels by sex. 

However for Gentoo there are actually three types of values for the 'sex' column. We'll have to try to manage that later.

You can even try to group further, though groupings are generally only meaningful for categorical values.


In [15]:
df.groupby(["species","sex","island"]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
species,sex,island,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelie,FEMALE,Biscoe,37.359091,17.704545,187.181818,3369.318182
Adelie,FEMALE,Dream,36.911111,17.618519,187.851852,3344.444444
Adelie,FEMALE,Torgersen,37.554167,17.55,188.291667,3395.833333
Adelie,MALE,Biscoe,40.590909,19.036364,190.409091,4050.0
Adelie,MALE,Dream,40.071429,18.839286,191.928571,4045.535714
Adelie,MALE,Torgersen,40.586957,19.391304,194.913043,4034.782609
Chinstrap,FEMALE,Dream,46.573529,17.588235,191.735294,3527.205882
Chinstrap,MALE,Dream,51.094118,19.252941,199.911765,3938.970588
Gentoo,.,Biscoe,44.5,15.7,217.0,4875.0
Gentoo,FEMALE,Biscoe,45.563793,14.237931,212.706897,4679.741379


## Exercises

Let's try out these operations with the Iris data set from datahub labeled as [Iris](https://datahub.io/machine-learning/iris#resource-iris).

Download the `iris_csv.csv` dataset into your working directory and then proceed with the questions below.

In [16]:
# We should always start with the import, although it may have been run above
import pandas as pd

In [23]:
# read the data into a dataframe called irisdf
irisdf = pd.read_csv('iris_csv.csv')

**Q1. Display Data** 

Display the first 8 rows of the `DataFrame` object `irisdf`

In [24]:
# Q1 answer
irisdf.head(8)

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa


 **Q2. Basic Exploration**
 
 Explore the data using the methods `info()` and `describe()`

In [25]:
# Q2 view information about the data frame

# use the info() method
irisdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   sepallength  150 non-null    float64
 1   sepalwidth   150 non-null    float64
 2   petallength  150 non-null    float64
 3   petalwidth   150 non-null    float64
 4   class        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [26]:
# Q2 view information about the data frame

# use the describe() method
irisdf.describe()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


**Q3. Show Columns**

Show the names of the columns

In [27]:
# Q3 answer
irisdf.columns

Index(['sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'class'], dtype='object')

**Q4. Show Frequency**

Find how many observations there are for each unique value of `class`.

In [28]:
# Q4 answer
irisdf['class'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: class, dtype: int64

**Q5. Grouping**

Using the `groupby()` method, find the means of the numeric values by `class`.

In [29]:
#Q5 Answer
irisdf.groupby(['class']).mean()

Unnamed: 0_level_0,sepallength,sepalwidth,petallength,petalwidth
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Iris-setosa,5.006,3.418,1.464,0.244
Iris-versicolor,5.936,2.77,4.26,1.326
Iris-virginica,6.588,2.974,5.552,2.026
