# Quick intro to pandas

pandas is a Python library for manipulating & analyzing data.

We usually import pandas like this:

In [1]:
import pandas as pd

pandas has two basic units that we care about: a `Series` and a `DataFrame`. `Series` are roughly akin to lists -- they're one-dimensional and can hold data of any type. `DataFrames` are two-dimensional (they have rows and columns) and are indexed by one or more columns. This means that a row can be identified by the labels in the index column. Each column of a `DataFrame` is also a `Series`. 

We'll focus on `DataFrames` for now. We can populate a `DataFrame` in a few different ways. First, let's instantiate a `DataFrame` with our own data.

## Making a DataFrame:

In [2]:
df = pd.DataFrame({
    "Name": ["Orca", "Blue Whale", "Beluga Whale", "Narwhal"],
    "Weight": [22000, 441000, 2530, 3500],
    "Cute":[True, True, True, True],
    "Habitat": ["Everywhere", "Everywhere", "Arctic", "Arctic"],
})

Let's look at the types the columns of `df` have: 

In [3]:
df.dtypes

Name       object
Weight      int64
Cute         bool
Habitat    object
dtype: object

Wow! These are a lot of different types! 

We can also access the column names like this:

In [4]:
df.columns

Index(['Name', 'Weight', 'Cute', 'Habitat'], dtype='object')

or see the first 3 entries like this:

In [5]:
df.head(3)

Unnamed: 0,Name,Weight,Cute,Habitat
0,Orca,22000,True,Everywhere
1,Blue Whale,441000,True,Everywhere
2,Beluga Whale,2530,True,Arctic


## Common Pandas operations
### Sorting
Now we'll practice some common operations in pandas. One thing we might want to do is sort a dataframe by a particular column:

In [6]:
df.sort_values(by="Weight", ascending=False)

Unnamed: 0,Name,Weight,Cute,Habitat
1,Blue Whale,441000,True,Everywhere
0,Orca,22000,True,Everywhere
3,Narwhal,3500,True,Arctic
2,Beluga Whale,2530,True,Arctic


What happens if we change the `ascending` parameter to True?
Can you sort the dataframe alphabetically by species name? 


In [13]:
# Your turn!!!

### Filtering
Another common thing we can do in Pandas is filtering. This means we only select the rows that match a specific condition. Let's filter for all the cetaceans whose weight is over 5000 pounds:

In [7]:
df[df.Weight > 5000]

Unnamed: 0,Name,Weight,Cute,Habitat
0,Orca,22000,True,Everywhere
1,Blue Whale,441000,True,Everywhere


Here, we're referring to the `Weight` column as an _attribute_ of our dataframe `df`. However, we can also refer to the column like we would a key in a dictionary, with the same result:

In [8]:
df[df['Weight'] > 5000]

Unnamed: 0,Name,Weight,Cute,Habitat
0,Orca,22000,True,Everywhere
1,Blue Whale,441000,True,Everywhere


Can you filter the dataframe to get all the cetaceans that live in the Arctic?


In [10]:
# Your turn!!

### Adding a column
We can also add a column to the dataframe as long as it's the same length as the existing dataframe:

In [11]:
df["Food"] = ["Everything", "Krill", "Squid, Clams, Octopus, Cod, Herring", "Fish"]


And we can create a column from an existing column using the `apply` function:


In [12]:
def convert_from_lbs_to_kg(wt):
    return wt * 0.45
df["Weight_kg"] = df["Weight"].apply(convert_from_lbs_to_kg)

When we apply a function, we can use an actual function, as we did above, or we can use an __anonymous function__, or a "lambda function". 

In [13]:
df["Weight_tons"] = df["Weight"].apply(lambda b: b / 2000)

### Grouping

We also might want to group our dataframe into chunks that have something in common.

In [14]:
groups = df.groupby('Habitat')
for habitat, df_gr in groups:
    print(habitat)
    print(df_gr.head())

Arctic
           Name  Weight  Cute Habitat                                 Food  \
2  Beluga Whale    2530  True  Arctic  Squid, Clams, Octopus, Cod, Herring   
3       Narwhal    3500  True  Arctic                                 Fish   

   Weight_kg  Weight_tons  
2     1138.5        1.265  
3     1575.0        1.750  
Everywhere
         Name  Weight  Cute     Habitat        Food  Weight_kg  Weight_tons
0        Orca   22000  True  Everywhere  Everything     9900.0         11.0
1  Blue Whale  441000  True  Everywhere       Krill   198450.0        220.5


Here, we're grouping our dataframe by the `Habitat` column. `groups` is an iterable (we can iterate over the items in it, in the order given to us) that has two variables. One of them is the unique value of the column we grouped by (`Everywhere` and `Arctic`) and the other one is the chunk of the dataframe that had that unique value in that column. 

Can you group the cetaceans that weigh more than 5000 pounds apart from the ones that weigh less than 5000 pounds? (Hint: You might have to create a new column!)

In [15]:
# Your Turn!!!

## Practicing with Pandas
We can also load a pandas dataframe from a `.csv` (comma separate values) file. We're going to load up `WhaleFromSpaceDB_Whales.csv`, which we sourced from the [UK Polar Data Centre](https://ramadda.data.bas.ac.uk/repository/entry/show?entryid=c1afe32c-493c-4dc7-af9f-649593b97b2c)

In [16]:
df_whales = pd.read_csv('data/WhaleFromSpaceDB_Whales.csv')
print(df_whales.head(5))
print(df_whales.columns)

              MstLklSp SpAbbr PtlOtrSp  BoL  BoW  BoS  BoC  FlukeP  Blow  \
0  Eubalaena australis    SRW     None    2    1    1    2       0     0   
1  Eubalaena australis    SRW     None    2    1    1    2       0     0   
2  Eubalaena australis    SRW     None    1    1    1    2       0     0   
3  Eubalaena australis    SRW     None    1    1    1    2       0     0   
4  Eubalaena australis    SRW     None    1    1    1    2       0     0   

   Contour  ...        Long   ImageID  \
0        0  ...  166.281952  1.01E+15   
1        0  ...  166.278841  1.01E+15   
2        0  ...  166.274348  1.01E+15   
3        0  ...  166.291132  1.01E+15   
4        0  ...  166.268754  1.01E+15   

                                      ImageFile  ImageDate  Satellite  \
0  06AUG12231250-P2AS-052609152010_01_P001+M2AS   20060812        QB2   
1  06AUG12231250-P2AS-052609152010_01_P001+M2AS   20060812        QB2   
2  06AUG12231250-P2AS-052609152010_01_P001+M2AS   20060812        QB2   
3  0

Let's explore the dataset a little bit! The `value_counts` function tells us how many times each unique value showed up in a column.

In [17]:
df_whales['MstLklSp'].value_counts()

Eubalaena australis       463
Eschrichtius robustus      80
Megaptera novaeangliae     56
Balaenoptera physalus      34
Name: MstLklSp, dtype: int64

Now let's check for flukes! Flukes are valuable to marine biologists because they help them identify individuals. For each whale species, can you count how many times flukes were and were not seen?

What about the average certainty for each species? (Hint: `df.column.mean()` gives the average value of a column, and `Certainty2` is the column that contains the certainty score!)

In [18]:
# Your Turn!!!