# Introduction to Pandas

- **Pandas** is a library for handling data in table like structures. It is inspired by the statistical programming language **R**. [Documentation](https://pandas.pydata.org/pandas-docs/stable/)


The pandas library is normally renamed to **pd** when imported


In [0]:
import pandas as pd

Pandas are table-like structures containing data in columns and rows. The tables for pandas are called **DataFrames**. Normally we would load data from a file, but it is also possible to create a directly.

In [0]:
countries_dictionary = { 
     'name' : ["Denmark", "Norway", "Sweden", "Iceland", "Finland"],
     'population' : [5.8, 5.4, 10.3, 0.3, 5.5],
     'capital': ["Copenhagen", "Oslo", "Stockholm", "Reykjavik", "Helsinki"]
}

df = pd.DataFrame(countries_dictionary)

Colaboratory knows how to display a DataFrame nicely:

In [4]:
df

Unnamed: 0,name,population,capital
0,Denmark,5.8,Copenhagen
1,Norway,5.4,Oslo
2,Sweden,10.3,Stockholm
3,Iceland,0.3,Reykjavik
4,Finland,5.5,Helsinki


We can se the list of columns:

In [5]:
list(df.columns)

['name', 'population', 'capital']

If we want a DataFrame with only some of the columns we select these:

In [6]:
df[["name", "capital"]]

Unnamed: 0,name,capital
0,Denmark,Copenhagen
1,Norway,Oslo
2,Sweden,Stockholm
3,Iceland,Reykjavik
4,Finland,Helsinki


It is possible to do operations on a column as if it was a scalar variable:

In [7]:
df["population"]/2

0    2.90
1    2.70
2    5.15
3    0.15
4    2.75
Name: population, dtype: float64

We can add new columns just as easily:

In [8]:
df["half_population"]=df["population"]/2
df

Unnamed: 0,name,population,capital,half_population
0,Denmark,5.8,Copenhagen,2.9
1,Norway,5.4,Oslo,2.7
2,Sweden,10.3,Stockholm,5.15
3,Iceland,0.3,Reykjavik,0.15
4,Finland,5.5,Helsinki,2.75


Pandas support functions on columns. For example we can find the mean population:

In [9]:
df["population"].mean()

5.46

Suppose we want to find the Nordic countries with a population higher than the mean.

We can start by comparing the populations to the mean:

In [11]:
df["population"] > df["population"].mean()

0     True
1    False
2     True
3    False
4     True
Name: population, dtype: bool

The boolean array can be used for filtering the original dataframe, so that we only see the countries with higher than mean population.

In [12]:
df[df["population"] > df["population"].mean()]

Unnamed: 0,name,population,capital,half_population
0,Denmark,5.8,Copenhagen,2.9
2,Sweden,10.3,Stockholm,5.15
4,Finland,5.5,Helsinki,2.75
