# Pandas basics

In this notebook we will **learn** how to work with the two main data types in `pandas`: `DataFrame` and `Series`.

## Data structures (`pandas`)

### `Series`

In `pandas`, series are the building blocks of dataframes.

Think of a series as a column in a table. A series collects *observations* about a given *variable*. 

In [1]:
from random import random
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

#### Numerical series

In [5]:
# let's create a series containing 100 random numbers
# ranging between 0 and 1

s = pd.Series([random() for n in range(0, 10)])

Each observation in the series has an **index** as well as a set of **values**: they can be accessed via the omonymous properties:

In [6]:
s.index

RangeIndex(start=0, stop=10, step=1)

In [7]:
list(s.index)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [8]:
s.values

array([0.59435936, 0.43685685, 0.51086363, 0.76952988, 0.52689493,
       0.68525707, 0.92024852, 0.79505981, 0.65153344, 0.55387761])

The `head()` and `tail()` methods allows for looking at the begininning and end of a series:

In [9]:
s.head()

0    0.594359
1    0.436857
2    0.510864
3    0.769530
4    0.526895
dtype: float64

In [10]:
s.tail()

5    0.685257
6    0.920249
7    0.795060
8    0.651533
9    0.553878
dtype: float64

The `value_counts()` method returns a count of distinct values within a series.

Is there any number in `s` that occurs twice?

In [11]:
# a `Series` can be easily cast into a list

list(s.value_counts()).count(2)

0

Another way of verifying this:

In [12]:
s.is_unique

True

In [13]:
s.min()

0.43685685182594836

In [14]:
s.max()

0.9202485242564136

In [15]:
s.mean()

0.6444481128304373

In [16]:
s.median()

0.622946404372272

#### Datetime series

In [21]:
from random import randint
from datetime import date

In [22]:
# let's generate a list of random dates
# in the range 1900-1950

dates = [
    date(
        year,
        randint(1, 12),
        randint(1, 28) # try replacing with 31 and see what happens
    )
    for year in range(1900,1950)
]

In [23]:
s1 = pd.Series(dates)

In [24]:
s1

0     1900-02-16
1     1901-12-20
2     1902-06-11
3     1903-06-12
4     1904-02-15
5     1905-01-14
6     1906-12-06
7     1907-01-28
8     1908-09-01
9     1909-08-17
10    1910-03-01
11    1911-08-26
12    1912-04-06
13    1913-08-27
14    1914-03-05
15    1915-12-23
16    1916-05-01
17    1917-11-05
18    1918-01-20
19    1919-08-03
20    1920-07-08
21    1921-01-15
22    1922-01-04
23    1923-01-17
24    1924-02-27
25    1925-03-21
26    1926-06-08
27    1927-04-17
28    1928-10-26
29    1929-07-18
30    1930-09-02
31    1931-05-26
32    1932-12-19
33    1933-11-03
34    1934-04-12
35    1935-11-02
36    1936-12-16
37    1937-08-25
38    1938-02-25
39    1939-01-02
40    1940-02-03
41    1941-04-19
42    1942-07-25
43    1943-04-28
44    1944-10-28
45    1945-04-23
46    1946-06-24
47    1947-11-01
48    1948-08-16
49    1949-01-19
dtype: object

In [25]:
type(s1[1])

datetime.date

In [26]:
s1 = Series(pd.to_datetime(dates))

In [27]:
type(s1[1])

pandas._libs.tslibs.timestamps.Timestamp

In [28]:
s1[1].day_name()

'Friday'

In [29]:
s1.min()

Timestamp('1900-02-16 00:00:00')

In [30]:
s1.max()

Timestamp('1949-01-19 00:00:00')

In [31]:
s1.mean()

Timestamp('1924-12-11 21:36:00')

### `DataFrame`


What is a `pandas.DataFrame`? Think of it as an in-memory spreadsheet that you can analyse and manipulate programmatically.

A `DataFrame` is a collection of `Series` having the same length and whose indexes are in sync. A *collection* means that each column of a dataframe is a series

Let's create a toy `DataFrame` by hand. 

In [32]:
dates = [
    date(
        year,
        randint(1, 12),
        randint(1, 28) # try replacing with 31 and see what happens
    )
    for year in range(1980,1990)
]

In [33]:
counts = [
    randint(0, 10000)
    for i in range(0, 10)
]

In [34]:
event_types = ["fire", "flood", "car_crash", "plane_crash"]
events = [
    np.random.choice(event_types)
    for i in range(0, 10)
]

In [35]:
assert len(events) == len(counts) == len(dates)

In [36]:
toy_df = pd.DataFrame({
    "date": dates,
    "count": counts,
    "event": events
})

In [37]:
toy_df

Unnamed: 0,date,count,event
0,1980-01-23,8801,plane_crash
1,1981-04-15,8468,plane_crash
2,1982-11-12,9443,flood
3,1983-11-22,4587,plane_crash
4,1984-11-04,701,fire
5,1985-06-19,3336,flood
6,1986-09-14,6862,plane_crash
7,1987-01-05,5948,fire
8,1988-03-11,7606,flood
9,1989-10-10,9845,car_crash


**Try out**: what happens if you change the length of either of the two lists? Try e.g. passing 20 dates instead of 10.

In [39]:
# a df is a collection of series
# each column is a series

type(toy_df.date)

pandas.core.series.Series

## Data manipulation in `pandas`

### Data types

String, datetimes (see above), categorical data.

In `pandas`, categories behave very much like string, yet they lead to better performances (faster operations, optimized storage).

Bottom-up approach:

In [41]:
# transforms a Series with strings into categories

toy_df.event.astype('category')

0    plane_crash
1    plane_crash
2          flood
3    plane_crash
4           fire
5          flood
6    plane_crash
7           fire
8          flood
9      car_crash
Name: event, dtype: category
Categories (4, object): [car_crash, fire, flood, plane_crash]

### Exploring a dataframe

Exploring a dataframe: df.head(), df.tail(), df.info().

The method `info()` gives you information about a dataframe:
- how much space does it take in memory?
- what is the datatype of each column?
- how many records are there?
- how many `null` values does each column contain (!)?

In [42]:
toy_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   date    10 non-null     object
 1   count   10 non-null     int64 
 2   event   10 non-null     object
dtypes: int64(1), object(2)
memory usage: 368.0+ bytes


Alternatively, if you need to know only the number of columns and rows you can use the `.shape` property.

It returns a tuple with 1) number of rows, 2) number of columns.

In [43]:
toy_df.shape

(10, 3)

`head()` prints by first five rows of a dataframe:

In [44]:
toy_df.head()

Unnamed: 0,date,count,event
0,1980-01-23,8801,plane_crash
1,1981-04-15,8468,plane_crash
2,1982-11-12,9443,flood
3,1983-11-22,4587,plane_crash
4,1984-11-04,701,fire


But the number of lines displayed is a parameter that can be changed:

In [45]:
toy_df.head(2)

Unnamed: 0,date,count,event
0,1980-01-23,8801,plane_crash
1,1981-04-15,8468,plane_crash


`tail()` does the opposite, i.e. prints the last n rows in the dataframe:

In [46]:
toy_df.tail()

Unnamed: 0,date,count,event
5,1985-06-19,3336,flood
6,1986-09-14,6862,plane_crash
7,1987-01-05,5948,fire
8,1988-03-11,7606,flood
9,1989-10-10,9845,car_crash


#### Adding columns

Let's go back to our toy dataframe:

In [47]:
toy_df.head()

Unnamed: 0,date,count,event
0,1980-01-23,8801,plane_crash
1,1981-04-15,8468,plane_crash
2,1982-11-12,9443,flood
3,1983-11-22,4587,plane_crash
4,1984-11-04,701,fire


Using the column selector with the name of a column that does not exist yet will add the effect of setting the values of all rows in that column to the value specified.

In [48]:
toy_df['country'] = "UK"

In [49]:
toy_df.head(3)

Unnamed: 0,date,count,event,country
0,1980-01-23,8801,plane_crash,UK
1,1981-04-15,8468,plane_crash,UK
2,1982-11-12,9443,flood,UK


But if the column already exists, its value is reset:

In [50]:
toy_df['country'] = "USA"

In [51]:
toy_df.head(3)

Unnamed: 0,date,count,event,country
0,1980-01-23,8801,plane_crash,USA
1,1981-04-15,8468,plane_crash,USA
2,1982-11-12,9443,flood,USA


#### Removing columns

The double square bracket notation ``[[...]]`` returns a dataframe having only the columns specified inside the inner brackets.

This said, removing a column is done by unselecting it:

In [52]:
# here we removed the column country 

toy_df2 = toy_df[['date', 'count', 'event']]

In [53]:
# it worked!

toy_df2.head()

Unnamed: 0,date,count,event
0,1980-01-23,8801,plane_crash
1,1981-04-15,8468,plane_crash
2,1982-11-12,9443,flood
3,1983-11-22,4587,plane_crash
4,1984-11-04,701,fire


#### Setting a column as index

In [54]:
toy_df.set_index('date')

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-23,8801,plane_crash,USA
1981-04-15,8468,plane_crash,USA
1982-11-12,9443,flood,USA
1983-11-22,4587,plane_crash,USA
1984-11-04,701,fire,USA
1985-06-19,3336,flood,USA
1986-09-14,6862,plane_crash,USA
1987-01-05,5948,fire,USA
1988-03-11,7606,flood,USA
1989-10-10,9845,car_crash,USA


In [55]:
toy_df.head(3)

Unnamed: 0,date,count,event,country
0,1980-01-23,8801,plane_crash,USA
1,1981-04-15,8468,plane_crash,USA
2,1982-11-12,9443,flood,USA


In [56]:
toy_df.set_index('date', inplace=True)

In [57]:
toy_df.head(3)

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-23,8801,plane_crash,USA
1981-04-15,8468,plane_crash,USA
1982-11-12,9443,flood,USA


**Q**: can you explain the effect of the `inplace` parameter by looking at the cells above?

### Accessing data

 .loc, .iloc, slicing, iteration over rows

In [58]:
toy_df.head(3)

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-23,8801,plane_crash,USA
1981-04-15,8468,plane_crash,USA
1982-11-12,9443,flood,USA


#### Label-based indexing

In [68]:
toy_df.loc[date(1980,1,1):date(1982,1,1)]

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-23,8801,plane_crash,USA
1981-04-15,8468,plane_crash,USA


#### Integer-based indexing

In [69]:
# select a single row, the first one

toy_df.iloc[0]

count             8801
event      plane_crash
country            USA
Name: 1980-01-23, dtype: object

In [70]:
# select  a range of rows by index

toy_df.iloc[[1,3,-1]]

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1981-04-15,8468,plane_crash,USA
1983-11-22,4587,plane_crash,USA
1989-10-10,9845,car_crash,USA


In [71]:
# select  a range of rows with slicing

toy_df.iloc[0:5]

Unnamed: 0_level_0,count,event,country
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1980-01-23,8801,plane_crash,USA
1981-04-15,8468,plane_crash,USA
1982-11-12,9443,flood,USA
1983-11-22,4587,plane_crash,USA
1984-11-04,701,fire,USA


In [72]:
toy_df.index

Index([1980-01-23, 1981-04-15, 1982-11-12, 1983-11-22, 1984-11-04, 1985-06-19,
       1986-09-14, 1987-01-05, 1988-03-11, 1989-10-10],
      dtype='object', name='date')

#### Iterating over rows

In [73]:
for n, row in toy_df.iterrows():
    print(n)

1980-01-23
1981-04-15
1982-11-12
1983-11-22
1984-11-04
1985-06-19
1986-09-14
1987-01-05
1988-03-11
1989-10-10


In [74]:
for n, row in toy_df.iterrows():
    print(n, row.event)

1980-01-23 plane_crash
1981-04-15 plane_crash
1982-11-12 flood
1983-11-22 plane_crash
1984-11-04 fire
1985-06-19 flood
1986-09-14 plane_crash
1987-01-05 fire
1988-03-11 flood
1989-10-10 car_crash


---