# Session 11: Introduction to Pandas and data manipulation

`pandas` is a Python library that uses `NumPy` under the hood. `pandas` allows us to perform data analysis and manipulation for tabular data. 

We can use pandas by importing it
```Python
import pandas
```
Or using its well known alias `pd`
```Python
import pandas as pd
```

In [1]:
# importing pandas as its shorter alias: pd
import pandas as pd

Now that we have loaded `pandas` we can start using it. But first, some of its main classes and functionalities.

## Pandas series

`pd.Series` is a pandas object that contains one dimensional data in an array-like data structure.

For each element in a `pd.Series`, a label is assigned to it: the `index`.

`pd.Series` can be created from lists, dictionaries, `NumPy` arrays, etc, and can contain integers, floats, strings, booleans, datetimes, ...

Let's create a series that contains the following elements `[1, 2, 3]` and let's assign the following labels (`index`) to them `["elem1", "elem2", "elem3"]`

In [6]:
# specifying the index

s = pd.Series([1, 2, 3], index=["elem1", "elem2", "elem3"])

s

elem1    1
elem2    2
elem3    3
dtype: int64

In [7]:
# without specifying the index you get an index of 0, 1, 2

s = pd.Series([1, 2, 3])

s

0    1
1    2
2    3
dtype: int64

### Series elements

`pd.Series` have two main properties: `values` and `index`. 

We can extract the values contained in a series or the index that labels the values.

In [8]:
print(s.values)
print(type(s.values))

[1 2 3]
<class 'numpy.ndarray'>


In [9]:
print(s.index)
print(type(s.index))

RangeIndex(start=0, stop=3, step=1)
<class 'pandas.core.indexes.range.RangeIndex'>


As we can see, the `values` are a `NumPy` array, whereas the `index` is a `pandas.Index` object

When `index` is not specified, `pandas` will use 0, 1, ..., N as indices.

In [10]:
s[0]

1

### Creating `pd.Series`

We can use the following objects as the argument of `pd.Series`

* From `list`, `tuple`
* From `np.array`
* From `dict`
* From `range()`

In [18]:
import numpy as np

# from list
s_list = pd.Series(["1", "2", "3"])

# from tuple
s_tuple = pd.Series(("1", "2", "3"))

# from np.array
s_array = pd.Series(np.array((1, 2, 3)))

# from dict
s_dict = pd.Series({"x": 1, "y": 2, "z": 3, })

# from range
s_range = pd.Series(range(5))

print(f"Series from list: {s_list}")
print(f"Series from list: {s_tuple}")
print(f"Series from array: {s_array}")
print(f"Series from dict: {s_dict}")
print(f"Series from dict: {s_range}")

Series from list: 0    1
1    2
2    3
dtype: object
Series from list: 0    1
1    2
2    3
dtype: object
Series from array: 0    1
1    2
2    3
dtype: int64
Series from dict: x    1
y    2
z    3
dtype: int64
Series from dict: 0    0
1    1
2    2
3    3
4    4
dtype: int64


### `pd.Series` basic properties and methods

In [19]:
s = pd.Series([1, 2, 3, 4])

In [20]:
# length: `len(s)`

print(len(s))

4


In [21]:
# shape: `s.shape`
s.shape

(4,)

In [22]:
# type of elements

s.dtype

dtype('int64')

In [23]:
# selecting a certain element according to the index: `get()`
ser = pd.Series({"a": 1, "b": 2, "c": 3})

ser.get("b")

2

## Pandas DataFrames

If we aggregate several series together we can build a `pd.DataFrame`. These objects are table-like structures in which each row is represented by its own `index` and each column is represented by a column name.

A `pd.DataFrame` is, in the end, a `NumPy` matrix of `n` rows and `m` columns, with labels for each row and column.

### How to create a `pd.DataFrame`

We can use the following objects as the argument of `pd.DataFrame`

* From dict of `pd.Series`: the keys will be the name of the columns
* From list of dicts: the keys will be the name of the columns
* From dict of lists: the keys will be the name of the columns 
...

In [24]:
# from dict of pd.Series

series1 = [1, 2, 3, 4]
series2 = [2, 3, 4, 5]

df = pd.DataFrame(
    {
        "var1": series1,
        "var2": series2
    }
)

df

Unnamed: 0,var1,var2
0,1,2
1,2,3
2,3,4
3,4,5


In [25]:
df.values

array([[1, 2],
       [2, 3],
       [3, 4],
       [4, 5]])

In [26]:
# from list of dicts

list_of_dicts = [
    {"Name": "Daniel", "Age": 32, "Furry": False, "Height": 178},
    {"Name": "Churro", "Age": 6, "Furry": True, "Height": 60},
    {"Name": "Plant", "Age": 1, "Furry": False, "Height": 40},
]

pd.DataFrame(list_of_dicts)

Unnamed: 0,Name,Age,Furry,Height
0,Daniel,32,False,178
1,Churro,6,True,60
2,Plant,1,False,40


In [27]:
# from dict of lists
dict_lists = {
    "var1": ["Good", "Average", "Bad"],
    "var2": [32, 6, 1],
    "var3": [False, True, None],
    "var4": [178, 60, 40]
}

pd.DataFrame(dict_lists)

Unnamed: 0,var1,var2,var3,var4
0,Good,32,False,178
1,Average,6,True,60
2,Bad,1,,40


### `pd.DataFrame` basic properties and methods

In [28]:
df = pd.DataFrame({
    "col_float": [1.0, 2.3, 5.66],
    "col_int": [1, 2, 3],
    "col_string": ["abc", "abc", "ghi"],
    "col_boolean": [True, True, False]
})

df

Unnamed: 0,col_float,col_int,col_string,col_boolean
0,1.0,1,abc,True
1,2.3,2,abc,True
2,5.66,3,ghi,False


#### Values:

`df.values` is the actual information contained and labelled in our dataframe.

Returns an `np.array`. These are formed as a list of list in which each sublist contains the rows of the matrix

In [29]:
# values: np.array
# np.arrays are formed as a list of list in which each sublist contains the rows of the matrix

df.values

array([[1.0, 1, 'abc', True],
       [2.3, 2, 'abc', True],
       [5.66, 3, 'ghi', False]], dtype=object)

#### Index:

We can see the index used in our DF by using the `index` property. We will receive a generator that we can unfold with `list()` for example

We can also change this index to any other info we want with `df.set_index()`.

If we pass the argument `drop=False` we will not remove the column from the values, and we'll have it as index AND in the values. Otherwise, it will be removed.

In [30]:
list(df.index)

[0, 1, 2]

In [31]:
# we can use an existing column as new index
df_new_index = df.set_index("col_string")

print(df_new_index.index)

df_new_index

Index(['abc', 'abc', 'ghi'], dtype='object', name='col_string')


Unnamed: 0_level_0,col_float,col_int,col_boolean
col_string,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abc,1.0,1,True
abc,2.3,2,True
ghi,5.66,3,False


We can reset our index to the original (0, 1, ...) with `df.reset_index()`

In [32]:
df_new_index.reset_index()

Unnamed: 0,col_string,col_float,col_int,col_boolean
0,abc,1.0,1,True
1,abc,2.3,2,True
2,ghi,5.66,3,False


#### Columns and their names:

`df.columns` return the labels attached to each column, or their names.

We can mutate this by mutating the `df.columns` values, or by using `df.rename()` passing as argument a dict with keys containing the old columns we want to change, and value the new name.

In [33]:
# get the columns names
df.columns

Index(['col_float', 'col_int', 'col_string', 'col_boolean'], dtype='object')

In [34]:
# updating df.columns by mutating the df.columns info directly
df.columns = ["type_" + colname.split("_")[1] for colname in df.columns]

print(df.columns)

Index(['type_float', 'type_int', 'type_string', 'type_boolean'], dtype='object')


In [35]:
# using df.rename({old_col:new_col})
# we need to use `inplace=True` if we want to update the information stored in memory or assign the result to a new variable

df.rename(columns={
    "type_float": "type_num_float",
    "type_int": "type_num_int"
}, inplace=True)

In [36]:
df

Unnamed: 0,type_num_float,type_num_int,type_string,type_boolean
0,1.0,1,abc,True
1,2.3,2,abc,True
2,5.66,3,ghi,False


#### Describe 

`df.describe()` returns a summary with statistics for the numeric columns, very useful for Exploratory Data Analysis

In [37]:
df.describe()

Unnamed: 0,type_num_float,type_num_int
count,3.0,3.0
mean,2.986667,2.0
std,2.40469,1.0
min,1.0,1.0
25%,1.65,1.5
50%,2.3,2.0
75%,3.98,2.5
max,5.66,3.0


#### Transpose

Since DataFrames are matrices, we can transpose them with `df.T`

In [38]:
df.T

Unnamed: 0,0,1,2
type_num_float,1,2.3,5.66
type_num_int,1,2,3
type_string,abc,abc,ghi
type_boolean,True,True,False


## Indexing and slicing

We can create subsets of our `pandas` objects in different ways:

* `df.loc[label_row_start:label_row_end, label_col_start:label_col_end]` to slice by label
* `df.iloc[pos_row_start:pos_row_end, pos_col_start:pos_col_end]` to slice by position
* Good old []

In [39]:
df = pd.DataFrame({
    "col_float": [1.0, 2.3, 5.66, 9.99],
    "col_int": [1, 2, 3, 4],
    "col_string": ["abc", "def", "ghi", "jkl"],
    "col_boolean": [True, True, False, True]
})

# set `col_string` as index
df = df.set_index("col_string")

df

Unnamed: 0_level_0,col_float,col_int,col_boolean
col_string,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abc,1.0,1,True
def,2.3,2,True
ghi,5.66,3,False
jkl,9.99,4,True


Now we have string labels for rows (`"abc", "def", "ghi"`) and columns (`"col_float", "col_int", "col_boolean"`)

### Slicing based on single values as arguments

* `loc[row_label, column_label]`
* `iloc[row_position, column_position]`

In [40]:
# value at second row and second column: using loc
df.loc["def", "col_int"]

2

In [41]:
# value at second row and second column: using iloc
df.iloc[1, 1]

2

We can also get all the row or all the column by using `:`

In [42]:
# getting the 3rd row using loc
df.loc["ghi", :]

col_float       5.66
col_int            3
col_boolean    False
Name: ghi, dtype: object

In [43]:
# getting the 3rd row using loc
df.iloc[2, :]

col_float       5.66
col_int            3
col_boolean    False
Name: ghi, dtype: object

In [44]:
# to get the whole column we can use `:` in the rows position

# with loc
df.loc[:, "col_float"]

col_string
abc    1.00
def    2.30
ghi    5.66
jkl    9.99
Name: col_float, dtype: float64

In [45]:
# with iloc
df.iloc[:, 0]

col_string
abc    1.00
def    2.30
ghi    5.66
jkl    9.99
Name: col_float, dtype: float64

We can do all this for columns with just square brackets `[]` and the name of the column:

In [46]:
df["col_int"]

col_string
abc    1
def    2
ghi    3
jkl    4
Name: col_int, dtype: int64

In [47]:
df[["col_float", "col_int"]]

Unnamed: 0_level_0,col_float,col_int
col_string,Unnamed: 1_level_1,Unnamed: 2_level_1
abc,1.0,1
def,2.3,2
ghi,5.66,3
jkl,9.99,4


### Slicing based on several values: lists or ranges.

* Using ranges of values:
  * Using `loc`: when using `loc` the final value **WILL BE INCLUDED**
    * df.loc[ini_row_label:end_row_label, ini_col_label:end_col_label]
  * Using `iloc`
    * df.iloc[ini_row_position:end_row_position, ini_col_position:end_col_position]
    
* Using list of values:
  * Using `loc`:
    * df.loc[[row_labels_to_include], [col_labels_to_include]]
  * Using `iloc`
    * df.iloc[[row_positions_to_include], [col_positions_to_include]]

In [48]:
# get the last 3 rows for the last 2 columns
# using ranges and loc
df.loc["def":"jkl", "col_int": "col_boolean"]

Unnamed: 0_level_0,col_int,col_boolean
col_string,Unnamed: 1_level_1,Unnamed: 2_level_1
def,2,True
ghi,3,False
jkl,4,True


In [49]:
# get the last 3 rows for the last 2 columns
# using ranges and iloc
df.iloc[-3:, -2:]


Unnamed: 0_level_0,col_int,col_boolean
col_string,Unnamed: 1_level_1,Unnamed: 2_level_1
def,2,True
ghi,3,False
jkl,4,True


In [50]:
# get the 2nd and 4th row for the 2nd column
# using lists and loc
df.loc[["def", "jkl"], "col_int"]

col_string
def    2
jkl    4
Name: col_int, dtype: int64

In [51]:
# get the 2nd and 4th row for the 2nd column
# using lists and iloc
df.iloc[[1, 3], 1]

col_string
def    2
jkl    4
Name: col_int, dtype: int64

### Slicing based on conditions

Before diving into it, let's see what `pandas` return when we perform logical operations on a series

In [52]:
# create series
s = pd.Series([1, 2, 3, 4, 5])

# condition
s > 3

0    False
1    False
2    False
3     True
4     True
dtype: bool

In [53]:
s[(s > 3) & (s % 2 == 0)]

3    4
dtype: int64

The result is another `pd.Series` filled with `True/False` according to whether or not the condition was met for each element.

We can use this to *filter* series, by including the condition between brackets.

```Python
series[series[condition]]
```

### logical operators in pandas
* and: &
* or: |

In [54]:
s = pd.Series(range(26))

s[s % 5 == 0]

0      0
5      5
10    10
15    15
20    20
25    25
dtype: int64

We can extend this behaviors to DataFrames: we can filter the rows of our df based on logical conditions on the columns.

In [55]:
# defining the dataset
df = pd.DataFrame({
    "col_a": [1, 2, 3, 4],
    "col_b": [2, 4, 6, 8],
    "col_c": [3, 6, 9, 12],
    "col_d": [4, 8, 12, 16]
})

df

Unnamed: 0,col_a,col_b,col_c,col_d
0,1,2,3,4
1,2,4,6,8
2,3,6,9,12
3,4,8,12,16


In [56]:
# filter rows that have `col_c` greater or equal than 9

df[df["col_c"] >= 9]

Unnamed: 0,col_a,col_b,col_c,col_d
2,3,6,9,12
3,4,8,12,16


We can concatenate conditions in a single instruction:

In [57]:
# rows with even values of `col_a` AND values of `col_d` greater than 8
df[
    (df["col_a"] % 2 == 0) &
    (df["col_d"] > 8)
]

Unnamed: 0,col_a,col_b,col_c,col_d
3,4,8,12,16


## Adding data to a `pd.DataFrame`

* Adding columns:
```Python
df["new_column"] = data_to_include
```

In [58]:
# create df
df = pd.DataFrame({
    "sport": ["football", "basketball", "rugby"],
    "round_ball": [True, True, False],
    "is_cool": [False, True, True]
})

df

Unnamed: 0,sport,round_ball,is_cool
0,football,True,False
1,basketball,True,True
2,rugby,False,True


In [59]:
# add a new column called players_per_team
df["players_per_team"] = [11, 5, 15]

# print df
df

Unnamed: 0,sport,round_ball,is_cool,players_per_team
0,football,True,False,11
1,basketball,True,True,5
2,rugby,False,True,15


* Adding rows:
```Python
new_row = {col_1: data_1, col_2: data_2, ..., col_n:data_n}
df = df.append(new_row, ignore_index=True)
```

In [60]:
# define new row
am_football = {
    "sport": "american football",
    "round_ball": False,
    "is_cool": False,
    "players_per_team": 11
}

# add new row for american football
df = df.append(am_football, ignore_index=True)

# print df
df

Unnamed: 0,sport,round_ball,is_cool,players_per_team
0,football,True,False,11
1,basketball,True,True,5
2,rugby,False,True,15
3,american football,False,False,11


## Practice

### 1. Create the following DataFrame

Save it as the variable `dataset`

| name   | age | type    | is_furry | likes_cats |
|--------|-----|---------|----------|------------|
| dani   | 33  | human   | False    | False      |
| churro | 8   | dog     | True     | False      |
| plant  | 2   | plant   | False    | True       |
| cup    | 1   | object  | False    | True       |

In [61]:
dataset = pd.DataFrame({
    "name": ["dani", "churro", "plant", "cup"],
    "age": [33, 8, 2, 1],
    "type": ["human", "dog", "plant", "object"],
    "is_furry": [False, True, False, False],
    "likes_cats": [False, False, True, True]
})

dataset

Unnamed: 0,name,age,type,is_furry,likes_cats
0,dani,33,human,False,False
1,churro,8,dog,True,False
2,plant,2,plant,False,True
3,cup,1,object,False,True


### 2. Return the value of `is_furry` for the element with `index = 2`

In [62]:
dataset.loc[2, "is_furry"]

False

In [63]:
dataset["is_furry"][2]

False

In [64]:
dataset.iloc[2, 3]

False

### 3. Add a new column containing the length of the name in characters

In [65]:
dataset["length_name"] = list(map(len, dataset["name"]))

dataset

Unnamed: 0,name,age,type,is_furry,likes_cats,length_name
0,dani,33,human,False,False,4
1,churro,8,dog,True,False,6
2,plant,2,plant,False,True,5
3,cup,1,object,False,True,3


In [66]:
dataset["length_name"] = dataset["name"].str.len()

dataset

Unnamed: 0,name,age,type,is_furry,likes_cats,length_name
0,dani,33,human,False,False,4
1,churro,8,dog,True,False,6
2,plant,2,plant,False,True,5
3,cup,1,object,False,True,3


### 4. Add a new column named `logical_op` containing the following logical operation:

```Python
`is_furry` and `likes_cats`
```

Keep in mind that logical operators in pandas are not the ones we know (`not`, `and`, `or`) but rather (`~`, `&`, `|`).

Source: https://jakevdp.github.io/PythonDataScienceHandbook/02.06-boolean-arrays-and-masks.html#Boolean-operators

In [67]:
dataset["logical_op"] = dataset["is_furry"] & dataset["likes_cats"]

dataset

Unnamed: 0,name,age,type,is_furry,likes_cats,length_name,logical_op
0,dani,33,human,False,False,4,False
1,churro,8,dog,True,False,6,False
2,plant,2,plant,False,True,5,False
3,cup,1,object,False,True,3,False


### 5. Change the `index` from the default to:

`element_1, element_2, element_3, element_4, element_5`

In [71]:
dataset.index = ["element_" + str(i) for i in range(1, 5)]

dataset

Unnamed: 0,name,age,type,is_furry,likes_cats,length_name,logical_op
element_1,dani,33,human,False,False,4,False
element_2,churro,8,dog,True,False,6,False
element_3,plant,2,plant,False,True,5,False
element_4,cup,1,object,False,True,3,False


### 6. Create a new dataframe called `non_furry` that contains the rows with `is_furry = False`:

In [72]:
non_furry = dataset[dataset["is_furry"] == False]

non_furry

Unnamed: 0,name,age,type,is_furry,likes_cats,length_name,logical_op
element_1,dani,33,human,False,False,4,False
element_3,plant,2,plant,False,True,5,False
element_4,cup,1,object,False,True,3,False
