# Pandas basics
Numpy is highly efficient when working with large arrays, but they are pure arrays, with no metadata related to them.

We (*with a true engineer spirit*) want to not only work fast but also comfortably. Imagine that we load a dataset with features (i.e. columns) such as "sex", "age", "city"...
When we have to get the city of a specific observation (i.e. row) we have to remember that "city" is the third column (in programming language the index 2).

Imagine if we could add some metadata to an array, so that when we want to access any column we can do it with readable names instead of indexes.

**Pandas** is a library that allows us to do (among other useful functionalities) this.

Let's start installing the library in our environment:

In [1]:
!pip install pandas



Pandas uses the concept of `Series` to represent single-dimensional data, and `DataFrame` for data with multiple dimensions.

In [2]:
import pandas

print("Single dimension:\n")
print(pandas.Series([1, 2, 3, 4]))

print("\n\nTwo dimensions:\n")
print(pandas.DataFrame([[1, 2, 3], [4, 5, 6]]))

Single dimension:

0    1
1    2
2    3
3    4
dtype: int64


Two dimensions:

   0  1  2
0  1  2  3
1  4  5  6


As we see, the data has "names" for the rows and the columns. But this is no different from what we were doing using numpy. So let's name our columns:

In [3]:
data = [[1, 2, 3], [4, 5, 6]]

pandas.DataFrame(data, columns=["c1", "c2", "c3"], index=["user 1", "user 2"])

Unnamed: 0,c1,c2,c3
user 1,1,2,3
user 2,4,5,6


We even can create them from dictionaries! In this case, it will fetch the names from the dictionary keys.

In [4]:
data = {
    "c1": [1, 4],
    "c2": [2, 5],
    "c55": [3, 6]
}

pandas.DataFrame(data)

Unnamed: 0,c1,c2,c55
0,1,2,3
1,4,5,6


We can even set values to be other types, such as strings:

In [5]:
data2 = {
    "c1": ["a", "b", "c"],
    "c2": [1, 2, 3],
    "c3": [1., 2., 3.]
}

pandas.DataFrame(data2)

Unnamed: 0,c1,c2,c3
0,a,1,1.0
1,b,2,2.0
2,c,3,3.0


## Accessing values
Let's see how to access the values of the dataframe.

First, let's fetch a column with its name:

In [6]:
data = {
    "c1": [1, 4],
    "c2": [2, 5],
    "c55": [3, 6]
}

dataframe = pandas.DataFrame(data, index=["user 1", "user 2"])

dataframe["c2"]

user 1    2
user 2    5
Name: c2, dtype: int64

We can also access a single value using the `loc` "function":

In [7]:
dataframe.loc["user 1", "c2"]

2

What if we want to use numeric indexing instead of the label? For those cases, pandas provides us with the "function" `.iloc`, that behaves like `.loc` but using integers as indexes, instead of the labels.

In [8]:
dataframe.iloc[0, 1]

2

### Accessing multiple columns

Imagine we have the following dataset (pandas allows us to read files from an URL directly):

In [9]:
iris = pandas.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
print(iris.shape)  # same meaning as numpy
iris.head()  # print only the first 5 rows

(150, 5)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


We have seen how we would fetch a single column, but what about fetching the columns "sepal_length" and "sepal_width"? In those cases, instead of using a single label, we could use a list of labels:

In [10]:
iris[["sepal_length", "sepal_width"]].head()

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6


The result is also a dataframe, with all the rows but only the columns we selected.

### Filtering rows

Finally, we also might want to get only a subset of the rows. For example, in the iris dataset we would like to get all the rows that are from a specific type. In this case, we would do:

In [11]:
iris[iris["species"] == "virginica"]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
101,5.8,2.7,5.1,1.9,virginica
102,7.1,3.0,5.9,2.1,virginica
103,6.3,2.9,5.6,1.8,virginica
104,6.5,3.0,5.8,2.2,virginica
105,7.6,3.0,6.6,2.1,virginica
106,4.9,2.5,4.5,1.7,virginica
107,7.3,2.9,6.3,1.8,virginica
108,6.7,2.5,5.8,1.8,virginica
109,7.2,3.6,6.1,2.5,virginica


Let's break this into two steps:

- Checking which rows are passing our check
- Filter the dataset with a list/pandas.Series of True/False values

In [12]:
indices = iris["species"] == "virginica"
indices

0      False
1      False
2      False
3      False
4      False
       ...  
145     True
146     True
147     True
148     True
149     True
Name: species, Length: 150, dtype: bool

In [13]:
iris[indices]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
100,6.3,3.3,6.0,2.5,virginica
101,5.8,2.7,5.1,1.9,virginica
102,7.1,3.0,5.9,2.1,virginica
103,6.3,2.9,5.6,1.8,virginica
104,6.5,3.0,5.8,2.2,virginica
105,7.6,3.0,6.6,2.1,virginica
106,4.9,2.5,4.5,1.7,virginica
107,7.3,2.9,6.3,1.8,virginica
108,6.7,2.5,5.8,1.8,virginica
109,7.2,3.6,6.1,2.5,virginica


When filtering by rows, the iterable we use in the "second step" must have the same amount of rows as the dataframe, otherwise it will return a ValueError, such as:

In [14]:
# iris[[True, False, False]] # <-- error
iris[[False] * 10 + [True] * 140]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
10,5.4,3.7,1.5,0.2,setosa
11,4.8,3.4,1.6,0.2,setosa
12,4.8,3.0,1.4,0.1,setosa
13,4.3,3.0,1.1,0.1,setosa
14,5.8,4.0,1.2,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


As we are using as indices a True/False iterable, we can use different conditions:

In [15]:
iris[iris["sepal_length"] < 4.5]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
8,4.4,2.9,1.4,0.2,setosa
13,4.3,3.0,1.1,0.1,setosa
38,4.4,3.0,1.3,0.2,setosa
42,4.4,3.2,1.3,0.2,setosa


And even combine multiple conditions. In those cases though, remember that we are applying boolean operations on the dataframes.

From pandas documentation:
>Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

In [16]:
iris[(iris["sepal_length"] < 4.5) & (iris["sepal_width"] < 3)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
8,4.4,2.9,1.4,0.2,setosa


## Pandas and Numpy
Usually, we would load our data using pandas, and do our preprocessing with their high-level methods to prepare the data. Then, we can either convert it to a numpy array using `dataframe.values` or feed it directly to some libraries that work with both types.

In [17]:
print(type(dataframe.values))
print(dataframe.values)

<class 'numpy.ndarray'>
[[1 2 3]
 [4 5 6]]
