# Pandas basics

Pandas is a software library written as an extension to NumPy for high-level data manipulation and analysis. In particular, it provides data structures and operations for manipulating numerical tables and time series.

Its key *data structure* is called the `DataFrame` that allows you to store and manipulate tabular data in rows of observations and columns of variables.

Open source software, distributed under a liberal BSD license, Pandas is developed and maintained publicly on GitHub by a vibrant, responsive, and diverse community.

- **Author:** Wes McKinney
- **Creation year:** 2008
- **Last stable version:** 1.2.1
- **Programmed in:** Python
- **Programming languages:** Python, C, CPython
- **License:** BSD (Berkeley Software Distribution)
- **Requires:** numpy, pytz, python-dateutil
- **Code base:** https://github.com/pandas-dev/pandas
- **Home-page:** https://pandas.pydata.org


### Why use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

Relevant data is very important in data science.

> **Data Science**: is a branch of computer science where is studied how to store, use and analyze data for deriving information from it.

### What can Pandas do?

Pandas gives you answers about the data like:

- Is there a correlation between two or more columns?
- What is average value?
- Max value?
- Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called **cleaning the data**.


### Why Pandas is build on Numpy?

Numpy is highly efficient when working with large arrays, but they are pure arrays, with no metadata related to them.

We (*with a true engineer spirit*) want to not only work fast but also comfortably. Imagine that we load a dataset with features (i.e. columns) such as "sex", "age", "city"...
When we have to get the city of a specific observation (i.e. row) we have to remember that "city" is the third column (in programming language the index 2).

Imagine if we could add some metadata to an array, so that when we want to access any column we can do it with readable names instead of indexes.

That's the reason why Pandas was developed, to provide an **abstraction layer** because is build on top of Numpy.


## Get Started

> **Note**: for the installation is used *pip* as package manager but there are others like *anaconda* that can be used as well

We can check if the **Pandas python module** is installed, just running the next command in our shell. If so, it will show us information about it along with the version.

In [None]:
!pip show pandas

If we do not have it installed, we can install it using the following command in our shell (usually inside a virtualenv):

In [None]:
!pip install pandas

On the other hand, if the version installed is not the most recent one; we can update the module by running:

In [None]:
!pip install -U pandas

## First step

First we need to import the module to give it access to the python script, and renaming it to "**pd**" following the most common convention.

In [1]:
import pandas as pd

Pandas uses the concept of `Series` to represent single-dimensional data, and `DataFrame` for data with multiple dimensions.

> **DataFrame**: a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.frame in R.

> **Series**: each column in a `DataFrame`.

In [None]:
print("Single dimension:\n")
print(pd.Series([1, 2, 3, 4]))

print("\n\nTwo dimensions:\n")
print(pd.DataFrame([[1, 2, 3], [4, 5, 6]]))

As we see, the data has "labels" for the rows and the columns (indexes by default if not set). But this is no different from what we were doing using numpy.

> **Note**: a pandas `Series` has no column labels, as it is just a single column of a `DataFrame`. A `Series` does have row labels.

So let's label our data:

In [None]:
data = [[1, 2, 3], [4, 5, 6]]

pd.DataFrame(data, columns=["c1", "c2", "c3"], index=["user 1", "user 2"])

We can even create them from dictionaries! In this case, it will fetch the column labels from the dictionary keys.

In [None]:
data = {
    "c1": [1, 4],
    "c2": [2, 5],
    "c55": [3, 6]
}

pd.DataFrame(data)

Also, we can set values to be other types, such as strings:

In [None]:
data2 = {
    "c1": ["a", "b", "c"],
    "c2": [1, 2, 3],
    "c3": [1., 2., 3.]
}

pd.DataFrame(data2)

## Accessing values
Let's see how to access the values of the `Dataframe`.

First, let's fetch a column with its label:

> **Note**: if you are familiar to `Python dictionaries`, the selection of a single column is similar to select dictionary values based on the key.

In [2]:
data = {
    "c1": [1, 4],
    "c2": [2, 5],
    "c55": [3, 6]
}

dataframe = pd.DataFrame(data, index=["user 1", "user 2"])
dataframe["c2"] # to select the column, use the column label in between square brackets []

user 1    2
user 2    5
Name: c2, dtype: int64

When selecting a single column of a pandas `DataFrame`, the result is a pandas `Series`. We can verify this by checking the type of the output:

In [None]:
print(type(dataframe["c2"]))

We can also access a single value using the `loc` "function":

In [None]:
dataframe.loc["user 1", "c2"] # [row label, column label] # only string values

###### What if we want to use numeric indexing instead of the label?

For those cases, pandas provides us with the "function" `.iloc`, that behaves like `.loc` but using integers as indexes, instead of the labels.

In [None]:
dataframe.iloc[0, 1] # only numeric values

## Accessing multiple columns

Imagine we have the following *dataset* (pandas allows us to read files from an URL directly):

In [3]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') # .read_csv returns a DataFrame
print(iris.shape)  # same meaning as numpy # (rows, columns)
iris.head()  # method to print only the first 5 rows

(150, 5)


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


We have seen how we would fetch a single column, but what about fetching the columns "*sepal_length*" and "*sepal_width*"? In those cases, instead of using a single label, we could use a list of labels:

In [4]:
iris[["sepal_length", "sepal_width"]].head()

Unnamed: 0,sepal_length,sepal_width
0,5.1,3.5
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
4,5.0,3.6


As we are doing a multiple column selection, the result is also a `DataFrame` with all the rows but only the columnsselected.

> **Note**: the inner square brackets define a `Python list` with column labels, whereas the outer brackets are used to select the data from a pandas `DataFrame` as seen in the previous example.

### Filtering rows

Finally, we also might want to get only a subset of the rows. For example, in the *iris dataset* we would like to get all the rows that are from a specific type.

In this case, we would do:

In [None]:
iris["species"] == "virginica"

Let's break this into two steps:

- Checking which rows are passing our check/condition returning a `Series` of True/False values.
- Filter the `DataFrame` thanks to the `Series` got in the previous step.

In [None]:
indices = iris["species"] == "virginica" # get a True/False Series, checking the condition for each row/record
indices.head()

In [None]:
iris[indices] # filtering the DataFrame from the True/False values got

When filtering by rows, the iterable we use in the "second step" must have the same amount of rows as the `Dataframe`, otherwise it will return a `ValueError`, such as:

In [None]:
# iris[[True, False, False]] # <-- error
iris[[False] * 10 + [True] * 140] # ValueError: Item wrong length 149 instead of 150 # change 139 to 140 to fix it

As we are using as indexes a True/False iterable, we can use different conditions:

In [None]:
iris[iris["sepal_length"] < 4.5]

And even combine multiple conditions. In those cases though, remember that we are applying boolean operations on the dataframes.

From pandas documentation:
> Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

In [None]:
iris[(iris["sepal_length"] < 4.5) & (iris["sepal_width"] < 3)]

## Pandas and Numpy
Usually, we would load our data using pandas, and do our preprocessing with their high-level methods to prepare the data. Then, we can either convert it to a numpy array using `dataframe.values` or feed it directly to some libraries that work with both types.

In [None]:
print(type(dataframe.values))
print(dataframe.values)