# Pandas

### What is Pandas?
Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

pandas is well suited for many different kinds of data:
1. Tabular data with heterogeneously-typed columns (columns of different datatypes basically), as in an SQL table or Excel spreadsheet

2. Ordered and unordered (not necessarily fixed-frequency) time series data.

3. Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

4. Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, ``Series`` (1-dimensional) and ``DataFrame`` (2-dimensional), handle the vast majority of typical use cases in finance, statistics, ML, etc

Let's first import pandas

In [1]:
import pandas as pd # we import the pandas library as 'pd' so that we can use the 'pd' abbreviation when calling functions

### DataFrame
A ``DataFrame`` is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table. A ``Series`` is just a column of this ``DataFrame`` and is hence 1-Dimensional

![DataFrame](assets\dataframe.png)
![Series](assets\series.png)

Let's create a DataFrame and Series

To make a dataframe we can use ``pd.DataFrame()`` given the data, columns, and index either in lists, arrays or dictionaries

In [2]:
df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Gender": ["male", "male", "female"],
    }
) # Here we create a DataFrame with the help of a dictionary
df

Unnamed: 0,Name,Age,Gender
0,"Braund, Mr. Owen Harris",22,male
1,"Allen, Mr. William Henry",35,male
2,"Bonnell, Miss. Elizabeth",58,female


In [3]:
# Let's say we want to work only with the Age column
ages = df['Age']
ages
# This is a pandas Series since it is one-dimensional

0    22
1    35
2    58
Name: Age, dtype: int64

In most ML tasks, we won't be creating the dataFrame from scratch and instead we will be reading it from files.

pandas provides the ``read_csv()`` function to read data stored as a csv file into a pandas ``DataFrame``. pandas supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, …), each of them with the prefix ``read_*``

As an example let us load the titanic dataset in the dataset folder

In [4]:
titanic = pd.read_csv("datasets/titanic.csv")
titanic # when viewing the DataFrame, we can see the first 5 rows and the last 5 rows if the DataFrame is large

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


It is good practice to first see what kind of data we have in hand before proceeding with making ML models. Pandas offers us tools to do exactly that

If you want to see the first 5 rows, you can use the ``head()`` method. This function also takes in an integer argument to view a custom number of rows. Similar to this we also have the ``tail()`` function.

In [5]:
titanic.head(8)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [6]:
titanic.tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


A check on how pandas interpreted each of the column data types can be done by requesting the pandas ``dtypes`` attribute. Note that we use ``dtypes`` and not ``dtypes()`` because ``dtypes`` is an **attribute** of ``pandas.DataFrame`` and not a **method**

In [7]:
titanic.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

You can also see a technical summary of the dataframe with the ``info()`` method

In [8]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Selection Tools and Methods
Lets say for some reason we are only interested in particular columns of the data. How do we proceed?
![Selection](assets/selection.png)

From the previous section, you might be thinking to do somehting like ``titanic['Age']`` to select just the Ages. Well if thats the case, then you certainly are correct. But what if I want to select more than 1 column? 

Let's see how to do that

In [9]:
ages = titanic["Age"]
ages

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

In [10]:
# lets check the shape of this series
ages.shape # remeber that shape is an attribute so we dont need to use ()

(891,)

In [12]:
# now i want to select age and cabin
age_cabin = titanic[["Age", "Cabin"]]
age_cabin.head()

Unnamed: 0,Age,Cabin
0,22.0,
1,38.0,C85
2,26.0,
3,35.0,C123
4,35.0,


In [13]:
# lets check the shape
age_cabin.shape

(891, 2)

The selection returned a ``DataFrame`` with 891 rows and 2 columns. Remember, a ``DataFrame`` is 2-dimensional with both a row and column dimension.

Okay, thats great now you can select the columns you want. What about rows? Can I filter specific rows?

![Filter rows](assets/filterrows.png)

Let's say i am interested for passengers above the age of 35. How to filter?

To select rows based on a conditional expression, use a condition inside the selection brackets ``[]``
The condition inside the selection brackets ``titanic["Age"] > 35`` checks for which rows the Age column has a value larger than 35:

In [16]:
above_35 = titanic[titanic["Age"] > 35]
above_35.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss Elizabeth",female,58.0,0,0,113783,26.55,C103,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,1,5,347082,31.275,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,0,0,248706,16.0,,S


The output of the conditional expression (>, but also ==, !=, <, <=,… would work) is actually a pandas ``Series`` of boolean values (either ``True`` or ``False``) with the same number of rows as the original ``DataFrame``. Such a ``Series`` of boolean values can be used to filter the ``DataFrame`` by putting it in between the selection brackets ``[]``. Only rows for which the value is ``True`` will be selected.

In [17]:
above_35.shape

(217, 12)

We can see that there are 217 passengers with age above 35

Now lets say I want the data of the passengers from cabin class 2 and 3. 

Similar to the conditional expression, the ``isin()`` conditional function returns a ``True`` for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets ``[]``. In this case, the condition inside the selection brackets ``titanic["Pclass"].isin([2, 3])`` checks for which rows the Pclass column is either 2 or 3.

In [18]:
class_23 = titanic[titanic["Pclass"].isin([2, 3])]
class_23.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [19]:
class_23.shape

(675, 12)

As a final example for conditional selections, lets say I want to work with passenger data for which the age is known.

The ``notna()`` conditional function returns a ``True`` for each row the values are not a ``Null`` value. As such, this can be combined with the selection brackets ``[]`` to filter the data table.

In [20]:
age_no_na = titanic[titanic["Age"].notna()]
age_no_na.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
age_no_na.shape

(714, 12)

Okay cool, now you can select columns or rows and filter them according to what you want. But what about rows **and** columns?
![Rows and cols](assets/rowsandcols.png)

Lets say I’m interested in the names of the passengers older than 35 years. 

In this case, a subset of both rows and columns is made in one go and just using selection brackets ``[]`` is not sufficient anymore. The ``loc``/``iloc`` operators are required in front of the selection brackets ``[]``. When using ``loc``/``iloc``, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

In [22]:
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]
adult_names.head()

1     Cumings, Mrs. John Bradley (Florence Briggs Th...
6                               McCarthy, Mr. Timothy J
11                              Bonnell, Miss Elizabeth
13                          Andersson, Mr. Anders Johan
15                     Hewlett, Mrs. (Mary D Kingcome) 
Name: Name, dtype: object

When using the column names, row labels or a condition expression, use the ``loc`` operator in front of the selection brackets ``[]``. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.

Lets say I’m interested in rows 10 till 25 and columns 3 to 5.

In [23]:
titanic.iloc[9:25, 2:5]

Unnamed: 0,Pclass,Name,Sex
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female
10,3,"Sandstrom, Miss Marguerite Rut",female
11,1,"Bonnell, Miss Elizabeth",female
12,3,"Saundercock, Mr. William Henry",male
13,3,"Andersson, Mr. Anders Johan",male
14,3,"Vestrom, Miss Hulda Amanda Adolfina",female
15,2,"Hewlett, Mrs. (Mary D Kingcome)",female
16,3,"Rice, Master Eugene",male
17,2,"Williams, Mr. Charles Eugene",male
18,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female


Again, a subset of both rows and columns is made in one go and just using selection brackets ``[]`` is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the ``iloc`` operator in front of the selection brackets ``[]``


When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For example, to assign the name ``anonymous`` to the first 3 elements of the fourth column:

In [24]:
titanic.iloc[0:3, 3] = "anonymous"
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,anonymous,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,anonymous,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,anonymous,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Before ending this notebook, here is a summary of things you should remember:
1. When selecting subsets of data, square brackets ``[]`` are used.

2. Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.

3. Select specific rows and/or columns using ``loc`` when using the row and column names.

4. Select specific rows and/or columns using ``iloc`` when using the positions in the table.

5. You can assign new values to a selection based on ``loc``/``iloc``.

6. Getting data in to pandas from many different file formats or data sources is supported by ``read_*`` functions.

7. The ``head``/``tail``/``info`` methods and the ``dtypes`` attribute are convenient for a first check.