<a href="https://colab.research.google.com/github/cristripoli/Codility/blob/master/pyds_02_introduction_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <img src="https://raw.githubusercontent.com/daitan-innovation/daitan-ml-course-resources/main/daitan-header.jpg" width="700">

<br />

<span style="font-size: 10px; font-style: italic;">
Privileged and confidential. If this content has been received in error, please delete it immediately.
<br />
Conteúdo confidencial. Se este material foi recebido por engano, por favor apague-o imediatamente.
</span>

# Trilha de Machine Learning

- **Module.** Python for Data Science
- **Instructors:**
  - Alisson Hayasi da Costa
  - Lucas Silveira de Moura

# *pandas* Basics

## Introduction

In this lesson you will learn the basics of the *pandas* library, including:
- Pandas Series and its most useful operations
- Pandas DataFrames and its most useful operations

Here's a bit about what pandas is all about:
*pandas* is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

*pandas* is well suited for many different kinds of data:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

The two primary data structures of *pandas*, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. *pandas* is built on top of NumPy (from the previous class) and is intended to integrate well within a scientific computing environment with many other 3rd party libraries (as, for example, jupyter itself).

In [None]:
import numpy as np
import pandas as pd

### *pandas* [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, objects, etc.). The axis labels are collectively called index. Labels need not be unique but must be a hashable type. The object provides a host of methods for performing operations involving both the index and the values.

Let's take a look at the most common way in which we can create a Series:

#### Series creation
Creating a Series by passing a list of values, letting pandas create a default integer index:
<br/>OBS: This also works by passing a NumPy array instead of a common Python list.

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Using the default visualization method of Series in a Jupyter notebook like this one we can observe:
- On the left column, we can see the index. In this case, it was automatically generated keeping the order of the inserted list, and starting at 0 with increments of 1 (from 0 to 5 as we have 6 elements).
- On the right column, we can see the values themselves.
- At the bottom, we have the type of the values in this Series (in this case, float64).

#### Accessing an element of a Series
Let's access a value in the Series by informing its index:

In [None]:
s[2]

#### Index of Series
If desired, a manually-defined index can be specified at the creation of the Series:

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=['a', 'b', 'c', 'd', 'e', 'f'])
s

And now let's access the same value as before, but now with a different index:

In [None]:
s['c']

Another commonly used way to access a Series element based on its index is to use the *loc[index]* notation:

In [None]:
s.loc['c']

#### Naming a Series
A Series can be [named](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.name.html). Naming a Series can be useful when using it to compose a DataFrame (will be covered later in this class).

In [None]:
s.name = 'Example Series'
s

#### Deleting an element from a Series
Elements from the Series can be deleted (or *dropped*) by using the [drop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.drop.html) method and passing the list of indexes to be removed.
Notice that the size of the Series will be affected when at least one element is dropped, instead of marking the elements as empty and keeping the size.

In [None]:
s = s.drop(['a', 'e'])
s

#### Sorting a Series
We can also sort Series by either its [values](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sort_values.html) or by its [index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sort_index.html).

In [None]:
s = s.sort_index(ascending=False)
s

In [None]:
s = s.sort_values(ascending=False)
s

If you want to access an element in the Series by its position, respecting only the current order the Series is ordered by, use the *iloc[position]* notation (0-based):

In [None]:
s.iloc[2]

#### Useful Series methods
Another useful operation for Series is [rank()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.rank.html). It ranks the values when possible, keeping the order of the Series. In the example below we can see that the non-numeric value (NaN) does not get ranked, and its rank instead shows as NaN as well. By default, tied values will have the same rank, but will keep the general count of elements in the rank (i.e., in a Series with 10 rankable elements where a single element is the highest, its rank will be also 10).

In [None]:
ranked_s = pd.Series([1, 2, 3, 3, 3, np.nan, 4, 5])
ranked_s

In [None]:
ranked_s.rank()

A quick and easy way to know how many non-NaN/null elements are present in a series is the [count()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.count.html) method. Let's compare it to simply computing the length of a Series:

In [None]:
print(f'The length of the Series "ranked_s" is {len(ranked_s)} but its count of non-null elements is {ranked_s.count()}')

We can also check how many elements per value are present in a series by using the [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) method. The result is a new Series where the indexes are the actual values found in the original Series, and the values are the counts of the respective value indicated by the index present. If you want to account for the null values as well, use the `dropna=False` parameter.

In [None]:
ranked_s.value_counts(dropna=False)

If you wish to fill in the values that are Null/NaN in a Series, use the [`fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html) method, passing as a parameter the value that the nulls should be replaced by:

In [None]:
ranked_s.fillna(10000)

Notice that this will return a new Series with the replaced values. If you want to apply this change to the original Series, use the `inplace=True` parameter as well.

Let's now take a look at some mathematical operations we can do with a single Series.
<br/><br/>A quick way to apply any operation to all elements in a Series is to use the [apply()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html) method. Its result is going to be a new Series where each value is the result of the defined operation applied to the respective element of the original Series. You can define any method this way. As an example, let's take the square of each element in our Series.

In [None]:
def square(value):
    return value*value

z = s.apply(square)
z

If the method is simple enough, like the one shown above, you can also use lambda notation to make it even smaller and simpler:

In [None]:
z = s.apply(lambda value: value*value)
z

In order to find the sum of all values in a Series, use the method [sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sum.html). It is automatically going to ignore the null/non-numeric values, unless otherwise specified.

In [None]:
z = s.sum()
z

You can use similar methods to the sum() described above in order to extract other kinds of values from the Series, like:
- [mean()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mean.html): Mean value (will work for numeric types only)
- [std()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.std.html): Standard deviation (will work for numeric types only)
- [min()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.min.html): Minimum value
- [max()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.max.html): Maximum value
- [median()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.median.html): Median value (will work for numeric types only)
- [mode()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mode.html): Most frequent value or mode. If there values tied for most frequent, multiple will be returned in a new Series.

In order for any of these methods to consider null values, you should pass `skipna=False` as a parameter.

#### Operations between 2 Series
Common mathematical operations between 2 Series are also possible. Notice however, that these operations will use the indexes of both Series in order to match the elements. Let's create another Series to test some of these operations with our old Series *s*, and also remember what *s* is so it all makes a little more sense.

In [None]:
s

In [None]:
s2 = pd.Series([np.nan, 2, np.nan, 4], index=['a', 'b', 'c', 'd'], name="Example Series 2")
s2

Let's begin with the [add()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.add.html) operation, which will add the elements of the specified Series to the elements of the caller Series. In the example we're going to use the `fill_value` parameter in order to specify what should be the value that overwrites the null/non-numeric values as well as unmatched values from either Series.

In [None]:
s.add(s2, fill_value=0)

Let's take a moment to understand the results above:
- *a* gets resolved to "NaN" since *a* was present only in the "s2" Series, and there its value was NaN.
- *b* gets resolved to 5.0 since in the "s" Series its value was 3.0 and in the "s2" Series its value was 2.0 (2.0 + 3.0 = 5.0)
- *c* gets resolved to 5.0 since in the "s" Series its value was 5.0 and in the "s2" Series its value was NaN, which due to "fill_value=0" got replaced by 0 (5.0 + 0 = 5.0).
- *d* gets resolved to 4.0 since in the "s" Series its value was NaN which due to "fill_value=0" gets replaced by 0, and in the "s2" Series its value was 4.0 (0 + 4.0 = 4.0).
- *f* gets resolved to 8.0 since it only exists in the "s" Series, where its value was 8.0. It is then added to the "fill_value" of 0 (8.0 + 0 = 8.0).

In a very similar fashion to the add() method described above, Series can also be subtracted, multiplied and divided by each other by using [sub()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sub.html) [mul()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.mul.html) and [div()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.div.html) respectively.

### *pandas* [DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html)
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Pandas DataFrames consist of three principal components, rows, columns and the data.

With a basic understanging of *pandas* Series, it becomes much easier to grasp the concept of DataFrames, which can be perceived as a Series of Series. For example, you can interpret each a row in a *pandas* DataFrame as a Series, where its index is going to be the column labels of that DataFrame, or you could also interpret each column in a *pandas* DataFrame as a Series instead, where its index is going to be the same index used for the rows in the DataFrame. It all depends on your needs for each situation.

DataFrames are usually the most used structures in *pandas*, since in most uses of the library people deal with lots of data, usually arranged into tables, which are easily represented inside a DataFrame.

#### DataFrame creation
There are several possible ways of creating a *pandas* DataFrame. Let's take a look into the most used and common methods:

From a CSV file - If you have a comma-separated values file (csv) in which the values are present in a tabular fashion, you can import it directly into your DataFrame by using the [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) method. The only required parameter to this method is the path and name to the csv file you wish to read from. This way, the first row in the CSV file will by default name the columns, and from the second row onwards, the data will be imported as the values in the DataFrame. 

In [None]:
url = 'https://github.com/daitan-innovation/daitan-ml-course-resources/blob/main/sample.csv?raw=true'
df = pd.read_csv(url)
df

You can also write back to a CSV file using the `df.to_csv('file_name')` method:

In [None]:
df.to_csv('new.csv')

Doing it in the way described above will also export the index data into the CSV, making it harder to read it later if you do not intend to use the index values in any meaningful way. To avoid that, pass also the parameter "index=False" so the index does not get written into the CSV file:

In [None]:
local_df = pd.read_csv('new.csv')
local_df

In [None]:
df.to_csv('new.csv', index=False)

In [None]:
local_df = pd.read_csv('new.csv')
local_df

From a list - As DataFrames can be considered a two-dimensional data structure, we can also use a one or two-dimensional list to create them. NumPy arrays can also be used in this same fashion.

In [None]:
df = pd.DataFrame([[1,'A'],[2,'Z'],[3,'B'],[4,'X']])
df

Notice that using a one-dimensional list will generate a DataFrame with only one column:

In [None]:
df = pd.DataFrame([5,4,3,2,1])
df

To create a DataFrame with a single row, use only one list inside another list:

In [None]:
df = pd.DataFrame([[2,4,6,8]])
df

#### DataFrame metadata

Let's now take a look into some information related to a DataFrame that is not the values contained in its rows.

[Column](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html) names:

The names of the columns names or labels are useful to identify each of the columns in your DataFrame. We can check that in the examples above, some DataFrames were created without specifying the column names, which then default to zero-based iteger increments (0, 1, 2, ...). In order to properly name the columns when creating a DataFrame using the DataFrame() method, you can pass the "columns" parameter with a list of names for the columns:

In [None]:
df = pd.DataFrame([[2,4,6,8]], columns=['first', 'second', 'third', 'fourth'])
df

Column [data types](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html):

Each column in a single DataFrame is evaluated to be from a single data type (dtype). If there is several different data types within a same column, *pandas* will identify that column's dtype as the most generic type that could contemplate all of the values in that column. To check what types are present in your df, you can use `df.dtypes`:

In [None]:
df = pd.read_csv('new.csv')
df

In [None]:
df.dtypes

Above we can check that the values from `sample_column_b` were classified as "int64", which is the default choice for integer values, but in case we know we could use only a 32-bit integer for the whole column, we could force the DataFrame to use that instead. In order to "force" a dtype interpretation when reading a DataFrame from a CSV file, we can use the `dtype` parameter and pass a dictionary like that:

In [None]:
df = pd.read_csv('new.csv', dtype={'sample_column_b': np.int32})
df

In [None]:
df.dtypes

DataFrame [shape](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html):

The shape of a DataFrame simply tells its dimensions; i.e. how many rows and columns it is currently made of (excluding the index). The return is a tuple that says: `(number_of_rows, number_of_columns)`

In [None]:
df.shape

DataFrame [index](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html):

Similarly to a Series, DataFrames also contain indexes. In the examples we saw so far, they appear as the left-most column, which does not have a name on top, and is by default a zero-based incremental integer (0, 1, 2, ...). In the example of the DataFrame read from the CSV file seen above, we can see it as the **bold 0, 1, 2, 3, 4** numbers on the left-hand side, with one number for each row. Indexes in a DataFrame serve the purpose of identifying the rows. If wanted, just like in a Series, an index can be specified to be something different from the default one. Let's see an example where we use `sample_column_a` as the index for the DataFrame read from the CSV:

In [None]:
custom_index_df = pd.read_csv('new.csv', index_col='sample_column_a')
custom_index_df

Now we can see that, as we have already specified a custom index based on another column (using the parameter `index_col='column_name'`) the default index has not been generated, and instead we are using the values of that specified column as the index for this new DataFrame (which also show up in **bold**). We can also notice that the name of the column `sample_column_a` appears on top of the column, instead of having an unnamed index, like in previous examples, and is slightly below the column names that are not indexes.

#### DataFrame data selection and access

If desired, we can select just fewer columns of the whole DataFrame by using this notation: `df[['wanted', 'columns']]`:

In [None]:
df = pd.read_csv('new.csv')
df

In [None]:
df[['sample_column_b']]

In order to check the first and last rows in a DataFrame, we can use the [head(n)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) and [tail(n)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html) methods, where n is the number of rows to be checked from the top or from the bottom (If no number is passed, the default is going to be 5):

In [None]:
df.head(3)

In [None]:
df.tail(3)

To select a row based on its **index**, we can use the [`df.loc[index]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) notation, which will yield all rows that contain that same **index**.
- If only one row satisfies the index identification, it will be returned as a Series, where its values's index is the names of each column in the original DataFrame.
- If more than one row satisfy the index, a sub-DataFrame will be returned, with all the rows that apply.
- If no rows satisfy the index, then a KeyError will be thrown.

In [None]:
df.loc[1]

In [None]:
custom_index_df = pd.read_csv('new.csv', index_col='sample_column_a')
custom_index_df

In [None]:
custom_index_df.loc['A']

We can also access the exact value in a row (identified by its index) and column (identified by its name) using this notation:
<br/>`df.loc['index_value', 'column_name']`
<br/>The result will be the single value found.

In [None]:
df.loc[1, 'sample_column_a']

To select a row purely based on which position it is located in the DataFrame, respecting its current order, we can use the [`df.iloc[position]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) notation (similarly to what we did previously in the Series section. As before, this is also zero-based.
- The result for a single position specified will be a Series, with the index being the names of the DataFrame's columns.
- The result for multiple positions specified will be a DataFrame containing the appropriate rows.
- If any position cannot be found in the DataFrame, an out-of-bounds IndexError will be thrown.

In [None]:
df.iloc[4]

In [None]:
df.iloc[[0, 2]]

In order to select data from your DataFrame based on the values of the columns, we can use this notation: `df[df['column_name'] == 'value']`. Notice that the `==` sign can be swapped for any other valid comparison evaluator. The result in this case will always be a sub-DataFrame even if no valid rows were found.

Some examples:

In [None]:
df[df['sample_column_a'] == 'A']

In [None]:
df[df['sample_column_b'] == 3]

In [None]:
df[df['sample_column_b'] == 0]

Let's now take a look at different ways in which we can iterate through the data in our DataFrames.

One of the most common and straightforward ways to iterate through a DataFrame is to iterate through its rows. We can use a method called [`iterrows()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iterrows.html) to achieve that. In each iteration, we receive a tuple with 2 values:
- The first value (in the position 0) is the index of that row in the original DataFrame.
- The second value (in the position 1) is a Series where the indexes are the column names and the values are the values themselves for that row and columns. The Series's name is going to be the same as the respective row's index.

Let's take a look into a single iteration over the rows in the `df` DataFrame:

In [None]:
for item in df.iterrows():
    print('The whole item:', item, '\n')
    print('Only the index:', item[0], '\n')
    print('Only the data:', item[1], '\n')
    break

Another useful way of iterating through a DataFrame is to iterate through its columns. We can use a method called [`iteritems()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iteritems.html) to do that. In each iteration, we also receive a tuple with 2 values:
- The first value (in the position 0) is the name of the respective column in the original DataFrame.
- The second value (in the position 1) is a Series where the indexes are the same indexes as in the original DataFrame, and its values are the respective values from that column and indexes. The Series's name is going to be the same as the respective column's name.

Let's take a look into a single iteration over the rows in the `df` DataFrame:

In [None]:
for item in df.iteritems():
    print('The whole item:', item, '\n')
    print('Only the column name:', item[0], '\n')
    print('Only the data:', item[1], '\n')
    break

#### Useful DataFrame methods

Let's take a look now into some methods we can use to summarize and manipulate data within a single DataFrame.

Statistical summarization of numerical data:

Firstly, the [describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method provides a quick statistical summary of all numeric data present in your DataFrame, separated by column. The result is a new dataframe in which each column represents a column in your DataFrame that contains a numeric dtype. Each row then accounts for a different statistic measure. Take a look below:

In [None]:
df.describe()

Removing duplicates:

If you need to remove duplicated rows from your DataFrame, you can use the [`drop_duplicates()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) method. By default, it is going to drop only the rows that are completely identical (i.e. the values from every column is the same, excluding the index). Let's see some examples:

In [None]:
df.drop_duplicates()

In the example above, no rows were dropped because all rows are different if we consider the values from all columns. Let's now consider only the `sample_column_a` by using the parameter `subset=['columns_to', 'be_considered']` so that rows 0 and 4 can be considered duplicates and remove one of them:

In [None]:
df.drop_duplicates(subset=['sample_column_a'])

We can also check that by default only the first row was kept. This can also be tweaked by providing the `keep='first/last/False'` parameter, with the following results:
- **keep=first**: keeps only the first duplicate (it is the default keep value).
- **keep=last**: keeps only the last duplicate.
- **keep=False**: removes all duplicates

It is also important to notice that this method will by default return a new DataFrame with the changes applied. In order to actually remove the duplicates from the original DataFrame, use the `inplace=True` parameter.

Filling nulls:

Just like in a Series, we can also directly fill all null values in a DataFrame by using the [`fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) method. Onca again, just like in the `drop_duplicates()` method described above, a new DataFrame will be returned. If you with to apply this change to the same DataFrame, pass `inplace=True` as a parameter.

In [None]:
test_nan_df = pd.DataFrame([[0, np.nan], [np.nan, 2]])
test_nan_df

In [None]:
test_nan_df.fillna(10)

Sorting data:

Sorting data in a DataFrame can be useful in a lot of different situations, especially because the default display methods of Jupyter, as well as the `head()` and `tail()` only show the first or last few rows of your DataFrame. There are 2 main methods for sorting your DataFrame:
- [`sort_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html)
- [`sort_values()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html)

By using `sort_index()`, you do not specify which column is used in the sort (as it is already specified to be the index). You can however use some parameters like `ascending=False` if you want to sort from the highest index to the lowest (by default the opposite will occur), as well as `axis=1` if you want to sort the columns instead of the rows.

Let's check some examples:

In [None]:
df.sort_index(ascending=False)

In [None]:
df.sort_index(axis=1, ascending=False)

By using `sort_values()`, you need to specify the parameter `by=['columns_to', 'be_considered']`. The order in which you specify this array matters as it will be used in case several values from the left-most column specified collide. Let's take a look into some examples:

In [None]:
df.sort_values(by=['sample_column_b'], ascending=False)

You can also specify a list of sorting orders in the `ascending=` parameter. It should signal what is the sorting order for each of the columns specified in the `by=` parameter:

In [None]:
df.sort_values(by=['sample_column_a', 'sample_column_b'], ascending=[False, True])

Finally, notice that just like in the `drop_columns()` method, the sorting methods will also return a new DataFrame. If you wish to apply the sorting to the same DataFrame that you are using, pass the parameter `inplace=True`.

Grouping data:

Sometimes you will need to aggregate or group data into buckets of similar features. *pandas* DataFrames provide you with the [`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method, which works very similarly to `GROUP BY` statements in SQL, for example. You will need to specify which are the columns that should be used in order to identify the different groups by passing the `by=['columns_to', 'be_considered']` parameter. The pure result of the `groupby()` method is going to be a collection of groups, which can each be interepreted as a tuple:
- The first item in the tuple (position 0) is the identifier of the group, which can be a single value if you specified only one column in the `by=['single_column']` parameter, or a tuple if you specified multiple columns in the `by=['multiple', 'columns']` parameter.
- The second and last item in the tuple (position 1) is a sub-DataFrame which contains all the rows that are part of that group.

Let's take a quick look into the first group generated by doing a `df.groupby()` and grouping by the column `sample_column_a`.

In [None]:
for item in df.groupby(by=['sample_column_a']):
    sub_df = item[1]
    break
sub_df

Usually, you will want to chain some aggregation function into the result of the groupby, so that it gets applied to all groups at the same time, and the result is a new DataFrame, which will contain one value per group found.

Let's see an example of that:

In [None]:
df

In [None]:
df.groupby(by=['sample_column_a']).min()

In [None]:
df.groupby(by=['sample_column_a']).max()

Applying functions to the whole DataFrame:

Similarly to what we saw before in the Series section, it is also possible to apply a function to the whole DataFrame, by using the [`apply()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html) method. Just like before, we can either define a function explicitly and pass it as the parameter for `apply()`, or quickly define a function inside the signature itself, using lambda notation.
However, in the case of the DataFrame version of this method, there is an important catch. We can pass the parameter `axis=0` (which already is the default value) or `axis=1`. Here's what it means:
- `axis=0`: Means that the function will be applied to all columns, each execution of `apply()` will receive one of the columns as its argument (in the format of a Series). This is the default value of `axis` if not otherwise specified.
- `axis=1`: Means that the function will be applied to all rows, each execution of `apply()` will receive one of the rows as its argument (in the format of a Series).

Let's check some examples for both:

In [None]:
df

In [None]:
df.apply(lambda column: column.min())

In the example above, what happened was that the method `min()` was used in each of the 2 Series (one for each column, as `axis` was not specified and thus was 0). This resulted in us getting back the minimum of the `sample_column_a`, which was "A", and the minimum of the `sample_column_b`, which was 1. Notice that the resulting Series had the column names as its index, and the resulting value of the operation as its values.

In [None]:
df.apply(lambda row: f"{row.loc['sample_column_a']}->{row.loc['sample_column_b']*2}", axis=1)

In the example above, we created a custom string for each of the original rows (`axis=1`), by using some selecting operators into each Series, and then multiplied the numeric value, found in `sample_column_b` by 2. Notice that the resulting Series has the original DataFrame's index as its own index and the result of the applied lambda function as its values.

Is it possible, then, to apply a single function to all elements of the DataFrame?

Yes. In order to do that, we can use the [`applymap()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.applymap.html) method. It will apply the same function to every single element in the DataFrame, instead of applying to each row or column. Let's take a look into an example:

In [None]:
df.applymap(lambda element: len(str(element)))

In the example above, we just applied a simple verification of how long the string of the element is, which ended up keeping the original shape of the DataFrame, as well as the indexes and column names.

#### Operations between 2 DataFrames

Let's now take a look into several different methods and operations that use 2 DataFrames.

Merge and join:

The methods [`merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) and [`join()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) are very similar to each other in what they accomplish, but differ in the flexibility of how to do it. Both can be very similar to a common database join, where we get rows from two different tables and extend the columns based on some join key. However, here is the difference between these two methods in *pandas*:
- With `merge()`, you can choose any column or set of columns that should match from both tables in order to perform the operation.
- With `join()`, the index is automatically used as the key for the operation.

Let's take a look into some examples:

Merge example:

In [None]:
df

In [None]:
df_to_merge = pd.DataFrame([['A', 'aa'], ['C', 'cc'], ['D', 'dd']], columns=['sample_column_a', 'extra_column'])
df_to_merge

In [None]:
df.merge(df_to_merge, on=['sample_column_a'])

In the example above we can notice that the `merge()` method used the following notation: 
<br/>`left_df.merge(right_df, on=['merge', 'columns'])`
<br/>There are also some other common parameters that we could add in order to make this merge more like what we need in different cases:
- `how='left'/'right'/'outer'/'inner'/'cross'`: This chooses what the [merge method](https://www.w3schools.com/sql/sql_join.asp) should be. Default is `inner`.
- `right_on=['right_df', 'merge_columns']`: If the column names of the columns that should be merged do not match across both DataFrames, you can instead use this field to signal which columns from the right DataFrame should be used in the merge.
- `left_on=['left_table', 'merge_columns']`: Same as the one described above, but signales the used columns in the left DataFrame.

Join example:

In [None]:
df

In [None]:
df2 = pd.DataFrame([300, 400, 500], columns=['new_column'])
df2

In [None]:
df.join(df2)

In the example above we can notice that the `join()` method used the following notation: 
<br/>`left_df.join(right_df)`
<br/>As this method is less flexible than the one described before, there are less custom parameters that can be passed and in this course we will only focus only on `how`:
- `how='left'/'right'/'outer'/'inner'`: This chooses what the [merge method](https://www.w3schools.com/sql/sql_join.asp) should be. Differently from the merge method, the default here is `left`.

Append:

The [`append()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html#pandas.DataFrame.append) method simply appends the rows from the specified DataFrame into the calling DataFrame. Columns that are not present in either DataFrame will be added as well, adding null values for the rows from the dataset that do not contain the column. Passing the parameter `ignore_index=True` will regenerate the index using zero-based incremental integers; otherwise the indices will be kepto from both DataFrames.

In [None]:
df

In [None]:
df2 = pd.DataFrame([[300, 'z'], [400, 'x'],  [500, 'y']], columns=['sample_column_a', 'new_column'])
df2

In [None]:
df.append(df2, ignore_index=True)

### Thank you!

This concludes the basics on *pandas*. We hope that this helps as an introduction to the subject and that the next steps into data science with Python become easier for you to grasp with this knowledge. If you want to further your learning into this subject, please check the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html), a [quick introduction guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) and some [common uses of the tool](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook).

We wish you all a happy learning and good luck!