# Table of Contents

1. [Pandas Imports](#intro)
2. [Introduction to Pandas Objects](#objects)
3. [Introduction to Viewing Data](#view)
4. [Introduction to Selecting Columns](#select)
5. [Introduction to Data Aggregations](#aggregate)
6. [Introduction to Merging Data Together](#merge)
7. [Introduction to Grouping Data](#group)
8. [Introduction to Reshaping Data](#reshape)
9. [Additional Resources and Tutorials](#references)

<hr />

# 10 Minutes to Pandas <a class="anchor" id="intro"></a>

This is a short introduction to pandas, geared mainly for new users. Essentially, I used [this walkthrough](https://pandas.pydata.org/docs/user_guide/10min.html) from the official Pandas site, but made a few additional comments.

Customarily, we import the following packages. Obviously, we'll want to import the *pandas* package, but we'll also want to import the *numpy* package. Lots of Python packages are used together in the Python ecosystem. Meaning, within our pandas functions, we'll most likely want to call a lot of functions from the numpy package. So, we'll import it now for future references.

In [1]:
import numpy as np
import pandas as pd

<hr />

# Object Creation <a class="anchor" id="objects"></a>

Again, the most important object in the Pandas package is the `DataFrame`. A DataFrame is a 2-dimensional, labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. More technically, it is a combination of `Series` objects. Generally, a DataFrame is the most commonly used pandas object.

At a high-level, we can think of a Series as an individual column of a DataFrame, or we can think of it as any ordinary [Python list](https://www.w3schools.com/python/python_lists.asp). Behind the scenes, *this ordinary Python list* is formatted as a Series object.

More technically, a Series is a 1-dimensional, labeled array capable of holding any Python data type (integers, strings, floating point numbers, Python objects, etc.). For a more detailed explanation of the data structures offered in Pandas, please refer to [this introduction](https://pandas.pydata.org/docs/user_guide/dsintro.html) to Pandas data structures.

### Example of a `Series` Object

For now, let's just take a look at an example of a Series object:

In [7]:
s = pd.Series([1, 3, 5, 6, 8])
s

0    1
1    3
2    5
3    6
4    8
dtype: int64

Notice, a `Series` object just uses an ordinary Python list `[1,2,3,5,6,8]` as its input. So, if it helps, we can just think of it as an ordinary Python list with some cool, extra Pandas functions wrapped around it.

### Example of a `DataFrame` Object

The most common way of creating a DataFrame is by passing a [Python dict](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) containing the column names (i.e. `name` and `age`) and column values. Notice, these column values are represented as their own Python lists. Behind the scenes, Pandas will receive this `dict` and convert all of the list values to Series objects.

For this example, we'll specify a DataFrame, consisting of three individuals and their respective ages. The individuals in our DataFrame are Anna, May, and Rick. Their respective ages are 20, 21, and 22.

In [50]:
df = pd.DataFrame({'name': ['Anna', 'Rick', 'May'], 'age': [20, 21, 22], 'gender': ['female', 'male', 'female']})
df

Unnamed: 0,name,age,gender
0,Anna,20,female
1,Rick,21,male
2,May,22,female


<hr />

# Viewing Data <a class="anchor" id="view"></a>

Now that we've created our own DataFrame, we'll discuss the essential methods called on a DataFrame. This is only a brief introduction, so I'll only list a few here. If you're interested to learn about a few more functions used for viewing data, please refer to [this introduction](https://pandas.pydata.org/docs/user_guide/basics.html) on the official Pandas website.

### Catching Glimpses of Data

Here is how to view the top and bottom rows of the frame:

In [51]:
df.head()

Unnamed: 0,name,age,gender
0,Anna,20,female
1,Rick,21,male
2,May,22,female


In [52]:
df.tail(3)

Unnamed: 0,name,age,gender
0,Anna,20,female
1,Rick,21,male
2,May,22,female


Notice, the default number of elements to display is five. So, if we don't specify any number within the parentheses, the number of rows will default to five. But, we can pass a custom number within these parentheses, as well. By doing this, it will only return the custom number of rows. For a more detailed explanation about this function and the parameters we can pass to this function, refer to the [function documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html).

### Describing Numerical Data

Moving on, we can also describe the numerical columns in our data set. Specifically, calling the `describe` function will return a set of descriptive statistics about our numerical columns.

These descriptive statistics include information about the central tendency, dispersion, and shape of the distribution of each numerical column. Specifically, it will return the following information:
- `count:` The number of unique values in the column
- `mean:` The average value in the column
- `std:` The standard deviation of values in the column
- `25%:` The 25th percentile.
- `75%:` The 75th percentile.
- `max:` The maximum value in the column.

In [53]:
df.describe()

Unnamed: 0,age
count,3.0
mean,21.0
std,1.0
min,20.0
25%,20.5
50%,21.0
75%,21.5
max,22.0


This function analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided. Meaning, columns of Strings are usually not provided, since these aren't usable.

### Describing Numerical Data

To use one final function as an example, we'll now sort the data set by a specific column. To do this, we'll call the `sort_values` function, which sorts the values along a given column.

In [54]:
df.sort_values(by='name')

Unnamed: 0,name,age,gender
0,Anna,20,female
2,May,22,female
1,Rick,21,male


<hr />

# Selecting Data <a class="anchor" id="select"></a>

In this section, we will focus on columns and subsets of data from a DataFrame. Namely, we'll focus on how to slice, dice, and generally get and set subsets of pandas objects. The primary focus will be on Series and DataFrame objects, as they have received more development attention in this area.

For a DataFrame, we can use the bracket syntax `[]` for indexing and selecting data. Specifically, we'll need to specify a column name within our DataFrame `['col_name']` in order to select that particular column. Keep in mind, we can also select multiple columns by passing a list of column names into the brackets. For a more detailed explanation of indexing and selecting data, refer to [this introduction](https://pandas.pydata.org/docs/user_guide/indexing.html) on the official Pandas website.

### Selecting a Single Column

For this example, we'll select the names of all of the tree individuals in our data set. For now, this selection will only include a single column, which yields a Series object:

In [55]:
df['name']

0    Anna
1    Rick
2     May
Name: name, dtype: object

### Selecting a Single Row

For this example, we'll select a single row from the DataFrame using the `iloc` function. For now, selection will only include a single row, but just know we can select multiple rows at a time, and multiple columns at a time. Similar to selecting a single column, selecting a single row will yield a Series object.

In [56]:
df.iloc[2]

name         May
age           22
gender    female
Name: 2, dtype: object

### Conditionally Selecting Rows

In most cases, we'll want to filter certain rows on a given condition or rule. In Pandas terminology, this is called *boolean indexing*. To do this, we just need to insert a condition that returns boolean values for each row.

Again, this is common operation used for boolean vectors to filter the data. Similar to SQL syntax, we can perform an *or* condition by using the `|` operator, and we can use the `&` operator to perform an *and* condition. Also, we can use the `~` operator for a *not* condition. Please refer to [this reference](https://pandas.pydata.org/docs/user_guide/indexing.html#boolean-indexing) for a more detailed explanation of boolean indexing and the possible set of operators used in boolean indexing. 

For this example, let's select the individuals who are at least 21 years old:

In [57]:
df[df['age'] <= 21]

Unnamed: 0,name,age,gender
0,Anna,20,female
1,Rick,21,male


### Setting Data Values

Selecting columns and filtering rows is great, but sometimes we'll actually want to assign values to these filtered rows and selected columns. To do this, we'll use the `loc` function.

We've already used the `iloc` function to filter on indivdual rows based on their index. Now, we can use the `loc` function to filter on certain rows based on a condition, while also selecting a particular column. For more information about other ways to assign data values to selected data, refer to [this section](https://pandas.pydata.org/docs/user_guide/10min.html#setting) on the official Pandas website.

For this example, we'll filter on the individuals who are named Anna. Then, we'll change Anna's age from 20 to 21:

In [58]:
df.loc[df['name'] == 'Anna', 'age'] = 21
df

Unnamed: 0,name,age,gender
0,Anna,21,female
1,Rick,21,male
2,May,22,female


<hr />

# Aggregations <a class="anchor" id="aggregate"></a>

There are so hundreds of operations we can call on our individual DataFrame and its columns. We'll only walk through a few of these functions, such as statistical methods, custom methods, and many more! 

In particular, these methods include functions, like the following:
- `mean():` Takes the average of all numerical columns in the data set
- `apply():` Applies a custom function to the columns in a data set
- `value_counts():` Computes the number of unique values in a data set

Again, there are hundreds of other pandas functions, including functions for adding columns together, substituting values within a DataFrame, filling in missing values within a DataFrame, etc. For a more comprehensive list of aggregations and operations we can call on our DataFrame, refer to [this introduction](https://pandas.pydata.org/docs/user_guide/basics.html#basics-binop).

### Averaging Numerical Data

In this example, we'll perform an average of the numerical columns in the data set:

In [59]:
df.mean()

age    21.333333
dtype: float64

### Applying Custom Functions

In this example, we'll perform a cumulative average of the columns in the data set. Notice, the cumulative average is applied to numerical and character columns. For this example, we used a custom function taken from the numpy package, called `cumsum`. However, we can use any custom function in place of this function, including our own custom functions we make ourselves.

In [60]:
df.apply(np.cumsum)

Unnamed: 0,name,age,gender
0,Anna,21,female
1,AnnaRick,42,femalemale
2,AnnaRickMay,64,femalemalefemale


### Histogramming Numerical Data

Often, we'll want to count some number of unique rows within our data. More specifically, we'll want to count some number of unique rows based on some column value, as well. To do this, we can call the `value_counts()` function on the entire DataFrame (or just a single/multiple set of columns).

The `value_counts` function called on a specific column computes a histogram of a 1D array of values. It can also be used as a function on regular arrays, and it can be used to count combinations across multiple columns. For a more detailed explanation of histogramming and counting data within our DataFrame, please refer to [this introduction](https://pandas.pydata.org/docs/user_guide/basics.html#basics-discretization) on the Pandas website.

In this example, we'll just count the unique ages of individuals in our data set. So, we'll only count the unique values of a single column, rather than count the unique rows of the entire DataFrame.

In [61]:
df['age'].value_counts()

21    2
22    1
Name: age, dtype: int64

<hr />

# Merging Data <a class="anchor" id="merge"></a>

Pandas provides various facilities for easily combining together Series or DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join and merge-type operations. In addition, pandas also provides utilities to compare two Series or DataFrame and summarize their differences.

In Pandas, we can combine DataFrames together by concatenation or joining. Concatenation can be achieved using the `concat` function, which essentially stacks multiple DataFrames together. In SQL, this is similar to performing a union on multiple tables.

For a more comprehensive list of merging and concatentation methods callable on our DataFrame, refer to [this section](https://pandas.pydata.org/docs/user_guide/merging.html#merging) on the official Pandas website.

### Concatenating Data Together

The `concat` function does all of the heavy lifting of performing concatenation operations along an axis while performing optional set logic (union or intersection) of the indexes on the other axes. Meaning, we can perform conditional concatenation operations on multiple DataFrames based on a set of rows or a set of columns. For more information about concatenation DataFrames together, refer to [this section](https://pandas.pydata.org/docs/user_guide/merging.html#concatenating-objects) on the official Pandas website.

For this example, we'll combine the following DataFrames:

In [62]:
df1 = pd.DataFrame({'name': ['Sue', 'Jan', 'Todd'], 'age': [12, 28, 8]})
df2 = pd.DataFrame({'name': ['Tanner', 'Sue'], 'age': [31, 25]})
pd.concat([df1, df2])

Unnamed: 0,name,age
0,Sue,12
1,Jan,28
2,Todd,8
0,Tanner,31
1,Sue,25


### Joining Data Together

The `merge` function performs a join operation, similar to SQL. So, similar to SQL, we can perform inner, left, right, outer, etc. joins on a set of DataFrames, given a set of parameters, like the column names for joining the tables together.

Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL. These methods perform significantly better (in some cases well over an order of magnitude better) than other open source implementations. The reason for this is careful algorithmic design and the internal layout of the data in DataFrame. For more information about joining DataFrames together, refer to [this section](https://pandas.pydata.org/docs/user_guide/merging.html#merging-join) on the official Pandas website.

In the following example, we'll just perform a simple inner join on the two DataFrames specified from the previous section:

In [63]:
pd.merge(df1, df2, how='inner', on='name')

Unnamed: 0,name,age_x,age_y
0,Sue,12,25


<hr />

# Grouping Data <a class="anchor" id="group"></a>

In Pandas, the `groupby` function performs a group-by function, similar to SQL. By *group-by* we are referring to a process involving one or more of the following steps:
- **Splitting** the data into groups based on some criteria
- **Applying** a function to each group independently
- **Combining** the results into a data structure

The split step is the most straightforward of the three methods. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to do some type of aggregation, transformation, filtration, or a combination of these methods.

At a high-level, we can think of aggregation as a type of method that computes a summary statistic for each group (e.g. mean, counts, etc.). We can think of transformation as a type of method that performs some group-specific computations (e.g. filling NAs for missing values). Lastly, we can think of filtration as a type of method that discards some groups (e.g. filtering out data based on the group's mean). For a more comprehensive list of methods and operations callable on our DataFrame, refer to [this section](https://pandas.pydata.org/docs/user_guide/groupby.html#groupby) on the official Pandas website.

In the following two examples, we'll combine the individuals into groups based on their ages. In the first example, we'll count the number of distinct ages in the data set. In the second example, we'll select the first name in our dataset after grouping the individuals by their age.

In [64]:
df.groupby('age').count()

Unnamed: 0_level_0,name,gender
age,Unnamed: 1_level_1,Unnamed: 2_level_1
21,2,2
22,1,1


In [65]:
df.groupby('age')['name'].first()

age
21    Anna
22     May
Name: name, dtype: object

<hr />

# Reshaping Data <a class="anchor" id="reshape"></a>

There are a few, more advanced methods used for reshaping our data set. This can be useful to transform data from one structure into another structure. For a more comprehensive list of advanced reshaping methods and operations callable on our DataFrame, refer to [this cheatsheet](https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-stacking) on the official Pandas website.

### Pivot Tables

The `pivot` function is used for creating spreadsheet-style pivot tables. While the pivot function provides general purpose pivoting with various data types (strings, numerics, etc.), Pandas also provides the `pivot_table` function for pivoting with aggregation of numeric data. For a more detailed explanation of pivot tables, refer to [this section](https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-pivot) on the official Pandas website.

In this example, we'll pivot the **age** values within our DataFrame, so the rows are indexed by the **age** column, and the new columns refer to the **name** column.

In [66]:
df.pivot(index='gender', columns='name', values='age')

name,Anna,May,Rick
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,21.0,22.0,
male,,,21.0


### Stacking Data

Closely related to the `pivot` method are the `stack` and `unstack` methods, which are available on Series and DataFrame objects. These methods are designed to work together, and they can be indexed on multiple columns from our original data set.

Essentially, the `stack` method pivots a level of the column labels, and it returrns a DataFrame with an index with a new inner-most level of row labels. For a more detailed explanation of stacking, please refer to [this section](https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-stacking) on the official Pandas website.

In the following example, we'll stack the entire DataFrame. This function is more useful to *de-pivot* an already pivoted DataFrame, whether it be a transformed DataFrame or a DataFrame loaded from a raw SQL table.

In [67]:
df.stack()

0  name        Anna
   age           21
   gender    female
1  name        Rick
   age           21
   gender      male
2  name         May
   age           22
   gender    female
dtype: object

<hr />

# Additional Resources <a class="anchor" id="references"></a>

- [Comprehensive Pandas Tutorial for Beginners](https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/)
- [Basic Pandas Tutorial for Beginners](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Comprehensive User Guide for Intermediates](https://pandas.pydata.org/docs/user_guide/index.html)
- [Pandas Cookbook for Intermediates](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)
- [Basic Pandas Tutorial for Experts](https://pandas.pydata.org/docs/user_guide/cookbook.html)
- [Paid Course for Pandas Learners](https://www.dataquest.io/m/291-introduction-to-pandas/)