# Pandas

# Table of Contents
  - [Pandas](#Pandas)
    - [Introduction](#Introduction)
      - [What is Pandas?](#What-is-Pandas?)
      - [Why We Need Pandas with Python](#Why-We-Need-Pandas-with-Python)
      - [When and Why It's Used](#When-and-Why-It's-Used)
      - [Pandas Performance](#Pandas-Performance)
      - [Alternatives and When They Are Better](#Alternatives-and-When-They-Are-Better)
    - [Working with Pandas `DataFrame`](#Working-with-Pandas-DataFrame)
      - [Pandas Data Structures](#Pandas-Data-Structures)
        - [Working with NumPy Arrays](#Working-with-NumPy-Arrays)
        - [`Series`](#Series)
        - [`Index`](#Index)
        - [`DataFrame`](#DataFrame)
      - [Creating `DataFrame` objects](#Creating-DataFrame-objects)
        - [Creating a `Series` object](#Creating-a-Series-object)
        - [Creating a `DataFrame` object from a `Series` object](#Creating-a-DataFrame-object-from-a-Series-object)
        - [Creating a `DataFrame` from Python Data Structures](#Creating-a-DataFrame-from-Python-Data-Structures)
          - [From a dictionary of list-like structures](#From-a-dictionary-of-list-like-structures)
          - [From a list of dictionaries](#From-a-list-of-dictionaries)
          - [From a list of tuples](#From-a-list-of-tuples)
          - [From a NumPy array](#From-a-NumPy-array)
      - [Creating a `DataFrame` object from the contents of a CSV File](#Creating-a-DataFrame-object-from-the-contents-of-a-CSV-File)
        - [Finding information on the file before reading it in](#Finding-information-on-the-file-before-reading-it-in)
          - [Examining a few rows](#Examining-a-few-rows)
          - [Column count](#Column-count)
        - [Reading in the file](#Reading-in-the-file)
        - [Writing a `DataFrame` Object to a CSV File](#Writing-a-DataFrame-Object-to-a-CSV-File)
      - [Writing a `DataFrame` Object to a Database](#Writing-a-DataFrame-Object-to-a-Database)
      - [Creating a `DataFrame` Object by Querying a Database](#Creating-a-DataFrame-Object-by-Querying-a-Database)
    - [Inspecting data](#Inspecting-data)
    - [Aggregating data](#Aggregating-data)

## Introduction



### What is Pandas?

[Pandas](https://pandas.pydata.org/docs/index.html) is an open-source data manipulation and analysis library for Python.
It provides powerful data structures and functions designed to make working with structured data intuitive and efficient. At the heart of Pandas are two primary data structures:

- **DataFrame**: A two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It's similar to a spreadsheet or SQL table and is generally the most commonly used Pandas object.
- **Series**: A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

Pandas integrates well with various other Python libraries, such as Matplotlib for plotting and NumPy for numerical computations, making it a central library in the Python data science stack.

### Why We Need Pandas with Python

Python, while a powerful programming language, isn't designed specifically for data analysis.
It lacks built-in, high-level data structures and tools that are intuitive and efficient for these tasks.
Here's where Pandas comes in:

- **Data Cleaning and Preparation**: Data scientists spend a significant amount of time cleaning and preparing data. Pandas simplifies these tasks with built-in functions for filtering, selecting, and manipulating data.
- **Data Analysis**: With Pandas, analyzing and exploring data is more straightforward. It provides functions for aggregating, summarizing, and transforming data, making it easier to derive insights.
- **Data Visualization**: Though Pandas is not a data visualization library, it seamlessly interfaces with Matplotlib for plotting and visualizing data, allowing quick and informative visual analysis.
- **Handling Diverse Data Types**: Pandas efficiently handles a variety of data formats, including CSV, Excel files, SQL databases, and HDF5 format, making it a versatile tool for diverse data analysis needs.

### When and Why It's Used

Pandas is widely used in a variety of fields for data analysis and manipulation tasks.
Some common use cases include:

- **Data Cleaning**: Transforming raw data into a form that is suitable for analysis, such as filling missing values, removing duplicates, and converting data types.
- **Data Exploration and Analysis**: Quick examination of data for patterns, irregularities, and insights. This includes operations like sorting, filtering, and grouping data.
- **Data Visualization**: Creating plots and graphs to understand trends and patterns in data.
- **Machine Learning**: Preprocessing and cleaning datasets before feeding them into machine learning models.

### Pandas Performance

Pandas is highly efficient for most data manipulation and analysis tasks, especially with small to moderately sized datasets.
It's optimized for performance in many scenarios, with critical code paths written in Cython or C.
However, when working with very large datasets (with about tens of millions of rows or more), Pandas can face performance issues due to:

- **Memory Usage**: Pandas typically requires significantly more memory than the size of the data, making it less efficient for very large datasets.
- **Speed**: For extremely large datasets, some operations in Pandas can be slow, as it's not fully optimized for all use cases, especially those involving large-scale, distributed computing.

### Alternatives and When They Are Better

One of the notable alternatives to Pandas is [**Polars**](https://pola.rs/).
Polars is a DataFrame library that is designed to handle larger datasets more efficiently than Pandas.
Here's why and when Polars can be a better choice:

- **Performance**: Polars is designed to be faster and more memory-efficient than Pandas, particularly with large datasets. It leverages modern hardware capabilities, like multi-threading, to speed up data processing.
- **Lazy Evaluation**: Polars supports lazy evaluation, where computations are queued and executed only when necessary. This approach can lead to performance improvements, especially in complex data pipelines.
- **Ease of Scaling**: For large-scale data processing, Polars can be a better fit. It's more adept at handling the kinds of big data tasks that are increasingly common in industry settings.

---

## Working with Pandas `DataFrame`

### Pandas Data Structures


In this section, we will discuss the `Series`, `Index`, and `DataFrame` classes. To do so, we will read in a snippet of the CSV file we will work with later. Don't worry about that part yet, though.

#### Working with NumPy Arrays
Let's read in a short CSV file (using `numpy`) for some sample data. 

In [None]:
import numpy as np

data = np.genfromtxt(
    'data/01/example_data.csv', delimiter=';', 
    names=True, dtype=None, encoding='UTF'
)
data

We can find the dimensions with the `shape` attribute:

In [None]:
data.shape

We can find the data types with the `dtype` attribute:

In [None]:
data.dtype

Each element in the array corresponds to a row from the CSV file. Unlike lists that can hold multiple data types, NumPy arrays are limited to one, enabling quick, vectorized actions. The data import resulted in an array of `numpy.void` objects, designed to handle various types. This occurs as each row contains diverse data types: four strings, one float, and one integer. Consequently, we miss out on the performance benefits NumPy offers for arrays with uniform data types.

Consider finding the highest magnitude. We can employ a [list comprehension](https://www.python.org/dev/peps/pep-0202/) to extract the third index from each row, which is in the form of a `numpy.void` object. By doing this, we create a list that allows us to determine the maximum value using the `max()` function.

In [None]:
%%timeit
max([row[3] for row in data])

If we, instead, create a NumPy array for each column, this operation is much easier (and more efficient) to perform. We can use a **[dictionary comprehension](https://www.python.org/dev/peps/pep-0274/)** to make a dictionary where the keys are the column names and the values are NumPy arrays of the data:

In [None]:
array_dict = {
    col: np.array([row[i] for row in data])
    for i, col in enumerate(data.dtype.names)
}
array_dict

Grabbing the maximum magnitude is now simply a matter of selecting the `mag` key and calling the `max()` method. This is nearly twice as fast as the list comprehension implementation when dealing with just 5 entries, imagine how much worse the first attempt will perform on large data sets:

In [None]:
%%timeit
array_dict['mag'].max()

However, this representation has other issues. Say we wanted to grab all the information for the earthquake with the maximum magnitude, how would we go about that? We would need to find the index of the maximum and then for each of the keys in the dictionary grab that index:

In [None]:
np.array([
    value[array_dict['mag'].argmax()] 
    for key, value in array_dict.items()
])

We now have a NumPy array consisting solely of strings, converting our numerical values into this format and reverting to the earlier setup. Additionally, if we aim to sort the data by magnitude in ascending order, the initial format requires sorting the rows based on the third index. In the second format, we need to establish the sorting order based on the `mag` column and then rearrange all other arrays accordingly. Handling multiple NumPy arrays with different data types simultaneously can be challenging. This is where `pandas` comes into play, enhancing the ease of working with NumPy arrays. Let's begin delving into `pandas` by understanding its data structure.

#### `Series`
The `Series` class provides a data structure for arrays of a single type with some additional functionality.

In [None]:
import pandas as pd

place = pd.Series(array_dict['place'], name='place')
place

Here are some commonly used attributes with `Series` objects:

|Attribute | Returns |
| --- | --- |
| `name` | The name of the `Series` object |
| `dtype` | The data type of the `Series` object |
| `shape` | Dimensions of the `Series` object in a tuple of the form `(number of rows,)` |
| `index` | The `Index` object that is part of the `Series` object |
| `values` | The data in the `Series` object |

For the most part, `pandas` objects use NumPy arrays for their internal data representations. However, for some data types, `pandas` builds upon NumPy to create its own [arrays](https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html). For this reason, depending on the data type, `values` can either be a `pandas.array` or `numpy.array` object. Therefore, if we need to ensure we get a specific type back, then it is recommended to use the `array` attribute or `to_numpy()` method, respectively, instead of `values`.

Now let's see some examples using these attributes.

**Getting the name of the series**

The NumPy array held the name of the data in the `dtype` attribute; here, we can access it directly: 

In [None]:
place.name

**Getting the data type**

A `Series` object holds a single data type.
Here it is `'O'` for object.

In [None]:
place.dtype

**Getting the dimensions of the series**

Just as with NumPy, we can use `shape` to get the dimensions as `(rows, columns)`.
`Series` objects are a single column, so they only have values for the rows dimension. 

In [None]:
place.shape

**Isolating the values from the series**

This `Series` object is storing its values as a NumPy array:

In [None]:
place.values

#### `Index`
The addition of the `Index` class makes the `Series` class more powerful than a NumPy array. We can get the index from the `index` attribute of a `Series` object:

In [None]:
place_index = place.index
place_index

As with `Series` objects, we can access the underlying data via the `values` attribute. Note that this `Index` object is also built on top of a NumPy array:

In [None]:
place_index.values

Here are some commonly used attributes with `Index` objects:

|Attribute | Returns |
| --- | --- |
| `name` | The name of the `Index` object |
| `dtype` | The data type of the `Index` object |
| `shape` | Dimensions of the `Index` object |
| `values` | The data in the `Index` object |
| `is_unique` | Check if the `Index` object has all unique values |

We can check the type of the underlying data, just like with a `Series` object:

In [None]:
place_index.dtype

Same for the dimensions:

In [None]:
place_index.shape

We can check if the values are unique:

In [None]:
place_index.is_unique

With NumPy we can perform arithmetic operations element-wise between arrays:

In [None]:
np.array([1, 1, 1]) + np.array([-1, 0, 1])

Pandas supports this as well, and the index determines how element-wise operations are performed. With addition, only the matching indices are summed:

In [None]:
numbers = np.linspace(0, 10, num=5) # makes numpy array([0, 2.5, 5, 7.5, 10])
x = pd.Series(numbers) # index is [0, 1, 2, 3, 4]
y = pd.Series(numbers, index=pd.Index([1, 2, 3, 4, 5]))
x + y

We aren't limited to the integer indices of list-like structures, and we can label our rows. The labels can be altered at any time and be things like dates or even another column. In chapter 3, we will discuss how to perform some operations on the index in order to change it. Then, in chapter 4, we will use the index for operations merging data and aggregating it.



#### `DataFrame`

Using a `Series` object for each column enhances the NumPy approach, yet challenges persist in sorting by values or extracting full rows.
A `DataFrame` provides a tabular representation comprising multiple `Series` objects as columns and a unified `Index` object labeling the rows.
We can construct a `DataFrame` from either of the previously discussed NumPy formats.
While it's possible to create a `Series` object for each column, it's unnecessary:

In [None]:
df = pd.DataFrame(array_dict) 

# this will also work with the first representation
# df = pd.DataFrame(data)

df

We can check the type of the underlying data with `dtypes` (note that it is not `dtype` as with `Series` and `Index` objects since each column will have its own data type):

In [None]:
df.dtypes

We can get the underlying data with the `values` attribute. Note that this looks very similar to our initial NumPy representation:

In [None]:
df.values

We can isolate the columns with the `columns` attribute. Notice that the columns are actually an `Index` object just on a different axis (columns are the horizontal index while rows are the vertical index).

In [None]:
df.columns

Here are some commonly used attributes:

|Attribute | Returns |
| --- | --- |
| `dtypes` | The data types of each column |
| `shape` | Dimensions of the `DataFrame` object in a tuple of the form `(number of rows, number of columns)` |
| `index` | The `Index` object along the rows of the `DataFrame` object |
| `columns` | The name of the columns (as an `Index` object) |
| `values` | The data in the `DataFrame` object |
| `empty` | Check if the `DataFrame` object is empty |

The `Index` object along the rows of the dataframe can be accessed via the `index` attribute (just as with `Series` objects):

In [None]:
df.index

As with both `Series` and `Index` objects, we can get the dimensions of the dataframe with the `shape` attribute. The result is of the form `(nrows, ncols)`. Our dataframe has 5 rows and 6 columns:

In [None]:
df.shape

Pandas allows arithmetic operations on dataframes, matching both index and column for execution.
This example showcases addition.
In string columns (`time`, `place`, `magType`, and `alert`), pandas concatenates values across dataframes.
For numeric columns (`mag` and `tsunami`), the values are summed.

In [None]:
df + df

### Creating `DataFrame` objects

In [None]:
import datetime as dt
import numpy as np
import pandas as pd

#### Creating a `Series` object

In [None]:
np.random.seed(0) # set a seed for reproducibility
pd.Series(np.random.rand(5), name='random')

#### Creating a `DataFrame` object from a `Series` object
Use the `to_frame()` method:

In [None]:
pd.Series(np.linspace(0, 10, num=5)).to_frame()

#### Creating a `DataFrame` from Python Data Structures

##### From a dictionary of list-like structures

The dictionary values can be lists, NumPy arrays, etc. as long as they have length (generators don't have length so we can't use them here):

In [None]:
np.random.seed(0) # set seed so result is reproducible
pd.DataFrame(
    {
        'random': np.random.rand(5),
        'text': ['hot', 'warm', 'cool', 'cold', None],
        'truth': [np.random.choice([True, False]) for _ in range(5)]
    }, 
    index=pd.date_range(
        end=dt.date(2019, 4, 21),
        freq='1D',
        periods=5, 
        name='date'
    )
)

##### From a list of dictionaries

In [None]:
pd.DataFrame([
    {'mag': 5.2, 'place': 'California'},
    {'mag': 1.2, 'place': 'Alaska'},
    {'mag': 0.2, 'place': 'California'},
])

##### From a list of tuples

In [None]:
list_of_tuples = [(n, n**2, n**3) for n in range(5)]
list_of_tuples

In [None]:
pd.DataFrame(
    list_of_tuples, 
    columns=['n', 'n_squared', 'n_cubed']
)

##### From a NumPy array

In [None]:
pd.DataFrame(
    np.array([
        [0, 0, 0],
        [1, 1, 1],
        [2, 4, 8],
        [3, 9, 27],
        [4, 16, 64]
    ]), columns=['n', 'n_squared', 'n_cubed']
)

### Creating a `DataFrame` object from the contents of a CSV File

#### Finding information on the file before reading it in
Before attempting to read in a file, we can use the command line to see important information about the file that may determine how we read it in. We can run command line code from Jupyter Notebooks by using `!` before the code.

For example, we can find out how many lines are in the file by using the `wc` utility (word count) and counting lines in the file (`-l`). The file has 9,333 lines:

In [None]:
!wc -l data/earthquakes.csv

**Windows users**: if the above doesn't work for you (depends on your setup), then use this instead:

```python
!find /c /v "" data\earthquakes.csv
```


We can find the file size by using `ls` to list the files in the `data` directory, and passing in the flags `-lh` to include the file size in human readable format. Then we use `grep` to find the file in question. Note that `|` passes the result of `ls` to `grep`. The `grep` utility is used for finding items that match patterns.

This tells us the file is 3.4 MB:

In [None]:
!ls -lh data | grep earthquakes.csv

**Windows users**: if the above doesn't work for you (depends on your setup), then use this instead:

```python
!dir data | findstr "earthquakes.csv"
```

We can even capture the result of a command and use it in our Python code:

In [None]:
files = !ls -lh data
[file for file in files if 'earthquake' in file]

**Windows users**: if the above doesn't work for you (depends on your setup), then use this instead:

```python
files = !dir data
[file for file in files if 'earthquake' in file]
```

##### Examining a few rows

We can use `head` to look at the top `n` rows of the file. With the `-n` flag, we can specify how many. This shows use that the first row of the file contains headers and that it is comma-separated (just because the file extension is `.csv` doesn't it contains comma-separated values):

In [None]:
!head -n 2 data/earthquakes.csv

**Windows users**: if the above doesn't work for you (depends on your setup), then use this instead:

```python
n = 2
with open('data/earthquakes.csv', 'r') as file:
    for _ in range(n):
        print(file.readline(), end='\r')
```


Just like `head` gives rows from the top, `tail` gives rows from the bottom. This can help us check that there is no extraneous data on the bottom of the field, like perhaps some metadata about the fields that actually isn't part of the dataset:

In [None]:
!tail -n 1 data/earthquakes.csv

**Windows users**: if the above doesn't work for you (depends on your setup), then use this instead:

```python
import os

with open('data/earthquakes.csv', 'rb') as file:
    file.seek(0, os.SEEK_END)
    while file.read(1) != b'\n':
        file.seek(-2, os.SEEK_CUR)
    print(file.readline().decode())
```

*Note*: To inspect more than one row from the end of the file, you will have to use this instead, which requires reading the whole file:

```python
n = 2
with open('data/earthquakes.csv', 'r') as file:
    print('\r'.join(file.readlines()[-n:]))
```



##### Column count
We can use `awk` to find the column count. This is a utility for pattern scanning and processing. The `-F` flag allows us to specify the delimiter (comma, in this case). Then we specify what to do for each record in the file. We choose to print `NF` which is a predefined variable whose value is the number of fields in the current record. Here, we say `exit` so that we print the number of fields (columns, here) in the first row of the file, then we stop. 

This tells us we have 26 data columns:

In [None]:
!awk -F',' '{print NF; exit}' data/earthquakes.csv

**Windows users**: if the above or below don't work for you (depends on your setup), then use this instead:

```python
with open('data/earthquakes.csv', 'r') as file:
    print(len(file.readline().split(',')))
```


Since we know the 1st line of the file had headers, and the file is comma-separated, we can also count the columns by using `head` to get headers and parsing them in Python:

In [None]:
headers = !head -n 1 data/earthquakes.csv
len(headers[0].split(','))

**Windows users**: if you had to use the alternatives above, consider trying out [Cygwin](https://www.cygwin.com) or [Windows Subsystem for Linux (WSL)](https://docs.microsoft.com/en-us/windows/wsl/about).



#### Reading in the file

Our file is small in size, has headers in the first row, and is comma-separated, so we don't need to provide any additional arguments to read in the file with `pd.read_csv()`, but be sure to check the [documentation](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for possible arguments:

In [None]:
df = pd.read_csv('data/earthquakes.csv')

Note that we can also pass in a URL. Let's read this same file from GitHub:

In [None]:
df = pd.read_csv(
    'https://github.com/stefmolin/'
    'Hands-On-Data-Analysis-with-Pandas-2nd-edition'
    '/blob/master/ch_02/data/earthquakes.csv?raw=True'
)

Pandas is usually very good at figuring out which options to use based on the input data, so we often won't need to add arguments to the call; however, there are many options available should we need them, some of which include the following:

| Parameter | Purpose |
| --- | --- |
| `sep` | Specifies the delimiter |
| `header` | Row number where the column names are located; the default option has `pandas` infer whether they are present |
| `names` | List of column names to use as the header |
| `index_col` | Column to use as the index |
| `usecols` | Specifies which columns to read in |
| `dtype` | Specifies data types for the columns | 
| `converters` | Specifies functions for converting data in certain columns |
| `skiprows` | Rows to skip |
| `nrows` | Number of rows to read at a time (combine with `skiprows` to read a file bit by bit) |
| `parse_dates` | Automatically parse columns containing dates into datetime objects |
| `chunksize` | For reading the file in chunks |
| `compression` | For reading in compressed files without extracting beforehand |
| `encoding` | Specifies the file encoding |

#### Writing a `DataFrame` Object to a CSV File

Our file is small in size, has headers in the first row, and is comma-separated, so we don't need to provide any additional arguments to read in the file with `pd.read_csv()`, but be sure to check the [documentation](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for possible arguments:

In [None]:
df.to_csv('output.csv', index=False)

### Writing a `DataFrame` Object to a Database
Note the `if_exists` parameter. By default, it will give you an error if you try to write a table that already exists. Here, we don't care if it is overwritten. Lastly, if we are interested in appending new rows, we set that to `'append'`.

In [None]:
import sqlite3

with sqlite3.connect('data/01/quakes.db') as connection:
    pd.read_csv('data/01/tsunamis.csv').to_sql(
        'tsunamis', connection, index=False, if_exists='replace'
    )

### Creating a `DataFrame` Object by Querying a Database
Using a SQLite database. Otherwise you need to install [SQLAlchemy](https://www.sqlalchemy.org/).

In [None]:
import sqlite3

with sqlite3.connect('data/01/quakes.db') as connection:
    tsunamis = pd.read_sql('SELECT * FROM tsunamis', connection)

tsunamis.head()

## Data wrangling

## Data aggregation