# Introduction to Pandas

## What is Pandas?

**pandas** is a Python package providing fast, flexible, and expressive data structures **designed to make working with “relational” or “labeled” data** both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, **real-world data analysis in Python**. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:
- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, **Series (1-dimensional)** and **DataFrame (2-dimensional)**, handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.

**pandas is built on top of NumPy** and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Here are just a few of the things that pandas does well:
- Easy handling of **missing data** (represented as NaN) in floating point as well as non-floating point data
- Size mutability: columns can be **inserted and deleted** from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
- Powerful, flexible **group by** functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Make it **easy to convert** ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent **label-based slicing, fancy indexing**, and subsetting of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling of axes** (possible to have multiple labels per tick)
- **Robust IO tools** for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
- **Time series-specific functionality**: date range generation and frequency conversion, moving window statistics, date shifting, and lagging.

> **pandas is fast**. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However, as with anything else generalization usually sacrifices performance. So if you focus on one feature for your application you may be able to create a faster specialized tool.

## Installing Pandas

pandas can be installed via pip from PyPI.

Run: `uv pip install pandas`

> It is recommended to install and run pandas from a virtual environment, for example, using the Python standard library’s venv

pandas can also be installed with sets of optional dependencies to enable certain functionality. For example, to install pandas with the optional dependencies to read Excel files.

Run: `uv pip install "pandas[excel]"`

[Optional dependencies](https://pandas.pydata.org/docs/getting_started/install.html#optional-dependencies): You are highly encouraged to install these libraries, as they provide speed improvements, especially when working with large data sets.

Run: `uv pip install "pandas[excel,performance,html]"`


## Data structures

<table class="table">
<colgroup>
<col style="width: 17.6%">
<col style="width: 23.5%">
<col style="width: 58.8%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Dimensions</p></th>
<th class="head"><p>Name</p></th>
<th class="head"><p>Description</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p>1</p></td>
<td><p>Series</p></td>
<td><p>1D labeled homogeneously-typed array</p></td>
</tr>
<tr class="row-odd"><td><p>2</p></td>
<td><p>DataFrame</p></td>
<td><p>General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column</p></td>
</tr>
</tbody>
</table>

<img alt="../../_images/01_table_dataframe.svg" class="align-center" src="https://pandas.pydata.org/docs/_images/01_table_dataframe.svg">

The best way to think about the pandas data structures is as **flexible containers for lower dimensional data**. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

To load the pandas package and start working with it, import the package. The community agreed alias for pandas is `pd`, so loading pandas as `pd` is assumed standard practice for all of the pandas documentation.

The fundamental behavior about data types, indexing, axis labeling, and alignment apply across all of the objects. To get started, import NumPy and load pandas into your namespace:

In [4]:
import pandas as pd
import numpy as np

### Series

`Series` is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a `Series` is to call: `s = pd.Series(data, index=index)`

Here, data can be many different things:
- a Python dict
- an ndarray
- a scalar value (like 5)

The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

**From ndarray**

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values `[0, ..., len(data) - 1]`.

In [20]:
rng = np.random.default_rng()
s = pd.Series(rng.random(5), index=["a", "b", "c", "d", "e"])
s

a    0.197277
b    0.675288
c    0.923408
d    0.704963
e    0.789662
dtype: float64

In [8]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [21]:
pd.Series(rng.random(5))

0    0.810717
1    0.029372
2    0.272337
3    0.968102
4    0.196465
dtype: float64

> pandas supports non-unique index values. If an operation that does not support duplicate index values is attempted, an exception will be raised at that time.

**From dict**

`Series` can be instantiated from dicts:

In [10]:
d = {"b": 1, "a": 0, "c": 2}
pd.Series(d)

b    1
a    0
c    2
dtype: int64

If an index is passed, the values in data corresponding to the labels in the index will be pulled out.

In [11]:
d = {"a": 0.0, "b": 1.0, "c": 2.0}
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [12]:
pd.Series(d, index=["b", "c", "d", "a"])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

> NaN (not a number) is the standard missing data marker used in pandas.

Like a NumPy array, a pandas Series has a single dtype.

In [22]:
s = pd.Series(5, index=["a", "b", "c", "d", "e"])
s.dtype

dtype('int64')

This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few places, in which case the dtype would be an ExtensionDtype. Some examples within pandas are Categorical data and Nullable integer data type. 

If you need the actual array backing a Series, use Series.array. Accessing the array can be useful when you need to do some operation without the index.

In [23]:
s.array

<NumpyExtensionArray>
[np.int64(5), np.int64(5), np.int64(5), np.int64(5), np.int64(5)]
Length: 5, dtype: int64

While Series is ndarray-like, if you need an actual ndarray, then use `Series.to_numpy()`.

In [24]:
s.to_numpy()

array([5, 5, 5, 5, 5])

### DataFrame

https://pandas.pydata.org/docs/user_guide/dsintro.html#series