In [1]:
import numpy as np
import pandas as pd

# Introduction to Pandas

Pandas is a Python package created that provides high-level data structures and methods for data manipulation and analysis. It was created in 2010 by Wes McKinney.

The core components of pandas are:

- **Series:** A labeled one-dimensional array.
- **DataFrame:** A tabular data structure with labeled rows and columns.

Initially built on top of NumPy, pandas extends numerical computing with n-dimensional homogeneous arrays to a SQL-like data structures and manipulation idiom. It includes integrated time series support, automatic data alignment, and robust handling of missing data. These features make pandas highly effective for tasks like data selection, aggregation, element-wise transformations, user-defined function (UDF) applications, and relational operations (e.g., joins).

With over 200 million downloads per month, pandas is currently the most popular data manipulation tool in the Python ecosystem. It supports NumPy, Apache Arrow, and integrates with numerous other tools that enhances your capabilities.

# Fundamental Data Structures

Pandas data manipulation occurs through the core data structures: `Series` and `DataFrame`. 

Complementing these, pandas also has the `Index` data structure, which enhances the behavior and capabilities of both `Series` and `DataFrame`.

## Series 

A Series is a one-dimensional, array-like data structure that contains values of the same type and an associated index.

You can create a Series from scratch in various ways, such as using an array-like object (e.g., NumPy arrays), an iterable, a dictionary, or a scalar value. 

Series objects comes with a range of useful metadata attributes that help in understanding the data being manipulated. Common attributes include:

- `dtype`/`dtypes`: data type of the values
- `shape`: shape of the underlying data
- `size`: number of elements in the Series
- `nbytes`: Size in bytes of Series data
- `values`: ndarray (or ndarray-like) of Series values


In [2]:
s = pd.Series([10, 20, 30, 40])
s

0    10
1    20
2    30
3    40
dtype: int64

In [3]:
s.dtypes

dtype('int64')

In [4]:
s.shape

(4,)

In [5]:
s.size

4

In [6]:
s.nbytes

32

In [7]:
s.values

array([10, 20, 30, 40])

Every pandas object (Series or DataFrame) has an associated Index. If an index is not explicitly defined, a Series is created with an index consisting of the integers 0 through N - 1 (where N is the length of the data).

The index values serve as labels and can be used for various data selection operations. Given the importance and fit between pandas objects and their indexes, you can also create it with custom labels.

Additionaly, you can retrieve the Index of a Series using the `index` attribute.

In [8]:
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
s

a    10
b    20
c    30
d    40
dtype: int64

In [9]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [10]:
s['b']

20

Finally, with a Series, you can perform numerous data transformations, including filtering, binary operations, aggregations, element-wise transformations, and many others.

## DataFrames

A DataFrame is a two-dimensional heterogeneous data structure that behaves like a table or spreadsheet. It has rows and columns, where each column can have a different data type (e.g., integer, float, string). The DataFrame's rows and columns are labeled, allowing for easy indexing and retrieval of data.

There are many ways to create DataFrames from scratch. The most common methods are: a list of dictionaries, a dictionary of equal-length lists, or NumPy arrays.

Just like Series, an index will be automatically assigned if not provided. Columns are `Index` objects too, so if not explicitly defined, they will be automatically assigned in the 0 to N-1 fashion (where N is the number of columns in the data). The columns are ordered according to the order of the keys in the data (which depends on their insertion order in the dictionary, for example). All Index methods can be applied to both row indices and columns.

Additionally, DataFrames have a range of useful metadata attributes that help in understanding the data being manipulated. Common attributes include:

- `dtypes`: The data types of the columns.
- `shape`: The dimensionality of the DataFrame (rows, columns).
- `size`: The number of elements in the DataFrame.
- `columns`: The column labels.
- `values`: The underlying NumPy array of values in the DataFrame.

In [11]:
df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Dragon Fruit', 'Elderberry'],
    'Quantity': [10, 20, 15, 30, 5],
    'Price': [5, 3, 1, 15, 20]
})
df

Unnamed: 0,Fruit,Quantity,Price
0,Apple,10,5
1,Banana,20,3
2,Cherry,15,1
3,Dragon Fruit,30,15
4,Elderberry,5,20


In [12]:
df.dtypes

Fruit       object
Quantity     int64
Price        int64
dtype: object

In [13]:
df.shape

(5, 3)

In [14]:
df.size

15

In [15]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [16]:
df.columns

Index(['Fruit', 'Quantity', 'Price'], dtype='object')

In [17]:
df.values

array([['Apple', 10, 5],
       ['Banana', 20, 3],
       ['Cherry', 15, 1],
       ['Dragon Fruit', 30, 15],
       ['Elderberry', 5, 20]], dtype=object)

With a DataFrame, you can perform various data transformations, including filtering, grouping, merging, and applying UDF functions to rows or columns.

## Quick Description Methods

Pandas objects has some basic methods that provide a quick description and overview of the data structure. The most commonly used are:

- `head()`: returns the first n rows of the Series. By default, it returns the first 5 rows.
- `tail()`: returns the last n rows of the DataFrame. By default, it returns the last 5 rows.
- `info()`: provides a concise summary of the DataFrame, including the index dtype and column dtypes, non-null values, and memory usage.
- `describe():` provides descriptive statistics that summarize the central tendency, dispersion, and shape of a datasetâ€™s distribution, excluding NaN values.

In [18]:
df.head(3)

Unnamed: 0,Fruit,Quantity,Price
0,Apple,10,5
1,Banana,20,3
2,Cherry,15,1


In [19]:
df.tail(3)

Unnamed: 0,Fruit,Quantity,Price
2,Cherry,15,1
3,Dragon Fruit,30,15
4,Elderberry,5,20


In [20]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Fruit     5 non-null      object
 1   Quantity  5 non-null      int64 
 2   Price     5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 532.0 bytes


In [21]:
df.describe()

Unnamed: 0,Quantity,Price
count,5.0,5.0
mean,16.0,8.8
std,9.617692,8.258329
min,5.0,1.0
25%,10.0,3.0
50%,15.0,5.0
75%,20.0,15.0
max,30.0,20.0


## Index Objects

Pandas' `Index` objects are responsible for holding axis labels (including a DataFrame's column names) and other metadata (like axis names). They provide the infrastructure necessary for lookups, data alignment, and reindexing.

You can create an Index by using any array or sequence of labels with the `index` parameter when declaring a Series or DataFrame.

In [22]:
data = {
    'A': [1, 2, 3],
    'B': [4, 5, 6],
}
index = ['row1', 'row2', 'row3']

df = pd.DataFrame(data, index=index)
df

Unnamed: 0,A,B
row1,1,4
row2,2,5
row3,3,6


An important characteristic of `Index` objects is their immutability, meaning they cannot be modified by the user once created.

This immutability ensures that sharing `Index` objects among data structures is safe, as changes to an index result in a new `Index` object rather than altering the original.

For example, consider the following:

In [23]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
index = ['row1', 'row2', 'row3']
df = pd.DataFrame(data, index=index)

print("Original DataFrame:")
print(df)
print("Original Index object ID:", id(df.index))

df.loc['row4'] = [7, 8]
print("\nDataFrame after adding a row:")
print(df)
print("New Index object ID:", id(df.index))

df = df.drop('row1')
print("\nDataFrame after dropping a row:")
print(df)
print("New Index object ID:", id(df.index))

Original DataFrame:
      A  B
row1  1  4
row2  2  5
row3  3  6
Original Index object ID: 4811346448

DataFrame after adding a row:
      A  B
row1  1  4
row2  2  5
row3  3  6
row4  7  8
New Index object ID: 4811350336

DataFrame after dropping a row:
      A  B
row2  2  5
row3  3  6
row4  7  8
New Index object ID: 4811354752


Although powerful, some users may not explicitly take advantage of Index capabilities. However, understanding how Index objects work is important, especially in relation to data alignment.

In pandas, operations such as arithmetic, selection, and filtering are based on data alignment. This means that these operations depend on the matching of indexes and columns between the involved DataFrames or Series. If the indexes or columns do not align, pandas will reindex the data structures to enforce alignment before performing the operation. Consequently, any mismatched labels will result in NaN values in the output to indicate the lack of corresponding data points.

By maintaining and aligning indices and columns, pandas ensures that data operations always preserve context. This prevents errors that can occur when working with heterogeneous or misaligned data, providing a more reliable and consistent data manipulation experience.

In [24]:
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, index=['a', 'b', 'c'])
df1

Unnamed: 0,A,B
a,1,4
b,2,5
c,3,6


In [25]:
df2 = pd.DataFrame({'A': [7, 8, 9], 'B': [10, 11, 12]}, index=['b', 'c', 'd'])
df2

Unnamed: 0,A,B
b,7,10
c,8,11
d,9,12


In [26]:
df1 + df2

Unnamed: 0,A,B
a,,
b,9.0,15.0
c,11.0,17.0
d,,


# References

- [Python for Data Analysis by Wes McKinney (3e)](https://wesmckinney.com/book/)
- [Pandas Official Documentation](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Frequently Asked Questions (FAQ) on Pandas](https://pandas.pydata.org/docs/user_guide/gotchas.html)