Pandas is an open source Python library for highly specialized data analysis. Currently it is the reference
point that all professionals using the Python language need to study and analyze data sets for statistical
purposes of analysis and decision making.
This library has been designed and developed primarily by Wes McKinney starting in 2008; later, in
2012, Sien Chang, one of his colleagues, was added to the development. Together they set up one of the most
used libraries in the Python community.
Pandas arises from the need to have a specific library for analysis of the data which provides, in the
simplest possible way, all the instruments for the processing of data, data extraction, and data manipulation.
This Python package is designed on the basis of the NumPy library. This choice, we can say, was critical
to the success and the rapid spread of pandas. In fact, this choice not only makes this library compatible with
most of the other modules, but also takes advantage of the high quality of performance in calculating of the
NumPy module.
Another fundamental choice has been to design ad hoc data structures for the data analysis. In fact,
instead of using existing data structures built into Python or provided by other libraries, two new data
structures have been developed.
These data structures are designed to work with relational data or labeled, thus allowing you to manage
data with features similar to those designed for SQL relational databases and Excel spreadsheets.
Throughout the book, in fact, we will see a series of basic operations for data analysis, normally used on
the database tables or spreadsheets. Pandas in fact provides an extended set of functions and methods that
allow you to perform, and in many cases even in the best way, these operations.
So pandas has as its main purpose to provide all the building blocks for anyone approaching the world
of data analysis.

In [2]:
conda list pandas

# packages in environment at /home/festus/.config/jupyterlab-desktop/envs/env_1:
#
# Name                    Version                   Build  Channel
pandas                    2.2.2           py312h1d6d2e6_1    conda-forge

Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import numpy as np

Introduction to pandas Data Structures

The heart of pandas is just the two primary data structures on which all transactions, which are generally
made during the analysis of data, are centralized:

•Series

•DataFrame

The Series is the object of the pandas library designed to represent one-dimensional data structures,
similarly to an array but with some additional features. Its internal structure is simple and is
composed of two arrays associated with each other. The main array has the purpose to hold the data (data of
any NumPy type) to which each element is associated with a label, contained within the other array, called
the Index.

Declaring a Series

To create the Series, simply call the Series() constructor passing as an argument an
array containing the values to be included in it.

In [7]:
s = pd.Series([12,-4,7,9])
s

0    12
1    -4
2     7
3     9
dtype: int64

As you can see from the output of the Series, on the left there are the values in the Index, which is a
series of labels, and on the right the corresponding values.
If you do not specify any index during the definition of the Series, by default, pandas will assign
numerical values increasing from 0 as labels. In this case the labels correspond to the indexes (position in
the array) of the elements within the Series object.
Often, however, it is preferable to create a Series using meaningful labels in order to distinguish and
identify each item regardless of the order in which they were inserted into the Series.
So in this case it will be necessary, during the constructor call, to include the index option assigning an
array of strings containing the labels.

In [2]:
s = pd.Series([12,-4,7,9], index=['a','b','c','d'])
s

a    12
b    -4
c     7
d     9
dtype: int64

If you want to individually see the two arrays that make up this data structure you can call the two
attributes of the Series as follows: index and values.

In [10]:
s.values

array([12, -4,  7,  9])

In [11]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [14]:
# For that concerning individual elements, you can select them as an ordinary numpy array (not anymore), specifying the key.
# We must use the iloc attribute to select the element
s.iloc[2]

7

In [15]:
# Or you can specify the label corresponding to the position of the index.
s['b']

-4

In [16]:
# In the same way you select multiple items in a numpy array, you can specify the following:
s[0:2]

a    12
b    -4
dtype: int64

In [17]:
# or even in this case, use the corresponding labels, but specifying the list of labels within an array.
s[['b','c']]

b   -4
c    7
dtype: int64

In [3]:
s

a    12
b    -4
c     7
d     9
dtype: int64

Now that you understand how to select individual elements, you also know how to assign new values to
them. In fact, you can select the value by index or label.

In [14]:
s.iloc[1] = 0

In [15]:
s

a    12
b     0
c     7
d     9
dtype: int64

In [16]:
s['b'] = 1

In [18]:
s

a    12
b     1
c     7
d     9
dtype: int64