Pandas is an open source Python library for highly specialized data analysis. Currently it is the reference
point that all professionals using the Python language need to study and analyze data sets for statistical
purposes of analysis and decision making.
This library has been designed and developed primarily by Wes McKinney starting in 2008; later, in
2012, Sien Chang, one of his colleagues, was added to the development. Together they set up one of the most
used libraries in the Python community.
Pandas arises from the need to have a specific library for analysis of the data which provides, in the
simplest possible way, all the instruments for the processing of data, data extraction, and data manipulation.
This Python package is designed on the basis of the NumPy library. This choice, we can say, was critical
to the success and the rapid spread of pandas. In fact, this choice not only makes this library compatible with
most of the other modules, but also takes advantage of the high quality of performance in calculating of the
NumPy module.
Another fundamental choice has been to design ad hoc data structures for the data analysis. In fact,
instead of using existing data structures built into Python or provided by other libraries, two new data
structures have been developed.
These data structures are designed to work with relational data or labeled, thus allowing you to manage
data with features similar to those designed for SQL relational databases and Excel spreadsheets.
Throughout the book, in fact, we will see a series of basic operations for data analysis, normally used on
the database tables or spreadsheets. Pandas in fact provides an extended set of functions and methods that
allow you to perform, and in many cases even in the best way, these operations.
So pandas has as its main purpose to provide all the building blocks for anyone approaching the world
of data analysis.

In [2]:
conda list pandas

# packages in environment at /home/festus/.config/jupyterlab-desktop/envs/env_1:
#
# Name                    Version                   Build  Channel
pandas                    2.2.2           py312h1d6d2e6_1    conda-forge

Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import numpy as np

Introduction to pandas Data Structures

The heart of pandas is just the two primary data structures on which all transactions, which are generally
made during the analysis of data, are centralized:

•Series

•DataFrame

The Series is the object of the pandas library designed to represent one-dimensional data structures,
similarly to an array but with some additional features. Its internal structure is simple and is
composed of two arrays associated with each other. The main array has the purpose to hold the data (data of
any NumPy type) to which each element is associated with a label, contained within the other array, called
the Index.

Declaring a Series

To create the Series, simply call the Series() constructor passing as an argument an
array containing the values to be included in it.

In [7]:
s = pd.Series([12,-4,7,9])
s

0    12
1    -4
2     7
3     9
dtype: int64

As you can see from the output of the Series, on the left there are the values in the Index, which is a
series of labels, and on the right the corresponding values.
If you do not specify any index during the definition of the Series, by default, pandas will assign
numerical values increasing from 0 as labels. In this case the labels correspond to the indexes (position in
the array) of the elements within the Series object.
Often, however, it is preferable to create a Series using meaningful labels in order to distinguish and
identify each item regardless of the order in which they were inserted into the Series.
So in this case it will be necessary, during the constructor call, to include the index option assigning an
array of strings containing the labels.

In [7]:
s = pd.Series([12,-4,7,9], index=['a','b','c','d'])
s

a    12
b    -4
c     7
d     9
dtype: int64

If you want to individually see the two arrays that make up this data structure you can call the two
attributes of the Series as follows: index and values.

In [10]:
s.values

array([12, -4,  7,  9])

In [11]:
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [14]:
# For that concerning individual elements, you can select them as an ordinary numpy array (not anymore), specifying the key.
# We must use the iloc attribute to select the element
s.iloc[2]

7

In [15]:
# Or you can specify the label corresponding to the position of the index.
s['b']

-4

In [16]:
# In the same way you select multiple items in a numpy array, you can specify the following:
s[0:2]

a    12
b    -4
dtype: int64

In [17]:
# or even in this case, use the corresponding labels, but specifying the list of labels within an array.
s[['b','c']]

b   -4
c    7
dtype: int64

In [3]:
s

a    12
b    -4
c     7
d     9
dtype: int64

Now that you understand how to select individual elements, you also know how to assign new values to
them. In fact, you can select the value by index or label.

In [14]:
s.iloc[1] = 0

In [15]:
s

a    12
b     0
c     7
d     9
dtype: int64

In [16]:
s['b'] = 1

In [18]:
s

a    12
b     1
c     7
d     9
dtype: int64

In [2]:
arr = np.array([1,2,3,4])
s3 = pd.Series(arr)

In [8]:
s3

0    1
1    2
2   -2
3    4
dtype: int64

In [9]:
s4 = pd.Series(s)
s4

a    12
b    -4
c     7
d     9
dtype: int64

When doing this, however, you should always keep in mind that the values contained within the
NumPy array or the original Series are not copied, but are passed by reference. That is, the object is inserted
dynamically within the new Series object. If it changes, for example its internal element varies in value, then
those changes will also be present in the new Series object.

In [10]:
s3

0    1
1    2
2   -2
3    4
dtype: int64

In [11]:
arr[2] = -2
s3

0    1
1    2
2   -2
3    4
dtype: int64

As we can see in this example, by changing the third element of the arr array we also modified the
corresponding element in the s3 Series.

Thanks to the choice of NumPy library as the base for the development of the pandas library and as a result,
for its data structures, many operations applicable to NumPy arrays are extended to the Series. One of these
is the filtering of the values contained within the data structure through conditions.
For example, if you need to know which elements within the series have value greater than 8, you will
write the following:

In [12]:
s[s > 8]

a    12
d     9
dtype: int64

Other operations such as operators (+, -, *, /) or mathematical functions that are applicable to NumPy array
can be extended to objects Series.
Regarding the operators you can simply write the arithmetic expression.

In [14]:
s / 2

a    6.0
b   -2.0
c    3.5
d    4.5
dtype: float64

However, regarding the NumPy mathematical functions, you must specify the function referenced with
np and the instance of the Series passed as argument.

In [18]:
np.log(s)

a    2.484907
b         NaN
c    1.945910
d    2.197225
dtype: float64

Often within a Series there may be duplicate values and then you may need to have information on what are
the samples contained, counting duplicates and whether a value is present or not in the Series.
In this regard, declare a series in which there are many duplicate values.

In [19]:
serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])
serd

white     1
white     0
blue      2
green     1
green     2
yellow    3
dtype: int64

To know all the values contained within the Series excluding duplicates, you can use the unique() function.
The return value is an array containing the unique values in the Series, though not necessarily in order.

In [20]:
serd.unique()

array([1, 0, 2, 3])

A function similar to unique( ) is the value_counts( ) function, which not only returns the unique values
but calculates occurrences within a Series.

In [21]:
serd.value_counts()

1    2
2    2
0    1
3    1
Name: count, dtype: int64

Finally, isin( ) is a function that evaluates the membership, that is, given a list of values, this function
lets you know if these values are contained within the data structure. Boolean values that are returned can be
very useful during the filtering of data within a series or in a column of a DataFrame.

In [22]:
serd.isin([0,3])

white     False
white      True
blue      False
green     False
green     False
yellow     True
dtype: bool

In [23]:
serd[serd.isin([0,3])]

white     0
yellow    3
dtype: int64