
# Introduction to Pandas

**`pandas`** is a Python package providing ***fast, flexible, and expressive data structures*** designed to work with *relational* or *labeled* data or both.
<br></br>
<center>*It is a fundamental high-level building block for doing practical, real world data analysis in Python.*</center> 

Pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- **Hierarchical labeling** of axes
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5
- **Time series functionality**: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

In [1]:
from IPython.core.display import HTML
HTML("<iframe src=http://pandas.pydata.org width=800 height=350></iframe>")



In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Set some Pandas options
pd.options.display.max_columns =30
pd.options.display.max_rows= 20

## Pandas Data Structures

### Pandas Series and Data Storage

A **pandas.Series** is a one-dimensional labeled array that can hold data of any type (e.g., integers, floats, strings, or objects). Each element in a Series is associated with an index, making it easy to access and manipulate data.

#### **How Data is Stored in a Series**
1. **NumPy Array**: 
   - For standard data types like integers, floats, the data in a Series is stored internally as a **NumPy array**. 
   - NumPy arrays provide efficient storage and operations for homogeneous data (all elements of the same type).

2. **ExtensionArray**:
   - For specialized or custom data types (e.g., `category`, `Int64` for nullable integers, or `string`), the Series uses an **ExtensionArray** instead of a NumPy array.
   - ExtensionArrays are designed to handle features or data types not natively supported by NumPy.

#### **Key Points**
- The choice between a NumPy array and an ExtensionArray depends on the data type of the Series.
- NumPy arrays ensure high performance for numeric and homogeneous data.
- ExtensionArrays provide flexibility for working with specialized data types, while still supporting most pandas operations.

In [5]:
# Create an empty Series
empty_series = pd.Series()
empty_series[0] = 10.0
empty_series[1] = 20
empty_series[2] = 30

empty_series

0    10.0
1    20.0
2    30.0
dtype: float64

In [10]:
#help(pd.Series)

In [6]:
# Create an empty list to store data
data_list = []

# Loop to add data to the list
for i in range(1, 5):
    data_list.append(i * 200)

# Create a Series from the list
series_from_list = pd.Series(data_list)
series_from_list

0    200
1    400
2    600
3    800
dtype: int64

In [7]:
a = pd.Series([300, 500, 700, 900])
a

0    300
1    500
2    700
3    900
dtype: int64

In [8]:
type(a)

pandas.core.series.Series

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object.

Getting values out of a series:

In [9]:
a.values

array([300, 500, 700, 900])

In [10]:
a.values.dtype

dtype('int64')

In [11]:
for v in a.values:
    print(v)

300
500
700
900


In [12]:
c = a.values.tolist()
c

[300, 500, 700, 900]

Getting indexes of the series:

In [13]:
a.index

RangeIndex(start=0, stop=4, step=1)

In [14]:
for i in a.index:
    print(i)

0
1
2
3


In [15]:
a.index.dtype

dtype('int64')

In [16]:
s = pd.Series([1, 2, 3], ['a', 'b', 'c'])
print(s)

tuple(zip(s.index,s))

a    1
b    2
c    3
dtype: int64


(('a', 1), ('b', 2), ('c', 3))

In [17]:
s.index

Index(['a', 'b', 'c'], dtype='object')

### Example 1 - Bacteria :

We can assign meaningful labels to the index, if they are available:

In [18]:
bt = pd.Series([200, 300, 400], index =["Firmicutes", "Proto", "Bacteroid"])
bt

Firmicutes    200
Proto         300
Bacteroid     400
dtype: int64

In [20]:
#Pandas allow a non unique index(label) 
bt1 = pd.Series([200, 300, 400], index =["Firmicutes", "Proto", "Proto"])
bt1

Firmicutes    200
Proto         300
Proto         400
dtype: int64

In [19]:
bt2 = pd.Series(index =["Firmicutes", "Proto", "Proto"], data = [200, 300, 400])
bt2

Firmicutes    200
Proto         300
Proto         400
dtype: int64

In [20]:
for i in bt.index:
    print(i)

Firmicutes
Proto
Bacteroid


In [21]:
for index, value in bt.items():
    print(f'Index: {index}, Value: {value}')

Index: Firmicutes, Value: 200
Index: Proto, Value: 300
Index: Bacteroid, Value: 400


In [22]:
bt[1]

  bt[1]


300

In [23]:
bt2["Proto"]

Proto    300
Proto    400
dtype: int64

In [26]:
l = pd.Series([1,2,3,4], index = ['Sunday', "Monday", "Friday", "Tuesday"])

In [27]:
l.index

Index(['Sunday', 'Monday', 'Friday', 'Tuesday'], dtype='object')

In [28]:
l.index = ["ab", 'bc', 'cd', 'de']

In [29]:
l.index

Index(['ab', 'bc', 'cd', 'de'], dtype='object')

In [30]:
l.index = ["ab", 'bc', 'cd', 'de', 'me']

ValueError: Length mismatch: Expected axis has 4 elements, new values have 5 elements

In [31]:
l["ab"]=5

In [32]:
l

ab    5
bc    2
cd    3
de    4
dtype: int64

In [33]:
l["ab"]='5a'

In [34]:
l

ab    5a
bc     2
cd     3
de     4
dtype: object

In [35]:
l = pd.Series([1,2,3,4], index = ['Sunday', "Monday", "Friday", "Tues"])
l

Sunday    1
Monday    2
Friday    3
Tues      4
dtype: int64

These labels can be used to refer to the values in the `Series`.

In [36]:
[i for i in l.index if i.endswith('day')!=True]

['Tues']

In [37]:
[name.endswith("day") for name in l.index]

[True, True, True, False]

In [38]:
l[[name.endswith("day") for name in l.index]]

Sunday    1
Monday    2
Friday    3
dtype: int64

In [39]:
l[[True, True, True, False]]

Sunday    1
Monday    2
Friday    3
dtype: int64

In [40]:
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

bacteria

Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64

In [41]:
bacteria['Actinobacteria']

569

In [42]:
bacteria[bacteria.values == 569]

Actinobacteria    569
dtype: int64

In [43]:
bacteria[bacteria == 569]

Actinobacteria    569
dtype: int64

In [44]:
bacteria[bacteria == 569].index

Index(['Actinobacteria'], dtype='object')

In [46]:
index_value = bacteria[bacteria == 569].index[0]
index_value

'Actinobacteria'

In [45]:
index_type = bacteria[bacteria == 569].index.dtype
index_type

dtype('O')

In [47]:
[name.startswith('Bact') for name in bacteria.index]

[False, False, False, True]

In [48]:
[str.lower(name).startswith('bact') for name in bacteria.index]

[False, False, False, True]

In [49]:
bacteria[[str.lower(name).endswith("bacteria") for name in bacteria.index]]

Proteobacteria    1638
Actinobacteria     569
dtype: int64

Notice that the indexing operation preserved the association between the values and the corresponding indices.

We can still use positional indexing if we wish.

In [50]:
bacteria[0]

632

In [51]:
bacteria[1]

1638

We can give both the array of values and the index meaningful labels themselves:

In [52]:
l.index = [1,2,3,4]
l.index

Index([1, 2, 3, 4], dtype='int64')

In [53]:
l.name = 'sutun adi'

In [54]:
l.index.name = 'index adi'

In [55]:
l.name

'sutun adi'

In [56]:
l

index adi
1    1
2    2
3    3
4    4
Name: sutun adi, dtype: int64

In [57]:
bacteria.name = 'counts'
bacteria.index.name = 'phylum'
bacteria

phylum
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
Name: counts, dtype: int64

NumPy's math functions and other operations can be applied to Series without losing the data structure.

In [58]:
np.sqrt(np.sum(bacteria))

54.35071296680477

In [59]:
np.sum(l)

10

In [60]:
np.log(bacteria)

phylum
Firmicutes        6.448889
Proteobacteria    7.401231
Actinobacteria    6.343880
Bacteroidetes     4.744932
Name: counts, dtype: float64

In [61]:
bacteria

phylum
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
Name: counts, dtype: int64

In [62]:
bacteria.index.str

<pandas.core.strings.accessor.StringMethods at 0x74f15a508dd0>

In [63]:
#(BIT AND & = LOGICAL AND) (BIT OR | = LOGICAL OR) (BIT NOT ~ = LOGICAL NOT)

In [64]:
bacteria[(bacteria>100) & (bacteria%2==0) & (bacteria.index.str.endswith("bacteria"))]

# TRUE & TRUE  & FALSE = FALSE
# TRUE & TRUE  & TRUE  = TRUE
# TRUE & FALSE & TRUE  = FALSE
# TRUE & FALSE & FALSE = FALSE

phylum
Proteobacteria    1638
Name: counts, dtype: int64

In [65]:
bacteria[(bacteria>100) & bacteria%2==0 & (bacteria.index.str.endswith("bacteria"))]

phylum
Firmicutes         632
Proteobacteria    1638
Name: counts, dtype: int64

In [66]:
bacteria[bacteria>100 & bacteria%2==0 & bacteria.index.str.endswith("bacteria")]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [67]:
bacteria[bacteria>500]

phylum
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Name: counts, dtype: int64

A `Series` can be thought of as an ordered key-value store. In fact, we can create one from a `dict`:

In [68]:
ll = pd.Series([1,2,3,4], index = ["a", "b", "c", "d"])
ll

a    1
b    2
c    3
d    4
dtype: int64

In [69]:
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
x=pd.Series(bacteria_dict)
x

Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64

In [70]:
type(x)

pandas.core.series.Series

In [71]:
dictionary = {'a':'1', 'b':2, 'c':3}
l = pd.Series(dictionary)
l

a    1
b    2
c    3
dtype: object

In [72]:
a = list(range(1,11))
a

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [73]:
b = list(range(1, 21, 2))
b

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]

In [74]:
ll = pd.Series(b, index = a)
ll

1      1
2      3
3      5
4      7
5      9
6     11
7     13
8     15
9     17
10    19
dtype: int64

In [75]:
b1 = pd.Series(dict(zip(a,b)))
b1

1      1
2      3
3      5
4      7
5      9
6     11
7     13
8     15
9     17
10    19
dtype: int64

Notice that the `Series` is created in key-sorted order.

If we pass a custom index to `Series`, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the `NaN` (not a number) type for missing values.

In [76]:
bacteria_dict

{'Firmicutes': 632,
 'Proteobacteria': 1638,
 'Actinobacteria': 569,
 'Bacteroidetes': 115}

In [78]:
bacteria2 = pd.Series(bacteria_dict, index=['Firmicutes','Proteobacteria','Actinobacteria','Bacteroidetes', 'Cyanobacteria'])
bacteria2

Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
Bacteroidetes      115.0
Cyanobacteria        NaN
dtype: float64

In [79]:
bacteria2.isnull().sum()

1

In [80]:
bacteria3 = pd.Series([1,2,3], index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])

ValueError: Length of values (3) does not match length of index (4)

Critically, the labels are used to **align data** when used in operations with other Series objects:

In [81]:
bacteria

phylum
Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
Name: counts, dtype: int64

In [82]:
bacteria2

Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
Bacteroidetes      115.0
Cyanobacteria        NaN
dtype: float64

In [83]:
# NaN -> 5 = NaN 
# 5 -> (NaN) = NaN 
# NaN -> (NaN) = NaN
bacteria + bacteria2

Actinobacteria    1138.0
Bacteroidetes      230.0
Cyanobacteria        NaN
Firmicutes        1264.0
Proteobacteria    3276.0
dtype: float64

In [84]:
bacteria.equals(bacteria2)

False

In [85]:
a1=pd.Series([3,4,5,6], index = ["a","b", "c", "d"])
a2=pd.Series([3,4,5,6], index = ["a","b", "c", "d"])

a1.equals(a2)

True

In [86]:
a1+a2

a     6
b     8
c    10
d    12
dtype: int64

In [87]:
print(bacteria.index.tolist())
print(bacteria2.index.tolist())

['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']
['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes', 'Cyanobacteria']


Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition.