# 2.1.3 Pandas intro

The Pandas library is a core part of the Python data science ecosystem.
It provides easy-to-use data structures and data analysis tools.

Pandas has some great resources for getting started, including guides
tailored to those familiar with other software for manipulating data:
https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started
.

For now, we’ll stick just to what we need for this course.

In [1]:
import pandas as pd

## Structures

Pandas has two main **labelled** data structures: 
- Series


In [2]:
s = pd.Series([0.3, 4, 1, None, 9])
print(s)

0    0.3
1    4.0
2    1.0
3    NaN
4    9.0
dtype: float64


-   DataFrame

In [3]:
import numpy as np
df = pd.DataFrame(np.random.randn(10,2), index=np.arange(3, 13), columns=["random_A", "random_B"]) 
df

Unnamed: 0,random_A,random_B
3,1.677205,-1.948433
4,1.543037,-0.086532
5,0.689541,1.616827
6,-0.414047,0.430653
7,0.155389,-0.396176
8,0.126505,0.593522
9,0.807657,-0.555792
10,0.282047,0.832545
11,-0.369983,0.672435
12,-1.738524,0.74881



Once we have data in these Pandas structures, we can perform some useful
operations such as: - `info()` - (`DataFrame` only) - prints a concise
summary of a `DataFrame` - `value_counts()` - returns a `Series`
containing counts of unique values in the structure


In [4]:
s = pd.Series(np.random.randint(0,2,10))
print(s)
print() # blank line 
print("value counts:")
print(s.value_counts())

0    0
1    0
2    0
3    1
4    1
5    0
6    0
7    1
8    0
9    1
dtype: int64

value counts:
0    6
1    4
dtype: int64



We’ll see more on how to use these structures, and other Pandas
capabilities, later.

## Indexing

Again, we’re just covering some basics here. For a complete guide to
indexing in Pandas see
[here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html).

Pandas allows us to use the same basic `[]` indexing and `.` attribute
operators that we’re used to with Python and NumPy. However, Pandas also
provides the (often preferred) `.loc` labelled indexing method and the
`.iloc` position indexing methods.

### `[]` Indexing

For basic `[]` indexing, we can select columns from a DataFrame and
items from a Series.

In [5]:
# Basic indexing using `[]` on DataFrame

# select a single column
print("single column from DataFrame, gives us a Series:")
display(df["random_A"])

# select two columns

print("two columns from DataFrame, gives us a DataFrame:")
display(df[["random_A", "random_B"]])

# and for a Series

# select single item

print("single item from Series, gives us an item (of type numpy.int64,in this case):") 
display(s[2])

# select two items

print("two items from Series, gives us a Series:")
display(s[[2,4]])


single column from DataFrame, gives us a Series:


3     1.677205
4     1.543037
5     0.689541
6    -0.414047
7     0.155389
8     0.126505
9     0.807657
10    0.282047
11   -0.369983
12   -1.738524
Name: random_A, dtype: float64

two columns from DataFrame, gives us a DataFrame:


Unnamed: 0,random_A,random_B
3,1.677205,-1.948433
4,1.543037,-0.086532
5,0.689541,1.616827
6,-0.414047,0.430653
7,0.155389,-0.396176
8,0.126505,0.593522
9,0.807657,-0.555792
10,0.282047,0.832545
11,-0.369983,0.672435
12,-1.738524,0.74881


single item from Series, gives us an item (of type numpy.int64,in this case):


0

two items from Series, gives us a Series:


2    0
4    1
dtype: int64

Note that we can't do:

In [6]:
import traceback # library used to print error without breaking python
try: 
    display(df[5])
except KeyError as e:
    traceback.print_exc()

Traceback (most recent call last):
  File "/usr/local/Caskroom/miniforge/base/envs/playground/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 5

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/var/folders/f4/m42_v8vj1fjc0wrwtb69xjt80000gr/T/ipykernel_16330/1043223323.py", line 3, in <cell line: 2>
    display(df[5])
  File "/usr/local/Caskroom/miniforge/base/envs/playground/lib/python3.10/site-packages/pandas/core/frame.py", line 3458, in __getit


as this tries to access a row, not a column.

### Attribute Access

Similarly, we can access a column from a DataFrame and an item from a
Series using as an attribute. However, we can’t do this when the label
is not a valid identifier.


In [7]:
display(df.random_A)

3     1.677205
4     1.543037
5     0.689541
6    -0.414047
7     0.155389
8     0.126505
9     0.807657
10    0.282047
11   -0.369983
12   -1.738524
Name: random_A, dtype: float64

### `.loc`

`.loc` provides label-based indexing. `.loc` can also be used for
slicing and we can even provide a `callable` as its input! However, here
we’ll just show single item access.


In [8]:
display(df.loc[5])

random_A    0.689541
random_B    1.616827
Name: 5, dtype: float64

In [9]:
# and for a Series

display(s.loc[2])

0


### `.iloc`

`.iloc` provides integer-based indexing. This closely resembles Python and NumPy slicing. Again, we'll just show single item access.

In [10]:
# for DataFrame
display(df.iloc[5])

# and for a Series
display(s.iloc[2])

random_A    0.126505
random_B    0.593522
Name: 8, dtype: float64

0