# Pandas: The basics

In [1]:
import pandas as pd

## Introduction to Pandas Objects
At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. There are three basic data structures in Pandas: **Series**, **DataFrame**, and **Index**.

### Pandas Series

A Pandas Series is a one-dimensional array of indexed data. In its simplest form, it can be created from a list or array as follows:

In [2]:
series = pd.Series([1,2,3,4])

In [3]:
print(series)

0    1
1    2
2    3
3    4
dtype: int64


In [4]:
print(type(series.values))

<class 'numpy.ndarray'>


As we see in the preceding output, the Series wraps both a sequence of values and a sequence of indices. We can access the values directly with the `values` attribute. The values are stored as a numpy array.

In [5]:
series.values

array([1, 2, 3, 4])

The index of the series can be accessed with the `index` attribute. Its type is a `pd.Series`.

In [6]:
print(series.index)

RangeIndex(start=0, stop=4, step=1)


In [7]:
print(type(series))

<class 'pandas.core.series.Series'>


Like with a NumPy array, data can be accessed by the **associated index** via the familiar Python **square-bracket notation**:

In [8]:
# Get the second element in the series
series[1]

2

In [9]:
# Get element two and three in the series
series[1:3]

1    2
2    3
dtype: int64

**Series as a generalized NumPy array:** <br/>
From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the NumPy array has an *implicitly defined integer index* used
to access the values, the Pandas Series has an *explicitly defined index* associated with
the values. This explicit index definition gives the Series object additional capabilities. For
example, the index does not have to be an integer, but can consist of values of any type.

Let's see how this works ...

In [10]:
# Creates a pandas series with some given values that should be used for the index
series = pd.Series([1,2,3,4], 
                   index=['a', 'b', 'c', 'd'])

In [11]:
print(series)

a    1
b    2
c    3
d    4
dtype: int64


We can now access the dataframe based on the index values.

In [12]:
# Retrieve the item with the index value = b
series['b']

2

In [13]:
# We can even fetch multiply items by passing a list of index values
series[['b', 'c']]

b    2
c    3
dtype: int64

Alternatively, since individual indices uniquely map to individual values, we can define a Pandas series with a dictionary.

In [14]:
series = pd.Series({
    'a': 1,
    'b': 2,
    'c': 3,
    'd': 4
})

In [15]:
print(series)

a    1
b    2
c    3
d    4
dtype: int64


### Pandas DataFrame
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a
sequence of aligned Series objects. By "aligned" we mean that they share the same index.

Let's construct our first dataframe by explicitly specifying the values in the columns.

In [16]:
df = pd.DataFrame(
    {
        'column_1': [0, 1, 2],
        'column_2': ['xasdf', 'asdf', 'asdf'],
    },
    index = ['row_1', 'row_2', 'row_3'] # Optional: Set the index
)

Alternatively, we can specify and create the same dataframe as follows:

In [17]:
df = pd.DataFrame([
        {'column_1': 0, 'column_2': 'xasdf'},
        {'column_1': 1, 'column_2': 'xasdf'},
        {'column_1': 2, 'column_2': 'xasdf'},
    ], 
    index = ['row_1', 'row_2', 'row_3']
)

In [18]:
print(df)

       column_1 column_2
row_1         0    xasdf
row_2         1    xasdf
row_3         2    xasdf


Like Series, DataFrames have an `index` and `values` attribute. Furthermore, a 'columns' attribute is provided to access the column labels.

In [19]:
print(df.index)

Index(['row_1', 'row_2', 'row_3'], dtype='object')


In [20]:
df.values

array([[0, 'xasdf'],
       [1, 'xasdf'],
       [2, 'xasdf']], dtype=object)

In [21]:
df.columns

Index(['column_1', 'column_2'], dtype='object')

Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data.

In [22]:
print(type(df.loc['row_2']))

<class 'pandas.core.series.Series'>


We can fetch a particular row of a dataframe via the `.loc[<index_value>]` attribute. Note that if we specify the `.loc` attribute and instead access the dataframe with `df[<something>]`, we obtain the column `<something>` rather than a row.

In [23]:
print(df.loc['row_2'])

column_1        1
column_2    xasdf
Name: row_2, dtype: object


In [24]:
print(df.loc[['row_1', 'row_2']])

       column_1 column_2
row_1         0    xasdf
row_2         1    xasdf


## Data Selection

We will look in detail at methods and tools to access values in Pandas dataframes.

### Data Selection in Series

The slicing and indexing conventions above can be a source of confusion. For example, if your series has an explicit integer index, an indexing operation such as `df[1]` will use the **explicit indices**, while a slicing operation such as `df[1:3]` will use the **implicit Python-style index**.

In [25]:
# Create a dataframe with an explicitely defined index
series = pd.Series([10,11,12,13], 
                   index=[3,2,1,0])

In [26]:
# Access the element with index value = 0
print(series[0])

13


In [27]:
# Access the elements with index value = 0 and 1
print(series[[0,1]])

0    13
1    12
dtype: int64


However, if we use slicing ...

In [28]:
# We use slicing to acces the index.
# Question: Which value will be returned ?
print(series[:1])

3    10
dtype: int64


**Pandas does not look at the explicitly defined index! Instead, it looks at the implicit index! (the order of rows)**

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes which explicitly access certain indexing schemes. 

#### Accessing Pandas Series using attributes

The `.loc` attribute allows indexing and slicing which **always references the explicit index**. We refer to the values in this index as **labels**.

In [29]:
# Access the index by label 0
print(series.loc[0])

13


In [30]:
# Get the element with label 0 and 1
print(series.loc[[0,1]])

0    13
1    12
dtype: int64


In [40]:
# .loc with list slicing uses the explicit index
# The index with value = 1 is the third value in the list. Hence, we obtain three values.
print(series.loc[:1])

3    10
2    11
1    12
dtype: int64


**Note that in contrast to conventional Python slicing, both the start and the stop are included when present in the index!**

The implicit index can be explicitely access with the `.iloc` attribute.

In [32]:
print(series.iloc[0])

10


In [33]:
print(series.iloc[[0,1]])

3    10
2    11
dtype: int64


In [34]:
print(series.iloc[:1])

3    10
dtype: int64


## Data Selection in DataFrames

In [35]:
df = pd.DataFrame([
        {'column_1': 0, 'column_2': 'xasdf'},
        {'column_1': 1, 'column_2': 'xasdf'},
        {'column_1': 2, 'column_2': 'xasdf'},
    ]
)

Unlike the case for series, the `[]` operator indexes the columns of a dataframe. It allows us to select certain columns based on their names.

In [36]:
print(df['column_1'])

0    0
1    1
2    2
Name: column_1, dtype: int64


In [37]:
print(df[['column_1', 'column_2']])

   column_1 column_2
0         0    xasdf
1         1    xasdf
2         2    xasdf


But again, if we start slicing, the `[]` operator will slice the implicit index.

In [38]:
# Returns the first row
df[:1]

Unnamed: 0,column_1,column_2
0,0,xasdf


However, note that training to access a specific row by writing `df[<row_idx>]` will throw an error. <br/>
This happens because the provided value is not a slice. Hence, Pandas start to look for a column named `<row_idx>`.

In [39]:
df[0]

KeyError: 0

#### Accessing Pandas DataFrames using attributes

Again, to avoid any confusion, Pandas provides the `.iloc`, `.loc` attributes for dataframes.

In [None]:
df = pd.DataFrame([
        {'column_1': 0, 'column_2': 'xasdf'},
        {'column_1': 1, 'column_2': 'xasdf'},
        {'column_1': 2, 'column_2': 'xasdf'},
    ],
    index = ['row_1', 'row_2', 'row_3']
)

Again, the implicit index can be accessed with `.iloc`.

In [None]:
# Returns the first row
df.iloc[0]

In [None]:
# Returns the first two rows
df.iloc[:2]

However, there is also the possibility of accessing particular **cells** based on their position.

In [None]:
# Obtain the value at position row_idx=2 and column_1
df.iloc[2,0]

The `.loc` attribute can be used to access the explicitly defined index. We refer to the values in this index as **labels**.

In [None]:
# Get row with label "row_1"
print(df.loc['row_1'])

Similarly, we can also access a particular cell based on the row label and the column label.

In [None]:
# Access the value at row label "row_2" and column label "column_2"
df.loc['row_2'].loc['column_2']

Alternatively, we can use a slightly more compact notation ...

In [None]:
df.loc['row_2', 'column_2']

Note that `.loc` can also handle binary masks ...

In [None]:
df.loc[[False, True, False]]

## Modifying a DataFrame

Since we now understand how specific rows, columns, or cells can be selected in a Pandas dataframe, we now look at how to modify them.

In [None]:
df = pd.DataFrame([
        {'column_1': 0, 'column_2': 'xasdf'},
        {'column_1': 1, 'column_2': 'xasdf'},
        {'column_1': 2, 'column_2': 'xasdf'},
    ],
    index = ['row_1', 'row_2', 'row_3']
)

In [None]:
print(df)

### Set the value of a cell by its row and column label

In [None]:
df.loc['row_3', 'column_2'] = 'new'

In [None]:
print(df)

### Set cells with the value "xasdf" in column "column_2" to 4

In [None]:
df[df['column_2'] == 'xasdf'] = 4

In [None]:
print(df)

### Set the row labeled "row_3" to ['a', 'b']

In [None]:
# The order in which define the values in the list needs to be the same as the order of the rows in the dataframe.
df.loc['row_3'] = ['a', 'b']

In [None]:
print(df)

In [None]:
# Alternatively, we can pass a dictionary where the keys represent column labels.
df.loc['row_3'] = {'column_1': 'x', 'column_2':'y'}

In [None]:
print(df)

### Set all rows where column_2 is equal to "x" to "y" 

In [None]:
df = pd.DataFrame([
        {'column_1': 0, 'column_2': 'xasdf'},
        {'column_1': 1, 'column_2': 'xasdf'},
        {'column_1': 2, 'column_2': 'x'},
    ],
    index = ['row_1', 'row_2', 'row_3']
)

In [None]:
df.loc[df['column_2'] == 'x'] = 'y'

In [None]:
print(df)

### Add a new row with label row_4

In [None]:
df.loc['row_4'] = [1, 2]

In [None]:
print(df)