# Pandas - High level Data Manipulation

M. V. dos Santos (marcelo.santos at df.ufcg.edu.br)

The latest version of
this [Jupyter notebook](https://jupyter.org/) lecture is available at [https://github.com/mvsantosdev/scientific-python-lectures.git](https://github.com/mvsantosdev/scientific-python-lectures.git).

In [1]:
import pandas as pd

## Introduction

In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. (Wikipedia)

## DataFrames

The primary pandas data structure. Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

In [2]:
df = pd.DataFrame() #empty DataFrame

In [3]:
df

In [6]:
data = {'col1': [1, 2], 'col2': [3, 4]}

df = pd.DataFrame(data)

df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [8]:
data = {'col1': {8: 1, 9: 2}, 'col2': {8: 3, 9: 4}}

df = pd.DataFrame(data)

df

Unnamed: 0,col1,col2
8,1,3
9,2,4


In [17]:
data = [{'id': 5, 'col1': 1, 'col2': 3}, {'id': 6, 'col1': 2, 'col2': 4}]

df = pd.DataFrame(data)

df

Unnamed: 0,id,col1,col2
0,5,1,3
1,6,2,4


In [18]:
df=df.set_index('id')

df

Unnamed: 0_level_0,col1,col2
id,Unnamed: 1_level_1,Unnamed: 2_level_1
5,1,3
6,2,4


In [19]:
df.reset_index(drop=True, inplace=True)

df

Unnamed: 0,col1,col2
0,1,3
1,2,4


### DataFrame properties

In [54]:
df.size

4

In [55]:
df.shape

(2, 2)

In [56]:
df.ndim

2

In [57]:
df.columns

Index(['val1', 'val2'], dtype='object')

In [58]:
df.columns = ['val1', 'val2']

df

Unnamed: 0,val1,val2
a,1,3
b,2,4


In [59]:
df.index

Index(['a', 'b'], dtype='object')

In [60]:
df.index = ['a', 'b']

df

Unnamed: 0,val1,val2
a,1,3
b,2,4


In [61]:
df.index

Index(['a', 'b'], dtype='object')

In [62]:
df.values

array([[1, 3],
       [2, 4]])

## Accessing values

Accessing columns:

In [63]:
df['val1']

a    1
b    2
Name: val1, dtype: int64

In [64]:
df.val1

a    1
b    2
Name: val1, dtype: int64

Accessing rows:

In [65]:
df.loc['a']

val1    1
val2    3
Name: a, dtype: int64

In [66]:
df.iloc[0]

val1    1
val2    3
Name: a, dtype: int64

In [67]:
df.loc['a', 'val1'], df.iloc[0, 0]

(1, 1)

In [68]:
df.loc[:, 'val1'] #much faster for large datasets

a    1
b    2
Name: val1, dtype: int64

Columns and rows return as Series objects. One-dimensional ndarray with axis labels (including time series).

In [69]:
s=df.loc[:, 'val1']

type(s)

pandas.core.series.Series

## Importing data