<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-pandas" data-toc-modified-id="Introduction-to-pandas-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to pandas</a></span><ul class="toc-item"><li><span><a href="#Dataframe-construction-from-numpy-arrays" data-toc-modified-id="Dataframe-construction-from-numpy-arrays-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Dataframe construction from numpy arrays</a></span></li><li><span><a href="#Selecting-columns-of-a-pd.DataFrame" data-toc-modified-id="Selecting-columns-of-a-pd.DataFrame-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Selecting columns of a <code>pd.DataFrame</code></a></span></li><li><span><a href="#Column-types" data-toc-modified-id="Column-types-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Column types</a></span></li><li><span><a href="#Column-hetoregeneity" data-toc-modified-id="Column-hetoregeneity-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Column hetoregeneity</a></span></li></ul></li></ul></div>

# Introduction to pandas

In [1]:
import pandas as pd
import numpy as np
import itertools

## Dataframe construction from numpy arrays 

Pandas Dataframes can be constructed from `np.ndarray` objects

In [2]:
X = np.array([[0,0,5],[7,0,0],[3,3,3]])
X

array([[0, 0, 5],
       [7, 0, 0],
       [3, 3, 3]])

In [3]:
df = pd.DataFrame(X,columns=["a","b","c"])
df

Unnamed: 0,a,b,c
0,0,0,5
1,7,0,0
2,3,3,3


In [17]:
type(df)

pandas.core.frame.DataFrame

# Columns in a pandas DataFrame

## Selecting columns of a `pd.DataFrame`

We can get the information of a column using `df[colname]` for a column named `colname`. 
The returned object of `df[colname]` is a `pd.Series` object.

In [5]:
df["a"]

0    0
1    7
2    3
Name: a, dtype: int64

In [6]:
type(df["a"])

pandas.core.series.Series

We can also use  `df[[colname]]` for a column named `colname`, doing so we generate a pandas dataframe object.

In [7]:
df[["a"]]

Unnamed: 0,a
0,0
1,7
2,3


In [9]:
type(df[["a"]])

pandas.core.frame.DataFrame

## Column types

In [10]:
df.dtypes

a    int64
b    int64
c    int64
dtype: object

## Column hetoregeneity

Columns in a pandas dataframe can be numbers, strings or even arrays

In [11]:
df["a"] = [np.array([2,3,4]),
           np.array([2,3,4]),
           np.array([3,3,2])]

In [12]:
df

Unnamed: 0,a,b,c
0,"[2, 3, 4]",0,5
1,"[2, 3, 4]",0,0
2,"[3, 3, 2]",3,3


In [13]:
df.dtypes

a    object
b     int64
c     int64
dtype: object

In [14]:
df["a"][0]

array([2, 3, 4])

In [15]:
df["a"][1]

array([2, 3, 4])

In [16]:
df["a"][2]

array([3, 3, 2])

# Rows of a pandas dataframe

One can select rows of a pandas dataframe using the **`df.loc[]`** syntac passing index values of the rows one wants to select.

In the `df` defined above indices are integers from 0 to 2.

In [25]:
ids_to_select = [0,1]
df.loc[ids_to_select]

Unnamed: 0,a,b,c
0,"[2, 3, 4]",0,5
1,"[2, 3, 4]",0,0


In [28]:
df.loc[2]

a    [3, 3, 2]
b            3
c            3
Name: 2, dtype: object

One can also select rows by position using **`df.iloc[]`**

In [31]:
df.iloc[[2]]

Unnamed: 0,a,b,c
2,"[3, 3, 2]",3,3


# Index of a Dataframe

The index of a dataframe is a special column that is used to index/select the rows of a dataframe.

The index is important because it allows us fast selection of rows compared with using boolean vectors for selection.

Let us create an example

In [60]:
n_obs = 10_000_000
n_max = 1000

x = np.random.rand(n_obs)
indices = np.random.randint(0,n_max,n_obs)

In [86]:
df = pd.DataFrame({'x':x, 'indices':indices})

In [87]:
df

Unnamed: 0,x,indices
0,0.157628,658
1,0.077828,382
2,0.798671,950
3,0.109354,410
4,0.310457,682
...,...,...
9999995,0.727102,73
9999996,0.912861,688
9999997,0.793439,416
9999998,0.233583,383


Consider the problem of getting a slice of all elements with column indices with value 232

In [88]:
%%timeit
df[df.indices==232]

3.65 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


We can use as indices the column indices

In [89]:
df.index = df.indices

In [90]:
%%timeit
df.loc[232]

6.13 ms ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


both will yield to the same result

In [91]:
pd.testing.assert_frame_equal(df.loc[232], df[df.indices==232])

In [98]:
df.loc[232].head()

Unnamed: 0_level_0,x,indices
indices,Unnamed: 1_level_1,Unnamed: 2_level_1
232,0.21004,232
232,0.955793,232
232,0.892964,232
232,0.436349,232
232,0.805153,232
