# DataFrames

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index.

In [5]:
import pandas as pd
import numpy as np

In [6]:
df = pd.DataFrame(np.random.rand(5,4),index=['A' ,'B', 'C' ,'D' ,'E'],columns=['W' ,'X' ,'Y' ,'Z'])

In [7]:
df

Unnamed: 0,W,X,Y,Z
A,0.006249,0.016105,0.668698,0.35299
B,0.920795,0.805295,0.043009,0.937955
C,0.395814,0.942302,0.009836,0.88547
D,0.51097,0.321547,0.435193,0.972378
E,0.521871,0.148523,0.506401,0.086488


## Selection and Indexing

Let's learn the various methods to grab data from a DataFrame

In [8]:
df['W']

A    0.006249
B    0.920795
C    0.395814
D    0.510970
E    0.521871
Name: W, dtype: float64

In [9]:
# Pass a list of column names
df[['W','Z']]

Unnamed: 0,W,Z
A,0.006249,0.35299
B,0.920795,0.937955
C,0.395814,0.88547
D,0.51097,0.972378
E,0.521871,0.086488


DataFrame Columns are just Series

In [10]:
type(df['W'])

pandas.core.series.Series

**Creating a new column:**

In [11]:
df['new'] = df['W'] + df['Y']

In [12]:
df

Unnamed: 0,W,X,Y,Z,new
A,0.006249,0.016105,0.668698,0.35299,0.674947
B,0.920795,0.805295,0.043009,0.937955,0.963804
C,0.395814,0.942302,0.009836,0.88547,0.40565
D,0.51097,0.321547,0.435193,0.972378,0.946163
E,0.521871,0.148523,0.506401,0.086488,1.028272


** Removing Columns**

In [13]:
df.drop('new',axis=1,inplace=True)

In [14]:
df

Unnamed: 0,W,X,Y,Z
A,0.006249,0.016105,0.668698,0.35299
B,0.920795,0.805295,0.043009,0.937955
C,0.395814,0.942302,0.009836,0.88547
D,0.51097,0.321547,0.435193,0.972378
E,0.521871,0.148523,0.506401,0.086488


Can also drop rows this way:

In [17]:
df.drop('E',axis=0,inplace=True)

** Selecting Rows**

In [21]:
df.loc[['A','B']]

Unnamed: 0,W,X,Y,Z
A,0.006249,0.016105,0.668698,0.35299
B,0.920795,0.805295,0.043009,0.937955


In [22]:
df

Unnamed: 0,W,X,Y,Z
A,0.006249,0.016105,0.668698,0.35299
B,0.920795,0.805295,0.043009,0.937955
C,0.395814,0.942302,0.009836,0.88547
D,0.51097,0.321547,0.435193,0.972378


In [23]:
df.loc[['A','B'],['W','X']]

Unnamed: 0,W,X
A,0.006249,0.016105
B,0.920795,0.805295


Or select based off of position instead of label 

In [24]:
df.iloc[2]

W    0.395814
X    0.942302
Y    0.009836
Z    0.885470
Name: C, dtype: float64

** Selecting subset of rows and columns **

In [17]:
df.loc[['A','B'],['W','Y']]

Unnamed: 0,W,Y
A,0.844933,0.981566
B,0.274244,0.821961


### Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

In [18]:
df

Unnamed: 0,W,X,Y,Z
A,0.844933,0.614515,0.981566,0.508522
B,0.274244,0.176267,0.821961,0.257647
C,0.621103,0.034201,0.207057,0.705593
D,0.234953,0.864842,0.88328,0.975084
E,0.382878,0.412758,0.81018,0.531792


In [26]:
df[df>0.5]

Unnamed: 0,W,X,Y,Z
A,0.844933,0.614515,0.981566,0.508522
B,,,0.821961,
C,0.621103,,,0.705593
D,,0.864842,0.88328,0.975084
E,,,0.81018,0.531792
