# Data Analysis I Using `pandas`

## <img src='https://az712634.vo.msecnd.net/notebooks/python_course/v1/geekcup.png' alt="Smiley face" width="42" height="42" align="left">Learning Objectives
---
* See some basic options for importing data files
* Understand how to manipulate row and column names
* Get an idea of how to deal with missing data
* Become familiar with slicing data
* Become familiar with assignment
* See how broadcasting works
* Understand more data structure manipulation (adding and removing columns)

In [2]:
import pandas as pd
import numpy as np

### Data types in `pandas` - you will see through examples how these work
* `Series`
* `DataFrame`
* `Panel` (not covered here)

### Input from csv and excel files

In [7]:
# Check current directory for files

# Uncomment for linux or OSX
!ls ./anaconda3_410
#!wget *.*
#!ls 'https://raw.githubusercontent.com/ogrisel/parallel_ml_tutorial/master/notebooks'
# Uncomment for windows
#!dir

LICENSE.txt  conda-meta  etc	  lib	 pkgs	ssl
bin	     envs	 include  lib64  share	var
--2017-02-09 13:37:37--  http://*.*/
Resolving webproxy (webproxy)... 100.105.133.11
Connecting to webproxy (webproxy)|100.105.133.11|:3128... connected.
Proxy request sent, awaiting response... 403 Forbidden
2017-02-09 13:37:37 ERROR 403: Forbidden.



In [3]:
# Reading a csv file with the read_csv function

import os

data = pd.read_csv('https://raw.githubusercontent.com/ogrisel/parallel_ml_tutorial/master/notebooks/titanic_train.csv', 
                    sep = ',')

In [4]:
# What are the dimensions
print(data.shape)

# What are the column names
print(data.columns)

# What do the first few rows look like
data.head()

(891, 12)
Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age', u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'], dtype='object')


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


To read an excel file, ensure you have the <b>`xlrd`</b> package installed (`pandas` method `read_excel` needs it).  For Windows binaries go [here](http://www.lfd.uci.edu/~gohlke/pythonlibs/).  If you have a conda install, just `conda install xlrd`.

With `pandas` and `xlrd` one can read an excel file by simply:

```python
# Reading from an excel file with read_excel
data = pd.read_excel(os.path.join('data', 'GDS4517.xls'))
```

In [5]:
# Some toy data
a = np.arange(10)
b = np.sin(a)

# Place it into a dictionary
array_dict = {'a': a, 'b': b}

# Initialize a dataframe with toy data
df = pd.DataFrame(array_dict)
df

Unnamed: 0,a,b
0,0,0.0
1,1,0.841471
2,2,0.909297
3,3,0.14112
4,4,-0.756802
5,5,-0.958924
6,6,-0.279415
7,7,0.656987
8,8,0.989358
9,9,0.412118


### The idea behind `pandas`
* The most common data structure in `pandas` is the **DataFrame** much like the analogous data.frame in R.

```python
# Some toy data
a = np.arange(10)
b = np.sin(a)

# Place it into a dictionary
array_dict = {'a': a, 'b': b}

# Initialize a dataframe with toy data
df = pd.DataFrame(array_dict)
```

* `pandas` provides higher level data manipulation tools than `numpy`, but is built on top of `numpy`.  Given the richness of capabilities with `pandas`, `pandas` operations are often slower than a similar operation with a `numpy` array.  However, it is not hard to convert from one to the other.
* The basic unit of the DataFrame in `pandas` is of the `Series` type.

In [8]:
# 2D numpy array
np_array = np.random.randint(1, 10, size = 16).reshape(4, 4)
print(np_array)
# Convert to pd DataFrame
df = pd.DataFrame(np_array)

df

[[9 4 9 7]
 [3 3 2 2]
 [8 2 6 9]
 [3 1 5 1]]


Unnamed: 0,0,1,2,3
0,9,4,9,7
1,3,3,2,2
2,8,2,6,9
3,3,1,5,1


In [9]:
# pandas DataFrame
df = pd.DataFrame(data = pd.Series(range(12)).reshape(3, 4), columns = list('abcd'))

df

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [10]:
# Convert to ndarray (TWO ways)

# first way
df.as_matrix() # not a matrix, however, just numpy array

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [11]:
# convert to ndarray

# Second way
df.values

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Hey!  Did you notice how in the jupyter notebook the `pandas` DataFrame is rendered nicely?  That's a reason some people will convert to DataFrames in jupyter notebooks...it makes it easier to see the data.

### Renaming row and column names

Initialize a `pandas` dataframe with toy data:

In [12]:
# Note here we are initializing a dataframe with a dict of 1D ndarrays (numpy arrays)
df = pd.DataFrame({'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})

df

Unnamed: 0,data1,data2
0,0.018495,-0.749575
1,-0.980687,-0.127589
2,-1.321556,2.614226
3,-2.531159,0.649898
4,0.506157,-1.932074


Rename columns (with <b>`columns`</b> keyword) and rows (with <b>`index`</b> keyword) inplace (note: we could have specified `columns` when initializing the DataFrame):

In [13]:
df.rename(index = {0: 'a', 
                   1: 'b',
                   2: 'c',
                   3: 'd',
                   4: 'e'}, 
          columns = {'data1': 'one', 'data2': 'two'}, inplace = True)
df

Unnamed: 0,one,two
a,0.018495,-0.749575
b,-0.980687,-0.127589
c,-1.321556,2.614226
d,-2.531159,0.649898
e,0.506157,-1.932074


### Reordering of things

Using the toy dataframe from above, we shall now reorder the <b>rows</b>:

In [14]:
df2 = pd.DataFrame(df, index = ['b', 'c', 'd', 'a', 'e'])
df2

Unnamed: 0,one,two
b,-0.980687,-0.127589
c,-1.321556,2.614226
d,-2.531159,0.649898
a,0.018495,-0.749575
e,0.506157,-1.932074


In [15]:
# How would you modify the above cell to do the same reordering,
#   but at the same time, remove one, say the one labeled 'e'

# Write your code here...

df3 = pd.DataFrame(df, index = ['b', 'c', 'd', 'a'])
df3

Unnamed: 0,one,two
b,-0.980687,-0.127589
c,-1.321556,2.614226
d,-2.531159,0.649898
a,0.018495,-0.749575


There's another way (same result, but does not modify object, `df`, inplace):

In [16]:
# This does NOT change df

df.reindex(['b', 'c', 'd', 'a', 'e']) # compare to df2 above

Unnamed: 0,one,two
b,-0.980687,-0.127589
c,-1.321556,2.614226
d,-2.531159,0.649898
a,0.018495,-0.749575
e,0.506157,-1.932074


In [None]:
# How would you modify the above cell (using reindex still) 
#   to not only reorder rows, but remove one from the view, 
#   say the one labeled 'e'

# Write your code here...


A quick trick to switch around columns

In [17]:
# Quick inplace transformation
df[['one', 'two']] = df[['two', 'one']]
df

Unnamed: 0,one,two
a,-0.749575,0.018495
b,-0.127589,-0.980687
c,2.614226,-1.321556
d,0.649898,-2.531159
e,-1.932074,0.506157


### Introducing the `Series` object

<b>Properties of the `Series` object</b>
* alignment of data and label are intrinsic
* is a 1D array (actually just a `numpy` array with and index)
* slicing also slices the index
* can be initialized with a scalar, a dict or an ndarray (aka numpy array)
* if initialized with numpy array and an index is given, length must match data
* numpy functions can take a Series as input

<b>Examples of initializing a `Series`:</b>

In [21]:
# With a scalar only
a = pd.Series(5)
print('a:\n', a)
#print(a)

# With a scalar and index
b = pd.Series(5, index = ['Z'])
print('b:\n', b)

# With a scalar and index
c = pd.Series(5, index = ['X', 'Y', 'Z'])
print('c:\n', c)

# With a dict
d = pd.Series({'A': 1, 'B': 2})
print('d:\n', d)

# dict.  if index given, labels must match, but can add more
e = pd.Series({'A': 1, 'B': 2}, index = ['A', 'B', 'C'])
print('e:\n', e)

# With an ndarray
f = pd.Series(np.random.randn(5))
print('f:\n', f)

# With an ndarray and index (lengths must match)
g = pd.Series(np.random.randn(5), index = ['M', 'N', 'O', 'P', 'Q'])
print('g:\n', g)

('a:\n', 0    5
dtype: int64)
('b:\n', Z    5
dtype: int64)
('c:\n', X    5
Y    5
Z    5
dtype: int64)
('d:\n', A    1
B    2
dtype: int64)
('e:\n', A     1
B     2
C   NaN
dtype: float64)
('f:\n', 0    1.078626
1   -0.137018
2    3.329960
3   -1.011534
4    2.037318
dtype: float64)
('g:\n', M    0.863232
N   -0.969351
O    0.036484
P    1.352242
Q    0.015220
dtype: float64)


### Missing data

Initialize `pandas` dataframe with some <b>`Series`</b> objects:

In [22]:
# Initialize a dataframe with a dict of pandas Series

df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

# Notice the introduction of NaNs (why did this happen?)

df

Unnamed: 0,one,three,two
a,0.355765,,-0.355987
b,-0.304327,1.312259,-0.635785
c,-0.891402,-0.144047,0.525341
d,,0.542831,0.642723


In [23]:
# Where are the NaNs?
pd.isnull(df)

Unnamed: 0,one,three,two
a,False,True,False
b,False,False,False
c,False,False,False
d,True,False,False


In [24]:
# Replace NaN with a scalar
df2 = df.fillna(0)
df2

Unnamed: 0,one,three,two
a,0.355765,0.0,-0.355987
b,-0.304327,1.312259,-0.635785
c,-0.891402,-0.144047,0.525341
d,0.0,0.542831,0.642723


In [25]:
# Drop any row with NA/NaN
# how = 'all' will drop only rows with ALL nan
df2 = df.dropna(how = 'any')
df2

Unnamed: 0,one,three,two
b,-0.304327,1.312259,-0.635785
c,-0.891402,-0.144047,0.525341


In [26]:
# Only look in column 'one' for NaNs and drop a row if any
df2 = df.dropna(subset = ['one'])
df2

Unnamed: 0,one,three,two
a,0.355765,,-0.355987
b,-0.304327,1.312259,-0.635785
c,-0.891402,-0.144047,0.525341


EXERCISE 1:  
```python 
alldates = pd.date_range('09-01-2013', '09-10-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
```

* expand to include "missing dates" in `alldates` but not `s`
* set missing dates to 0

In [45]:
alldates = pd.date_range('09-01-2013', '09-10-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})

print(alldates)
#print(s)

s2 = pd.Series(s, index = alldates)
print(s2)

s3 = s2.fillna(0)
print(s3)

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-09-01, ..., 2013-09-10]
Length: 10, Freq: D, Timezone: None
2013-09-01   NaN
2013-09-02   NaN
2013-09-03   NaN
2013-09-04   NaN
2013-09-05   NaN
2013-09-06   NaN
2013-09-07   NaN
2013-09-08   NaN
2013-09-09   NaN
2013-09-10   NaN
Freq: D, dtype: float64
2013-09-01    0
2013-09-02    0
2013-09-03    0
2013-09-04    0
2013-09-05    0
2013-09-06    0
2013-09-07    0
2013-09-08    0
2013-09-09    0
2013-09-10    0
Freq: D, dtype: float64


### Slicing

In [46]:
# Use pandas to create a range of dates
dates = pd.date_range('19740101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = list('ABCD'))
df

Unnamed: 0,A,B,C,D
1974-01-01,-0.47262,-0.033557,0.532828,0.144537
1974-01-02,-0.035638,0.223567,-0.509015,0.569364
1974-01-03,0.931459,-1.551312,0.452477,0.768549
1974-01-04,-1.83128,-1.137377,0.814599,-0.765874
1974-01-05,0.629646,1.653523,1.190127,2.455193
1974-01-06,-2.187781,0.617605,-0.163634,-1.045497


In [47]:
# Slice out rows 2-4
df[1:4]

Unnamed: 0,A,B,C,D
1974-01-02,-0.035638,0.223567,-0.509015,0.569364
1974-01-03,0.931459,-1.551312,0.452477,0.768549
1974-01-04,-1.83128,-1.137377,0.814599,-0.765874


In [37]:
# Slice using index range (aka labels)
df['19740102':'19740104']

Unnamed: 0,A,B,C,D
1974-01-02,0.120676,-1.135891,-1.334192,0.335737
1974-01-03,0.663491,-0.227691,-0.270027,-1.027227
1974-01-04,0.700266,-1.269358,0.455615,-1.796695


In [48]:
# Slice with names using loc
df.loc[:, ['B', 'D']] # notice lack of parentheses here!

Unnamed: 0,B,D
1974-01-01,-0.033557,0.144537
1974-01-02,0.223567,0.569364
1974-01-03,-1.551312,0.768549
1974-01-04,-1.137377,-0.765874
1974-01-05,1.653523,2.455193
1974-01-06,0.617605,-1.045497


In [49]:
# Slice with index using iloc
df.iloc[3,] # is this a row or column?

A   -1.831280
B   -1.137377
C    0.814599
D   -0.765874
Name: 1974-01-04 00:00:00, dtype: float64

In [52]:
# Slice out specific rows and/or columns with iloc
d4 = df.iloc[[0, 3], [1, 2]]

Unnamed: 0,B,C
count,2.0,2.0
mean,-0.585467,0.673714
std,0.780519,0.199242
min,-1.137377,0.532828
25%,-0.861422,0.603271
50%,-0.585467,0.673714
75%,-0.309512,0.744157
max,-0.033557,0.814599


In [41]:
# Return types...

df = pd.DataFrame(np.random.randn(3, 4))

# What type is returned from loc and iloc - check here...


EXERCISE 2:  Slicing rows and columns by index<br>
Using this dataframe, 
```python
dates = pd.date_range('19740101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = list('ABCD'))
```
Do the following:<br>
1.  Slice out the first row by index
*  Slice out the first column by index
*  Slice out the first and last row, first and last column, by index

In [83]:
dates = pd.date_range('19740101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index = dates, columns = list('ABCD'))
print(df)

#get first row
#df2 = df.iloc[0,:]
print(df2)

#get first col by index
df3 = df.iloc[:,0]
#print(df3)

#get first & last col by index
df4 = df.iloc[[0,5],[0,3] ]
print(df4)


                   A         B         C         D
1974-01-01  1.088854 -0.060348 -1.702012  0.949595
1974-01-02 -1.137522  0.102697 -1.581060 -0.438360
1974-01-03  0.235176 -1.305368  0.116977  0.306957
1974-01-04  1.776084 -0.372009 -0.244790  0.900835
1974-01-05  0.088288  1.149497 -1.510497 -0.181049
1974-01-06 -0.130298  2.471069 -0.256185  2.118328
A   -1.447313
B    0.914352
C    0.992185
D   -1.452973
Name: 1974-01-01 00:00:00, dtype: float64
                   A         D
1974-01-01  1.088854  0.949595
1974-01-06 -0.130298  2.118328


<b>Just like with numpy arrays, slicing `pandas` dataframes produces a <i>view</i></b>.  Remember that when you modify a view, you will also modify the original since it is not a copy.

EXERCISE 3:  Slicing and views
* Write some code here to prove that dataframe slicing produces views...(might produce a warning which is very nice of the interpreter)

In [90]:
# Code up your solution/proof here...
# Solution to showing that slicing a dataframe produces a view, not a copy

# This will likely produce a warning...

# Data
n = pd.DataFrame(np.random.randn(12).reshape(3, 4))
print('original n:', n)#, sep ='\n')

# Slice n
view = n[0:1]
print('\nview:', view)#, sep = '\n')

# Set first element (0,0) to 0
view.iloc[0, 0] = 0
print('\nmodified view:', view)#, sep = '\n')

# Look for 0,0 being 0 when printed below
print('\nn', n)#, sep = '\n')

('original n:',           0         1         2         3
0  0.170749  0.426795  1.043635 -0.406827
1 -0.556874  0.399512  0.820149  1.781749
2  0.378850 -1.543736 -1.050161 -0.919281)
('\nview:',           0         1         2         3
0  0.170749  0.426795  1.043635 -0.406827)
('\nmodified view:',    0         1         2         3
0  0  0.426795  1.043635 -0.406827)
('\nn',           0         1         2         3
0  0.000000  0.426795  1.043635 -0.406827
1 -0.556874  0.399512  0.820149  1.781749
2  0.378850 -1.543736 -1.050161 -0.919281)


A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### Boolean indexing
<table style="width:50%" align="left">
  <tr>
    <td><b>Operator/Method</b></td>
    <td><b>Meaning</b></td>		
  </tr>
    <tr>
    <td>`isnull`</td>
    <td>Returns a df of boolean values representing if the value is null</td>		
  </tr>
    <tr>
    <td>`isin`</td>
    <td>Returns rows where value is in a certain column</td>		
  </tr>
  <tr>
    <td>`|`</td>
    <td>or</td>		
  </tr>
  <tr>
    <td>`&`</td>
    <td>and</td>		
  </tr>
  <tr>
    <td>`~`</td>
    <td>not</td>		
  </tr>
</table>

In [95]:
# Initialize a dataframe with a dict of pandas Series and introduce NaNs
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two' : pd.Series(np.random.randn(3), index=['a', 'b', 'd']),
    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,three,two
a,-0.26562,,-1.333952
b,0.258648,-0.524998,0.838746
c,-0.220483,0.478359,
d,,1.211835,0.905704


In [98]:
print(df)

df[pd.isnull(df)] = 0.777

print(df)


# Take note of where the NaNs appear

        one     three       two
a -0.265620  0.777000 -1.333952
b  0.258648 -0.524998  0.838746
c -0.220483  0.478359  0.777000
d  0.777000  1.211835  0.905704
        one     three       two
a -0.265620  0.777000 -1.333952
b  0.258648 -0.524998  0.838746
c -0.220483  0.478359  0.777000
d  0.777000  1.211835  0.905704


EXERCISE 4:  Replace NaNs with scalar inplace<br><br>
Using the dataframe above, replace all NaNs with a scalar using a criterion (`pd.isnull()`) and inplace (`df[*criterion*]`).

In [None]:
# Code up your solution here...df[df > 0]
df6 = df[df > 0]

### Assignment

In [99]:
# Initialize a dataframe with a dict of pandas Series and introduce NaNs
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,three,two
a,-0.296609,,1.782546
b,0.942898,-1.55364,0.549673
c,-1.39937,-0.515617,0.222548
d,,-1.57571,0.038742


In [100]:
# Assign a whole row
df2 = df.copy()
df2.iloc[3] = 0
df2

Unnamed: 0,one,three,two
a,-0.296609,,1.782546
b,0.942898,-1.55364,0.549673
c,-1.39937,-0.515617,0.222548
d,0.0,0.0,0.0


In [101]:
# Assign a whole column
df2 = df.copy()
df2['one'] = 0
df2

Unnamed: 0,one,three,two
a,0,,1.782546
b,0,-1.55364,0.549673
c,0,-0.515617,0.222548
d,0,-1.57571,0.038742


In [102]:
# Using a criterion to fill in missing values by assignment
df2 = df.copy()
df2[df2.isnull()] = 0
df2

Unnamed: 0,one,three,two
a,-0.296609,0.0,1.782546
b,0.942898,-1.55364,0.549673
c,-1.39937,-0.515617,0.222548
d,0.0,-1.57571,0.038742


EXERCISE 5:  Setting rows of an empty dataframe
<br>Using the following syntax create an empty 100x10 dataframe and assign each row to the same array of numbers
```python
pd.DataFrame(index = range(nrows), columns = range(ncols))
```

In [125]:
# Code up your solution here...
#np.range()
x = np.random.randn(10)
df9 = pd.DataFrame(index = range(100), columns = range(10))

for i in range(df9.shape[0]):
    df9.iloc[i] = x
    
df9

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482
1,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482
2,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482
3,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482
4,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482
5,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482
6,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482
7,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482
8,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482
9,0.09812588,1.091615,-0.1451485,-0.3398887,0.475404,-1.058019,0.4484893,0.03245215,0.7060423,-1.048482


### Broadcasting
* Broadcasting is essentially vectorizing array operations, usually arithmetic.  The term comes from the `numpy` package.  Here, it is applied to `pandas` dataframes.

In [128]:
# Let's create a simple dataframe from a range of numbers with column names
df = pd.DataFrame(np.arange(12).reshape(4, 3), columns = ['a', 'b', 'c'])
df

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


<b>Scalar value broadcasting</b>

In [132]:
# Addition
df =df + 101
df

# Try subtraction, multiplication and division on your own


Unnamed: 0,a,b,c
0,202,203,204
1,205,206,207
2,208,209,210
3,211,212,213


<b>Array broadcasting</b>

In [133]:
d = [1, 2, 3]

df * d

# Is the broadcast happening row-wise or column-wise?

# The array 'd' could also be numpy array or pandas series...try these


Unnamed: 0,a,b,c
0,202,406,612
1,205,412,621
2,208,418,630
3,211,424,639


### Adding and removing columns

In [134]:
# Our familiar pandas dataframe
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
    'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
    'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,three,two
a,0.396797,,0.438713
b,-1.328575,1.79435,0.318041
c,0.826842,-0.819218,0.05847
d,,0.399921,1.375172


In [135]:
# Create a new column and add it to dataframe
df['four'] = df['one'] + df['two']
df

Unnamed: 0,one,three,two,four
a,0.396797,,0.438713,0.83551
b,-1.328575,1.79435,0.318041,-1.010534
c,0.826842,-0.819218,0.05847,0.885312
d,,0.399921,1.375172,


In [136]:
# Remove a column by label
df.drop('four', axis = 'columns')

# Check to see if df was modified (if not how would we modify it inplace?)
df

Unnamed: 0,one,three,two,four
a,0.396797,,0.438713,0.83551
b,-1.328575,1.79435,0.318041,-1.010534
c,0.826842,-0.819218,0.05847,0.885312
d,,0.399921,1.375172,


### References
[The basics from pandas documentation]: http://pandas.pydata.org/pandas-docs/version/0.16.2/basics.html
[Pandas cheatsheet from Notebook Gallery]: http://nbviewer.ipython.org/github/pybokeh/jupyter_notebooks/blob/master/pandas/PandasCheatSheet.ipynb
1. [The basics from pandas documentation]
* [Pandas cheatsheet from Notebook Gallery]

Created by a Microsoft Employee.
	
The MIT License (MIT)<br>
Copyright (c) 2016