# Data munging with pandas

_Adapted from original materials by [manishamde](https://github.com/manishamde)._

Data munging or data wrangling is loosely the process of manually converting or mapping data from one 
"raw" form into another format that allows for more convenient consumption of the data with the help of 
semi-automated tools.

Data munging is basically the hip term for cleaning up a messy data set.

Data munging involves common operations such as: 
 - Indexing
 - Renaming
 - Handling missing values
 - map(), apply(), applymap()
 - New Columns = f(Existing Columns)
 - Basic stats
 - Merge, join
 - Plots

In [1]:
import pandas as pd
import numpy as np
%pylab inline

Populating the interactive namespace from numpy and matplotlib


We'll try the above operations on a very simple dataframe

In [2]:
def defdf():
    df = pd.DataFrame({'int_col' : [1, 2, 6, 8, -1], 
                   'float_col' : [0.1, 0.2, 0.2, 10.1, None], 
                   'str_col' : ['a', 'b', None, 'c', 'a']})
    return df

df = defdf()
df

Unnamed: 0,float_col,int_col,str_col
0,0.1,1,a
1,0.2,2,b
2,0.2,6,
3,10.1,8,c
4,,-1,a


## Indexing

### Selecting a subset of columns
 * Select only the float and string columns of the dataframe

In [None]:
df[]

### Conditional indexing
 * Using boolean indexing, select the rows of the dataframe for which float column is larger than 0.15
 * Select the rows for which float column is larger than 0.1 and integer column is larger than 2. Change 'and' by 'or' 
 * Select the rows for which string column is not 'a'

In [9]:
df[df['float_col']>0.15]

Unnamed: 0,float_col,int_col,str_col
1,0.2,2,b
2,0.2,6,
3,10.1,8,c


## Renaming

 * Use the rename method to rename all three columns
 * Set inplace=True for the changes to affect the existing dataframe

In [4]:
df.rename(columns = ['x','y','x'])

TypeError: 'list' object is not callable

## Handling missing values

### Drop missing values

 * Use dropna to drop all rows with missing data (NaN). From now on, perform the rest of the exercises on this modified dataframe.

In [8]:
df.dropna()

Unnamed: 0,float_col,int_col,str_col
0,0.1,1,a
1,0.2,2,b
3,10.1,8,c


### Fill missing values
 * Use fillna to fill missing data. Fill float column with median of column and string column with a character of your choosing. Use inplace to alter the value in the original dataframe.

In [20]:
df.float_col.fillna(df['float_col'].median(),inplace = True)
df

Unnamed: 0,float_col,int_col,str_col
0,0.1,1,a
1,0.2,2,b
2,0.2,6,
3,10.1,8,c
4,0.2,-1,a


In [None]:
df.float_col.fillna(df['float_col'].median(),inplace = True)
df

## Vectorized operations: map, apply
### map
The map operation maps the values of a series iterating over each element
 * Use map to generate a series that equals each element of integer column squared

### apply
The apply operation applies a function along any axis of the dataframe. 
 - axis=0: apply function to each column
 - axis=1: apply function to each row
Depending on the return type of the function passed to apply(), the result will either be of lower dimension or the same dimension.


 * Use apply on columns to compute the square root of the float and integer columns
 * Use apply on rows to compute the cumulative sum by rows of the elements of the float and integer columns

In [22]:
df['float_col'].apply(sqrt)

0    0.316228
1    0.447214
2    0.447214
3    3.178050
4    0.447214
Name: float_col, dtype: float64

### applymap
The applymap operation applies a function to a dataframe that is intended to operate elementwise

 * Use applymap to transform the dataframe in the following manner: duplicate elements of type string ('z' -> 'zz') and compute the exponential of numerical elements. Hint: define first the function fn that needs to be applied.

## New Columns = f(Existing Columns)

Generating new columns from existing columns in a data frame is an integral part of the data mungling workflow. 

### multiple columns as a function of a single column

 * Use map in combination with zip to construct two new columns being the square and third power of the integer column. [Help](http://stackoverflow.com/questions/12356501/pandas-create-two-new-columns-in-a-dataframe-with-values-calculated-from-a-pre)

### single column as a function of multiple columns

 * Use apply to construct a new column sum of the float and integer columns. [Help](http://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe?lq=1)
 * Use apply to construct a new column composed of the concatenation of the string column and the integer column cast to string

### multiple columns as a function of multiple columns

 * Use apply and a function that returns a Series to construct two new columns being the square root of the float and integer columns. [Help](http://stackoverflow.com/questions/10751127/returning-multiple-values-from-pandas-apply-on-a-dataframe)

## Basic stats

### describe

 * Use describe to gather information on the distribution of the float and integer columns.
 * Use boxplot for a visual representation of the same information

In [23]:
df.describe()

Unnamed: 0,float_col,int_col
count,5.0,5.0
mean,2.16,3.2
std,4.438806,3.701351
min,0.1,-1.0
25%,0.2,1.0
50%,0.2,2.0
75%,0.2,6.0
max,10.1,8.0


## Merge and Join

Pandas supports database-like joins which makes it easy to link data frames.

 * Perform inner, outer, left and right joins of the dataframe with the second dataframe defined below

In [None]:
df2 = pd.DataFrame({'str_col_2' : ['a','b'], 'int_col_2' : [1, 2]})
df2

## Plots

Pandas is equiped with straightforward wrappers for quick plotting of data

 * Use plot to visualize the values columns of the dataframe defined below
 * Use hist to visualize the distribution of the data in the form of a histogram

In [None]:
plot_df = pd.DataFrame(np.random.randn(1000,2),columns=['x','y'])
plot_df['y'] = plot_df['y'].map(lambda x : x + 1)
plot_df.head()