## Data Analysis in Python: Introduction to Pandas

Pandas is built on top of numPy, and is designed to eliminate the need for writing loops for any filtering or aggregation work. It is implemented in C, so is around 15x faster than base python.

Key Features
* Easy handling of ***missing data.*** (`dropna, fillna, ffill, isnull, notnull`)
* Simple ***mutations*** of tables (add/remove columns)
* Easy ***slicing*** of data (fancy indexing and subsetting)
* Automatic ***data alignment*** (by index)
* Powerful ***split-apply-combine*** (`groupby`)
* Intuitive ***merge/join*** (`concat, join`)
* Reshaping and ***Pivoting*** (`stack, pivot`)
* ***Hierarchical Labeling*** of axes indices
* Robust ***I/O tools*** to work with csv, Excel, flat files, ***databases and HDFS***
* Integrated ***Time Series*** Functionality
* Easy plotting (`plot`)

Stack Overflow page for handling big data workflows in Pandas [link](http://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas)

In [2]:
import os
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import matplotlib.pyplot as plt

%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [3]:
np.array?

## Using the `array()` function

In [None]:
arr_1d = np.array((1, 2, 3))

In [None]:
arr_1d.shape

In [None]:
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]); arr_2d

In [None]:
arr_2d.shape

In [None]:
arr_3d = array([arr_2d, arr_2d])

In [None]:
arr_3d.shape

## Using generation functions

In [None]:
np.arange(16)

In [None]:
np.arange(16).reshape(4, 4)

## Using random numbers

In [None]:
arr_1d = np.random.randint(1, 100, 16)

In [None]:
arr_2d = np.random.randint(0, 1000, 16).reshape(4, 4)

In [None]:
np.random.randn(10).round(2)

In [None]:
np.random.randn(30).reshape(5, 6).round(2)

## Subsetting (get a number or many numbers from an array)

In [None]:
arr_1d

In [None]:
arr_1d[::2]

In [None]:
arr_2d

In [None]:
# Getting a single number
arr_2d[3, 0]

In [None]:
arr_2d[0, :]

In [None]:
arr_2d[:, 0]

In [None]:
arr_2d[:2, :]

In [None]:
arr_2d[:, 2:]

In [None]:
arr_2d[2:4, 2:4]

## Subsetting with Booleans

In [None]:
arr_1d

In [None]:
arr_1d > 50

In [None]:
arr_1d[arr_1d > 50]

In [None]:
arr_2d

In [None]:
arr_2d % 2 == 0

In [None]:
arr_2d[arr_2d % 2 == 0]

## Mathematical Operations

In [None]:
print arr_1d
arr_1d + arr_1d

In [None]:
print arr_2d
arr_2d + arr_2d

## Math Functions

In [None]:
np.sqrt(arr_1d).round(2)

In [None]:
np.log(arr_2d).round(2)

## Array Attribues and Methods

- commonly used: `reshape, round... `

In [None]:
array([True, True, False, True]).all()

In [None]:
array([True, True, False, True]).any()

In [None]:
arr_1d.argmax()

In [None]:
arr_1d.argsort()

In [None]:
arr_1d[arr_1d.argsort()]

In [None]:
arr_1d

In [None]:
arr_1d.clip(20, 90)

---
# Pandas

http://pandas.pydata.org/

---

- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

---

In [None]:
from pandas import Series, DataFrame

In [None]:
print pd.__version__

In [None]:
ls

### 2.2 Importing a CSV File

Reading a CSV is as simple as calling the read_csv function. By default, the `read_csv` function expects the column separator to be a comma, but you can change that using the sep parameter.

Syntax: `pd.read_csv(filepath, sep=, header=, names=, skiprows=, na_values= ... )`

### Inspect file without importing it

In [None]:
!wc -l train.csv

In [None]:
!head -10 train.csv 

In [None]:
df_titanic = pd.read_csv('train.csv')

In [None]:
type(df_titanic)

In [None]:
df_titanic.tail(2)

---

## 3. Support for SQL Databases

pandas also has some support for reading/writing DataFrames directly from/to a database. 

You'll typically just need to pass a connection object to the `read_frame` or `write_frame` functions within the pandas.io module.

***Note*** `write_frame` executes as a series of `INSERT INTO` statements and thus trades speed for simplicity. If you're writing a large DataFrame to a database, it might be quicker to write the DataFrame to CSV and load that directly using the database's file import arguments.

In [None]:
[x for x in os.listdir(os.getcwd()) if 'db' in x]

In [None]:
pd.read_sql?

In [None]:
from pandas.io import sql
import sqlite3 

conn = sqlite3.connect('towed.db')
query = "SELECT * FROM towed WHERE make = 'FORD';"

results = pd.read_sql(query, con=conn)
print results.head()

In [None]:
pd.read_sql("SELECT distinct make, count(*) from towed group by 1 order by 2 desc limit 10", conn)

---

Read: 

1. Homepage http://www.sqlalchemy.org/
2. Engines http://docs.sqlalchemy.org/en/latest/core/engines.html

---

## 4. Reading from the Clipboard!

This is as straight forward as it ought to be.

Example: `df_2 = pd.read_clipboard(); df_2.head()`

---

---
## 1. Series
It is a 1-d array of data (similar to an array/list/column in a table) with an associated **labeled _index_.**

 It can be created in the same way as a NumPy array is created

> Syntax: `Series(data=, index=, dtype=, name=)`

In [None]:
from pandas import Series, DataFrame, read_csv

- Using an array

In [None]:
x_random = np.random.randn(10).round(2)

print type(x_random)
x_random

In [None]:
Series(x_random, 
       name='my_series_1', 
       dtype=object, 
       index=['ind_' + str(i) for i in range(10)])

In [None]:
s_random = Series(x_random)
# no index specified, numeric will be automatically generated
print type(s_random)
s_random

In [None]:
my_series = Series(x_random, index=list('aabbbccdef'))
# passing an index specifically
my_series

<big>
- Using a list, tuple or dict

In [None]:
dict_1 = {v: k for k, v in zip(np.random.random(10).round(2), list('abcdefghij'))}

In [None]:
dict_1

In [None]:
Series(dict_1)

In [None]:
Series(data=[1, 2, 3], 
       index=list('abc'), 
       name='Series_1', 
       dtype=float)

In [None]:
Series(data=(1, 2, 3), index=list('abc'), 
       name='Series_1', dtype=np.int64)

In [None]:
Series({'a': 1, 'b': 2, 'c':3}, dtype=str)

---
## Series Attributes

In [None]:
my_series

In [None]:
my_series.values

In [None]:
my_series.index

In [None]:
type(my_series.index)

### Modifying the Index

In [None]:
my_series.index = list('abcde' * 2)

In [None]:
my_series

In [None]:
my_series.name = 'ser1'

In [None]:
my_series

--- Convert a dictonary to a Series, using the **keys** of the dictionary as its **index**.

---
**Task 1:** Declare a series using this data, 
```
{'ham': 1, 'eggs': 3, 'bacon': 2, 'coffee': 1, 'toast': 0.5, 'jam': 0.2}
```
- The menu options should form the index.
- The series should be called 'menu'

In [None]:
diner = Series({'ham': 1, 'eggs': 3, 'bacon': 2, 'coffee': 1, 'toast': 0.5, 'jam': 0.2},
              name='menu')

In [None]:
Diner = diner.to_dict()

In [None]:
{k:Diner[k] for k in ['eggs', 'bacon']}

In [None]:
Diner[['eggs', 'bacon']]

In [None]:
diner[['eggs', 'bacon']]

In [None]:
diner['coffee':'jam']

In [None]:
diner + 0.50

In [None]:
diner.map(lambda x: float(x) + 0.50)

---------------------------------------------------------------------------------------------------------------------

## 1.3 Subsetting a Series

<big>

The different methods of subsetting that we've seen so far include

- Using slices or positional indexers (for lists and arrays)
- Using keys (for dictionaries)
- Using bools (for arrays)

For the Pandas Series, we can use either of the above strategies, leveraging specialized methods for pulling data from a Series. 

In [None]:
my_series = Series(np.random.randn(5).round(2), index = list('abcde'))
my_series

In [None]:
# One Label
my_series['a']

In [None]:
# List of Labels
my_series[['a', 'b']] 

In [None]:
# Label Slice
my_series['b':'d']

In [None]:
my_series[1:4]

In [None]:
# positional slicing
my_series[:3]

In [None]:
my_series[:2]

In [None]:
my_series[::-1]

In [None]:
my_series[::-2]

In [None]:
ser_2 = Series(list('abcde'), index=range(1, 6))

## Series Slicing Methods

    loc and iloc
    at and iat
    ix

In [None]:
# LABEL BASED INDEXER METHOD
my_series.loc?

In [None]:
my_series.loc[['a', 'c', 'e']]

In [None]:
# INTERGER BASED INDEXER METHOD
my_series.iloc?

In [None]:
my_series.iloc[2:4]

In [None]:
# MIXED LABELS and INTEGERS BASED INDEXER METHOD
my_series.ix?

In [None]:
my_series.ix['a':'c']

In [None]:
my_series.ix[0:4]

## Boolean Series and Indexing

- Use conditional operators to create an equal-length Boolean series
- Subset the series using this boolean inside square brackets

In [None]:
ser_x = my_series.copy()

In [None]:
ser_x

In [None]:
%%timeit
positives = []
for i in ser_x:
    if i > 0:
        positives.append(round(i, 2))        
positives        

In [None]:
%%timeit

positives = ser_x[ser_x > 0]
positives

### pd.Series.ix?

``.ix[]`` supports mixed integer and label based access. It is
primarily label based, but will fall back to integer positional
access

In [None]:
my_series.ix[0:2]

In [None]:
my_series.ix['c':'e']

### Series Methods for finding the biggest values

In [None]:
my_series.max()

In [None]:
my_series.idxmax()

In [None]:
my_series.idxmin()

In [None]:
my_series.min()

---------------------------------------------------------------------------------------------------------------------

## 1.4 Array Operations on a Series
Array operations on the Series preserves the index-value links.

In [None]:
ser_x2 = Series(range(4, 21, 4))
print ser_x2

np.sqrt(ser_x2)

In [None]:
my_series_2 = Series({'c': 1, 'd': 0.14, 'e':10, 'f': 2, 'g':-0.5})
print my_series,'\n', my_series_2

In [None]:
print my_series + my_series_2

In [None]:
my_series / 2

Methods that apply to dicts are also valid on a Series.

---------------------------------------------------------------------------------------------------------------------

### 1.5 Check if an item exists in a Series

In [None]:
'boo' in my_series

---
## THE `.isin()` method

In [None]:
pls = Series(['c', 'py', 'java', 'scala'])

In [None]:
-pls.isin(['c', 'py'])

In [None]:
pls[pls.isin(['java', 'py'])]

---------------------------------------------------------------------------------------------------------------------

## Reindexing

In [None]:
my_series

In [None]:
my_series[['a', 'y', 'b', 'd', 'e', 'x']]

In [None]:
my_series.reindex(['a', 'y', 'b', 'd', 'e', 'x'])

### 1.6 Detect Missing Values


- Missing values appear as NaN. Funtions _isnull_ and _notnull_ are used to detect missings.
- They both produce booleans that can be used for subsetting

In [None]:
cities = Series(data = [18, None, 5, None, 13], 
                index=['DEL', 'BOM', 'BLR', 'DXB', 'BKK'])
cities

In [None]:
pd.Series.isnull?
pd.Series.notnull?

In [None]:
print zip(cities, cities.isnull())

In [None]:
zip(cities, cities.notnull())

In [None]:
cities[cities.isnull()]

In [None]:
cities[cities.notnull()]

In [None]:
index2 = ['a', 'd', 'e', 'f', 'g']
my_series2 = my_series[index2]
my_series2

In [None]:
my_series2[my_series2.notnull()]

In [None]:
my_series2[my_series2.isnull()]

In [None]:
my_series2.fillna(-999)

In [None]:
cities.fillna(method='ffill')

In [None]:
cities.fillna(cities.median())

## Difference between None and NaN

<big><br>

- `NaN` is a mathematical entity
- `None` is for missing data

In [None]:
type(np.nan)

In [None]:
bool(np.nan)
# Truthiness value of np.nan is True

In [None]:
type(None)

In [None]:
bool(None)

In [None]:
# Series Methods do not discriminate between None and NaN
Series({'a': None, 'c': 101, 'b': np.nan, 'd': 'red'}).isnull()

---

<big>

Task: Generate a Series of 150 ages with a mean of 35 years. Set every fifteenth value to missing. Find the nexw mean. Fill the missing data with (a) mean (b) median, and report the new means

hint: use `np.random.randn`

In [None]:
ages = Series(np.random.normal(loc=35, scale=1, size=1500))

In [None]:
ages.plot.hist(bins=20, xlim=(30, 40))

In [None]:
ages.mean()

In [None]:
ages[::15] = np.nan

In [None]:
ages.mean()

In [None]:
ages.median()

In [None]:
ages.fillna(ages.mean()).mean()

In [None]:
ages.fillna(ages.median()).mean()

# Strategies for dealing with missing data

---

1. Drop if there aren't too many missings
2. Impute with 0s (for quantities like Amounts), with Mean (for a symmetric distribution), with Median for an Asymmetric Distribution
3. Impute by cluster mean/median
4. Use kNN/Regression where the variable with missings = DV, others are IVs. Predict the missing data.

---

### 1.7 Alignment in Arithmetic Operations
Series with different indexes will be automatically aligned, and NaNs induced in locations where data is not found. 

The indexes are _unioned_.

> Think of binary operations as outer joins. 

In [None]:
my_series

In [None]:
my_series2 = Series(range(5), index=list('cdefg'))

In [None]:
my_series2

In [None]:
# Series have different indexes
# The indexes are UNION'd
# Missing values used where there isnt a match.
my_series + my_series2

In [None]:
my_series > my_series2

---
## Series Methods - Important

In [None]:
# Describe method on char series
Series(list('Dogs are descended from wolves.')).describe()

In [None]:
# Describe method on numeric series
ser_x = Series(np.random.normal(0, 1, 10000))
ser_x.describe()

In [None]:
ser_x.describe(percentiles=[0.01, 0.05, 0.97, 0.99], )

In [None]:
ser_x.describe().loc[['min', 'mean', '50%', 'max']]

### More Series methods using the Titanic Data

In [None]:
df_x = pd.read_csv('train.csv')

In [None]:
df_x.head()

In [None]:
type(df_x)

In [None]:
type(df_x.loc[:, 'Survived'])

In [None]:
type(df_x.loc[0, :])

- `value_counts()` for frequency tables

In [None]:
df_x['Embarked'].value_counts()

In [None]:
%pylab inline

In [None]:
df_x['Embarked'].value_counts().plot.bar(figsize=(3, 3));

<big>
this is equivalent to

    SELECT distinct Embarked, count(*) 
    from titanic 
    group by 1

- `astype()` for type conversion

In [None]:
df_x['PassengerId'].head().astype(float).astype(str)

- `map()` for applying a function to each element of a series.

In [None]:
df_x['Fare'].head().map(lambda x: 'Rs. ' + str(int(x * 65.0)))

In [None]:
df_x['Age'].head().map(lambda x: x-1)

In [None]:
df_x['Name'].head().map(lambda x: 'Female' if (('Mrs' in x) or ('Miss' in x)) else 'Male' )

In [None]:
df_x['Sex'].head().map(lambda x: x.upper())

In [None]:
df_x['Name'].head().map(lambda x: (x[:20], len(x)))

- `idxmax()` for finding where the maximum value occured

In [None]:
fares = Series(df_x['Fare'].values, index=df_x['Name'].values)
print fares.head()
print fares.idxmax(), fares.idxmin()

- `.sort_values()` to sort a series

In [None]:
fares.sort_values(ascending=False)[:5]

- `plot()` to visualize data

In [None]:
import seaborn as sns

In [None]:
fares.plot.hist(figsize=(5, 3));

In [None]:
df_x['Embarked'].value_counts().plot.barh(figsize=(3, 3));

In [None]:
df_x[['Age', 'Fare']].plot.box(figsize=(4, 2), ylim=(0, 80), subplots=True);

- `replace()` method for replacing values using a dict

In [None]:
fruits = Series(['apples', 'oranges', 'peaches', 'mangoes'])

In [None]:
rep = {'apples':'bananas', 'peaches':'grapes'}

In [None]:
fruits.replace(rep)

- `duplicated()` and `drop_duplicates()` creates a boolean to indicate where duplicates occured

In [None]:
ser_d = Series(list('ABCDEFABGHIABCD'))

In [None]:
zip(ser_d, ser_d.duplicated())

In [None]:
ser_d[-ser_d.duplicated()]

In [None]:
ser_d.drop_duplicates()

In [None]:
ser_d[-ser_d.duplicated(keep=False)]

In [None]:
ser_with_dups = Series(list('ABCDEFABGHIABCD'))

In [None]:
ser_with_dups.drop_duplicates()

In [None]:
ser_with_dups

In [None]:
ser_with_dups.drop_duplicates(inplace=True)

In [None]:
ser_with_dups

In [None]:
df_x.columns.tolist()

In [None]:
perx = df_x.Fare.describe(percentiles=[0.95])['95%']
# or: df_x.Fare.quantile(0.95)

In [None]:
df_x.Fare.clip_upper(perx).plot.hist(figsize=(3, 3))

In [None]:
df_x = pd.read_csv('train.csv')
df_x.Fare.plot.hist(figsize=(3, 3))

---
<center>

# $DataFrame$

</center>
---

It is 2-D table like data structure that has both a row and column index. Similar to the R data frame. Each column can be a different dtype. 

Can be thought of a dict of Series objects.
One of the most common ways of creating a dataframe is from a dictionary of arrays or lists.

### 2.1 Creating a DataFrame
_Syntax_: `DataFrame(data=, index=, columns=)`

`data` could be a dict of equal length lists or numPy arrays.

***NOTE***: Each axis has an index (which is a self contained data structure), that is used to implement
+ Fast lookups
+ Data alignment and joins


In [None]:
pd.DataFrame?

### - Creating a DataFrame from a 2D Array

In [None]:
# 1D array
np.arange(20, 32)

In [None]:
# 1D array converted to 2D
np.arange(20, 32).reshape(3, 4)

In [None]:
# Creating a DF using all defaults
DataFrame(np.arange(20, 32).reshape(3, 4))

In [None]:
# Creating a DF using specific declarations
DataFrame(data = np.arange(20, 32).reshape(3, 4), 
          columns = list('WXYZ'), 
          index = list('ABC'))

In [None]:
my_df = DataFrame(np.random.random(16).reshape(4, 4), 
                  columns=['c1', 'c2', 'c3', 'c4'], 
                  index=['r1', 'r2', 'r3', 'r4']).round(2)
print my_df

In [None]:
# To rearrange columns pass the desired order as a list of colnames
my_df.loc[['r2', 'r4'], ['c4', 'c2', 'c3', 'c1']]

### creating DataFrame using dict of equal length lists

In [None]:
my_dict = {'ints': np.arange(5),
           'floats': np.arange(0.1, 0.6, 0.1),
          'strings': list('abcde')}
my_dict

In [None]:
my_df2 = DataFrame(my_dict, 
                   index=list('vwxyz'))
my_df2

---
### Subsetting a column from a DataFrame using `[]` or `[[]]`


In [None]:
arr_1 = np.random.randn(56).reshape(7, 8).round(2)

In [None]:
df_1 = DataFrame(arr_1)

In [None]:
df_1

In [None]:
arr_1[:5, :5]

In [None]:
df_1.iloc[:5, :5].values

---
<big>

The difference between a single `[]` accessor and `[[]]` is that the latter always produces
a DataFrame, while the former always produces a Series.

---

In [None]:
my_df2.columns

In [None]:
# subset one column
print type(my_df2['floats'])
my_df2['floats']
     # my_df2.loc[:, 'floats']

In [None]:
# subset one column
print type(my_df2[['floats']])
my_df2[['floats']]

In [None]:
my_df2[['ints', 'floats']]
    # or, my_df2.loc[:, ['ints', 'floats']]

## Adding a column to a DataFrame

In [None]:
my_df2

In [None]:
print my_df2
print

## Add a NEW column to a DataFrame
my_df2['bools'] = [True, False, True, True, False]
print my_df2
print 

# Creating a derived column
my_df2['ints2'] = my_df2['ints'] * 7 
print my_df2

In [None]:
my_df2.loc[:, 'strings'] = my_df2['strings'].map(lambda x: x.upper())

In [None]:
print my_df2


### Drop Columns

In [None]:
my_df2.drop('ints2', axis=1)

In [None]:
my_df2

In [None]:
print my_df2.drop('ints2', axis='columns', inplace=True)

# for permanent deletion, use inplace = True

In [None]:
my_df2

In [None]:
arr_1.shape
# without parentheses => attribute

In [None]:
arr_1.sum()
# with parenthese => method

---
## Pandas DataFrame Attributes

In [None]:
arr_1.shape
# without parentheses => attribute

In [None]:
arr_1.sum()
# with parenthese => method

In [None]:
df_titanic = pd.read_csv('train.csv')

In [None]:
df_titanic.dtypes

In [None]:
df_titanic.shape

In [None]:
df_titanic.ndim

In [None]:
df_titanic.empty

In [None]:
my_df

In [None]:
# Transpose
my_df.T

In [None]:
# Subset using label lists
my_df.loc[['r1', 'r2'], ['c3', 'c4']]

In [None]:
# Subset using label slices
my_df.loc['r1':'r3', 'c1':'c2']

In [None]:
# Subset using integer lists
my_df.iloc[[0, 2], [2, 1]]

In [None]:
df_x.info()

In [None]:
df_x.get_dtype_counts()

---
## Pandas DataFrame Methods

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

## Selected few DataFrame methods:

### - the `.to_csv` method

In [None]:
DataFrame.to_
# explore write methods

In [None]:
[x for x in os.listdir(os.getcwd()) if '.csv' in x]

In [None]:
# OUtput to CSV
my_df2.to_csv('my_df2.csv')

In [None]:
[x for x in os.listdir(os.getcwd()) if '.csv' in x]

In [None]:
!head -2 my_df2.csv

### - The `drop` method 

- Remove one or more rows (default action)
- Remove one or more columns (pass axis=1) 

In [None]:
my_df2

In [None]:
# Drop a row
my_df2.drop('v')

In [None]:
# Drop a column
my_df2.drop('ints2', axis=1, inplace=True)

In [None]:
my_df2

In [None]:
# Deleting a column
del my_df2['const']; my_df2

### - the math methods

In [None]:
my_df

In [None]:
print my_df.sum()
my_df.sum(axis=1)

In [None]:
print my_df.mean()
my_df.mean(axis=1)

In [None]:
my_df.std(axis=1)

In [None]:
df_titanic.describe()

In [None]:
df_titanic.groupby(['Pclass', 'Sex'])['Survived'].quantile([0.5, 0.75, 0.95]).unstack().reset_index()

In [None]:
pd.pivot_table(data=df_titanic, index='Pclass', columns='Sex', values='Survived')

In [None]:
df_titanic.describe(include=[object])

In [None]:
df_titanic.loc[:, df_titanic.dtypes[(df_titanic.dtypes == object)].index.tolist()].describe()

---------------------------------------------------------------------------------------------------------------------

# `Revision Task`

In [None]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}

countries = DataFrame(data, index=list('pqrst'))

countries

In [None]:
countries.loc[countries['country'] == 'Germany', 'population']

In [None]:
countries.set_index('country').loc['Germany', 'population']

In [None]:
countries.loc[countries['population']>50, ['capital', 'area']]

In [None]:
countries.shape

In [None]:
# Check row names
countries.index

In [None]:
# Check column names
countries.columns

In [None]:
# To check the data types of the different columns:
countries.dtypes

In [None]:
type(countries)

In [None]:
# Overview of the data
countries.info()

In [None]:
countries.values

---

## Setting an arbitrary Index
If we don't like what the index looks like, we can reset it and set one of our columns

-- the `set_index()` method

In [None]:
data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}

countries = DataFrame(data, index=list('pqrst'))

countries

In [None]:
countries.index = list('uvwxy')
# manually provided index

In [None]:
countries

In [None]:
countries.set_index('country', inplace=True)

In [None]:
countries

In [None]:
countries.loc['Belgium':'Netherlands']

In [None]:
countries.loc['Belgium', 'population']

In [None]:
countries.reset_index(inplace=True)

In [None]:
countries

In [None]:
countries.set_index(['country', 'capital'])

---------------------------------------------------------------------------------------------------------------------

### 2.4 Some DataFrame Methods

*** --- `sort_values()`:  *** used to sort the data

In [None]:
countries

In [None]:
countries.sort_values(by='population', ascending=False)

In [None]:
countries.sort_values(by=['population', 'area'], ascending=False)

**--- create a new column**

In [None]:
countries['pop_density'] = countries['population'] / countries['area'] * 10000
countries

*** --- `describe(): `*** computes summary statistics for each column numerical (default) column

In [None]:
countries.describe()

*** --- `plot(): `*** used to quickly visualize the data in different ways

The available plotting types: ‘line’ (default), ‘bar’, ‘barh’, ‘hist’, ‘box’ , ‘kde’, ‘area’, ‘pie’, ‘scatter’, ‘hexbin’.

In [None]:
countries[['population', 'pop_density']].plot.scatter(x='population', 
                                                      y='pop_density', 
                                                      figsize=(3,3));

##### Barplot

In [None]:
countries.set_index('country').loc[:, 'population'].sort_values().plot.barh(figsize=(4, 3))
plt.savefig('countries_barplot.png', dpi=200)

In [None]:
[x for x in os.listdir(os.getcwd()) if 'jpg' in x or 'png' in x or 'jpeg' in x]

---------------------------------------------------------------------------------------------------------------------

### 2.5 Subsetting a DataFrame

*** --- 1 Column***: For a DataFrame, basic indexing selects the columns.

An individual column can be retrieved as a Series using `df['col']` or `df.col` 

This is especially helpful for creating boolean indexes.

In [None]:
my_df2

#### Subsetting one column

In [None]:
my_df2['floats']

In [None]:
my_df2[['floats']]

In [None]:
my_df2.floats
my_df2['floats']

In [None]:
all(countries.area == countries['area'])

In [None]:
all(my_df2[['floats']] == my_df2['floats'])

*** --- 2+ Columns***: Multiple columns are retrieved as a DataFrame using a list of column names `df[['col1', 'col2']]` 

In [None]:
type(my_df2[['ints', 'strings']])

In [None]:
countries[['area', 'population']]

---------------------------------------------------------------------------------------------------------------------

*** --- Rows***: can be retrieved by position or name by methods such as ***.ix***
It provides the label indexing facility.


This is a very powerful method that can subset the DataFrame in any way (using locations or index or ranges)


```
.loc and .iloc
.ix
.at and .iat
```

## Advaned Indexing - 1 [.loc and .iloc]

For more advanced indexing, you have some extra attributes:

***--- loc:*** selection by label

`Syntax: df.loc[[indices], [colnames]]`

`[indices]` could be specified as a list, splice (start : end) or using a boolean series.


In [None]:
countries

In [None]:
# Using a row index (before comma) and a column name (after comma)
countries.loc[['Germany', 'United Kingdom'], 
              ['area', 'capital']]

In [None]:
# Using a row index splice and a column index splice
countries.loc['Belgium':'Germany', :]

In [None]:
countries.loc['Belgium':'Germany', 'area':'population']

#### Subsetting with Booleans

In [None]:
(countries['population'] > 50)

In [None]:
countries.loc[countries['population'] > 50, :]

In [None]:
countries.population > 50
# Created using the dataframe
# The index-value links are preserved

In [None]:
Series([False, True, True, False, True])
# Arbit series
# Doesnt have the same index as the df

In [None]:
countries.loc[countries.population > 50, :]

In [None]:
countries.loc[Series([False, True, True, False, True]), :]
# Gives you an Indexing Error

### - The `reset_index()` method

In [None]:
countries

In [None]:
countries.reset_index(inplace=True)

In [None]:
countries

---

## Using Comprehensions to subset rows and columns.

In [None]:
['France' in x for x in countries.country]

In [None]:
countries.loc[['France' in x for x in countries.country], :]

In [None]:
[(('area' in x) or ('capital' in x)) for x in countries.columns]

In [None]:
countries.loc[['France' in x for x in countries.set_index('country').index],
             [(('area' in x) or ('capital' in x)) for x in countries.columns]]

In [None]:
# Using a boolean for rows and a list of column names
countries.loc[countries['population'] > 5, ['capital', 'area']]

## *** --- iloc:*** selection by position.

Selecting by position with iloc works similar as indexing numpy arrays

In [None]:
countries

In [None]:
# Using splices for both rows and columns
countries.iloc[1:4, 1:3]

> Note: The different indexing methods can also be used to assign data.

---------------------------------------------------------------------------------------------------------------------

## Advaned Indexing - 2 [using the ***.ix*** method]



`Syntax: df.ix[<specify-rows>, <specify-cols>]`

- Here, `specify-cols` could be done as a singular/list/splice of colname(s)
    - Additionally, we could even specify integer ranges (splices).

- Similarly, `specify-rows` can be done using indices (if you want to subset rows by name)
    - by using integer splices (if you want to subset by position)

In [None]:
my_df2

In [None]:
# Columns

# select a column by name
my_df2.ix[:, 'strings']

In [None]:
# select multiple columns by name
my_df2.ix[:, ['strings', 'floats']] 

In [None]:
# select columns by position
print my_df2.ix[:, 0:2] 

In [None]:
# Subsetting Rows using integers and ix
print my_df2.ix[0, :] # first row
print my_df2.ix[2, :] # second row

In [None]:
print my_df2.ix[0:2] # by position: returns the first 2 rows

In [None]:
# Subsetting rows using labels and .ix

print my_df2.ix['x':'z'] # by index: returns the last three rows
# We can slice the DataFrame using an index object

In [None]:
# Use the :: for stepping through rows
my_df2.ix[::-1, ['strings', 'ints']]

In [None]:
my_df2

In [None]:
my_df2.ix[0:5:2]

---

NOTE:

> The column returned when indexing a DataFrame is a view on the underlying
data, not a copy. Thus, any in-place modifications to the Series
will be reflected in the DataFrame. The column can be explicitly copied
using the Series’s copy method.

In [None]:
cap = countries['capital']

In [None]:
cap[1] = 'Paris'

In [None]:
countries

---------------------------------------------------------------------------------------------------------------------

### 2.6 Reindexing
It is used to create a DF with the data _conformed_ to a new index.

If we subset a Series or DataFrame with an index object, 

the data is _rearranged_ to obey this new index and missing values are introduced wherever the data was not present

#### --- Reindexing a Series

In [None]:
countries

In [None]:
countries.set_index('capital', inplace=True)

In [None]:
countries

In [None]:
capital2 = ['Paris', 'Brussels', 'Munich', 'Amsterdam', 'Manchester', 'Madrid']

countries.loc[capital2, :].fillna('Info NA')

In [None]:
countries.reindex(capital2)

## --- Reindexing a DataFrame

`Syntax: df.reindex(index=, columns=, fill_value=, method=)`

In [None]:
frame = DataFrame(np.random.randn(9).reshape(3, 3), index=list('acd'), columns=list('zwx')); frame

In [None]:
frame.reindex(list('abcde'))

In [None]:
frame

In [None]:
frame.reindex(index=list('abcde'), 
              columns=list('wxyz'))

In [None]:
# ffill will copy values from the previous row where possible, and if an entire column is NaN, fill_value will populate it.
(frame
 .reindex(index=list('abcde'), columns=list('wxyz'),
          method='ffill',
          fill_value=0.1)
)

> NOTE: `reindex` is quite helpful to align dataFrames for merging/appending tasks

--------------------------------------------------------------------------------------------------------------------

## Binary Operations involving DataFrames: _Data Alignment by Index_

In [None]:
df1 = DataFrame(data=np.arange(1, 13).reshape(4,3), index=list('abcd'), columns=list('pqr'))
df2 = DataFrame(data=np.arange(1, 13).reshape(3,4), index=list('abc'), columns=list('pqrs')) 
print 'df1', '\n', df1, '\n'
print 'df2', '\n', df2, '\n'
print 'df1 + df2', '\n', df1 + df2

***Note***: Alignments happen on both axes. No information is discarded by default

In [None]:
# We could choose to get rid of or impute missing rows/columns
(df1 + df2).fillna(0)

---

<big>

## Task 1

- Import the data into Pandas '`train.csv`'
    - Report the shape of the table and the data type of each column.

- Subset the data to retain only the passengers that survived. Name this object 'Survivors'
    - Within the 'survivors' find the summary statistics of the 'Age' and 'Fare' variables.

- Subset the data to retain only Female passengers.
    - Find out how many females over the age of 30 survived.

- Are there any missing values in the Age column?    
    - Draw a histogram of this column.
    - Replace missing values with a) Mean b) Median
    - Report the new summary statistics on this variable
---    

In [None]:
df_titanic.loc[df_titanic['Survived'] == 1, ['Age', 'Fare']].describe()

In [None]:
(df_titanic.loc[((df_titanic['Sex'] == 'female') & (df_titanic['Survived'] == 1)) , 'Age'] > 30).value_counts()

In [None]:
df_titanic['Age'].isnull().value_counts()

In [None]:
df_titanic['Age'].isnull().sum()

In [None]:
df_titanic['Age'].plot.hist(figsize=(3, 3));

In [None]:
age_mean = df_titanic['Age'].mean()
df_titanic['Age'].fillna(age_mean).plot.hist(figsize=(3, 3));
# replace missings with mean

In [None]:
df_titanic['Age'].fillna(df_titanic['Age'].median()).plot.hist(figsize=(3, 3));

---
### 2.7 Deleting Rows or Colums

This can be done using the `drop` method and specifying row indices or column names.

To drop **rows**

`Syntax: df.drop(indices-as-a-list)`

To drop **columns**

`Syntax: df.drop(colnames-as-a-list, axis=1)`


Parameters 

- axis (controls row/col deletion)
- inplace (changes to be permanent or not)

In [None]:
df1

In [None]:
# Delete the row with index 'b'
df1.drop('b', inplace=True); df1

In [None]:
# Delete the column with name 'q'
df1.drop('q', axis=1, inplace=True); df1

In [None]:
df2

In [None]:
# Remove multiple rows
df2.drop(['a', 'b'], inplace=True); df2

---
## Using Boolean Indexing to conditionally replace data

In [None]:
print df1, '\n', df2

In [None]:
df3 = df1 + df2

In [None]:
df3

In [None]:
df3.isnull()

In [None]:
df3[df3.isnull()] = -99

In [None]:
df3

In [None]:
df3[df3>8] = 99; df3

In [None]:
d11 = DataFrame(np.random.randn(25).reshape(5,5), 
index=list('abcde'), 
columns=list('vwxyz')).round(2); print d11

In [None]:
d11.quantile(np.arange(0.25, 1, 0.25))

In [None]:
d11.describe().round(2)

---------------------------------------------------------------------------------------------------------------------

## 2.8 Apply functions to each element/rows or columns of a DataFrame

Using 

- **`s.map()`**, apply a func to each element of a Series
- **`df.applymap()`** apply a func to each element of a DF
- **`df.apply()`** apply a func to rows/columns of a DF

> Lambda functions are also known as ANONYMOUS functions because they typically do not have a name.

- The are used extensively in Python, and even more with the map, applymap and apply functions

In [None]:
DataFrame(arr_1)

In [None]:
DataFrame(arr_1).applymap(lambda x: 'neg' if x < 0 else 'pos')
# a function that checks if value is less than 0, then says 'neg' for negative or 'pos' for positive

In [None]:
# Named Function
def addTen(x):
    return x + 10

In [None]:
addTen(1)

---

In [None]:
# Create a dataframe to work with
df8 = DataFrame(np.random.randn(25).reshape(5, 5), 
                index=list('abcde'), 
                columns=list('vwxyz')).round(2); df8

## 2.8.1 Apply a function to each element

--- Calling ***numPy ufuncs*** on DataFrame objects will apply the function to each element

> ufuncs work on a single argument

In [None]:
np.abs(df8.values)

In [None]:
np.abs(df8)

In [None]:
df8.applymap(np.abs)

In [None]:
%%timeit
np.abs(df8)
## This is an example of a VECTORIZED OPERATION

In [None]:
%%timeit
df8.applymap(np.abs)

> It is possible to ***apply a udf to each element*** in a Series (using **`map`**) or a DataFrame (using **`applymap`**)

In [None]:
# Write a function that formats a number to 2 decimal places
format8 = lambda x: '%.1f' %x

# SAME AS
# def format8(x):
#     return '%.2f' %x

format8(3.1341)

In [None]:
df8.applymap(format8)

---
## 2.8.2 Apply a udf to each row/column

Using the `.apply(func, axis=)` method to a DataFrame does this

In [None]:
df8.mean()

In [None]:
%timeit df8.mean()

In [None]:
df8.apply(np.mean, axis=0)

In [None]:
%timeit df8.apply(mean, axis=0)

In [None]:
df8

In [None]:
df8.max(axis=1)
# apply is implied

In [None]:
df8.apply(max, axis=1)
# apply is explicitly mentioned

In [None]:
df8.apply(lambda x: (x.max() - x.min()))

---

### Using different functions with `apply()`

---
<big>

When you write a lambda function to work with apply on a dataframe, remember

1. The input to the function will be a SERIES.
2. The output can be a number or a series.

---

In [None]:
# Use a general function that returns multiple values
def func8(x):
    return Series([x.min(), x.mean(), x.max()], 
                  index=['MIN.', 'MEAN.', 'MAX.'])

In [None]:
s_1 = Series(np.random.randn(10))

In [None]:
func8(s_1)

In [None]:
df8

In [None]:
df8.apply(func8)

In [None]:
df8.apply(func8, axis=1)

---

## Big Task 

---

Create a function called DESCRIBE_X that takes as input a Series, 
and produces a series containing

- `min, max, sum, mean, std, missings, nonmissings, skew, kurtosis`
- `percentiles - as specified by the user`

---

`DESCRIBE_X(s_1, percentiles=[0.1, 0.3, 0.92, 0.99]`

---

Output

min
max
sum
mean
std
0.1
0.3
0.92
0.99

---

Then, apply this function to the rows, columns of arr_1 as a DataFrame.

---



## Task 2

- Define a function call 'Standardizing' which works on a series as:
    - Find the mean and standard deviation of the series
    - Subtract each value of the series with the mean
    - Divide the result with the standard deviation
    - Apply this function to each numerical column (int or float) of the Titanic Dataset
    
---

In [None]:
# Solution to task 1

In [None]:
%load task_describe_X.py

In [None]:
DESCRIBE_X(s=Series(np.random.random(100)), perc=[0.5])

In [None]:
arr_1 = DataFrame(np.random.randn(5000).reshape(1000, 5), columns=['Col_' + str(i) for i in range(5)])
arr_1.apply(lambda c: DESCRIBE_X(s=c, perc=[0.1, 0.2, 0.8, 0.9]))

In [None]:
# Solution to task 2

In [None]:
df_4 = DataFrame(np.random.randn(1000).reshape(200, 5), 
                 columns=['Col_' + str(x) for x in range(5)]); df_4.head()

In [None]:
df_4.mean()

In [None]:
df_4.std()

In [None]:
df_4_NORM = df_4.apply(lambda s: (s - s.mean())/s.std())

In [None]:
df_4_NORM.mean()

In [None]:
df_4_NORM.std()

## Revision Task

Create a 500 x 10 matrix/DataFrame filled with random numbers.
Name the columns as Col_1 ... Col_10

Declare a function that operates on a Series (hence row or column) and returns the square root of the sum of squares of the min and max numbers in each

1. Row
2. Column


---

## 2.9 Sorting 

** 2.9.1 Sorting a Series ** 

To sort a series on its index, use 
- `my_series.sort_index()`

To sort a series on its values, use 
- `my_series.sort_values(ascending=<bool>)`

In [None]:
# Create a Series with explicit index 
s9 = Series(np.random.randint(0, 50, 5), 
            index=list('dcbae')); s9

In [None]:
s9.index

In [None]:
s9.values

In [None]:
# Sorting on the index
s9.sort_index(ascending=False)

In [None]:
# Sorting on the values 
s9.sort_values(ascending=False, inplace=True)

In [None]:
s9

---
## ** 2.9.2 Sorting a DataFrame**

Here we use the `sort_index()` method and specify what columns to sort on using `by=`, and the order of sorting

In [None]:
d9 = DataFrame(np.random.randint(0, 100, 30).reshape(10, 3), 
               index=list('abcdefghij'), 
               columns=list('prq')); d9

In [None]:
d9.sort_values(by='p', ascending=False, inplace=True); d9

In [None]:
d9.sort_values(by=['p', 'q'], ascending=False, inplace=False)

In [None]:
# d9.sort_index? 
# Sort DataFrame either by labels (along either axis) or by the values in a column

d9.sort_index(axis=0)

In [None]:
df_titanic.sort_values(by=['Age', 'Pclass'], ascending=False)[['Name','Age', 'Pclass']][5:15]

** Reordering rows or columns **

In [None]:
# without arguments, it will sort the index of the dataframe
d9.sort_index()

In [None]:
# To sort column names
d9.sort_index(axis=1)

** Sorting on values **

In [None]:
# Sort the data by the values of a column
d9.sort_values(by='q', ascending=False)

In [None]:
# Sort the data by the values of 2 columns
d9.sort_values(by=['p', 'r'], ascending=False)

---
## 2.10 Ranking

This can be done by calling the `.rank()` method on a Series

`Syntax: obj.rank(axis=, ascending=, method=)`

Here, `method` refers to the method used to break ties if different elements have the same values.

In [None]:
s9 = Series(np.random.randint(0, 100, 5))

In [None]:
s9

In [None]:
s9.rank()
# .astype(int) - 1

In [None]:
# Create a series
s1 = Series(np.random.randint(80, 100, 15))

In [None]:
s1

In [None]:
s1.rank(ascending=False)

In [None]:
# Create a dataframe with series, ranks as the 2 columns
DataFrame({'a': s1, 'b': s1.rank(ascending=False)})

---------------------------------------------------------------------------------------------------------------------

## 2.11 Descriptive Statistics

** Categorical Data **

- Calling describe() on a Categorical object shows count, # of unique, mode, mode's frequency

In [None]:
print Series(list('abcxyzbbc'))

Series(list('abcxyzbbc')).describe()

In [None]:
Series(list('abc')*4).describe()

In [None]:
df_titanic['Embarked'].describe()

---
### ** Numeric Data **

Pandas objects have a set of common math/stat methods that extract

--- a single value from a Series

--- a Series from a DataFrame (along a specified axis)

Methods include:

`count, sum, mean, median, min/max, idxmin/idxmax, skew, kurt, cumsum, cumprod, pct_change` and more.

> Note that these methods would be applied over each row/column as specified and results collated

In [None]:
d11 = DataFrame(np.random.randint(0, 100, 25).reshape(5, 5), 
                index=list('abcde'), 
                columns=list('vwxyz')); print d11

In [None]:
# Getting colsums is as simple as calling the .sum() method of a DataFrame
print d11.sum()
print d11.mean()

In [None]:
# Working on rows would require you to pass axis=1 to the .sum() method
d11.sum(axis=1)

> Note that missing values are ignored by default. Pass `skipna=False` to disable this.

In [None]:
# Find the min/max for each column/row
d11.min(axis=1)

---
## Task 2

- is .sum() faster or using np.sum in an apply function

---

In [None]:
%timeit d11.apply(lambda x: np.sum(x))

In [None]:
%timeit d11.sum()

### Locate the min and max

In [None]:
# Where does this value occur?
d11['v'].idxmin()

In [None]:
d11.ix['d'].idxmax()

### `describe()` works on all numeric (int or float) columns in a DataFrame

In [None]:
d11.describe()

In [None]:
titanic.dtypes

In [None]:
titanic.describe()

### Percentiles

In [None]:
# Calculate quantiles for each column 
titanic.quantile(np.arange(0.80, 1, 0.05))

---
## Task 3

In [None]:
(titanic
 .describe()
 .append(titanic.quantile(np.arange(0.80, 1, 0.05))))

---
# Two Numerical Variables: Correlation

DataFrames have `.corr()` and `.cov()` methods that return a full correlation/covariance matrix as a DataFrame

In [None]:
df_1 = DataFrame({'a': np.random.randn(1000), 'b': np.random.randn(1000)})

In [None]:
import seaborn as sns
df_1.plot.scatter(x='a', y='b', figsize=(3, 3));

In [None]:
df_1.corr()

In [None]:
print "Correlation Coefficient b/w Var A and Var B is ... ", df_1.corr().loc['a', 'b']

# produce a correlation matrix, filled with correlation coefficients
# between any two variables, the corr coeff measures the strength of the relationship

# coeff close to zero means NO RELATIONSHIP
# coeff close to +1 means STRONG POSITIVE RELATIONSHIP (directly proportional)
# coeff close to -1 means STRONG NEGATIVE RELATIONSHIP (inversely proportional)

In [None]:
from IPython.display import Image
Image("https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Correlation_examples2.svg/2000px-Correlation_examples2.svg.png")

In [None]:
df_2 = df_1.assign(c = lambda d: -4 * d['a'] + 5)

In [None]:
df_2.plot.scatter(x='a', y='c', figsize=(4, 4));

In [None]:
df_2.corr()

## Uncorrelated Variables

In [None]:
(DataFrame(np.random.randint(0, 1000, 5000).reshape(2500, 2), 
          columns=['X', 'Y'])
.plot.scatter(x='X', y='Y', figsize=(3, 3)));

In [None]:
(DataFrame(np.random.randint(0, 1000, 5000).reshape(2500, 2), 
          columns=['X', 'Y'])).corr()

## Positively Correlated

In [None]:
X = np.random.randint(0, 1000, 2500)
Y = 4 * X + 25 + 1000 * np.sin(X) + np.random.randint(0, 200, 2500)

In [None]:
DataFrame({'X': X, 'Y': Y}).plot.scatter(x='X', y='Y', figsize=(3, 3));

In [None]:
DataFrame({'X': X, 'Y': Y}).corr()

In [None]:
d11

In [None]:
df_titanic = pd.read_csv('train.csv')

In [None]:
df_titanic.describe().columns.tolist()[-4:]

In [None]:
sns.heatmap(df_titanic[df_titanic.describe().columns.tolist()[-4:]].corr().round(2))

In [None]:
# The Correlation Matrix
sns.heatmap(d11.corr().round(2))

In [None]:
df_titanic.corr()

In [None]:
sns.heatmap(titanic[['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].fillna(0).corr())

---
<big>  Task
    
- Find the correlation matrix for the Titanic data

- Report the correlation (and your interpretation) between 
    - Age and Fare 

- print only the correlation values that are above 0.7 (or below -0.7)

---

---------------------------------------------------------------------------------------------------------------------

# 2.12 More Methods for Pandas objects

Now we look at methods for unique values (`unique`), frequency tables (`value_counts`), membership (`isin`)

In [None]:
# Getting distinct values in a Series
s12 = Series(list('the quick brown fox jumped over the lazy dog'))

set(s12) 
#  or
s12.unique()

In [None]:
s12.nunique()

In [None]:
titanic.dtypes

In [None]:
titanic.Survived.value_counts()

### the `isin` method

In [None]:
# isin returns a boolean that can be used to index the original value
s12.isin(list('pqrst'))[:10]

In [None]:
print titanic.Sex.head()

titanic.Sex.head().isin(['male'])

In [None]:
colours = Series(['red', 'blue', 'white', 'green', 'black', 'white', None])

print colours.isin(['white', 'blue', None])

---------------------------------------------------------------------------------------------------------------------

## Difference between `None` and `np.nan`

In [None]:
type(None)

In [None]:
type(np.nan)

---
## 2.12 Handling Missing Data

Pandas treats the numpy `NaN` and the Python `None` as missing values.

--- These can be **detected** in a Series or DataFrame using **`obj.notnull()`**,  **`obj.isnull()`** which returns a boolean.

--- **To filter out missing data** from a Series, or to remove rows (default action) or columns with missing data in a DataFrame, we use **`obj.dropna()`**

--- Missing Value **imputation** is done using the **`obj.fillna()`** method.

In [None]:
# Create a string Series and set some values to missing
s12 = Series(['abc', 'pqr', np.nan, 'xyz', np.nan, 'ijk', None])

s12

In [None]:
# Detect missing values
s12.isnull()

In [None]:
# Replace missing values with 0
s12.fillna(0)

In [None]:
s12.fillna('missing')

In [None]:
s12.fillna(method='ffill')

In [None]:
s12.fillna(method='bfill')

In [None]:
# Create a numeric Series and set a few values to missing
s13 = Series(np.random.randint(0, 50, 16), index=list('abcdefghabcdefgh'))
s13[::2] = np.nan
s13

In [None]:
# Fill with median
s13.fillna(s13.median())

# We could use 0, or .mean() or some arbitrary method

In [None]:
s13.dropna()

---
<big>


### How to fill missing values?

- It depends on the type of your variable.

For numeric, two methods.
Use Median if data is skewed, else use Mean.

For categorical, use Mode.

----

#### Task

- Check which of the Titanic Data Variables have Missings.
- Find the means
- Impute missings with means.
- Find the means again.
- Have they shifted?


---------------------------------------------------------------------------------------------------------------------

## 3. Implementing Split-Apply-Combine: The _groupby_ method
+ You may group along the rows or columns.
+ Returns the groupby object that stores info on how to split the data
+ To this object, we implement Aggregations (reduce size of data) or Transformations (no change in size) or Apply

In [None]:
from IPython.display import Image
Image("http://i.imgur.com/yjNkiwL.png")

In [None]:
df = DataFrame({'floats': np.random.randn(20), 
                'string': list('a' * 4 + 'b' * 6 + 'c' * 3 + 'd' * 7)})

In [None]:
df[:5]

In [None]:
df['string'].value_counts()

In [None]:
print df.groupby('string').mean()

In [None]:
print df.groupby('string').min()

In [None]:
df_titanic = pd.read_csv('train.csv')

print df_titanic.groupby('Embarked')['Fare'].max()

In [None]:
print df_titanic.groupby('Embarked').apply(lambda g: g[['Fare', 'Age']].max())


## Creating a GroupBy Object

In [None]:
grouped = df.groupby('string')

In [None]:
type(grouped)

In [None]:
len(grouped)

In [None]:
for tab in grouped:
    print type(tab)
    print len(tab)
    print tab[1][:3]
    print 

In [None]:
Series(map(lambda tup: {tup[0]: tup[1]['floats'].mean()}, grouped))

In [None]:
print grouped.mean()

## Explore GroupbyObject Methods

In [None]:
# Specialized functions on grouped object directly
print grouped.sum()
print grouped.mean()
print grouped.median()

# try min, max, idxmin, idxmax and so on

---
### An example using the Titanic Data

In [None]:
df_titanic.Embarked.value_counts()

In [36]:
(df_titanic
 .groupby('Embarked')
 .apply(lambda x: (x.set_index('Name').loc[:, 'Fare'].idxmax(),
                  x.set_index('Name').loc[:, 'Fare'].max())))

Embarked
C               (Ward, Miss. Anna, 512.3292)
Q        (Minahan, Dr. William Edward, 90.0)
S    (Fortune, Mr. Charles Alexander, 263.0)
dtype: object

In [37]:
(df_titanic
 .groupby('Sex')
 .apply(lambda x: x[['Age', 'Fare']].mean()).round(0))

Unnamed: 0_level_0,Age,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,28.0,44.0
male,31.0,26.0


In [38]:
# Top two Fares for Sex x Embarked categories

(df_titanic
 .set_index('Name')
 .groupby(['Sex', 'Embarked'])
 .apply(lambda x: x['Fare'].sort_values(ascending=False).head(2)))

Sex     Embarked  Name                                 
female  C         Ward, Miss. Anna                         512.3292
                  Ryerson, Miss. Susan Parker "Suzette"    262.3750
        Q         Minahan, Miss. Daisy E                    90.0000
                  Rice, Mrs. William (Margaret Norton)      29.1250
        S         Fortune, Miss. Mabel Helen               263.0000
                  Fortune, Miss. Alice Elizabeth           263.0000
male    C         Lesurer, Mr. Gustave J                   512.3292
                  Cardeza, Mr. Thomas Drake Martinez       512.3292
        Q         Minahan, Dr. William Edward               90.0000
                  Rice, Master. Eugene                      29.1250
        S         Fortune, Mr. Mark                        263.0000
                  Fortune, Mr. Charles Alexander           263.0000
Name: Fare, dtype: float64

In [None]:
  male female
S x     x2
C y     y2
Q z     z2

male S x
male C y 
male Q z
fema S x2
fema C y2
feme Q z2

In [None]:
pd.pivot_table(data = df_titanic, 
               index = 'Embarked', 
               columns = 'Sex', 
               values = 'Fare', 
               aggfunc = [np.mean, np.sum]).to_clipboard()

In [None]:
df_titanic.columns

In [None]:
df_titanic.dtypes

In [None]:
%timeit df_titanic.query("Parch <= 2")

In [None]:
%timeit df_titanic.loc[df_titanic.Parch <= 2, :]

In [None]:
pd.pivot_table(data = df_titanic.query("Parch <= 2"), 
               index='Pclass', 
               columns=['Sex','Parch'], 
               values='Survived').round(2)

In [None]:
%timeit pd.pivot_table(data = df_titanic, index='Pclass', values='Fare')

In [None]:
%timeit df_titanic.groupby('Pclass')['Fare'].mean()

---
<big>

*Task* | Using variables Embarked, PClass and Sex find the group with the highest rate of survival.

---

In [40]:
grps = [['Embarked'], ['Pclass'], ['Sex'], 
        ['Embarked', 'Pclass'], ['Embarked', 'Sex'], ['Pclass', 'Sex'],
        ['Embarked', 'Pclass', 'Sex']]

per_survived = {}

for grp in grps:
    per_survived['_'.join(grp)] = df_titanic.groupby(grp)['Survived'].mean()

In [41]:
per_survived.keys()

['Embarked_Pclass_Sex',
 'Embarked',
 'Embarked_Pclass',
 'Embarked_Sex',
 'Pclass',
 'Sex',
 'Pclass_Sex']

In [45]:
series_concatd = pd.concat([per_survived.get(k) for k in per_survived])

In [46]:
series_concatd.sort_values(ascending=False)[:10]

(Q, 2, female)    1.000000
(C, 2, female)    1.000000
(Q, 1, female)    1.000000
(C, 1, female)    0.976744
(1, female)       0.968085
(S, 1, female)    0.958333
(2, female)       0.921053
(S, 2, female)    0.910448
(C, female)       0.876712
(Q, female)       0.750000
Name: Survived, dtype: float64

In [50]:
series_concatd.sort_values()[:10]

(Q, 2, male)    0.000000
(Q, 1, male)    0.000000
(Q, male)       0.073171
(Q, 3, male)    0.076923
(S, 3, male)    0.128302
(3, male)       0.135447
(S, 2, male)    0.154639
(2, male)       0.157407
(S, male)       0.174603
male            0.188908
Name: Survived, dtype: float64

In [62]:
sizes = {}
for grp in grps:
    sizes['_'.join(grp)] = df_titanic.groupby(grp).apply(lambda g: g.size)

In [63]:
sizes = pd.concat([sizes[k] for k in sizes.keys()])
type(sizes)

pandas.core.series.Series

In [64]:
sizes.name = 'Num_of_Passengers'

In [68]:
pd.concat([series_concatd, sizes], axis=1)['Num_of_Passengers'].sum

(C, 1, female)     516
(C, 1, male)       504
(C, 2, female)      84
(C, 2, male)       120
(C, 3, female)     276
(C, 3, male)       516
(Q, 1, female)      12
(Q, 1, male)        12
(Q, 2, female)      24
(Q, 2, male)        12
(Q, 3, female)     396
(Q, 3, male)       468
(S, 1, female)     576
(S, 1, male)       948
(S, 2, female)     804
(S, 2, male)      1164
(S, 3, female)    1056
(S, 3, male)      3180
C                 2016
Q                  924
S                 7728
(C, 1)            1020
(C, 2)             204
(C, 3)             792
(Q, 1)              24
(Q, 2)              36
(Q, 3)             864
(S, 1)            1524
(S, 2)            1968
(S, 3)            4236
(C, female)        876
(C, male)         1140
(Q, female)        432
(Q, male)          492
(S, female)       2436
(S, male)         5292
1                 2592
2                 2208
3                 5892
female            3768
male              6924
(1, female)       1128
(1, male)         1464
(2, female)

In [65]:
pd.concat([series_concatd, sizes], axis=1).sort_values('Survived', ascending=False)[:10]

Unnamed: 0,Survived,Num_of_Passengers
"(C, 2, female)",1.0,84
"(Q, 1, female)",1.0,12
"(Q, 2, female)",1.0,24
"(C, 1, female)",0.976744,516
"(1, female)",0.968085,1128
"(S, 1, female)",0.958333,576
"(2, female)",0.921053,912
"(S, 2, female)",0.910448,804
"(C, female)",0.876712,876
"(Q, female)",0.75,432


In [66]:
pd.concat([series_concatd, sizes], axis=1).sort_values('Survived', ascending=True)[:10]

Unnamed: 0,Survived,Num_of_Passengers
"(Q, 2, male)",0.0,12
"(Q, 1, male)",0.0,12
"(Q, male)",0.073171,492
"(Q, 3, male)",0.076923,468
"(S, 3, male)",0.128302,3180
"(3, male)",0.135447,4164
"(S, 2, male)",0.154639,1164
"(2, male)",0.157407,1296
"(S, male)",0.174603,5292
male,0.188908,6924


----
## Reshaping your data  with  `stack, unstack and pivot_table`

### LONG to WIDE

In [None]:
df = pd.DataFrame({'A': list('x' * 5) + list('y' * 5), 
                   'B': list('abcde' * 2),
                   'C': np.random.randint(0, 100, 10)})

df.set_index(['A', 'B'], inplace=True)

type(df)

In [None]:
df.unstack()

In [None]:
df.unstack(level=0)

In [None]:
# Subsetting a DataFrame with Hierachical Index
df.loc['y'].loc['a':'c']

In [None]:
df_1 = DataFrame(np.random.randn(16).reshape(4,4), 
                 index=list('abcd'), 
                 columns=list('pqrs'))
print 'DataFrame ... \n', df_1

In [None]:
print '\n\nStacking results in one column...\n', df_1.stack()

> To use stack/unstack, we need the values we want to shift from rows to columns or the other way around as the index