# Pandas Overview


One of the fundamental concepts across Python, R, and databases is the **tabular data structure**, especially with named columns and records as rows. This tabular data structure is usually heterogeneous in nature. 
In Python, we use the **DataFrame** data structure within the **Pandas** library.
In R you will learn of the **native data frame** type.
In SQL, the `table` is the fundamental data storage concept.

In each of these cases, you have facilities to do operations on a column or row of data. Additionally, you can subset the data by selecting a list of columns and filtering out to only have have particular rows based on boolean (T/F) tests of conditions.
Even more complex concepts are possible, as you will see, such as grouping rows for analytics and other operations.

<mark>In this lab and in the subsequent labs and practices, you may see repeated information. Sometimes it is necessary so that you become accustomed/familiar to the most frequently used methods/topics.</mark>


In this session, examples covered are:

1. Delete columns
1. Convert column into binary
1. Convert column into np.datetime
1. Find unique values in a column
1. Iterating through columns and excluding columns from iteration
1. Drop rows from data frame
1. Subset dataframe by columns
1. Subset dataframe by rows
1. Resample dataset


We will walk them through with minimum examples.

In [None]:
import os, sys
import itertools
import random
import numpy as np
import pandas as pd


def load_data(filepath='processing_examples.csv'):
    dataset = pd.read_csv('processing_examples.csv')
    return dataset

def raw_state(dataset):
    print('====== before ======')
    print(dataset)

def curr_state(dataset):
    print('====== after ======')
    print(dataset)


We will be using a toy dataset (processing_examples.csv) to go over various features of pandas dataframe. 

In [None]:
# load the data
dataset = load_data()

## Show dimension/shape and size

In [None]:
print(f"Shape = {dataset.shape}")
print(f"Num rows = {dataset.shape[0]}")
print(f"Num rows = {dataset.shape[1]}")
print(f"Num elements = {dataset.size}")
print(f"Column names = {dataset.columns}")


## Show the first and last few lines in a dataframe

In [None]:
print(dataset.head())  # default 5 lines
print(dataset.tail())  # default 5 lines

print(dataset.head(2))  # show first 2 lines
print(dataset.tail(2))  # show last 2 lines

## Data summary

In [None]:
dataset.info()

In [None]:
dataset.describe()  # gives some basic stats on the numerical attributes

## Update all the values

This shows an example of zeroing out all elements in a dataset, just so to make sure you could understand the syntax we are using in this lab  and quickly show what jobs these above functions do.

In [None]:
dataset = load_data()
raw_state(dataset)
dataset.iloc[:, :] = 0
curr_state(dataset)

## Delete a column

In [None]:
dataset = load_data()
raw_state(dataset)

ret = dataset.drop('float', axis = 1)
print('====== returns ======')
print(ret)

curr_state(dataset)


In [None]:
dataset = load_data()
raw_state(dataset)

ret = dataset.drop('float', inplace = True, axis = 1)
print('====== returns ======')
print(ret)

curr_state(dataset)


**Recommended way**

In [None]:
dataset = load_data()
raw_state(dataset)

del dataset['float']

curr_state(dataset)


## Convert column into binary

In [None]:
dataset = load_data()
raw_state(dataset)

dataset['yes/no'] = dataset['yes/no'].apply(['Yes', 'No'].index)

curr_state(dataset)


In [None]:
dataset = load_data()
raw_state(dataset)

dataset['yes/no'] = list(map(['Yes', 'No'].index, dataset['yes/no']))

curr_state(dataset)


## Convert column into np.datatime

In [None]:
dataset = load_data()
raw_state(dataset)

dataset['date'] = dataset['date'].apply(np.datetime64)

curr_state(dataset)
    
print(type(dataset['date'][0]))

In [None]:
dataset = load_data()
raw_state(dataset)

dataset['date'] = dataset['date'].apply(np.datetime64)
print('====== day ======')
print(dataset['date'].apply(lambda d: d.day))

curr_state(dataset)


In [None]:
dataset = load_data()
raw_state(dataset)

dataset['date'] = dataset['date'].apply(np.datetime64)
print('====== day ======')
print(dataset['date'].apply(lambda d: d.month))

curr_state(dataset)


## Find unique values in a column

In [None]:
dataset = load_data()
print(np.unique(dataset['categorical']))
# or 
print(dataset['categorical'].unique())

## Iterating through columns

In [None]:
dataset = load_data()

for column_name in ['int', 'categorical']:
    print(dataset[column_name].head())

In [None]:
dataset = load_data()

for column_name in dataset.columns:
    print(dataset[column_name].head())

In [None]:
dataset = load_data()

for column_name in np.array(dataset.columns)[[2,4]]:
    print(dataset[column_name].head())

In [None]:
dataset = load_data()

for column_name in np.array(dataset.columns)[[False, True, False, False, True]]:
    print(dataset[column_name].head())

## Iterating through columns with exclusions

In [None]:
dataset = load_data()

exclusion = ['float', 'yes/no', 'date']

for column_name in set(dataset.columns)-set(exclusion):
    print(dataset[column_name].head())

In [None]:
dataset = load_data()

exclusion = [0,2,3]

for column_name in set(dataset.columns)-set(np.array(dataset.columns)[exclusion]):
    print(dataset[column_name].head())

In [None]:
dataset = load_data()

exclusion = [0,2,3]

for column_name in [v for i,v in enumerate(dataset.columns) if i not in exclusion]:
    print(dataset[column_name].head())

In [None]:
dataset = load_data()

exclusion = [True, False, True, True, False]

for column_name in np.array(dataset.columns)[~np.array(exclusion)]:
    print(dataset[column_name].head())

## Drop rows from data frame

In [None]:
dataset = load_data()
raw_state(dataset)

dataset.drop([3,4,5], inplace=True)

curr_state(dataset)



In [None]:
dataset = load_data()
raw_state(dataset)

dataset.drop([3,4,5], inplace=True)
dataset.reset_index(drop=True, inplace=True)


curr_state(dataset)
    

## Subset data frame by columns

In [None]:
dataset = load_data()
dataset.iloc[:, [1, 2]]

In [None]:
dataset = load_data()
dataset.loc[:, ['int', 'yes/no']]

In [None]:
dataset = load_data()

dataset.loc[:, filter(lambda i: 'a' in i, dataset.columns)]

In [None]:
dataset = load_data()

dataset.loc[:, filter(lambda i: i.startswith('c') or i.startswith('d'), dataset.columns)]

In [None]:
dataset = load_data()

dataset.loc[:, [i for i in dataset.columns if i.startswith('c') or i.startswith('d')]]

## Subset data frame by rows

In [None]:
dataset = load_data()

dataset.iloc[[3,4,5], :]

In [None]:
dataset = load_data()

print(dataset['int']>5)

dataset[dataset['int']>5]

In [None]:
dataset = load_data()

dataset[(dataset['float']>0.5) & (dataset['yes/no']=='Yes')]

Sidebar: replacing dataset[] with np.sum(), you can get the count of records  
satisfying the condition.

In [None]:
dataset = load_data()

np.sum((dataset['float']>0.5) & (dataset['yes/no']=='Yes'))

### Notice the usage of dataset[]

It acts on either rows or columns depending on the context.

In [None]:
dataset = load_data()

obj = 'categorical'
dataset[obj]

In [None]:
dataset = load_data()

obj = [False,False,False,False,True,False,True,False]
dataset[obj]

That was how the following statement could work.

In [None]:
dataset = load_data()

dataset[dataset['categorical']=='C']

### And this indexer is also writable on both axes

In [None]:
dataset = load_data()
raw_state(dataset)

dataset['categorical'] = ['A']*8

curr_state(dataset)
    

In [None]:
dataset = load_data()
raw_state(dataset)

dataset[dataset['categorical']=='C'] = [-1, -128, 'No', '0000-00-00', '<<<']

curr_state(dataset)
    

In [None]:
dataset = load_data()
raw_state(dataset)

dataset[dataset['categorical']=='C'] = [
    [-1, -128, 'No', '0000-00-00', '<<<'],
    [0, 127, 'Yes', '1900-12-01', '<<<']
]

curr_state(dataset)


## Resample dataset

In [None]:
dataset = load_data()
raw_state(dataset)

dataset = dataset.sample(frac = 0.2)
curr_state(dataset)
    

In [None]:
dataset = load_data()
raw_state(dataset)

dataset = dataset.sample(frac = 0.9, replace = True)
curr_state(dataset)
    

In [None]:
dataset = load_data()
raw_state(dataset)

dataset = dataset.sample(frac = 1)

curr_state(dataset)
        

In [None]:
dataset = load_data()
raw_state(dataset)

dataset = dataset.sample(frac = 1.5, replace = True)

curr_state(dataset)


    

In [None]:
dataset = load_data()
raw_state(dataset)

dataset = dataset.sample(frac = 1.5, replace = True).reset_index(drop=True)

curr_state(dataset)
    

### All done!  Clear Cells, Save Notebook, `File > Close and Halt`