* Overview of Pandas read APIs
* Read sales data into Pandas Data Frame
* Read data from csv file without Header
* Specifying Delimiter or Seperator
* Filter for Valid Sales Records
* Compute Commission Amount and Sale Revenue
* Overview of Pandas write or to APIs
* Write Processed data to csv file with header
* Validate reviewing file
* Exercise and Solution

In [None]:
# Overview of Pandas Read APIs
# pd.read_csv
# pd.read_json
# many more

In [None]:
# Read sales data into Pandas Data Frame
import pandas as pd

In [None]:
sales_df = pd.read_csv('data/sales/part-00000')

In [None]:
sales_df.shape

In [None]:
sales_df.columns

In [None]:
# Read data from csv file without Header
departments_df = pd.read_csv('data/retail_db/departments/part-00000')

In [None]:
departments_df # No proper column names (uses values in first row by default)

In [None]:
departments_df = pd.read_csv(
    'data/retail_db/departments/part-00000', 
    header=None # uses system generated column names
)

In [None]:
departments_df.columns

In [None]:
# Specifying column names
departments_df = pd.read_csv(
    'data/retail_db/departments/part-00000',
    names=['department_id', 'department_name'] # uses specified column names
)

In [None]:
departments_df # first row will not be considered as header

In [None]:
# Specifying delimiter or separator
departments_df = pd.read_csv(
    'data/retail_db/departments/part-00000',
    names=['department_id', 'department_name'],
    sep=',' # , is default. delimiter is same as sep
)

In [None]:
departments_df

In [None]:
# Filter for Valid Sales Records
# using query
sales_df.query('commission_pct != -1')

In [None]:
sales_df.query('(commission_pct != -1) & (commission_pct.notnull())')

In [None]:
sales_df = sales_df.query('(commission_pct != -1) & (commission_pct.notnull())')

In [None]:
# Compute Commission Amount and Sale Revenue
# using apply

In [None]:
# create a lambda function to compute commission amount and sale revenue
calculate_commission = lambda row: row['sale_amount'] * row['commission_pct'] / 100

In [None]:
calculate_revenue = lambda row: row['sale_amount'] - row['commission_amt']

In [None]:
# create new columns using apply method
sales_df['commission_amt'] = sales_df.apply(calculate_commission, axis=1)

In [None]:
sales_df

In [None]:
sales_df['sale_revenue'] = sales_df.apply(calculate_revenue, axis=1)

In [None]:
sales_df

In [None]:
# Overview of Pandas write or to APIs
# to_csv
# to_json
# many more

In [None]:
# Write Processed data to csv file with header
sales_df.to_csv('data/sales/sales_with_revenue.csv', header=True)

In [None]:
# Validate reviewing file
pd.read_csv('data/sales/sales_with_revenue.csv') # writes with Dataframe index

In [None]:
sales_df.to_csv(
    'data/sales/sales_with_revenue.csv', 
    header=True,
    index=False
) # Dataframe index will be skipped

In [None]:
pd.read_csv('data/sales/sales_with_revenue.csv')

* Exercise: Read data from `data/hr_db/employees/part-00000.csv` into Dataframe. Filter for employees whose salary is greater than or equal to 15000 and write to `data/hr_db/employees_hni/part-00000.csv`.
  * Make sure the `data/hr_db/employees_hni` folder is created, if it does not exists using `os`.
  * Here are the column names that are supposed to be used.
    * `employee_id`
    * `first_name`
    * `last_name`
    * `email`
    * `phone_number`
    * `hire_date`
    * `job_id`
    * `salary`
    * `commission_pct`
    * `manager_id`
    * `department_id`
  * Make sure the newly created file does not contain Dataframe index.

In [None]:
import pandas as pd

In [None]:
columns = ['employee_id', 'first_name', 'last_name', 'email',
           'phone_number', 'hire_date', 'job_id', 'salary',
           'commission_pct', 'manager_id', 'department_id']
employees_df = pd.read_csv(
    'data/hr_db/employees/part-00000.csv',
    sep='\t',
    names=columns
)

In [None]:
employees_hni = employees_df.query('salary >= 15000')

In [None]:
import os
os.makedirs('data/hr_db/employees_hni', exist_ok=True)

In [None]:
employees_hni.to_csv(
    'data/hr_db/employees_hni/part-00000.csv',
    header=True,
    index=False
)

In [None]:
pd.read_csv('data/hr_db/employees_hni/part-00000.csv')