* Recap of data processing using low level APIs
* Overview of Pandas read APIs
* Read sales data into Pandas Data Frame
* Read data from csv file without Header
* Specifying Delimiter or Seperator
* Filter for Valid Sales Records
* Compute Commission Amount and Sale Revenue
* Overview of Pandas write or to APIs
* Write Processed data to csv file with header
* Validate reviewing file
* Exercise and Solution

In [None]:
# Recap of data processing using low level APIs
with open('data/sales/part-00000') as fp:
    sales = fp.read().splitlines()
    
sales_valid = []

for sale in sales[1:]:
    s = sale.split(',')
    if s[3] != '' and int(s[3]) >= 0:
        sales_valid.append((int(s[0]), int(s[1]), float(s[2]), int(s[3])))
        
sales_valid

In [None]:
import pandas as pd
sales_df = pd.read_csv('data/sales/part-00000')
sales_df.query('(commission_pct >= 0) & (commission_pct.notnull())')

# readable
# easy
# lesser code
# high quality

In [None]:
# Overview of Pandas read APIs
# pd.read_csv
# pd.read_json
# many more

In [None]:
# Read sales data into Pandas Data Frame
import pandas as pd

In [None]:
sales_df = pd.read_csv('data/sales/part-00000')

In [None]:
sales_df

In [None]:
sales_df.shape

In [None]:
sales_df.columns

In [None]:
sales_df.dtypes

In [None]:
# Read data from CSV file without header
departments_df = pd.read_csv('data/retail_db/departments/part-00000')

In [None]:
departments_df

In [None]:
departments_df.shape

In [None]:
departments_df = pd.read_csv(
    'data/retail_db/departments/part-00000',
    header=None
)

In [None]:
departments_df

In [None]:
departments_df.shape

In [None]:
departments_df.columns

In [None]:
help(pd.read_csv)

In [None]:
# Specifying columns while reading data without header
departments_df = pd.read_csv(
    'data/retail_db/departments/part-00000',
    names=['department_id', 'department_name']
)

In [None]:
departments_df

In [None]:
departments_df.shape

In [None]:
departments_df.columns

In [None]:
departments_df.dtypes

In [None]:
# Specifying delimiter or separator
departments_df = pd.read_csv(
    'data/retail_db/departments/part-00000',
    names=['department_id', 'department_name'],
    sep=','
)

In [None]:
departments_df

In [None]:
# Filter for Valid Sales Records
sales_df

In [None]:
help(sales_df.query)

In [None]:
sales_df.query('commission_pct.notnull()')

In [None]:
sales_df.query('commission_pct >= 0')

In [None]:
sales_df.query('commission_pct.notnull() & (commission_pct >= 0)')

In [None]:
# Compute Commission Amount and Sale Revenue
calculate_commission = lambda row: (row['sale_amount'] * row['commission_pct']) / 100

In [None]:
calculate_revenue = lambda row: row['sale_amount'] - row['commission_amount']

In [None]:
sales_df['commission_amount'] = sales_df.apply(calculate_commission, axis=1)

In [None]:
sales_df

In [None]:
sales_df['sale_revenue'] = sales_df.apply(calculate_revenue, axis=1)

In [None]:
sales_df

In [None]:
# Overview of Pandas write or to APIs
# sales_df.to_csv
# sales_df.to_json
# sales_df.to_sql

In [None]:
# Write data in a dataframe to CSV File without header
departments_df

In [None]:
help(departments_df.to_csv)

In [None]:
departments_df.to_csv(
    'data/retail_db/departments/dummy.csv',
    sep=';',
    header=False,
    index=False
)

In [None]:
# Write Processed sales data to csv file with header
sales_df

In [None]:
sales_df.to_csv(
    'data/sales/sales_with_revenue.csv',
    header=True,
    index=False
)

In [None]:
# Validate reviewing file
pd.read_csv('data/sales/sales_with_revenue.csv')

* Exercise: Read data from `data/hr_db/employees/part-00000.csv` into Dataframe. Filter for employees whose salary is greater than or equal to 15000 and write to `data/hr_db/employees_hni/part-00000.csv`.
  * Make sure the `data/hr_db/employees_hni` folder is created, if it does not exists using `os`.
  * Here are the column names that are supposed to be used.
    * `employee_id`
    * `first_name`
    * `last_name`
    * `email`
    * `phone_number`
    * `hire_date`
    * `job_id`
    * `salary`
    * `commission_pct`
    * `manager_id`
    * `department_id`
  * Make sure the newly created file does not contain Dataframe index.