## Group 4 - Lab Exercise 3

**This file is only used for describing what does the raw data look like and guides us how to process it.**
**It is not intended to be included in the data pipelines**

In [None]:
import pandas as pd
import numpy as np

In [None]:
#Loading the data
df_branch_service = pd.read_json("branch_service_transaction_info.json")
df_customer_transaction = pd.read_json("customer_transaction_info.json")
df_merged = pd.merge(df_customer_transaction, df_branch_service)

# **Data Profiling and Validation**


## Looking at the Raw Data


using .sample() for randomized lookup

In [None]:
df_branch_service.sample(10)

In [None]:
df_customer_transaction.sample(10)

## Checking for Duplicates

.shape to get row and column count

In [None]:
print(df_branch_service.shape)
print(df_customer_transaction.shape)

to count unique txn_id, use .nunique()

In [None]:
print(df_branch_service['txn_id'].nunique())
print(df_customer_transaction['txn_id'].nunique())

difference between the original and unique txn_id row count means that there are **duplicate rows**

## Checking for Empty Values

checking null values count per column

In [None]:
df_branch_service.isnull().sum()

In [None]:
df_customer_transaction.isnull().sum()

branch_name and price have null values

### Empty Values of branch_name Column

look at the possible values of branch_name

In [None]:
df_branch_service['branch_name'].unique()

two types of empty value in the branch_name column: '' & None ('N/A' not considered as empty since it has a different meaning)

Let '' also be null so that we can use one fill method for both of the possible empty values

Can fill the empty values up via forward fill or backward fill

In [None]:
df_branch_service['branch_name'] = df_branch_service['branch_name'].ffill().bfill()

### Empty Values of price Column

since there are empty values on a numerical column, it is best to mean fill its empty values

look at the price mean, the group mean from service, and the group mean from branch_name & service

In [None]:
df_branch_service['price'].describe()

In [None]:
df_branch_service.groupby(['service'])['price'].describe()

In [None]:
df_branch_service.groupby(['branch_name', 'service'])['price'].describe()

it is best to use the group mean from branch_name and service to accurately fill up null values in the specific group

## Standardizing Data

### Dates

In [None]:
print(df_merged['avail_date'].describe)
print(df_merged['birthday'].describe)

currently the two dates in our data is in object data type, convert it to datetime data type

In [None]:
df_merged['avail_date'] = pd.to_datetime(df_merged['avail_date'], format='%Y-%m-%d')
df_merged['birthday'] = pd.to_datetime(df_merged['birthday'], format='%Y-%m-%d')

There are certain rows that are **impossible** to happen looking at our dates columns:

In [None]:
print(df_merged['avail_date'].max())

dates that are **later than the current date** are invalid and should be removed

In [70]:
from datetime import date
today = str(date.today())
df_merged = df_merged[(df_merged['avail_date'] <= today) & (df_merged['birthday'] <= today)]

avail_date that happens **before** the birthday are also invalid and should be removed

In [None]:
df_merged = df_merged[(df_merged['avail_date'] > df_merged['birthday'])]

### first_name & last_name

The name columns have values that contains special characters. Values in it also does not stick with one format

In [None]:
df_merged.sample(20)

only allow letters in this columns and stick to one format (e.g. all uppercase)

### price

The price column has too many decimal places and does not really represent how most monetary systems work. Round up to two decimal places.

In [None]:
df_merged['price'] = df_merged['price'].round(2)