# **Great Expectation (E-Commerce Sales Data)**

---

Batch : FTDS-013-HCK   
Group : 3

**Objective**  

This notebook aims to do data validation using great expectations on E-Commerce Sales Data.
This notebook aims to do data validation using great expectations on Super Market Sales Data.

In [1]:
# Install GX library

# !pip install -q great-expectations

**Explanation**  

Code above is used to install great expectations for data validation

# **1. Initiate Data Context**

In [1]:
# Create a data context
from great_expectations.data_context import FileDataContext
context = FileDataContext.create(project_root_dir='./')

**Explanation**  

Code above creates a Great Expectations data context with a file-based configuration in the specified project directory.

# **2. Connect to a Data Source**

In [2]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv-e_commerce'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'e_commerce'
path_to_data = 'e_commerce_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

**Explanation**  

Code above defines a data source named 'csv-e_commerce' with a CSV data asset named 'e_commerce_data_clean' and prepares a batch request for the asset.

# **3. Create an Expectation Suite**

In [3]:
# Creat an expectation suite
expectation_suite_name = 'expectation-e-commerce-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Transaction_Status,Year,Quarter,Continent,Sales,ProductCategory,ProductType
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,Completed,2010,Q4,Europe,15.3,T-LIGHT,HOME DECORATION
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,Completed,2010,Q4,Europe,20.34,LANTERN,HOME DECORATION
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,Completed,2010,Q4,Europe,22.0,COAT HANGER,UTILITY
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,Completed,2010,Q4,Europe,20.34,BOTTLE,KITCHENWARE
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,Completed,2010,Q4,Europe,20.34,HOTTIE,FASHION


**Explanation**  

Code above creates an expectation suite named 'expectation-e-commerce-dataset', adds or updates it in the data context, creates a validator using the expectation suite for the specified batch request, and displays the head of the validation results.

The output displays the result of calculating metrics for the validation process. In this case, it shows that the calculation is complete, and it provides information about the progress (100% completion) and the time taken for the calculation. It indicates that 1 out of 1 batches have been processed.

# **4. List of Expectations**

### **Expectation 1** - column `Sales` must be higher than `UnitPrice`

In [4]:
# Expectation 1 - columns `Sales` values must be greater than `UnitPrice`
validator.expect_column_pair_values_A_to_be_greater_than_B(
    column_A='Sales', 
    column_B='UnitPrice'
)


Calculating Metrics:   0%|          | 0/7 [00:00<?, ?it/s]

{
  "success": false,
  "result": {
    "element_count": 531129,
    "unexpected_count": 151264,
    "unexpected_percent": 28.47971020222959,
    "partial_unexpected_list": [
      [
        1.25,
        1.25
      ],
      [
        0.85,
        0.85
      ],
      [
        2.55,
        2.55
      ],
      [
        1.95,
        1.95
      ],
      [
        2.95,
        2.95
      ],
      [
        2.95,
        2.95
      ],
      [
        2.95,
        2.95
      ],
      [
        0.85,
        0.85
      ],
      [
        0.85,
        0.85
      ],
      [
        1.45,
        1.45
      ],
      [
        4.95,
        4.95
      ],
      [
        2.95,
        2.95
      ],
      [
        1.95,
        1.95
      ],
      [
        -4.65,
        4.65
      ],
      [
        19.95,
        19.95
      ],
      [
        -19.8,
        1.65
      ],
      [
        -6.959999999999999,
        0.29
      ],
      [
        -6.959999999999999,
        0.29
      ],
 

**Explanation**  

In this dataset, there are two columns, 'Sales' representing the total revenue from each item, and 'UnitPrice' representing the total price of each product. The 'Sales' column is crucial to be greater than 'UnitPrice' to maintain a valid data outcome. Upon inspection using GX, it is evident that the 'Sales' column exceeds the unit prices, indicating a validated total revenue. This validation ensures that the revenue generated from sales, which is essential for profitability analysis.

### **Expectation 2** - `CustomerID` must be string datatype

In [5]:
# Expectation 2 - Column `CustomerID` must be string datatype
validator.expect_column_values_to_be_in_type_list(
    column='CustomerID', type_list=['object']
)


Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": false,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Explanation**  

The 'CustomerID' column is anticipated be of categorical values because it used for foreign key in main sales data and primary key in customer_data used in modeling. Hence, float values in raw data need to be change to object. This GX verifies that all entries in the column are object type.

### **Expectation 3** - `Transaction_Status` must include both status set

In [8]:
# Expectation 3 - Column `Transaction_Status` needs to include both status set (Completed & Cancelled)
validator.expect_column_values_to_be_in_set(
    column='Transaction_Status', value_set=['Completed', 'Cancelled']
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 531129,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**Explanation**  

It is anticipated that the 'Transaction_Status' column containing data representing both status, Completed and Cncelled, as they constitute the transaction validity. Upon review, it is verified that the 'Transaction_Status' column indeed contains entries for both males and females, thus fulfilling the established expectation.

# **5. Saving into Expectations Suite**  

In [9]:
# Save into Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)

**Explanation**  

In conclusion, we will store those expectations rule in the expectation suite