=================================================
Flipkart Product Dataset : Analysis and Implementation Of Airflow

 
by __Angger Rizky Firdaus__  
This project is developed to accomplish milestone 3 of the FTDS Hacktiv8 program.  
contains the Great Expectations process to validate, document, and profile data workflows.  

=================================================

# Preparing Great Expectations Libs

In [1]:
#installing GreatExpectations Library
!pip install -q great-expectations

The library has been successfully installed.

In [5]:
# Create a data context
from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')

the line of code is used to establish a connection with the local file system in the specified project directory so that we can utilize Great Expectations to perform various operations such as validating, documenting, and analyzing data within the project.

In [6]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'product-dataset'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'flipkart-table'
path_to_data = 'P2M3_Angger_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

This code establishes a datasource and defines a data asset within the Great Expectations context, facilitating subsequent validation and testing of the specified CSV data asset ('flipkart-table').

In [7]:
# Creat an expectation suite
expectation_suite_name = 'expectation-flipkart-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,title,rating,main_category,platform,price,act_price,discount_percent,no_rating,no_reviews,5_star_rating,4_star_rating,3_star_rating,2_star_rating,1_star_rating,fulfilled
0,16695,Fashionable & Comfortable Bellies For Women (...,3.9,Women,Flipkart,698,999,30.13,38,7.0,17,9,6,3,3,0
1,5120,Combo Pack of 4 Casual Shoes Sneakers For Men ...,3.8,Men,Flipkart,999,1999,50.03,531,69.0,264,92,73,29,73,1
2,18391,Cilia Mode Leo Sneakers For Women (White),4.4,Women,Flipkart,2749,4999,45.01,17,4.0,11,3,2,1,0,1
3,495,Men Black Sports Sandal,4.2,Men,Flipkart,518,724,15.85,46413,6229.0,1045,12416,5352,701,4595,1
4,16408,Men Green Sports Sandal,3.9,Men,Flipkart,1379,2299,40.02,77,3.0,35,21,7,7,7,1


Now, we can utilize Great Expectations to validate our dataset.

# Expectations

##  1. expect_column_values_to_be_unique

The usage of `expect_column_values_to_be_unique` is employed to verify whether, after the cleaning process, there are any remaining duplicate entries. The column under scrutiny is the ID column, aiming to identify any duplicate data entries.

In [8]:
# Expectation 1  : expect_column_values_to_be_unique
validator.expect_column_values_to_be_unique('id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 14945,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The ID column has unique values for each entry, as evidenced by the successful validation process using the `expect_column_values_to_be_unique` function.

## 2  : expect_column_min_to_be_between

The use of the function `expect_column_min_to_be_between` is to determine whether the minimum value in a column falls within a predefined range. In this Great Expectations process, the function is applied to check the 'price' column to ensure that there are no prices below 0, as it is unreasonable for prices to be negative.

In [20]:
# Expectation 2  : expect_column_min_to_be_between
validator.expect_column_min_to_be_between('price',0,1000)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 69
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

price column has a minimum value within the specified range of 0 to 1000, as validated successfully using the `expect_column_min_to_be_between` function.

## 3 : expect_column_values_to_be_in_set

The function `expect_column_values_to_be_in_set` is utilized to ascertain whether the 'platform' column contains any data outside the values of 'Flipkart' and 'Amazon'. This dataset exclusively encompasses data from Flipkart and Amazon, hence any values outside these platforms would be considered anomalous and would warrant further investigation.

In [10]:
# Expectation 3  : expect_column_values_to_be_in_set
validator.expect_column_values_to_be_in_set(
        "platform",
        ['Flipkart','Amazon'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 14945,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The 'Platform' column has 2 unique values, 'Flipkart' and 'Amazon', which have been successfully validated using the `expect_column_values_to_be_in_set` function.

## 4 : expect_column_values_to_be_in_type_list

The function `expect_column_values_to_be_in_type_list` is employed to verify whether the 'act_price' column consists of numerical data types such as integers or floats. If there are any data points outside of these numerical data types, such as objects, the process will fail.

In [11]:
# Expectation 4  : expect_column_values_to_be_in_type_list
validator.expect_column_values_to_be_in_type_list(
        "act_price",
        ['int64','float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The 'act_price' column has a data type that is either integer or float, as confirmed by the successful execution of the `expect_column_values_to_be_in_type_list` function.

## 5 : expect_column_values_to_be_of_type

The function `expect_column_values_to_be_of_type` is employed to verify whether the 'price' column contains values other than the integer data type. If there are values such as objects, the process will fail. It's unlikely for the 'price' column to contain objects as its data type.

In [19]:
# Expectation 5  : expect_column_median_to_be_between
validator.expect_column_values_to_be_of_type(
        "price",'int64')

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The 'rating' column has datatype integer, as evidenced by the successful execution of the `expect_column_values_to_be_of_type` function.

## 6 : expect_column_max_to_be_between

The function `expect_column_max_to_be_between` is utilized to check whether there are any values in the 'discount_percent' column that exceed 100%, as discounts should not be below 0% or above 100%.

In [13]:
# Expectation 6 : expect_column_max_to_be_between
validator.expect_column_max_to_be_between('discount_percent', 0, 100)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 88.93
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The 'price' column has rating values within the range of 0 to 100, as validated successfully using the `expect_column_max_to_be_between` function.

## 7 : expect_column_stdev_to_be_between

The function `expect_column_stdev_to_be_between` is used to determine whether the 'rating' column has a standard deviation between 0 and 0.5. The range 0 to 0.5 is chosen to control the values within the rating. If the values fall outside the range of 1 to 5, then the standard deviation will change. If there are anomalies in the data, this process will fail.

In [12]:
# Expectation 7 : expect_column_stdev_to_be_between
validator.expect_column_stdev_to_be_between('rating',0,0.5)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 0.2987805469746736
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The 'rating' column has a standard deviation within the range of 0 to 0.5, as validated successfully using the `expect_column_stdev_to_be_between` function.