Output is cleared due to confidentiality of the information.

# Data Quality Checks

Completeness – a percentage of data that includes one or more values. It’s important that critical data (such as customer names, phone numbers, email addresses, etc.) be completed first since completeness doesn’t impact non-critical data that much.  Incomplete data is as dangerous as inaccurate data. Gaps in data collection lead to a partial view of the overall picture to be displayed. Without a complete picture of how operations are running, uninformed actions will occur. It’s important to understand the complete set of requirements that constitute a comprehensive set of data to determine whether or not the requirements are being fulfilled.

Granularity and Uniqueness: The level of detail at which data is collected is important, because confusion and inaccurate decisions can otherwise occur. Aggregated, summarized and manipulated collections of data could offer a different meaning than the data implied at a lower level. An appropriate level of granularity must be defined to provide sufficient uniqueness and distinctive properties to become visible. This is a requirement for operations to function effectively.

Timeliness – How much of an impact does date and time have on the data? This could be previous sales, product launches or any information that is relied on over a period of time to be accurate. Data collected too soon or too late could misrepresent a situation and drive inaccurate decisions.

Validity – Does the data conform to the respective standards set for it?  Requirements governing data set the boundaries of this characteristic. For example, on surveys, items such as gender, ethnicity, and nationality are typically limited to a set of options and open answers are not permitted. Any answers other than these would not be considered valid or legitimate based on the survey’s requirement. This is the case for most data and must be carefully considered when determining its quality. The people in each department in an organization understand what data is valid or not to them, so the requirements must be leveraged when evaluating data quality.

Accuracy – How well does the data reflect the real-world person or thing that is identified by it?  It cannot have any erroneous elements and must convey the correct message without being misleading. 

Consistency – How well does the data align with a preconceived pattern? Birth dates share a common consistency issue, since in the U.S., the standard is MM/DD/YYYY, whereas in Europe and other areas, the usage of DD/MM/YYYY is standard.

## Determine path

In [None]:
path_ipynb = r'C:\Users\luc57.DESKTOP-NB5DC80\AE\ipynb\\'
path_excel = r'C:\Users\luc57.DESKTOP-NB5DC80\AE\excel\\'

## Import required libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

## Read the dataframe

In [None]:
#change the current path
#os.chdir()

In [None]:
#list every file in the current directory
os.listdir()

In [None]:
df = pd.read_csv(path_excel+'discounts_2019-09-01_2020-11-01.csv')
df.head(2)

## Data types

In [None]:
df.dtypes

In [None]:
df['total_returns'].value_counts()
# total returns is in Euro

Variables _hour_, _day_, _month_,_quarter_,_year_,_day_of_the_week all have wrong datatypes._

Except for the variables above, the others seem fine.

## Convert Datatypes

### Convert Datetime Properties

In [None]:
df[['hour','hour_of_day','month','month_of_year','quarter','year','day_of_week']]

In [None]:
#correct the hour variable
from datetime import datetime
df['hour'] = pd.to_datetime(df['hour'])
df['hour']

In [None]:
#get the date and time correctly
df['date']=df['hour']
df['date']

In [None]:
#remove the date from hour 
df['hour'] = df['hour'].apply(lambda x: x.strftime('%H:%M:%S'))
df['hour']

## Value Distribution

In [None]:
df.describe()

Columns _marketing_event_target_,_marketing_event_type, and _automatic_discount_title_ seem to contain so many null values. 

## Missing data?

In [None]:
df.isnull().sum()

There are 4535 rows. Variables _marketing_event_target_ and _marketing_event_type_ and _automatic_discount_title_ should be deleted.

## Duplicate rows?

In [None]:
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = df[df.duplicated()]
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

## Drop useless columns

In [None]:
df = df.drop(columns=['product_vendor','shipping_title','total_shipping_script_discount','total_returns',
                      'marketing_event_target','marketing_event_type','automatic_discount_title','total_line_item_script_discount'])

## Check other columns that also have many missing data

In [None]:
df[['email','product_title','variant_title','total_gross_sales','total_net_sales']]

In [None]:
df['product_title'].value_counts()

In [None]:
df['variant_title'].value_counts()

# Check only done transactions

In [None]:
done_transactions = df.loc[df['total_gross_sales']!=0]
len(done_transactions)

In [None]:
done_transactions.tail(5)

In [None]:
done_transactions.isna().sum()

In [None]:
done_transactions['email'] = done_transactions['email'].replace(np.nan, 'user preference')
done_transactions['email'].isna().sum()

In [None]:
done_transactions['email'].value_counts()

By now, the only column containing null value should be _product_variant_.

In [None]:
done_transactions.isna().sum()

Let us check what is wrong with those transactions.

In [None]:
done_transactions[done_transactions.isna().any(axis=1)]

220 rows of done transactions do not have variant titles to them, but at least they have product names, so we can keep them.

## Check the value_counts of each column

In [None]:
for column in done_transactions.columns:
     print("\n" + column)
     print(done_transactions[column].value_counts())

In [None]:
del done_transactions['discount_applied'] #there is only one value to all the rows

## Select only numeric columns

In [None]:
done_transactions.select_dtypes([np.number])

In [None]:
# listing dataframes types
list(set(done_transactions.dtypes.tolist()))
# include only float and integer
done_transactions_num = done_transactions.select_dtypes(include = ['float64', 'int64'])
# display what has been selected
done_transactions_num.head()
# plot
done_transactions_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);

It seems that the column _total_shippping_price, total_shipping_discount, and total_quantity_ only have one value.

In [None]:
print(done_transactions['total_shipping_price'].value_counts())
print(done_transactions['total_shipping_discount'].value_counts())
print(done_transactions['total_quantity'].value_counts())

In [None]:
#drop the two columns as they don't provide unique information
done_transactions.drop(columns=['total_shipping_price','total_shipping_discount'],inplace=True)

## Column _name_ equals column  _discountcode_ ?

In [None]:
done_transactions['name']==done_transactions['discount_code']

In [None]:
done_transactions[['name', 'discount_code']].assign(NE=done_transactions.name != done_transactions.discount_code)

In [None]:
done_transactions['NE']=done_transactions['name']==done_transactions['discount_code']

In [None]:
done_transactions.loc[done_transactions['NE']=='False']

All rows have identical value. The column _name_ can be deleted.

In [None]:
done_transactions.drop(columns=['NE','name'],inplace=True)

In [None]:
done_transactions

## Exclude tests

In [None]:
print('Number of real transactions= ',len(done_transactions.loc[done_transactions['email']!='test']))
done_transactions = done_transactions.loc[done_transactions['email']!='test']

In [None]:
done_transactions

In [None]:
done_transactions.columns

In [None]:
suspicious_of_testing=['customer_name','discount_code','email','orders']
for column in suspicious_of_testing:
    print(column)
    print(done_transactions[column].value_counts())
    print('')

In [None]:
done_transactions[done_transactions['discount_code'].str.match('test')]

In [None]:
done_transactions = done_transactions.drop([2750,4004], axis=0)

In [None]:
done_transactions = done_transactions[~done_transactions.discount_code.str.contains("TEST")]
len(done_transactions)

## Why are there rows with sales but orders equal 0?

In [None]:
done_transactions.loc[done_transactions['orders']==0]

## Deselect rows which _total_quantity_return_ equals -1

In [None]:
done_transactions = done_transactions.loc[~done_transactions['total_quantity_returns']==-1]
len(done_transactions)

In [None]:
done_transactions = done_transactions.drop(columns='total_quantity_returns',axis=1)
len(done_transactions.columns)

## Fix the column Variant Title

In [None]:
done_transactions.loc[(done_transactions['product_title']=='Day Cream')]

In [None]:
done_transactions.loc[done_transactions['variant_title']=='20ml']

In [None]:
done_transactions.to_excel(path_excel+'done_transactions.xlsx')

In [None]:
done_transactions[['discount_code','orders','total_quantity']]