# A module to check data quality and handle incompletness with machine learning

The aim of this project is to build a data quality analyser. For any data set, the model is able to:
- incompleteness checker: provide a list of line in the dataset where it occur an incompleteness (also called missing value)
- duplication checker : provide a list of line that are duplicated
- accuracy: The quality of the model
- Correct data set with filling value
 We used a Randomforest model which is a machine learning algorithm to solve the problem of incompleteness in the data.

The module contain the function called **duplication_incompleteness_checker** that return the race of incompleteness, the race of duplication, the list of incompleteness lines( if exist) and the list of duplication lines( if exist). We also provide two graphe. The graph at left show the distribution of incompleteness, duplication and accuracy of the input data set; the graph at right show the last n-th columns with the most highest missing values percentage. The parameter n is fixe in parameter( n=5 by default).

In [3]:
import data_quality_analyzer # Importing our librarie
import time # Evaluate the execution time 
import pandas as pd 

#### Testing the incompleteness, duplication and evaluating the quality of the data

In [10]:
# Test with a csv file
init_time = time.time()
data_quality_analyzer.data_quality_analyzer('county_statistics.csv').duplication_incompleteness_checker()
print("execution time", time.time()-init_time)

No duplication in the data
The race of duplication is 100.00%
--------------------------------------------------------
There is incompleteness at lines:[36, 65, 85, 122, 168, 193, 269, 271, 301, 343, 452, 477, 478, 599, 629, 643, 712, 793, 818, 872, 891, 897, 972, 975, 997, 1162, 1165, 1185, 1211, 1281, 1471, 1476, 1528, 1562, 1637, 1719, 1762, 1763, 1812, 1895, 1896, 2051, 2071, 2079, 2130, 2218, 2254, 2291, 2295, 2349, 2402, 2415, 2430, 2463, 2467, 2608, 2641, 2659, 2688, 2690, 2874, 2946, 2978, 3037, 3045, 3103, 3110, 3113, 3114, 3115, 3116, 3117, 3118, 3119, 3120, 3121, 3122, 3123, 3124, 3125, 3126, 3127, 3128, 3129, 3130, 3131, 3132, 3133, 3134, 3135, 3136, 3137, 3138, 3139, 3140, 3141, 3142, 3143, 3144, 3145, 3146, 3147, 3148, 3149, 3150, 3151, 3152, 3153, 3154, 3155, 3156, 3157, 3158, 3159, 3160, 3161, 3162, 3163, 3164, 3165, 3166, 3167, 3168, 3169, 3170, 3171, 3172, 3173, 3174, 3175, 3176, 3177, 3178, 3179, 3180, 3181, 3182, 3183, 3184, 3185, 3186, 3187, 3188, 3189, 3190, 3191,

execution time 0.26020240783691406


In [11]:
# Test with a Excel file
data_quality_analyzer.data_quality_analyzer('Manchester United 2018-19.xlsx').duplication_incompleteness_checker(7)

No duplication in the data
The race of duplication is 100.00%
--------------------------------------------------------
There is incompleteness at lines:[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
The race of completeness is 0.00%
--------------------------------------------------------
Accuracy of the data: 0.00% 
--------------------------------------------------------


The result above show that the data set is evaluate as very bad since it accuracy is 0%.  This come from the fact that the race of incompleteness is 100%.

In [12]:
init_time = time.time()
data_quality_analyzer.data_quality_analyzer('weatherAUS.csv').duplication_incompleteness_checker(7)
print(f'execution time {time.time()-init_time} seconds')


No duplication in the data
The race of duplication is 100.00%
--------------------------------------------------------
There is incompleteness at lines:[3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 1

execution time 3.435734748840332 seconds


The result above show that the dataset containt 61,21% of missing values, 0% of duplication( no duplicate line), and the quality of the data is evaluate to 38,79%

In [8]:
init_time = time.time()
data_quality_analyzer.data_quality_analyzer('Manchester United 2018-2019.xlsx').duplication_incompleteness_checker()
print(f"execution time {time.time()-init_time} seconds")

integrityd lines are: [7, 14, 17, 18, 19, 20, 21]
The race of duplication is 65.000000.%
--------------------------------------------------------
There is incompleteness at lines:[8, 14]
The race of completeness is 84.62%
--------------------------------------------------------
Accuracy of the data: 49.62% 
--------------------------------------------------------


execution time 0.09849143028259277 seconds


The above dataset contain duplicate line(35%), incompleteness rows(15,38%), and the quality of the dataset is evaluate as 49,62%

In [7]:
# You can decide just to visualize the first 5 rows by adding -5 as a parameter
data_quality_analyzer.data_quality_analyzer('Manchester United 2018-2019.xlsx').duplication_incompleteness_checker()

integrityd lines are: [7, 14, 17, 18, 19, 20, 21]
The race of duplication is 65.000000.%
--------------------------------------------------------
There is incompleteness at lines:[8, 14]
The race of completeness is 84.62%
--------------------------------------------------------
Accuracy of the data: 49.62% 
--------------------------------------------------------


In [13]:
data_quality_analyzer.data_quality_analyzer('epl2020.csv').duplication_incompleteness_checker()

integrityd lines are: [578, 579, 580, 581, 582, 583, 584]
The race of duplication is 98.799314.%
--------------------------------------------------------
No imcompleteness in the data
The race of completeness is 100.00%
--------------------------------------------------------
Accuracy of the data: 98.80% 
--------------------------------------------------------


The result above show that the dataset doesn't contain missing values and there are a few number of duplicates rows(1,2%) thus, the quality of the data is evaluate as 98,8%.

In [14]:
data_quality_analyzer.data_quality_analyzer('Manchester United 2018-119.xlsx').duplication_incompleteness_checker()

integrityd lines are: [8, 18, 20]
The race of duplication is 84.210526.%
--------------------------------------------------------
There is incompleteness at lines:[2, 14, 17]
The race of completeness is 81.25%
--------------------------------------------------------
Accuracy of the data: 65.46% 
--------------------------------------------------------


In [10]:
# If the input is not a string or not a csv/Excel file
data_quality_analyzer.data_quality_analyzer('Manchester United').duplication_incompleteness_checker()

Exception: The input should be a csv or excel file

In [None]:
import package # Loarding the package

In [14]:
# filling/correct the missing value
result = data_quality_analyzer.data_quality_analyzer('Primary_Data_21.xlsx').fill_missing_values()
result.isna().sum()

Unnamed: 0                   0
S/N                          0
Torrefaction Temperature     0
Residence Time               0
Initial Weight               0
Weight After Torrefaction    0
Weight Loss (%)              0
%Weight Retained             0
%Volatile Matter             0
%Ash content                 0
%Fixed Carbon                0
Calorific Value (kJ/kg)      0
%Moisture Content            0
dtype: int64