<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Integrity" data-toc-modified-id="Data-Integrity-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Integrity</a></span><ul class="toc-item"><li><span><a href="#Columns-Info" data-toc-modified-id="Columns-Info-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Columns Info</a></span></li><li><span><a href="#Read-my-personal-data" data-toc-modified-id="Read-my-personal-data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Read my personal data</a></span></li><li><span><a href="#Conflicting-Labels" data-toc-modified-id="Conflicting-Labels-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Conflicting Labels</a></span></li></ul></li></ul></div>

# Data Integrity

[Source](https://docs.deepchecks.com/en/stable/checks_gallery/tabular.html)

- Columns Info
- Conflicting Labels
- Data Duplicates
- Feature Label Correlation
- Is Single Value
- Mixed Data Types
- Mixed Nulls
- Outlier Sample Detection
- Special Characters
- String Length Out Of Bounds
- String Mismatch




In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


import warnings
warnings.filterwarnings("ignore")
seed = 12345

In [1]:
# !pip install deepchecks --upgrade

## Columns Info


In [3]:
import numpy as np
import pandas as pd

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import ColumnsInfo

# Generating data
num_fe = np.random.rand(500)
cat_fe = np.random.randint(3, size=500)
num_col = np.random.rand(500)
date = range(1635693229, 1635693729)
index = range(500)
data = {'index': index, 'date': date, 'a': cat_fe, 'b': num_fe, 'c': num_col, 'label': cat_fe}
df = pd.DataFrame.from_dict(data)

dataset = Dataset(df, label='label', datetime_name='date', index_name='index', features=['a', 'b'], cat_features=['a'])

# Running columns_info check
check = ColumnsInfo()
check.run(dataset=dataset)

## Conflicting Labels


In [4]:
from deepchecks.tabular.checks import ConflictingLabels
from deepchecks.tabular.datasets.classification.phishing import load_data

# Load data
phishing_dataframe = load_data(as_train_test=False, data_format='Dataframe')
phishing_dataset = Dataset(phishing_dataframe, label='target', features=['urlLength', 'numDigits', 'numParams', 
                                                                         'num_%20', 'num_@', 'bodyLength', 'numTitles', 
                                                                         'numImages', 'numLinks', 'specialChars'])

## Run the Check
ConflictingLabels().run(phishing_dataset)


In [None]:
# We can also check label ambiguity on a subset of the features:

ConflictingLabels(n_to_show=1).run(phishing_dataset)

In [None]:
ConflictingLabels(columns=['urlLength', 'numDigits']).run(phishing_dataset)

In [None]:
## Define a Condition
# Now, we define a condition that enforces that the ratio of samples with conflicting labels should be 0. 
# A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong.

check = ConflictingLabels()
check.add_condition_ratio_of_conflicting_labels_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

ConflictingLabels
	Conditions:
		0: Ambiguous sample ratio is not greater than 0%

## Data Duplicates

- Why data duplicates?

The DataDuplicates check finds multiple instances of identical samples in the Dataset. Duplicate samples increase the weight the model gives to those samples. If these duplicates are there intentionally (e.g. as a result of intentional oversampling, or due to the dataset’s nature it has identical-looking samples) this may be valid, however if this is an hidden issue we’re not expecting to occur, it may be an indicator for a problem in the data pipeline that requires attentio

In [5]:
from deepchecks.tabular.datasets.classification.phishing import load_data
from deepchecks.tabular.checks import DataDuplicates

DataDuplicates().run(phishing_dataset)

In [6]:

# With Check Parameters
# ---------------------
# ``DataDuplicates`` check can also use a specific subset of columns (or alternatively
# use all columns except specific ignore_columns to check duplication):

DataDuplicates(columns=["entropy", "numParams"]).run(phishing_dataset)

In [7]:
DataDuplicates(ignore_columns=["scrape_date"], n_to_show=10).run(phishing_dataset)

In [None]:
# Define a Condition
# Now, we define a condition that enforce the ratio of duplicates to be 0. A condition is deepchecks’ 
# way to validate model and data quality, and let you know if anything goes wrong.

check = DataDuplicates()
check.add_condition_ratio_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

DataDuplicates
	Conditions:
		0: Duplicate data ratio is not greater than 0%

## Feature Label Correlation


The Predictive Power Score (PPS) is used to estimate the ability of a feature to predict the label by itself (Read more about Predictive Power Score). A high PPS (close to 1) can mean that this feature's success in predicting the label is actually due to data leakage - meaning that the feature holds information that is based on the label to begin with.


In [9]:
import numpy as np
import pandas as pd

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import FeatureLabelCorrelation
df = pd.DataFrame(np.random.randn(100, 3), columns=['x1', 'x2', 'x3'])
df['x4'] = df['x1'] * 0.5 + df['x2']
df['label'] = df['x2'] + 0.1 * df['x1']
df['x5'] = df['label'].apply(lambda x: 'v1' if x < 0 else 'v2')

ds = Dataset(df, label='label', cat_features=[])
# Using the FeatureLabelCorrelation check class
my_check = FeatureLabelCorrelation(ppscore_params={'sample': 10})
my_check.run(dataset=ds)

## Is Single Value


In [12]:
# Imports
import pandas as pd
from sklearn.datasets import load_iris

from deepchecks.tabular.checks import IsSingleValue
iris = load_iris()
X = iris.data
df = pd.DataFrame({'a':[3,4,1], 'b':[2,2,2], 'c':[None, None, None], 'd':['a', 4, 6]})
df

Unnamed: 0,a,b,c,d
0,3,2,,a
1,4,2,,4
2,1,2,,6


In [13]:
# See functionality
IsSingleValue().run(pd.DataFrame(X))

In [14]:
IsSingleValue().run(pd.DataFrame({'a':[3,4], 'b':[2,2], 'c':[None, None], 'd':['a', 4]}))

In [15]:
sv = IsSingleValue()
sv.run(df)

## Mixed Data Types
What are Mixed Data Types?

Mixed data types is when a column contains both string values and numeric values (either as numeric type or as string like “42.90”). This may indicate a problem in the data collection pipeline, or represent a problem situation for the model’s training.

This checks searches for columns with a mix of strings and numeric values and returns them and their respective ratios.

In [16]:
import pandas as pd
import numpy as np
from deepchecks.tabular.datasets.classification import adult

# Prepare functions to insert mixed data types

def insert_new_values_types(col: pd.Series, ratio_to_replace: float, values_list):
    col = col.to_numpy().astype(object)
    indices_to_replace = np.random.choice(range(len(col)), int(len(col) * ratio_to_replace), replace=False)
    new_values = np.random.choice(values_list, len(indices_to_replace))
    col[indices_to_replace] = new_values
    return col


def insert_string_types(col: pd.Series, ratio_to_replace):
    return insert_new_values_types(col, ratio_to_replace, ['a', 'b', 'c'])


def insert_numeric_string_types(col: pd.Series, ratio_to_replace):
    return insert_new_values_types(col, ratio_to_replace, ['1.0', '1', '10394.33'])


def insert_number_types(col: pd.Series, ratio_to_replace):
    return insert_new_values_types(col, ratio_to_replace, [66, 99.9])


# Load dataset and insert some data type mixing
adult_df, _ = adult.load_data(as_train_test=True, data_format='Dataframe')
adult_df['workclass'] = insert_numeric_string_types(adult_df['workclass'], ratio_to_replace=0.01)
adult_df['education'] = insert_number_types(adult_df['education'], ratio_to_replace=0.1)
adult_df['age'] = insert_string_types(adult_df['age'], ratio_to_replace=0.5)

In [18]:
adult_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,c,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,c,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [19]:
from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import MixedDataTypes

adult_dataset = Dataset(adult_df, cat_features=['workclass', 'education'])
check = MixedDataTypes()
result = check.run(adult_dataset)
result

## My personal data - Columns Info

In [None]:
data_raw = pd.read_csv("https://raw.githubusercontent.com/alinemati45/deepcheck/main/data/NMLoanDefault.csv" ) # Read csv from local address

data_raw= data_raw.drop("Unnamed: 0",axis=1)  # dRop index.

data_raw.head(7) # Display 7 row of data


#################################################################
##      Modify header if we have space to _ , lower care.      ##
#################################################################
data_raw.columns = ['_'.join(col.split(' ')).lower() for col in data_raw.columns]


dataset = Dataset(data_raw, label='target', cat_features=['job_cde', 'reason_cde'])
check = ColumnsInfo(n_top_columns=  14)
check.run(dataset=dataset)

Unnamed: 0,PROPERTY_VALUE_AMT,TARGET,CRDT_LINE_CNT,DEROG_CNT,DEBT_INC_RTIO_AMT,LOAN_AMT,REASON_CDE,YOJ_AMT,MORTGAGE_DUE_AMT,RCNT_CRDT_CNT,OLD_AGE_TRADE_AMT,JOB_CDE,DELINGQ_CNT
0,91704.0,0,20.0,0.0,30.206893,20000,DebtCon,10.0,28440.0,0.0,143.637439,ProfExe,0.0
1,88342.0,0,11.0,0.0,43.717635,4800,HomeImp,7.0,80482.0,0.0,275.032395,ProfExe,0.0
2,242602.0,0,26.0,0.0,41.277127,25700,DebtCon,8.0,197425.0,2.0,102.960346,Other,0.0
3,68500.0,0,42.0,0.0,,18000,DebtCon,10.0,45000.0,1.0,190.8,ProfExe,0.0
4,55500.0,0,11.0,0.0,,8600,DebtCon,6.0,41126.0,1.0,73.033333,Other,0.0
5,64386.0,0,17.0,0.0,36.787306,17200,DebtCon,27.0,51352.0,1.0,295.955414,Mgr,0.0
6,73380.0,0,18.0,0.0,24.374481,40300,DebtCon,4.0,33299.0,1.0,184.537589,ProfExe,0.0


## Conflicting Labels

In [None]:
from deepchecks.tabular.checks import ConflictingLabels
from deepchecks.tabular.datasets.classification.phishing import load_data

phishing_dataframe = load_data(as_train_test=False, data_format='Dataframe')
phishing_dataframe.head()
phishing_dataset = Dataset(phishing_dataframe, label='target', features=['urlLength', 'numDigits', 'numParams', 'num_%20', 'num_@', 'bodyLength', 'numTitles', 'numImages', 'numLinks', 'specialChars'])

ConflictingLabels().run(phishing_dataset)

Unnamed: 0,target,month,scrape_date,ext,urlLength,numDigits,numParams,num_%20,num_@,entropy,...,dse,bodyLength,numTitles,numImages,numLinks,specialChars,scriptLength,sbr,bscr,sscr
0,0,1,2019-01-01,net,102,8,0,0,0,-4.384032,...,191,32486,3,5,330,9419,23919,0.736286,0.28994,2.539442
1,0,1,2019-01-01,country,154,60,0,2,0,-3.566515,...,0,16199,0,4,39,2735,794,0.049015,0.168838,0.290311
2,0,1,2019-01-01,net,171,5,11,0,0,-4.608755,...,104,103344,18,9,302,27798,83817,0.811049,0.268985,2.412174
3,0,1,2019-01-01,com,94,10,0,0,0,-4.548921,...,466,34093,11,43,199,9087,19427,0.569824,0.266536,2.137889
4,0,1,2019-01-01,other,95,11,0,0,0,-4.717188,...,928,202,1,0,0,39,0,0.0,0.193069,0.0


In [None]:
# We can also check label ambiguity on a subset of the features:
ConflictingLabels(n_to_show=1).run(phishing_dataset)

In [None]:
ConflictingLabels(columns=['urlLength', 'numDigits']).run(phishing_dataset)

### Define a Condition
Now, we define a condition that enforces that the ratio of samples with conflicting labels should be 0. A condition is deepchecks’ way to validate model and data quality, and let you know if anything goes wrong

In [None]:
check = ConflictingLabels()
check.add_condition_ratio_of_conflicting_labels_not_greater_than(0)
result = check.run(phishing_dataset)
result.show(show_additional_outputs=False)

ConflictingLabels
	Conditions:
		0: Ambiguous sample ratio is not greater than 0%

## Check My presonal data  Conflicting Labels

In [None]:
from deepchecks.tabular.checks import ConflictingLabels
ConflictingLabels().run(dataset)

In [None]:
from deepchecks.tabular.datasets.classification.phishing import load_data
from deepchecks.tabular.checks import DataDuplicates

DataDuplicates().run(dataset)

# Data Duplicates