# Data Quality - project

The objective of the project is to deduplicate records from a dataset containing information about individuals (etudiant.csv). Each record in this dataset includes details such as the person's first name, last name, address, etc. Additionally, we have added a Record_ID attribute which serves as an identifier for the records. The goal is to identify records that represent the same individual.

The outcome of this part is a set of pairs of records (record_id1, record_id2) that represent the same physical person. In this section, you will use the deduplication technique covered in the Data Quality and Preparation course.

## Get Dataset

In [1]:
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport
from Levenshtein import ratio as levenshtein_ratio


In [2]:
df_raw = pd.read_csv("C:\David\ML\Data Quality\etudiant.csv")

## Data Exploration


Let's have a first overview of the dataset

In [3]:
df_raw.head(100)

Unnamed: 0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id,id record
0,matthwew,apted,18.0,atherton srteet,currumb in hill,port macquarie,3183,vic,19590825.0,9425976,B_0
1,william,badger,,dwyer s treet,glenlee,west lakes shore,3291,nsw,19001210.0,4656608,B_1
2,connor,bailke,186.0,lambie place,,port lincoln,7303,qld,19670529.0,4702335,B_2
3,kaitlin,goldsworthy,54.0,,,thirroul,2035,nsw,19640517.0,9127277,B_3
4,jasmyn,lowe,48.0,toohey place,grand ecntral,bicton,3085,nsw,19320918.0,1430128,B_4
...,...,...,...,...,...,...,...,...,...,...,...
95,cooper,mcneiwll,119.0,the sandys,maidment place,dungog,5075,nsw,19080127.0,2706283,B_95
96,lauren,jeffrues,9.0,endeavour street,,patonta,2910,wa,19850503.0,4934969,B_96
97,biacna,brummer-archer,311.0,henry melville crescent,rosetta village,strathdickie,3023,qld,19890705.0,9167456,B_97
98,kiana,pieris,33.0,,mungrum stud,hwthorn,4650,vic,19370813.0,5026429,B_98


The following code snippet generates a comprehensive profiling report for the dataset `df_raw`. The report provides detailed insights into the dataset's structure, statistics, and patterns, facilitating a thorough understanding of its contents.


In [4]:
profile = ProfileReport(df_raw, title="Profiling Report")
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



Key insights:
1. 3.8% of the data is missing
2. there is no obvious duplicate 
3. Some given name or surname are missing, this would have to be addressed 


## Data Wrangling

In this section, we perform data wrangling tasks on the dataset `df_raw` to prepare it for analysis. We've divided the tasks into several steps, including handling missing values, adjusting data types, and ensuring data integrity.


In [5]:
df = df_raw.copy()

### Complete missing values



In [6]:
#type of columns
df = df.fillna('')

#handle 0s as empty values in postcode, date_of_birth, soc_sec_id
df['postcode'] = df['postcode'].replace(0, '')
df['date_of_birth'] = df['date_of_birth'].replace(0, '')
df['soc_sec_id'] = df['soc_sec_id'].replace(0, '')



### Remove lines with missing values for Given Name and Surname 

In [7]:
df = df[np.logical_and(df.given_name != '', df.surname != '')]

### Adjust type

In [8]:
df['given_name'] = df['given_name'].str.upper()
df['surname'] = df['surname'].str.upper()
df['street_number'] = [item if item == '' else int(item) for item in df.street_number]
df['address_1'] = df['address_1'].str.upper()
df['address_2'] = df['address_2'].str.upper()
df['suburb'] = df['suburb'].str.upper()
df['postcode'] = df['postcode'].astype(int)
df['state'] = df['state'].str.upper()
df['date_of_birth'] = [item if item == '' else int(item) for item in df.date_of_birth]
df['soc_sec_id'] = df['soc_sec_id'].astype(int)
df['id record'] = df['id record'].astype(str)


# Search Space - Block method

To optimize our search for duplicates in a large dataset, we will use the Block method. This method involves efficiently ordering our dataset to reduce the search space to a small window.

In this case, we have a natural hierarchy of features to detect identical students: surname, given name, birth date, and postcode.
By ordering records based on this hirarchy, we are likely to get duplicates in a row. 

Let's reorder our DataFrame based on these features.


In [9]:
df = df.sort_values(by=['surname', 'given_name', 'date_of_birth', 'postcode'], ascending=True)

In [10]:
df.head(20)

Unnamed: 0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id,id record
106,PAUL,AAKSOA,35,BLAZVYPLACE,ELOUERA PARK,KINGSTON,3143,VIC,19280108,2229267,B_106
5848,PIA,ABA,16,TARDEN T STREET,,CLIFTON SPRINGS,6100,TASW,19030803,9804900,B_5848
18239,PIA,ABA,16,TARDEN T STREET,,CLIFTON SPRINGS,6100,TAS,19030803,9804900,B_18239
18555,PIA,ABA,16,TARDEN T STREET,,CLIFTON SPRINGS,6100,TAS,19030803,9804900,B_18555
22071,PIA,ABA,16,TARDEN T STREET,,CLIFTON SPRINGS,6100,TAS,19030803,9804900,B_22071
1848,LIMBERT,ABBIE,3,YURNG,BURRALY COURT,OYSTER BAY,7310,SA,19061013,2508428,B_1848
8757,LIMBERT,ABBIE,3,YURNG,BURRALY COURT,OYSTER BAY,7310,SA,19061013,2508428,B_8757
9780,LIMBERT,ABBIE,3,YURNG,BURRALY COURT,OYSTER BAY,7310,SA,19061013,2508428,B_9780
14231,LIMBERT,ABBIE,3,YURNG,BURRALY COURT,OYSTER BAY,7310,SA,19061013,2508428,B_14231
15381,LIMBERT,ABBIE,3,YURNG,BURRALY COURT,OYSTER BAY,7310,SA,19061013,2508428,B_15381


## Define our window based on the max occurence of a given name and surname

In [11]:
df[['surname', 'given_name']].value_counts()

surname   given_name
WHITE     CHLOE         18
WEBB      JOSHUA        17
WHITE     MIA           16
CLARKE    HOLLY         15
MASON     SOPHIE        15
                        ..
MATTHEWF  ARREN          1
MATTHEW   SPEIGHTW       1
MATTHES   PONNY          1
MASONR    DIAMOND        1
ZXGXR     KYLE           1
Name: count, Length: 5426, dtype: int64

Let's inspect the most occurent name

In [12]:
df[np.logical_and(df.surname == 'WHITE', df.given_name == 'CHLOE')]

Unnamed: 0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id,id record
5460,CHLOE,WHITE,101,GOLDSTEIN CTRESCENT,BRENTWOOD VLGE,STONEYFORD,2212,NSW,19181228,5862107,B_5460
6209,CHLOE,WHITE,101,GOLDSTEIN CTRESCENT,BRENTWOOD VLGE,STONEYFORD,2212,NSW,19181228,5862107,B_6209
6328,CHLOE,WHITE,101,GOLDSTEIN CTRESCENT,BRENTWOOD VLGE,STONEYFORD,2212,NSW,19181228,5862107,B_6328
6522,CHLOE,WHITE,101,GOLDSTEIN CTRESCENT,BRENTWOOD VLGE,STONEYFORD,2212,SW,19181228,5862107,B_6522
8588,CHLOE,WHITE,101,GOLDSTEIN CTRESCENT,BRENTWOOD VLGE,STONEYFORD,2212,NSW,19181228,5862107,B_8588
10292,CHLOE,WHITE,101,GOLDSTEIN CTRESCENT,BRENTWOOD VLGE,STONEYFORD,2212,NSW,19181228,5862107,B_10292
18809,CHLOE,WHITE,101,GOLDSTEIN CTRESCENT,BRENTWOOD VLGE,STONEYFORD,2212,NSW,19181228,5862107,B_18809
19014,CHLOE,WHITE,101,GOLDSTEIN CTRESCENT,BRENTWOOD VLGE,STONEYFORD,2212,NSW,19181228,5862107,B_19014
985,CHLOE,WHITE,33,GADALI CRESCENT,RONDYAVOO,LOGANKEA,3069,VIC,19331024,8293935,B_985
3012,CHLOE,WHITE,33,GADALI CRESCENT,RONDYAVOO,LOGANKEA,3069,VIC,19331024,8293935,B_3012


Hence, we decide that our window should be 20

In [13]:
block_window = 20

## Comparaison & Decision

Example using levenshtein distance

In [15]:
name1 = 'JOHN DOE'
name2 = 'JOHN DOO'
threshold_score = 0.7


similarity_score = levenshtein_ratio(name1, name2)
is_duplicate = similarity_score > threshold_score

print(f"Similarity score: {similarity_score}")
print(f"Are they duplicates? {'Yes' if is_duplicate else 'No'}")


Similarity score: 0.875
Are they duplicates? Yes


Example using cosine distance and n-gram

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

name1 = 'JOHN DOE'
name2 = 'JOHN DOO'
threshold_score = 0.7

vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2))
vectorizer.fit([name1, name2])
vector = vectorizer.transform([name1, name2])

similarity_score = cosine_similarity(vector)[0][1]

is_duplicate = similarity_score > threshold_score

print(f"Similarity score: {similarity_score}")
print(f"Are they duplicates? {'Yes' if is_duplicate else 'No'}")


Similarity score: 0.8571428571428569
Are they duplicates? Yes


### Let's Find Duplicates

We define two records as duplicates if the distances between the surname, given name, and date of birth are above a predefined threshold.
We decide not to use the address related features, assuming that someone can change location.

### Compute similarity ratio on all records



In [17]:
distance_student = {'record_id1': [], 'record_id2': [], 'score': []}

for i in range(0, df.shape[0] - block_window):
    for j in range(i + 1, i + block_window):
   
        lev1 = levenshtein_ratio(df.iloc[i]['surname'], df.iloc[j]['surname'])
        lev2 = levenshtein_ratio(df.iloc[i]['given_name'], df.iloc[j]['given_name'])
        lev3 = levenshtein_ratio(str(df.iloc[i]['date_of_birth']), str(df.iloc[j]['date_of_birth']))

        distance_student['record_id1'].append(df.iloc[i]['id record'])
        distance_student['record_id2'].append(df.iloc[j]['id record'])
        distance_student['score'].append(sum([lev1, lev2, lev3])/3)

df_distance = pd.DataFrame(distance_student)
            

### Define threshold and keep duplicates only 

After an experimental analysis, we decide to set our threshold to 0.88. 

In [30]:
threshold = 0.88

df_duplicate = df_distance[df_distance.score > threshold].sort_values(by='score', ascending=True)
df_duplicate.head()

Unnamed: 0,record_id1,record_id2,score
23465,B_6057,B_15847,0.888889
214835,B_15220,B_13875,0.888889
214836,B_15220,B_15895,0.888889
214837,B_15220,B_17611,0.888889
23447,B_315,B_15847,0.888889


### Inspecting records wiht the lowest ratio


In [48]:
df[df['id record'] == df_duplicate.iloc[0]['record_id1']]

Unnamed: 0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id,id record
6057,KEEGAN,BERRY,28,HATTERSLEY COURT,GUNDAMAIN,CORRIMQL EAST,3073,QLD,19230913,8103633,B_6057


In [49]:
df[df['id record'] == df_duplicate.iloc[0]['record_id2']]

Unnamed: 0,given_name,surname,street_number,address_1,address_2,suburb,postcode,state,date_of_birth,soc_sec_id,id record
15847,KUUGAN,BERRY,28,HATTERSLEY COURT,GUNDAMAIN,CORRIMQL EAST,3073,QLD,19230913,8103633,B_15847


## Conclusion

We managed to succesfully find duplicates in our dataset by:
1. clean the data 
2. reorder rows to bring closer potential duplicates
3. defined a relevant search space
4. computed the distance between each rows within our search space using
5. defined a relevant distance threshold to decide wether or not our records are duplicates

