# Analyse de la qualité des données

## Prise de connaissance avec le jeu de données

### Chargement des données brut

In [1]:
from getting_started import df_patient, df_pcr

Conversion de chaque attribut du référentiel en un type de données plus spécifique.

In [2]:
df_patient = df_patient.convert_dtypes()

df_patient.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   patient_id     20000 non-null  Int64 
 1   given_name     19560 non-null  string
 2   surname        19575 non-null  string
 3   street_number  19618 non-null  Int64 
 4   address_1      19204 non-null  string
 5   suburb         19788 non-null  string
 6   postcode       19801 non-null  string
 7   state          18010 non-null  string
 8   date_of_birth  17989 non-null  Int64 
 9   age            16003 non-null  Int64 
 10  phone_number   19081 non-null  string
 11  address_2      7893 non-null   string
dtypes: Int64(4), string(8)
memory usage: 1.9 MB


Conversion de chaque attribut de l'échantillon en un type de données plus spécifique.

In [3]:
df_pcr = df_pcr.convert_dtypes()

df_pcr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8800 entries, 0 to 8799
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   patient_id  8800 non-null   Int64 
 1   pcr         8800 non-null   string
dtypes: Int64(1), string(1)
memory usage: 146.2 KB


### Présence de doublons sur l'identifiant

In [4]:
with_duplicated_id = df_patient.patient_id.duplicated(keep=False)

df_patient[with_duplicated_id].sort_values("patient_id")

Unnamed: 0,patient_id,given_name,surname,street_number,address_1,suburb,postcode,state,date_of_birth,age,phone_number,address_2
12117,109304,zachary,farronato,30,outtrim avenue,como,2196,vic,19090801,31,07 22894061,the reefs
14839,109304,bailey,donaldson,20,tardent street,ryde,0812,qld,19580310,26,07 13479210,
4386,110207,toby,brock,4,merriman crescent,baralaba,3025,nsw,19000424,35,08 33842007,leitrim
12989,110207,zali,brock,32,hedger street,toorak,5038,act,,22,08 96818512,
10184,115791,hannah,clarke,70,galmarra street,mayfield,7010,vic,19830828,25,04 70760611,
...,...,...,...,...,...,...,...,...,...,...,...,...
10507,990695,erin,braunack,49,moondarra street,broken hill,2640,qld,19830122,30,03 69523317,yuulong
8764,990936,amy,royle,90,whittell crescent,coramba,5032,sa,19950326,,08 07309295,tewantin plaza
12563,990936,samantha,green,21,brierly street,ardrossan,2140,,19380210,29,02 51600621,
2385,994235,trent,stewart-jones,129,macfarland crescent,wangaratta,2732,nsw,,,07 98662458,mountview


Il y a 403 patients qui partagent un identifiant.

### Présence de doublons sur l'ensemble des attributs

In [5]:
with_same_attributes = df_patient.drop(columns="patient_id").duplicated(keep=False)

df_patient[with_same_attributes].sort_values(by=["surname", "given_name"])

Unnamed: 0,patient_id,given_name,surname,street_number,address_1,suburb,postcode,state,date_of_birth,age,phone_number,address_2
6981,804259,charlie,gamlin,33,nungara street,bayswater,5251,qld,19190111.0,24.0,04 33326042,
12438,107928,charlie,gamlin,33,nungara street,bayswater,5251,qld,19190111.0,24.0,04 33326042,
15184,744576,freya,jaffres,50,hart place,biggera waters,4413,wa,,33.0,07 76055136,
17522,449738,freya,jaffres,50,hart place,biggera waters,4413,wa,,33.0,07 76055136,
5729,973030,delaney,kermeen,9,wallen place,st kilda east,4405,qld,19391226.0,25.0,08 37919311,
15337,373129,delaney,kermeen,9,wallen place,st kilda east,4405,qld,19391226.0,25.0,08 37919311,
1052,664037,samantha,laundy,2,mannheim street,quairading,4740,qld,19480111.0,,02 37735421,stanton
18867,421721,samantha,laundy,2,mannheim street,quairading,4740,qld,19480111.0,,02 37735421,stanton
1863,658924,lewis,matthews,24,allawah flats,attunga,3216,wa,19690623.0,26.0,03 95122427,
1944,669936,lewis,matthews,24,allawah flats,attunga,3216,wa,19690623.0,26.0,03 95122427,


Il y 22 patients qui sont des doublons parfaits hors identifiant.

Ces deux dernières observations justifient une analyse approfondie sur le référenciel de patients afin de définir une stratégie de dédoublonnage et de réconciliation de données adaptée en préambule de l'analyse exploratoire.

## Analyse de l'échantillon de tests PCR

### Répartition des valeurs du test PCR

In [6]:
df_pcr.pcr.value_counts()

N           3482
Negative    3134
Positive    1283
P            901
Name: pcr, dtype: Int64

Deux conventions sont utilisées pour représenter les deux valeurs possibles d'un test PCR (négatif ou positif) : `N / P` et `Negative / Positive`.

Il faudra normaliser ces résultats dans une variable catégorielle ordonnée.

### Exhaustivité du référentiel

In [7]:
df_pcr.patient_id.isin(df_patient.patient_id).all()

True

L'intégralité des identifiants associés à chaque test de l'échantillon sont présents dans le référenciel.

## Analyse du référenciel de patients

### Nom et prénom

#### Valeurs manquantes

In [8]:
df_na_in_patient_name = df_patient[["surname", "given_name"]].isna()

df_na_in_patient_name.value_counts()

surname  given_name
False    False         19139
         True            436
True     False           421
         True              4
dtype: int64

Il y a 861 patients dont le nom et / ou prénom ne sont pas renseignés.

Répartition des valeurs :

In [9]:
df_patient_full_name = df_patient[["surname", "given_name"]].dropna()

df_patient_full_name.value_counts()

surname   given_name
white     emiily        14
green     emiily        12
white     joshua        11
campbell  joshua        11
ryan      emiily        10
                        ..
newberry  jack           1
          hannah         1
          dominic        1
          daniel         1
aaberg    charlotte      1
Length: 16681, dtype: int64

#### Fautes typographiques

In [22]:
from jellyfish import damerau_levenshtein_distance
from pandas import merge

df_patient["full_name"] = df_patient.agg(
    lambda x: f"{x.given_name} {x.surname}", axis="columns")

df_full_name = df_patient[["patient_id", "full_name", "phone_number"]].dropna()
df_full_name = df_full_name.merge(df_full_name, on="phone_number")
df_full_name = df_full_name[df_full_name.patient_id_x != df_full_name.patient_id_y]
df_full_name["linked_ids"] = df_full_name[["patient_id_x", "patient_id_y"]].apply(
    lambda x: tuple(sorted(x)), axis="columns")
df_full_name.drop_duplicates("linked_ids", inplace=True)
df_full_name["similarity"] = df_full_name.apply(
    lambda row: damerau_levenshtein_distance(row.full_name_x, row.full_name_y), axis="columns")

df_full_name[df_full_name.similarity == 1].sort_values("phone_number").head(10)

Unnamed: 0,patient_id_x,full_name_x,phone_number,patient_id_y,full_name_y,linked_ids,similarity
4192,311830,taaila <NA>,02 00325977,210155,taalia <NA>,"(210155, 311830)",1
4193,311830,taaila <NA>,02 00325977,525466,taalia <NA>,"(311830, 525466)",1
985,431593,adam schumann,02 01272164,123387,adam schumajnn,"(123387, 431593)",1
991,123387,adam schumajnn,02 01272164,375877,adam schumann,"(123387, 375877)",1
990,123387,adam schumajnn,02 01272164,505218,adam schumann,"(123387, 505218)",1
1827,489678,jacob svenson,02 03546747,909797,jaob svenson,"(489678, 909797)",1
1831,669354,jacob svenson,02 03546747,909797,jaob svenson,"(669354, 909797)",1
1835,576055,jacob svenson,02 03546747,909797,jaob svenson,"(576055, 909797)",1
579,399260,<NA> peterssen,02 03687263,776651,<NA> petersen,"(399260, 776651)",1
574,950122,<NA> petersen,02 03687263,399260,<NA> peterssen,"(399260, 950122)",1


In [23]:
df_full_name[df_full_name.similarity > 1].sort_values("phone_number").head(10)

Unnamed: 0,patient_id_x,full_name_x,phone_number,patient_id_y,full_name_y,linked_ids,similarity
6144,953966,to godfrey,02 01871708,405442,thomas godfrey,"(405442, 953966)",4
16219,382081,james <NA>,02 03755662,687453,jim <NA>,"(382081, 687453)",3
589,471602,briekle tippuns,02 04356284,106764,brielle tippins,"(106764, 471602)",2
585,718467,brielle tippins,02 04356284,471602,briekle tippuns,"(471602, 718467)",2
5705,496474,stacia seddon,02 05657798,963018,anastasia seddon,"(496474, 963018)",4
5702,970678,stacia seddon,02 05657798,963018,anastasia seddon,"(963018, 970678)",4
4075,131348,timothy britten,02 08766786,646113,timothy bripen,"(131348, 646113)",2
10877,898204,jordan ballantyne,02 13710140,494861,ballantyne jordan,"(494861, 898204)",14
10881,740337,jordan ballantyne,02 13710140,494861,ballantyne jordan,"(494861, 740337)",14
10882,740337,jordan ballantyne,02 13710140,219095,<NA> ballantyne,"(219095, 740337)",6


Valeurs manquantes :

In [None]:
df_patient[["age", "date_of_birth"]].isna().value_counts()

Dates de naissance invalides :

In [None]:
from pandas import to_datetime

df_patient["dob_datetime"] = to_datetime(
    df_patient.date_of_birth,
    errors="coerce",
    format="%Y%M%d",
)

df_patient[["age", "date_of_birth", "dob_datetime"]].sample(10, random_state=42)

Incohérences entre âge déclaré et calculé : 

In [None]:
from datetime import datetime
df_patient["age_from_dob"] = ((datetime.today() - df_patient["dob_datetime"].dropna()).dt.days).floordiv(365)

df_patient.head(20)

## Analyse des états et codes postaux

Valeurs manquantes

In [None]:
df_patient[["state", "postcode"]].isna().value_counts()

Valeurs des états

In [None]:
states = df_patient.state.dropna().value_counts()

print(states[:8])
print(states[8:])
print(states[8:].sum())

Incohérence état / postcode

In [None]:
postcodes = df_patient.postcode.dropna().value_counts()

Inversion postcode / quartier

In [None]:
df_patient.loc[df_patient.suburb.str.isnumeric()]

## Sanitize postcode and state

Assume postcode is more reliable than state.
Test all postcodes are valid.
Case postcode invalid, try swap with suburb.
Test some state are invalid.
Normalize state with typos.
For missing or invalid states, guess from postcode.
Keep state if postcode invalid.


In [None]:
df_patient.state = df_patient.state.str.upper()
df_patient_invalid_postcode = df_patient[df_patient.state.isin(["NSW", "VIC", "QLD", "WA", "SA", "TAS", "ACT", "NT"])]
df_patient_vic = df_patient_invalid_postcode.loc[df_patient_invalid_postcode.state == "VIC"]
df_patient_vic.loc[~df_patient_vic.postcode.str.match(r"[3|8]\d{3}")].head()

## Sanitize phone numbers

- Format XX XXXXXXXX, 04 if mobile
- Must be coherent with postcode


In [None]:
df_patient.loc[~df_patient.phone_number.str.match(r"\d{2}\s\d{8}")].phone_number.count()