# Analyse de la qualité des données

## Chargement des données brut

In [1]:
from getting_started import df_patient, df_pcr

df_patient = df_patient.convert_dtypes()
df_pcr = df_pcr.convert_dtypes()

In [2]:
df_patient.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   patient_id     20000 non-null  Int64 
 1   given_name     19560 non-null  string
 2   surname        19575 non-null  string
 3   street_number  19618 non-null  Int64 
 4   address_1      19204 non-null  string
 5   suburb         19788 non-null  string
 6   postcode       19801 non-null  string
 7   state          18010 non-null  string
 8   date_of_birth  17989 non-null  Int64 
 9   age            16003 non-null  Int64 
 10  phone_number   19081 non-null  string
 11  address_2      7893 non-null   string
dtypes: Int64(4), string(8)
memory usage: 1.9 MB


In [3]:
df_patient.sample(20, random_state=42)

Unnamed: 0,patient_id,given_name,surname,street_number,address_1,suburb,postcode,state,date_of_birth,age,phone_number,address_2
10650,303462,,drilling,7,chant street,sippy downs,2600,wa,19650127.0,23.0,04 71886806,
2041,614257,luke,hammill,107,northbourne avenue,long plains,2446,qld,19840526.0,,,silverweir
8668,795416,nicholas,,22,hillebrand street,melton south,2035,tas,19841020.0,35.0,08 75018121,
1114,101062,cameron,wheatley,7,orchard place,coodanup,4878,qld,19260905.0,34.0,08 42055982,
13902,989589,emiily,wagnitz,17,rapke place,helena valley,2031,vic,19730214.0,25.0,08 01184523,templemore
11963,634140,emiily,tremellen,44,,willetton,3163,vic,19361231.0,33.0,,gannet house
11072,168274,madeleine,brammy,63,newlop street,cronulla,2031,nsw,19090301.0,32.0,07 38164479,
3002,728419,alyssa,buckoke,3,totterdell street,safety bay,4216,vic,19870126.0,28.0,02 53294110,homefield home
19771,124996,rachel,green,61,neville place,campsie,2120,qld,19090604.0,23.0,03 87755699,mlc centre
8115,298159,emiily,priest,54,,orelia,5374,vic,19041129.0,32.0,02 97502729,


In [4]:
df_pcr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8800 entries, 0 to 8799
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   patient_id  8800 non-null   Int64 
 1   pcr         8800 non-null   string
dtypes: Int64(1), string(1)
memory usage: 146.2 KB


In [5]:
df_pcr.sample(20, random_state=42)

Unnamed: 0,patient_id,pcr
95,887071,N
592,391464,P
4990,247603,N
8330,905428,N
360,126015,N
3501,579885,Negative
1780,820785,Negative
6815,352515,Positive
4098,471233,N
6633,424464,N


## Analyse de l'identifiant métier `patient_id`

Nombre de patients avec un `patient_id` doublonné

In [6]:
df_patient.duplicated(subset="patient_id", keep=False).sum()

403

Nombre de tests PCR associés à un `patient_id` doublonné

In [7]:
patient_id_duplicated = df_patient.loc[
    df_patient.duplicated(subset="patient_id"),
    "patient_id"
]
df_pcr.patient_id.isin(patient_id_duplicated).sum()

168

## Analyse des tests PCR

Liste des valeurs du test PCR

In [8]:
df_pcr.pcr.value_counts()

N           3482
Negative    3134
Positive    1283
P            901
Name: pcr, dtype: Int64

Un test PCR peut être positif (valeurs `Positive` ou `P`) ou négatif (valeurs `Negative` ou `N`). Il faut passer par une étape de normalisation.

In [9]:
df_pcr["is_positive"] = df_pcr.pcr.str[0] == "P"

df_pcr.sample(10, random_state=42)

Unnamed: 0,patient_id,pcr,is_positive
95,887071,N,False
592,391464,P,True
4990,247603,N,False
8330,905428,N,False
360,126015,N,False
3501,579885,Negative,False
1780,820785,Negative,False
6815,352515,Positive,True
4098,471233,N,False
6633,424464,N,False


## Analyse des noms et prénoms

Répartition des valeurs :

In [28]:
df_patient[["given_name", "surname"]].value_counts()

given_name  surname   
emiily      white         14
            green         12
joshua      campbell      11
            white         11
william     white         10
                          ..
lukas       hanna          1
            gilbertson     1
            frahn          1
            clarke         1
aaliyah     bartel         1
Length: 16681, dtype: int64

Valeurs manquantes :

In [137]:
df_patient[["given_name", "surname"]].isna().value_counts()

given_name  surname
False       False      19139
True        False        436
False       True         421
True        True           4
dtype: int64

## Analyse des âges et dates de naissance

Valeurs manquantes :

In [138]:
df_patient[["age", "date_of_birth"]].isna().value_counts()

age    date_of_birth
False  False            14391
True   False             3598
False  True              1612
True   True               399
dtype: int64

Dates de naissance invalides :

In [40]:
from pandas import to_datetime

df_patient["dob_datetime"] = to_datetime(
    df_patient.date_of_birth,
    errors="coerce",
    format="%Y%M%d",
)

df_patient[["age", "date_of_birth", "dob_datetime"]].sample(10, random_state=42)

Unnamed: 0,age,date_of_birth,dob_datetime
10650,23.0,19650127,1965-01-27 00:01:00
2041,,19840526,1984-01-26 00:05:00
8668,35.0,19841020,1984-01-20 00:10:00
1114,34.0,19260905,1926-01-05 00:09:00
13902,25.0,19730214,1973-01-14 00:02:00
11963,33.0,19361231,1936-01-31 00:12:00
11072,32.0,19090301,1909-01-01 00:03:00
3002,28.0,19870126,1987-01-26 00:01:00
19771,23.0,19090604,1909-01-04 00:06:00
8115,32.0,19041129,1904-01-29 00:11:00


Incohérences entre âge déclaré et calculé : 

In [93]:
from datetime import datetime
df_patient["age_from_dob"] = ((datetime.today() - df_patient["dob_datetime"].dropna()).dt.days).floordiv(365)

df_patient.head(20)

Unnamed: 0,patient_id,given_name,surname,street_number,address_1,suburb,postcode,state,date_of_birth,age,phone_number,address_2,dob_datetime,age_from_dob
0,221958,matisse,clarke,13,rene street,ellenbrook,2527,wa,19710708.0,32.0,08 86018809,westella,1971-07-08,49.0
1,771155,joshua,elrick,23,andrea place,east preston,2074,nsw,19120921.0,34.0,02 97793152,foxdown,1912-09-21,108.0
2,231932,alice,conboy,35,mountain circuit,prospect,2305,nsw,19810905.0,22.0,02 20403934,,1981-09-05,39.0
3,465838,sienna,craswell,39,cumberlegeicrescent,henty,3620,wa,19840809.0,30.0,02 62832318,jodane,1984-08-09,36.0
4,359178,joshua,bastiaans,144,lowrie street,campbell town,4051,nsw,19340430.0,31.0,03 69359594,,1934-04-30,86.0
5,744167,ky,laing,448,nyawi place,barmera,3556,qld,19050919.0,32.0,03 59872070,,1905-09-19,115.0
6,210268,matthew,laing,11,barnes place,laurieton,2160,nsw,19061018.0,29.0,02 86925029,,1906-10-18,113.0
7,832180,jack,renfrey,27,osmand street,maribyrnong,2170,qld,19610518.0,31.0,03 15575583,dhurringill,1961-05-18,59.0
8,154886,adele,ryan,76,house circuit,new farm,2200,qld,19430102.0,33.0,07 37444521,,1943-01-02,77.0
9,237337,breeanne,wynne,12,cowper street,bonnet bay,2062,qld,19030606.0,35.0,08 24888117,,1903-06-06,117.0


## Analyse des états et codes postaux

Valeurs manquantes

In [41]:
df_patient[["state", "postcode"]].isna().value_counts()

state  postcode
False  False       17828
True   False        1973
False  True          182
True   True           17
dtype: int64

Valeurs des états

In [48]:
states = df_patient.state.dropna().value_counts()

print(states[:8])
print(states[8:])
print(states[8:].sum())

nsw    6143
vic    4352
qld    3516
wa     1580
sa     1391
tas     507
act     250
nt      132
Name: state, dtype: Int64
nss    7
ws     6
ns     6
ql     5
nsq    4
      ..
nze    1
ai     1
vi     1
w      1
nfw    1
Name: state, Length: 94, dtype: Int64
139


Incohérence état / postcode

In [50]:
postcodes = df_patient.postcode.dropna().value_counts()

Inversion postcode / quartier

In [53]:
df_patient.loc[df_patient.suburb.str.isnumeric()]

Unnamed: 0,patient_id,given_name,surname,street_number,address_1,suburb,postcode,state,date_of_birth,age,phone_number,address_2,dob_datetime
3976,810644,juliana,grosvenor,5,connelly pace,3023,port noarlunga south,tas,19991215.0,,03 55227740,,1999-01-15 00:12:00
4080,986559,kirra,choi-lundberg,102,centaurus street,6168,naremburn,vic,19261104.0,27.0,08 69584599,,1926-01-04 00:11:00
5792,752873,lochlan,blake,258,,4216,toowoobma,wa,19080821.0,31.0,02 84630666,,1908-01-21 00:08:00
6218,902348,isaac,nakoje,19,collier street,6017,brighton,,19640421.0,8.0,02 69439226,,1964-01-21 00:04:00
6618,678110,jaden,green,5,dovey place,3185,oraneg,vic,19151204.0,23.0,02 73534391,,1915-01-04 00:12:00
9653,690348,andrew,ryan,20,mainwaring rich circuit,3020,blacktown,wa,19760001.0,22.0,,,1976-01-01 00:00:00
11333,738103,hugi,pascoe,167,leita court,3023,port lincoln,nsw,19040401.0,,07 84786511,,1904-01-01 00:04:00
14255,684359,sonia,green,50,kalgoorlie crescent,6112,ashfield,sa,,9.0,03 46671647,,NaT
15479,355033,abby,yoob,243,weston street,3181,forest hill,,19660615.0,26.0,02 68667816,,1966-01-15 00:06:00
15575,572694,isabella,beddimg,17,heidelberg street,3764,toowoomba,nsw,19510108.0,,03 01075733,laurel bank,1951-01-08 00:01:00


## Sanitize postcode and state

Assume postcode is more reliable than state.
Test all postcodes are valid.
Case postcode invalid, try swap with suburb.
Test some state are invalid.
Normalize state with typos.
For missing or invalid states, guess from postcode.
Keep state if postcode invalid.


In [161]:
df_patient.state = df_patient.state.str.upper()
df_patient_invalid_postcode = df_patient[df_patient.state.isin(["NSW", "VIC", "QLD", "WA", "SA", "TAS", "ACT", "NT"])]
df_patient_vic = df_patient_invalid_postcode.loc[df_patient_invalid_postcode.state == "VIC"]
df_patient_vic.loc[~df_patient_vic.postcode.str.match(r"[3|8]\d{3}")].head()

Unnamed: 0,patient_id,given_name,surname,street_number,address_1,suburb,postcode,state,date_of_birth,age,phone_number,address_2,dob_sanitized
24,299041,harrison,neumann,60,eggleston crescent,tweed heads,2560,VIC,19700617.0,22.0,04 13196831,,1970-06-17
35,256730,xani,soulemezis,34,macrossan rescent,mullumbimby creek,2192,VIC,19890205.0,,07 24955924,sec 1,1989-02-05
37,484681,chloe,gimbrere,53,o'connor circuit,wynnum west,2317,VIC,19130421.0,28.0,08 69503221,,1913-04-21
42,552366,dylan,leane,158,byrne street,landsborough,2528,VIC,,,07 12047561,,NaT
48,428947,abbie,fitzpatrick,7,jacka crescent,prospect,5091,VIC,19700122.0,35.0,03 41475097,,1970-01-22


## Sanitize phone numbers

- Format XX XXXXXXXX, 04 if mobile
- Must be coherent with postcode


In [170]:
df_patient.loc[~df_patient.phone_number.str.match(r"\d{2}\s\d{8}")].phone_number.count()

0