# Cleaning The COUGHVID crowdsourcing dataset
Cough audio signal classification has been successfully used to diagnose a variety of respiratory conditions, and there has been significant interest in leveraging Machine Learning (ML) to provide widespread COVID-19 screening. The COUGHVID dataset provides over 25,000 crowdsourced cough recordings representing a wide range of participant ages, genders, geographic locations, and COVID-19 statuses.
__[The link to the article](https://www.nature.com/articles/s41597-021-00937-4)__

## Initial data cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
import librosa
import librosa.display

In [2]:
df = pd.read_csv('metadata_compiled.csv', index_col = 0)
df.head(10)

Unnamed: 0,uuid,datetime,cough_detected,latitude,longitude,age,gender,respiratory_condition,fever_muscle_pain,status,...,quality_4,cough_type_4,dyspnea_4,wheezing_4,stridor_4,choking_4,congestion_4,nothing_4,diagnosis_4,severity_4
0,00014dcc-0f06-4c27-8c7b-737b18a2cf4c,2020-11-25T18:58:50.488301+00:00,0.0155,48.9,2.4,,,,,,...,,,,,,,,,,
1,00039425-7f3a-42aa-ac13-834aaa2b6b92,2020-04-13T21:30:59.801831+00:00,0.9609,31.3,34.8,15.0,male,False,False,healthy,...,,,,,,,,,,
2,0007c6f1-5441-40e6-9aaf-a761d8f2da3b,2020-10-18T15:38:38.205870+00:00,0.1643,,,46.0,female,False,False,healthy,...,,,,,,,,,,
3,00098cdb-4da1-4aa7-825a-4f1b9abc214b,2021-01-22T22:08:06.742577+00:00,0.1133,47.4,9.4,66.0,female,False,False,healthy,...,,,,,,,,,,
4,0009eb28-d8be-4dc1-92bb-907e53bc5c7a,2020-04-12T04:02:18.159383+00:00,0.9301,40.0,-75.1,34.0,male,True,False,healthy,...,,,,,,,,,,
5,0012c608-33d0-4ef7-bde3-75a0b1a0024e,2020-04-15T01:03:59.029326+00:00,0.0482,-16.5,-71.5,,,,,,...,,,,,,,,,,
6,001328dc-ea5d-4847-9ccf-c5aa2a3f2d0f,2020-04-13T22:23:06.997578+00:00,0.9968,,,21.0,male,False,False,healthy,...,,,,,,,,,,
7,00196ba6-0087-484b-a104-3e8884599596,2021-05-28T15:47:24.337832+00:00,0.3079,,,,,,,,...,,,,,,,,,,
8,001c85a8-cc4d-4921-9297-848be52d4715,2020-04-17T15:24:35.822355+00:00,0.0735,40.6,-3.6,,,,,,...,,,,,,,,,,
9,001d8e33-a4af-4edb-98ba-b03f891d9a6c,2020-05-13T01:27:42.552773+00:00,0.0306,13.8,-89.6,,female,False,True,COVID-19,...,,,,,,,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34434 entries, 0 to 34433
Data columns (total 51 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   uuid                   34434 non-null  object 
 1   datetime               34434 non-null  object 
 2   cough_detected         34434 non-null  float64
 3   latitude               19431 non-null  float64
 4   longitude              19431 non-null  float64
 5   age                    19396 non-null  float64
 6   gender                 20664 non-null  object 
 7   respiratory_condition  20664 non-null  object 
 8   fever_muscle_pain      20664 non-null  object 
 9   status                 20664 non-null  object 
 10  status_SSL             8331 non-null   object 
 11  quality_1              820 non-null    object 
 12  cough_type_1           820 non-null    object 
 13  dyspnea_1              820 non-null    object 
 14  wheezing_1             820 non-null    object 
 15  st

In [4]:
df.isnull().sum()

uuid                         0
datetime                     0
cough_detected               0
latitude                 15003
longitude                15003
age                      15038
gender                   13770
respiratory_condition    13770
fever_muscle_pain        13770
status                   13770
status_SSL               26103
quality_1                33614
cough_type_1             33614
dyspnea_1                33614
wheezing_1               33614
stridor_1                33614
choking_1                33614
congestion_1             33614
nothing_1                33614
diagnosis_1              33614
severity_1               33614
quality_2                33614
cough_type_2             33615
dyspnea_2                33614
wheezing_2               33614
stridor_2                33614
choking_2                33614
congestion_2             33614
nothing_2                33614
diagnosis_2              33614
severity_2               33614
quality_3                33614
cough_ty

In [5]:
#Removing unnecessary columns
df = df.drop(['datetime', 'longitude', 'latitude'], axis = 1)
df.columns

Index(['uuid', 'cough_detected', 'age', 'gender', 'respiratory_condition',
       'fever_muscle_pain', 'status', 'status_SSL', 'quality_1',
       'cough_type_1', 'dyspnea_1', 'wheezing_1', 'stridor_1', 'choking_1',
       'congestion_1', 'nothing_1', 'diagnosis_1', 'severity_1', 'quality_2',
       'cough_type_2', 'dyspnea_2', 'wheezing_2', 'stridor_2', 'choking_2',
       'congestion_2', 'nothing_2', 'diagnosis_2', 'severity_2', 'quality_3',
       'cough_type_3', 'dyspnea_3', 'wheezing_3', 'stridor_3', 'choking_3',
       'congestion_3', 'nothing_3', 'diagnosis_3', 'severity_3', 'quality_4',
       'cough_type_4', 'dyspnea_4', 'wheezing_4', 'stridor_4', 'choking_4',
       'congestion_4', 'nothing_4', 'diagnosis_4', 'severity_4'],
      dtype='object')

Let's find out which columns are numerical and which are categorical

In [6]:
numerical_cols = df.select_dtypes(include='number').columns
print('Numerical columns')
print(numerical_cols)

Numerical columns
Index(['cough_detected', 'age'], dtype='object')


In [7]:
categorical_cols = df.select_dtypes(include=['object']).columns
display('Categorical columns')
display(categorical_cols)

'Categorical columns'

Index(['uuid', 'gender', 'respiratory_condition', 'fever_muscle_pain',
       'status', 'status_SSL', 'quality_1', 'cough_type_1', 'dyspnea_1',
       'wheezing_1', 'stridor_1', 'choking_1', 'congestion_1', 'nothing_1',
       'diagnosis_1', 'severity_1', 'quality_2', 'cough_type_2', 'dyspnea_2',
       'wheezing_2', 'stridor_2', 'choking_2', 'congestion_2', 'nothing_2',
       'diagnosis_2', 'severity_2', 'quality_3', 'cough_type_3', 'dyspnea_3',
       'wheezing_3', 'stridor_3', 'choking_3', 'congestion_3', 'nothing_3',
       'diagnosis_3', 'severity_3', 'quality_4', 'cough_type_4', 'dyspnea_4',
       'wheezing_4', 'stridor_4', 'choking_4', 'congestion_4', 'nothing_4',
       'diagnosis_4', 'severity_4'],
      dtype='object')

In [8]:
#let's find out unique values in each categorical column (except for 'uuid' column)

for column in categorical_cols[1:]:
    unique_values = df[column].value_counts()
    non_null_values = df[column].count()
    display(f'Column: {column} - Non value values: {non_null_values}')
    display(unique_values)
    display()

'Column: gender - Non value values: 20664'

male      12850
female     7682
other       132
Name: gender, dtype: int64

'Column: respiratory_condition - Non value values: 20664'

False    17107
True      3557
Name: respiratory_condition, dtype: int64

'Column: fever_muscle_pain - Non value values: 20664'

False    18179
True      2485
Name: fever_muscle_pain, dtype: int64

'Column: status - Non value values: 20664'

healthy        15476
symptomatic     3873
COVID-19        1315
Name: status, dtype: int64

'Column: status_SSL - Non value values: 8331'

healthy     8046
COVID-19     285
Name: status_SSL, dtype: int64

'Column: quality_1 - Non value values: 820'

ok          614
poor        156
good         32
no_cough     18
Name: quality_1, dtype: int64

'Column: cough_type_1 - Non value values: 820'

dry        425
unknown    323
wet         72
Name: cough_type_1, dtype: int64

'Column: dyspnea_1 - Non value values: 820'

False    815
True       5
Name: dyspnea_1, dtype: int64

'Column: wheezing_1 - Non value values: 820'

False    765
True      55
Name: wheezing_1, dtype: int64

'Column: stridor_1 - Non value values: 820'

False    820
Name: stridor_1, dtype: int64

'Column: choking_1 - Non value values: 820'

False    818
True       2
Name: choking_1, dtype: int64

'Column: congestion_1 - Non value values: 820'

False    806
True      14
Name: congestion_1, dtype: int64

'Column: nothing_1 - Non value values: 820'

True     746
False     74
Name: nothing_1, dtype: int64

'Column: diagnosis_1 - Non value values: 820'

COVID-19               279
healthy_cough          259
lower_infection        244
upper_infection         23
obstructive_disease     15
Name: diagnosis_1, dtype: int64

'Column: severity_1 - Non value values: 820'

mild           521
pseudocough    249
severe          34
unknown         16
Name: severity_1, dtype: int64

'Column: quality_2 - Non value values: 820'

good        422
ok          286
poor         93
no_cough     19
Name: quality_2, dtype: int64

'Column: cough_type_2 - Non value values: 819'

dry        600
wet        133
unknown     86
Name: cough_type_2, dtype: int64

'Column: dyspnea_2 - Non value values: 820'

False    724
True      96
Name: dyspnea_2, dtype: int64

'Column: wheezing_2 - Non value values: 820'

False    737
True      83
Name: wheezing_2, dtype: int64

'Column: stridor_2 - Non value values: 820'

False    788
True      32
Name: stridor_2, dtype: int64

'Column: choking_2 - Non value values: 820'

False    795
True      25
Name: choking_2, dtype: int64

'Column: congestion_2 - Non value values: 820'

False    797
True      23
Name: congestion_2, dtype: int64

'Column: nothing_2 - Non value values: 820'

True     586
False    234
Name: nothing_2, dtype: int64

'Column: diagnosis_2 - Non value values: 820'

COVID-19               285
upper_infection        183
lower_infection        173
obstructive_disease    112
healthy_cough           67
Name: diagnosis_2, dtype: int64

'Column: severity_2 - Non value values: 820'

mild           572
unknown        139
severe          59
pseudocough     50
Name: severity_2, dtype: int64

'Column: quality_3 - Non value values: 820'

good        749
ok           31
no_cough     25
poor         15
Name: quality_3, dtype: int64

'Column: cough_type_3 - Non value values: 795'

dry        358
wet        291
unknown    146
Name: cough_type_3, dtype: int64

'Column: dyspnea_3 - Non value values: 820'

False    818
True       2
Name: dyspnea_3, dtype: int64

'Column: wheezing_3 - Non value values: 820'

False    818
True       2
Name: wheezing_3, dtype: int64

'Column: stridor_3 - Non value values: 820'

False    819
True       1
Name: stridor_3, dtype: int64

'Column: choking_3 - Non value values: 820'

False    820
Name: choking_3, dtype: int64

'Column: congestion_3 - Non value values: 820'

False    817
True       3
Name: congestion_3, dtype: int64

'Column: nothing_3 - Non value values: 820'

True     787
False     33
Name: nothing_3, dtype: int64

'Column: diagnosis_3 - Non value values: 793'

upper_infection        364
healthy_cough          199
lower_infection        194
obstructive_disease     35
COVID-19                 1
Name: diagnosis_3, dtype: int64

'Column: severity_3 - Non value values: 796'

mild           447
pseudocough    196
severe         105
unknown         48
Name: severity_3, dtype: int64

'Column: quality_4 - Non value values: 820'

good        664
ok          111
poor         28
no_cough     17
Name: quality_4, dtype: int64

'Column: cough_type_4 - Non value values: 803'

dry        654
wet        120
unknown     29
Name: cough_type_4, dtype: int64

'Column: dyspnea_4 - Non value values: 820'

False    806
True      14
Name: dyspnea_4, dtype: int64

'Column: wheezing_4 - Non value values: 820'

False    804
True      16
Name: wheezing_4, dtype: int64

'Column: stridor_4 - Non value values: 820'

False    810
True      10
Name: stridor_4, dtype: int64

'Column: choking_4 - Non value values: 820'

False    820
Name: choking_4, dtype: int64

'Column: congestion_4 - Non value values: 820'

False    743
True      77
Name: congestion_4, dtype: int64

'Column: nothing_4 - Non value values: 820'

True     684
False    136
Name: nothing_4, dtype: int64

'Column: diagnosis_4 - Non value values: 791'

upper_infection        313
healthy_cough          221
lower_infection        121
COVID-19                84
obstructive_disease     52
Name: diagnosis_4, dtype: int64

'Column: severity_4 - Non value values: 802'

mild           472
pseudocough    221
severe          69
unknown         40
Name: severity_4, dtype: int64

## Analysing 'quality' columns
To enhance the quality of the dataset with clinically validated information, four expert physicians revised 1000 recordings, selecting one of the predefined options for audio quality: 
 - Good: Cough present with minimal background noise
 - Ok: Cough present with background noise
 - Poor: Cough present with significant background noise
 - No cough present

In [9]:
#Let's analyse quality columns by separating them into dataframe
quality_df = pd.DataFrame().assign(quality_1 = df["quality_1"], 
                                   quality_2 = df["quality_2"], 
                                   quality_3 = df["quality_3"], 
                                   quality_4 = df["quality_4"])
quality_df.head(10)

Unnamed: 0,quality_1,quality_2,quality_3,quality_4
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
5,,,,
6,,,,
7,,,,
8,,,,
9,,,,


In [10]:
# There are rows with all NaNs, let's drop them
quality_df = quality_df.dropna(how='all')
quality_df

Unnamed: 0,quality_1,quality_2,quality_3,quality_4
14,,,,good
16,,,good,
42,,good,,
51,ok,,,
70,,,good,
...,...,...,...,...
34407,,,good,
34411,no_cough,,,
34413,,,,ok
34421,,,,poor


In [11]:
#checking unique values in each column
for column in quality_df:
    unique_values = quality_df[column].value_counts()
    display(unique_values)

ok          614
poor        156
good         32
no_cough     18
Name: quality_1, dtype: int64

good        422
ok          286
poor         93
no_cough     19
Name: quality_2, dtype: int64

good        749
ok           31
no_cough     25
poor         15
Name: quality_3, dtype: int64

good        664
ok          111
poor         28
no_cough     17
Name: quality_4, dtype: int64

In [12]:
#Let's create separate data set where quality value is 'good' or 'ok'
fair_quality_df = quality_df[quality_df.isin(['good','ok']).any(axis=1)]
fair_quality_df

Unnamed: 0,quality_1,quality_2,quality_3,quality_4
14,,,,good
16,,,good,
42,,good,,
51,ok,,,
70,,,good,
...,...,...,...,...
34380,ok,,,
34394,,,good,
34407,,,good,
34413,,,,ok


In [13]:
#Identifying rows where more than one expert has provided audio quality check
expert_fair_quality = fair_quality_df[fair_quality_df.notnull().sum(axis=1)>1]
expert_fair_quality

Unnamed: 0,quality_1,quality_2,quality_3,quality_4
186,good,good,good,good
266,poor,ok,good,good
2077,ok,poor,good,good
2079,ok,ok,good,good
3486,poor,poor,good,ok
...,...,...,...,...
33151,ok,good,good,good
33544,poor,good,good,good
34099,ok,good,good,good
34290,ok,ok,good,good


In [14]:
#this is an example of the row containing all four experts' assessments. 
#We may need it later in our analysis

df.iloc[186]

uuid                     01567151-7bb2-45ee-9aa8-a1332b5941ea
cough_detected                                          0.982
age                                                       NaN
gender                                                    NaN
respiratory_condition                                     NaN
fever_muscle_pain                                         NaN
status                                                    NaN
status_SSL                                                NaN
quality_1                                                good
cough_type_1                                              dry
dyspnea_1                                               False
wheezing_1                                              False
stridor_1                                               False
choking_1                                                True
congestion_1                                            False
nothing_1                                               False
diagnosi

## Exploring records with cough value of more than 0.8 
It is recommended using a cough_detected value of 0.8 because, this threshold exhibits an average precision of 95.4%. Therefore, only 4.6% of recordings with a cough_detected probability greater than 0.8 can be expected to contain non-cough events, which is not a large enough portion of the dataset to significantly bias cough classification algorithms. 

In [15]:
threshold_value_df = df[df['cough_detected']>=0.8]
threshold_value_df

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality_1,cough_type_1,...,quality_4,cough_type_4,dyspnea_4,wheezing_4,stridor_4,choking_4,congestion_4,nothing_4,diagnosis_4,severity_4
1,00039425-7f3a-42aa-ac13-834aaa2b6b92,0.9609,15.0,male,False,False,healthy,healthy,,,...,,,,,,,,,,
4,0009eb28-d8be-4dc1-92bb-907e53bc5c7a,0.9301,34.0,male,True,False,healthy,healthy,,,...,,,,,,,,,,
6,001328dc-ea5d-4847-9ccf-c5aa2a3f2d0f,0.9968,21.0,male,False,False,healthy,healthy,,,...,,,,,,,,,,
12,0028b68c-aca4-4f4f-bb1d-cb4ed5bbd952,0.8937,28.0,female,False,False,healthy,healthy,,,...,,,,,,,,,,
13,00291cce-36a0-4a29-9e2d-c1d96ca17242,0.9883,15.0,male,False,False,healthy,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34425,ffedc843-bfc2-4ad6-a749-2bc86bdac84a,0.9498,23.0,male,False,False,healthy,healthy,,,...,good,dry,False,False,False,False,False,True,healthy_cough,pseudocough
34426,ffeea120-92a4-40f9-b692-c3865c7a983f,0.9784,22.0,female,False,False,healthy,healthy,,,...,,,,,,,,,,
34428,fff30afc-db62-4408-a585-07ca9a254fcc,0.9698,,,,,,,,,...,,,,,,,,,,
34432,fffce9f0-a5e8-4bee-b13b-c671aac4a61c,0.9754,,,,,,,,,...,,,,,,,,,,


In [16]:
#Let's remove all the rows without expert's assessments
quality_expert_df = threshold_value_df.dropna(
    subset = ['quality_1', 'quality_2', 'quality_3', 'quality_4'], 
    how='all')
quality_expert_df

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality_1,cough_type_1,...,quality_4,cough_type_4,dyspnea_4,wheezing_4,stridor_4,choking_4,congestion_4,nothing_4,diagnosis_4,severity_4
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,,,...,good,dry,False,False,False,False,False,True,upper_infection,mild
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,,,...,,,,,,,,,,
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,,,...,,,,,,,,,,
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,,,...,,,,,,,,,,
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,,,...,,,,,,,,,,
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,,,...,,,,,,,,,,
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,,,...,ok,dry,False,False,False,False,False,True,COVID-19,mild
34421,ffe0658f-bade-4654-ad79-40a468aabb03,0.9846,22.0,male,True,True,COVID-19,,,,...,poor,unknown,False,False,False,False,False,False,,unknown


In [17]:
#Identifying rows where audio quality is either 'good' or 'ok'
good_quality_expert_df = quality_expert_df[quality_expert_df.isin(
    {'quality_1': ['good','ok'],
    'quality_2': ['good','ok'],
    'quality_3': ['good','ok'],
    'quality_4': ['good','ok']}).any(axis=1)]

good_quality_expert_df

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality_1,cough_type_1,...,quality_4,cough_type_4,dyspnea_4,wheezing_4,stridor_4,choking_4,congestion_4,nothing_4,diagnosis_4,severity_4
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,,,...,good,dry,False,False,False,False,False,True,upper_infection,mild
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,,,...,,,,,,,,,,
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,,,...,,,,,,,,,,
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,,,...,,,,,,,,,,
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34380,ff8bfcc9-3df2-4752-8280-63f023fba31c,0.9830,,female,False,False,healthy,,ok,dry,...,,,,,,,,,,
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,,,...,,,,,,,,,,
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,,,...,,,,,,,,,,
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,,,...,ok,dry,False,False,False,False,False,True,COVID-19,mild


## Analysing each expert's dataframe
In order to create a dataframe which we can use to train ML model, we should first split existing dataframe into four dataframes representing each expert, unify columns' names and finaly combine them into single dataframe again.  

In [18]:
#removing irrelevant to expert 1 columns
columns_remove = ['cough_type_2','severity_2','dyspnea_2','wheezing_2','stridor_2','choking_2','congestion_2','nothing_2','diagnosis_2', 'quality_2',
                      'cough_type_3','severity_3','dyspnea_3','wheezing_3','stridor_3','choking_3','congestion_3','nothing_3','diagnosis_3', 'quality_3',
                      'cough_type_4','severity_4','dyspnea_4','wheezing_4','stridor_4','choking_4','congestion_4','nothing_4','diagnosis_4', 'quality_4']
 
expert_1 = good_quality_expert_df.drop(columns_remove, axis =1)
expert_1

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality_1,cough_type_1,dyspnea_1,wheezing_1,stridor_1,choking_1,congestion_1,nothing_1,diagnosis_1,severity_1
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,,,,,,,,,,
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,,,,,,,,,,
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,,,,,,,,,,
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,,,,,,,,,,
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34380,ff8bfcc9-3df2-4752-8280-63f023fba31c,0.9830,,female,False,False,healthy,,ok,dry,False,False,False,False,False,False,COVID-19,mild
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,,,,,,,,,,
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,,,,,,,,,,
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,,,,,,,,,,


In [19]:
#removing irrelevant to expert 2 columns
columns_remove = ['cough_type_1','severity_1','dyspnea_1','wheezing_1','stridor_1','choking_1','congestion_1','nothing_1','diagnosis_1', 'quality_1',
                      'cough_type_3','severity_3','dyspnea_3','wheezing_3','stridor_3','choking_3','congestion_3','nothing_3','diagnosis_3', 'quality_3',
                      'cough_type_4','severity_4','dyspnea_4','wheezing_4','stridor_4','choking_4','congestion_4','nothing_4','diagnosis_4', 'quality_4']
expert_2 = good_quality_expert_df.drop(columns_remove, axis =1)
expert_2

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality_2,cough_type_2,dyspnea_2,wheezing_2,stridor_2,choking_2,congestion_2,nothing_2,diagnosis_2,severity_2
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,,,,,,,,,,
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,,,,,,,,,,
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,good,dry,False,False,False,False,False,True,lower_infection,mild
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,,,,,,,,,,
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34380,ff8bfcc9-3df2-4752-8280-63f023fba31c,0.9830,,female,False,False,healthy,,,,,,,,,,,
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,,,,,,,,,,
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,,,,,,,,,,
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,,,,,,,,,,


In [20]:
#removing irrelevant to expert 3 columns
columns_remove = ['cough_type_2','severity_2','dyspnea_2','wheezing_2','stridor_2','choking_2','congestion_2','nothing_2','diagnosis_2', 'quality_2',
                      'cough_type_1','severity_1','dyspnea_1','wheezing_1','stridor_1','choking_1','congestion_1','nothing_1','diagnosis_1', 'quality_1',
                      'cough_type_4','severity_4','dyspnea_4','wheezing_4','stridor_4','choking_4','congestion_4','nothing_4','diagnosis_4', 'quality_4']
expert_3 = good_quality_expert_df.drop(columns_remove, axis =1)
expert_3

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality_3,cough_type_3,dyspnea_3,wheezing_3,stridor_3,choking_3,congestion_3,nothing_3,diagnosis_3,severity_3
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,,,,,,,,,,
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,good,unknown,False,False,False,False,False,True,healthy_cough,pseudocough
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,,,,,,,,,,
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,good,wet,False,False,True,False,False,False,lower_infection,mild
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,good,wet,False,False,False,False,False,True,healthy_cough,pseudocough
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34380,ff8bfcc9-3df2-4752-8280-63f023fba31c,0.9830,,female,False,False,healthy,,,,,,,,,,,
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,good,dry,False,False,False,False,False,True,upper_infection,mild
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,good,wet,False,False,False,False,False,True,upper_infection,mild
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,,,,,,,,,,


In [21]:
#removing irrelevant to expert 4 columns
columns_remove = ['cough_type_2','severity_2','dyspnea_2','wheezing_2','stridor_2','choking_2','congestion_2','nothing_2','diagnosis_2', 'quality_2',
                      'cough_type_3','severity_3','dyspnea_3','wheezing_3','stridor_3','choking_3','congestion_3','nothing_3','diagnosis_3', 'quality_3',
                      'cough_type_1','severity_1','dyspnea_1','wheezing_1','stridor_1','choking_1','congestion_1','nothing_1','diagnosis_1', 'quality_1']
expert_4 = good_quality_expert_df.drop(columns_remove, axis =1)
expert_4

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality_4,cough_type_4,dyspnea_4,wheezing_4,stridor_4,choking_4,congestion_4,nothing_4,diagnosis_4,severity_4
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,good,dry,False,False,False,False,False,True,upper_infection,mild
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,,,,,,,,,,
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,,,,,,,,,,
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,,,,,,,,,,
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34380,ff8bfcc9-3df2-4752-8280-63f023fba31c,0.9830,,female,False,False,healthy,,,,,,,,,,,
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,,,,,,,,,,
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,,,,,,,,,,
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,ok,dry,False,False,False,False,False,True,COVID-19,mild


In [22]:
#unifying columns' names
def remove_last_two_characters(df, column_names):
    rename_dict = {col: col[:-2] for col in column_names}
    df.rename(columns=rename_dict, inplace=True)

column_names = [ 'cough_type_1','severity_1','dyspnea_1','wheezing_1','stridor_1','choking_1','congestion_1','nothing_1','diagnosis_1', 'quality_1']

remove_last_two_characters(expert_1, column_names)

In [23]:
#verifying results
expert_1

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality,cough_type,dyspnea,wheezing,stridor,choking,congestion,nothing,diagnosis,severity
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,,,,,,,,,,
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,,,,,,,,,,
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,,,,,,,,,,
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,,,,,,,,,,
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34380,ff8bfcc9-3df2-4752-8280-63f023fba31c,0.9830,,female,False,False,healthy,,ok,dry,False,False,False,False,False,False,COVID-19,mild
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,,,,,,,,,,
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,,,,,,,,,,
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,,,,,,,,,,


In [24]:
column_names = ['cough_type_2','severity_2','dyspnea_2','wheezing_2','stridor_2','choking_2','congestion_2','nothing_2','diagnosis_2', 'quality_2']
remove_last_two_characters(expert_2, column_names)

#verifying results once again
expert_2

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality,cough_type,dyspnea,wheezing,stridor,choking,congestion,nothing,diagnosis,severity
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,,,,,,,,,,
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,,,,,,,,,,
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,good,dry,False,False,False,False,False,True,lower_infection,mild
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,,,,,,,,,,
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34380,ff8bfcc9-3df2-4752-8280-63f023fba31c,0.9830,,female,False,False,healthy,,,,,,,,,,,
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,,,,,,,,,,
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,,,,,,,,,,
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,,,,,,,,,,


In [25]:
column_names = ['cough_type_3','severity_3','dyspnea_3','wheezing_3','stridor_3','choking_3','congestion_3','nothing_3','diagnosis_3', 'quality_3']
remove_last_two_characters(expert_3, column_names)

expert_3

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality,cough_type,dyspnea,wheezing,stridor,choking,congestion,nothing,diagnosis,severity
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,,,,,,,,,,
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,good,unknown,False,False,False,False,False,True,healthy_cough,pseudocough
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,,,,,,,,,,
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,good,wet,False,False,True,False,False,False,lower_infection,mild
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,good,wet,False,False,False,False,False,True,healthy_cough,pseudocough
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34380,ff8bfcc9-3df2-4752-8280-63f023fba31c,0.9830,,female,False,False,healthy,,,,,,,,,,,
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,good,dry,False,False,False,False,False,True,upper_infection,mild
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,good,wet,False,False,False,False,False,True,upper_infection,mild
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,,,,,,,,,,


In [26]:
column_names = ['cough_type_4','severity_4','dyspnea_4','wheezing_4','stridor_4','choking_4','congestion_4','nothing_4','diagnosis_4', 'quality_4']
remove_last_two_characters(expert_4, column_names)

expert_4

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality,cough_type,dyspnea,wheezing,stridor,choking,congestion,nothing,diagnosis,severity
14,0029d048-898a-4c70-89c7-0815cdcf7391,0.9456,35.0,male,True,False,symptomatic,healthy,good,dry,False,False,False,False,False,True,upper_infection,mild
16,002db0bd-e57f-4c30-ade0-16640d424eb7,0.9536,,,,,,,,,,,,,,,,
42,005b8518-03ba-4bf5-86d2-005541442357,0.9854,23.0,female,False,False,healthy,healthy,,,,,,,,,,
70,008ba489-31ad-44d8-856b-fcf72369dc46,0.9962,28.0,female,False,False,healthy,healthy,,,,,,,,,,
71,008c1c9e-aeef-40c5-846c-24f1b964f884,0.9751,44.0,male,False,False,symptomatic,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34380,ff8bfcc9-3df2-4752-8280-63f023fba31c,0.9830,,female,False,False,healthy,,,,,,,,,,,
34394,ffa718e8-da65-4602-8da8-cda7cdc568f2,0.9735,,,,,,,,,,,,,,,,
34407,ffbeb867-cdb7-4226-9456-e74c80acf2d9,0.9647,30.0,male,False,False,symptomatic,,,,,,,,,,,
34413,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,ok,dry,False,False,False,False,False,True,COVID-19,mild


In [27]:
#combining dataframes into unified dataframe
combined = pd.concat([expert_1, expert_2, expert_3, expert_4], ignore_index = True)

#dropping all NaN values 
combined.dropna(subset = ['cough_type','severity','dyspnea','wheezing','stridor','choking','congestion','nothing','diagnosis', 'quality'], inplace=True)

#verifying results
combined

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality,cough_type,dyspnea,wheezing,stridor,choking,congestion,nothing,diagnosis,severity
11,01567151-7bb2-45ee-9aa8-a1332b5941ea,0.9820,,,,,,,good,dry,False,False,False,True,False,False,COVID-19,mild
16,018b40a1-c109-459a-9e31-86cbd2cb3918,0.9869,,,,,,,ok,wet,False,False,False,False,False,True,lower_infection,mild
18,01ff40e8-63e6-4570-a463-9778ea30cad7,0.9686,24.0,other,False,False,symptomatic,,poor,dry,False,False,False,False,False,True,healthy_cough,pseudocough
28,0379c586-c500-483c-83a6-95b63afe6931,0.9916,63.0,male,True,False,COVID-19,,ok,dry,False,False,False,False,False,True,healthy_cough,pseudocough
29,038592cb-c8db-4f55-8052-e20059146cb5,0.9824,28.0,male,False,False,healthy,,ok,dry,False,False,False,False,False,True,COVID-19,mild
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9066,fed255ec-4829-4f4a-b22d-9bb23f2dd89f,0.9502,,,,,,COVID-19,good,dry,False,False,False,False,False,True,upper_infection,mild
9067,ff1234d7-7837-4ba7-842f-99fdc916baa9,0.9947,29.0,male,False,True,symptomatic,,good,dry,False,False,False,False,False,True,upper_infection,mild
9069,ff8363d2-016d-4738-9499-4c62480886fb,0.9933,,female,False,False,COVID-19,,ok,dry,False,False,False,False,False,True,COVID-19,mild
9074,ffd18a56-096d-40fc-9862-e5c5a8ca1fcd,0.9953,25.0,female,False,False,healthy,healthy,ok,dry,False,False,False,False,False,True,COVID-19,mild


Let's verify that the dataframe merged correctly. As per above we could see the row with index 186 had all four experts' assessments. We should now expect four rows with the same uuid number. 

In [28]:
df.iloc[186]

uuid                     01567151-7bb2-45ee-9aa8-a1332b5941ea
cough_detected                                          0.982
age                                                       NaN
gender                                                    NaN
respiratory_condition                                     NaN
fever_muscle_pain                                         NaN
status                                                    NaN
status_SSL                                                NaN
quality_1                                                good
cough_type_1                                              dry
dyspnea_1                                               False
wheezing_1                                              False
stridor_1                                               False
choking_1                                                True
congestion_1                                            False
nothing_1                                               False
diagnosi

In [29]:
row = combined['uuid'] == '01567151-7bb2-45ee-9aa8-a1332b5941ea'
true_rows = combined[row]
true_rows

Unnamed: 0,uuid,cough_detected,age,gender,respiratory_condition,fever_muscle_pain,status,status_SSL,quality,cough_type,dyspnea,wheezing,stridor,choking,congestion,nothing,diagnosis,severity
11,01567151-7bb2-45ee-9aa8-a1332b5941ea,0.982,,,,,,,good,dry,False,False,False,True,False,False,COVID-19,mild
2280,01567151-7bb2-45ee-9aa8-a1332b5941ea,0.982,,,,,,,good,dry,True,False,False,False,False,False,COVID-19,mild
4549,01567151-7bb2-45ee-9aa8-a1332b5941ea,0.982,,,,,,,good,wet,False,False,False,False,False,False,lower_infection,severe
6818,01567151-7bb2-45ee-9aa8-a1332b5941ea,0.982,,,,,,,good,dry,False,False,False,False,False,True,upper_infection,mild


As we can see there are four rows with the same uuid number.

In [30]:
combined.to_csv('cleaned_coughvid_data.csv', index = False)