# **Data Collection Notebook**

## Objectives

* Fetch Data from Kaggle and save as raw file and unzip.
* Inspect the data and save it under inputs/datasets/raw
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/trimmed_covid_dataset.csv

## Additional Comments

* The Dataset originally contained over 1 million patients. That was trimmed to 51,000.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/milestone-covid-19-study/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/milestone-covid-19-study'

# Install and fetch data from Kaggle

Install Kaggle packages

In [4]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Add kaggle.json token

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, and destination folder and download it.

In [6]:
KaggleDatasetPath = "meirnizri/covid19-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading covid19-dataset.zip to inputs/datasets/raw
 64%|████████████████████████▍             | 3.00M/4.66M [00:00<00:00, 5.98MB/s]
100%|██████████████████████████████████████| 4.66M/4.66M [00:00<00:00, 7.26MB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/covid19-dataset.zip
  inflating: inputs/datasets/raw/Covid Data.csv  


---

# Load and Inspect Kaggle Data

Import Pandas and Read CSV File

In [51]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/Covid Data.csv")
print(df.head())
print(df.describe())

   USMER  MEDICAL_UNIT  SEX  PATIENT_TYPE   DATE_DIED  INTUBED  PNEUMONIA  \
0      2             1    1             1  03/05/2020       97          1   
1      2             1    2             1  03/06/2020       97          1   
2      2             1    2             2  09/06/2020        1          2   
3      2             1    1             1  12/06/2020       97          2   
4      2             1    2             1  21/06/2020       97          2   

   AGE  PREGNANT  DIABETES  ...  ASTHMA  INMSUPR  HIPERTENSION  OTHER_DISEASE  \
0   65         2         2  ...       2        2             1              2   
1   72        97         2  ...       2        2             1              2   
2   55        97         1  ...       2        2             2              2   
3   53         2         2  ...       2        2             2              2   
4   68        97         1  ...       2        2             1              2   

   CARDIOVASCULAR  OBESITY  RENAL_CHRONIC  TOBACCO

View DataFrame Summary

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 21 columns):
 #   Column                Non-Null Count    Dtype 
---  ------                --------------    ----- 
 0   USMER                 1048575 non-null  int64 
 1   MEDICAL_UNIT          1048575 non-null  int64 
 2   SEX                   1048575 non-null  int64 
 3   PATIENT_TYPE          1048575 non-null  int64 
 4   DATE_DIED             1048575 non-null  object
 5   INTUBED               1048575 non-null  int64 
 6   PNEUMONIA             1048575 non-null  int64 
 7   AGE                   1048575 non-null  int64 
 8   PREGNANT              1048575 non-null  int64 
 9   DIABETES              1048575 non-null  int64 
 10  COPD                  1048575 non-null  int64 
 11  ASTHMA                1048575 non-null  int64 
 12  INMSUPR               1048575 non-null  int64 
 13  HIPERTENSION          1048575 non-null  int64 
 14  OTHER_DISEASE         1048575 non-null  int64 
 15

# Data cleaning and Dataset size reduction

Identify and handle missing data values

In [53]:
import numpy as np
df.replace([97, 99], np.nan, inplace=True)
print("Missing values count after replacement:")
print(df.isna().sum())

Missing values count after replacement:
USMER                        0
MEDICAL_UNIT                 0
SEX                          0
PATIENT_TYPE                 0
DATE_DIED                    0
INTUBED                 855869
PNEUMONIA                16003
AGE                        221
PREGNANT                523511
DIABETES                     0
COPD                         0
ASTHMA                       0
INMSUPR                      0
HIPERTENSION                 0
OTHER_DISEASE                0
CARDIOVASCULAR               0
OBESITY                      0
RENAL_CHRONIC                0
TOBACCO                      0
CLASIFFICATION_FINAL         0
ICU                     856032
dtype: int64


Drop PREGNANT column

In [54]:
df.drop(columns=['PREGNANT'], inplace=True)

Drop rows with any remaining missing values

In [55]:
df.dropna(inplace=True)

Display missing values after cleaning

In [56]:
print("Missing values after cleaning:")
print(df.isna().sum())

Missing values after cleaning:
USMER                   0
MEDICAL_UNIT            0
SEX                     0
PATIENT_TYPE            0
DATE_DIED               0
INTUBED                 0
PNEUMONIA               0
AGE                     0
DIABETES                0
COPD                    0
ASTHMA                  0
INMSUPR                 0
HIPERTENSION            0
OTHER_DISEASE           0
CARDIOVASCULAR          0
OBESITY                 0
RENAL_CHRONIC           0
TOBACCO                 0
CLASIFFICATION_FINAL    0
ICU                     0
dtype: int64


Downsample the dataset due to github only accepting 5mb of data

In [57]:
df_sampled = df.sample(frac=0.3, random_state=1)

Save the trimmed dataset to a new CSV file

In [58]:
df_sampled.to_csv('trimmed_covid_dataset.csv', index=False)

Display value counts for the SEX column in the sampled dataset

In [59]:
print("SEX column value counts after sampling:")
print(df_sampled['SEX'].value_counts())

SEX column value counts after sampling:
2    34007
1    23721
Name: SEX, dtype: int64


Check file size to ensure it fits within gitHub's limits

In [60]:
file_size = os.path.getsize('trimmed_covid_dataset.csv')
print(f'File size: {file_size / (1024 * 1024):.2f} MB')

File size: 3.22 MB


Delete initial covid dataset prior to trimming

In [61]:
directory = 'inputs/datasets/raw/'
files_to_delete = ['Covid Data.csv', 'covid19-dataset.zip']

for file in files_to_delete:
    file_path = os.path.join(directory, file)
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f'{file} has been deleted.')
    else:
        print(f'{file} does not exist.')

Covid Data.csv has been deleted.
covid19-dataset.zip does not exist.


Move trimmed dataset into correct folder

In [62]:
import shutil

source = 'trimmed_covid_dataset.csv'
destination = 'inputs/datasets/raw/trimmed_covid_dataset.csv'

shutil.move(source, destination)
print(f'{source} has been moved to {destination}.')

trimmed_covid_dataset.csv has been moved to inputs/datasets/raw/trimmed_covid_dataset.csv.


Check for duplications

In [63]:
duplicates = df_sampled.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

Number of duplicate rows: 12627


# Push files to Repo

In [64]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/covid-19-dataset.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'


---