# **Data Collection Notebook**

## Objectives

* Fetch Data from Kaggle and save as raw file and unzip.
* Inspect the data and save it under inputs/datasets/raw
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/trimmed_covid_dataset.csv

## Additional Comments

* The Dataset originally contained over 1 million patients. That was trimmed to 51,000.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/milestone-covid-19-study/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/milestone-covid-19-study'

# Install and fetch data from Kaggle

Install Kaggle packages

In [4]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Add kaggle.json token

In [6]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Define the Kaggle dataset, and destination folder and download it.

In [7]:
KaggleDatasetPath = "meirnizri/covid19-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading covid19-dataset.zip to inputs/datasets/raw
 64%|████████████████████████▍             | 3.00M/4.66M [00:00<00:00, 5.53MB/s]
100%|██████████████████████████████████████| 4.66M/4.66M [00:00<00:00, 6.73MB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle Data

Import Pandas and Read CSV File

In [8]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/trimmed_covid_dataset.csv")
print(df.head())
print(df.describe())

   USMER  MEDICAL_UNIT  SEX  PATIENT_TYPE   DATE_DIED  PNEUMONIA   AGE  \
0      2            12    1             1  9999-99-99        2.0  27.0   
1      2            12    1             1  9999-99-99        2.0  23.0   
2      1            12    1             2  9999-99-99        2.0  22.0   
3      1            12    1             1  9999-99-99        2.0  37.0   
4      2            12    1             1  9999-99-99        2.0  29.0   

   PREGNANT  DIABETES  COPD  ASTHMA  INMSUPR  HIPERTENSION  OTHER_DISEASE  \
0       2.0         2     2       2        2             2              2   
1       2.0         2     2       2        2             2              2   
2       2.0         2     2       2        2             2              2   
3       2.0         2     2       2        2             2              2   
4       2.0         2     2       2        2             1              2   

   CARDIOVASCULAR  OBESITY  RENAL_CHRONIC  TOBACCO  CLASIFFICATION_FINAL  
0               2

View DataFrame Summary

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51630 entries, 0 to 51629
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   USMER                 51630 non-null  int64  
 1   MEDICAL_UNIT          51630 non-null  int64  
 2   SEX                   51630 non-null  int64  
 3   PATIENT_TYPE          51630 non-null  int64  
 4   DATE_DIED             51630 non-null  object 
 5   PNEUMONIA             51630 non-null  float64
 6   AGE                   51630 non-null  float64
 7   PREGNANT              51630 non-null  float64
 8   DIABETES              51630 non-null  int64  
 9   COPD                  51630 non-null  int64  
 10  ASTHMA                51630 non-null  int64  
 11  INMSUPR               51630 non-null  int64  
 12  HIPERTENSION          51630 non-null  int64  
 13  OTHER_DISEASE         51630 non-null  int64  
 14  CARDIOVASCULAR        51630 non-null  int64  
 15  OBESITY            

# Data cleaning and Dataset size reduction

Identify and handle missing data values

In [10]:
import numpy as np
df.replace([97, 99], np.nan, inplace=True)
print(df.isna().sum())

USMER                   0
MEDICAL_UNIT            0
SEX                     0
PATIENT_TYPE            0
DATE_DIED               0
PNEUMONIA               0
AGE                     0
PREGNANT                0
DIABETES                0
COPD                    0
ASTHMA                  0
INMSUPR                 0
HIPERTENSION            0
OTHER_DISEASE           0
CARDIOVASCULAR          0
OBESITY                 0
RENAL_CHRONIC           0
TOBACCO                 0
CLASIFFICATION_FINAL    0
dtype: int64


Drop columns with more than 50% missing data

In [11]:
threshold = len(df) * 0.5
df.dropna(thresh=threshold, axis=1, inplace=True)

Drop rows with NaN values

In [12]:
df.dropna(inplace=True)

Display missing values after cleaning

In [13]:
print("Missing values after cleaning:")
print(df.isna().sum())

Missing values after cleaning:
USMER                   0
MEDICAL_UNIT            0
SEX                     0
PATIENT_TYPE            0
DATE_DIED               0
PNEUMONIA               0
AGE                     0
PREGNANT                0
DIABETES                0
COPD                    0
ASTHMA                  0
INMSUPR                 0
HIPERTENSION            0
OTHER_DISEASE           0
CARDIOVASCULAR          0
OBESITY                 0
RENAL_CHRONIC           0
TOBACCO                 0
CLASIFFICATION_FINAL    0
dtype: int64


Downsample the dataset due to github only accepting 5mb of data

In [26]:
df_sampled = df.sample(frac=0.1, random_state=1)

Save the trimmed dataset to a new CSV file

In [27]:
df_sampled.to_csv('trimmed_covid_dataset.csv', index=False)

Check file size to ensure it fits within gitHub's limits

In [28]:
file_size = os.path.getsize('trimmed_covid_dataset.csv')
print(f'File size: {file_size / (1024 * 1024):.2f} MB')

File size: 2.69 MB


Delete initial covid dataset prior to trimming

In [30]:
directory = 'inputs/datasets/raw/'
files_to_delete = ['Covid Data.csv', 'covid19-dataset.zip']

for file in files_to_delete:
    file_path = os.path.join(directory, file)
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f'{file} has been deleted.')
    else:
        print(f'{file} does not exist.')

Covid Data.csv has been deleted.
covid19-dataset.zip has been deleted.


Move trimmed dataset into correct folder

In [14]:
import shutil

source = 'trimmed_covid_dataset.csv'
destination = 'inputs/datasets/raw/trimmed_covid_dataset.csv'

shutil.move(source, destination)
print(f'{source} has been moved to {destination}.')

FileNotFoundError: [Errno 2] No such file or directory: 'trimmed_covid_dataset.csv'

Check for duplications

In [None]:
duplicates = df_sampled.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

# Push files to Repo

In [15]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/covid-19-dataset.csv",index=False)

---