Getting and loading the comm_use_subset datset #18

chendaniely · 2020-04-07T06:28:47Z

#15 parsed out parts of all the json files found in the comm_use_subset folder.
Let's run the code so you get the dataset on your computer and load the dataset up in Python for you to explore

Note: for all the code and commands I'm assuming you are in the root directory (i.e., the covid19 folder) and in the db_covid19 environment.

Getting the dataset

Look at the README.md on how to get the lastest master codebase and update your environments. Parse and write out comm_use_subset as tsv #15 brought a few new packages into the environment, mainly pandas and a package I wrote called pyprojroot.
The Makefile has a new target for data_kgl_text, you can either run the data parsing script with make data_kgl_text or look at the Makefile and copy+paste the command (python ./analysis/db/dan/load_data.py)
- This should trigger a progress bar of 9000+ items. It's going through all 9000+ articles in the comm_use_subset folder and getting the text from the papers.

Loading the dataset

The script from ./analysis/db/dan/load_data.py saved the parsed dataset in ./data/db/final/kaggle/paper_text/comm_use_subset.tsv, in python you can load the dataset using pandas with

import pandas as pd
from pyprojroot import here

# load the tab separated file
dat = pd.read_csv(here("./data/db/final/kaggle/paper_text/comm_use_subset.tsv"), sep="\t")

# look at the first few lines
dat.head()

# look at the column names, their data types, and number of non missing elements
dat.info()

# number of rows and columns
dat.shape

If interest would be the abstract and text columns of the dataset. We can look into the spaCy library for NLP functions

The text was updated successfully, but these errors were encountered:

chendaniely · 2020-04-09T06:11:30Z

Everyone, please take a look at #26.

chendaniely assigned chendaniely, Makhsuda, benrayden37, LoveishSarolia, gitchrissi, sjain1701 and mitchdolby Apr 7, 2020

chendaniely mentioned this issue Apr 9, 2020

Kaggle dataset update -> V6 #26

Closed

6 tasks

chendaniely added the setup Tasks to get things started label Apr 16, 2020

chendaniely closed this as completed Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting and loading the comm_use_subset datset #18

Getting and loading the comm_use_subset datset #18

chendaniely commented Apr 7, 2020 •

edited

Loading

chendaniely commented Apr 9, 2020

Getting and loading the comm_use_subset datset #18

Getting and loading the comm_use_subset datset #18

Comments

chendaniely commented Apr 7, 2020 • edited Loading

Getting the dataset

Loading the dataset

chendaniely commented Apr 9, 2020

chendaniely commented Apr 7, 2020 •

edited

Loading