Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting and loading the comm_use_subset datset #18

Closed
5 of 6 tasks
chendaniely opened this issue Apr 7, 2020 · 1 comment
Closed
5 of 6 tasks

Getting and loading the comm_use_subset datset #18

chendaniely opened this issue Apr 7, 2020 · 1 comment
Labels
setup Tasks to get things started

Comments

@chendaniely
Copy link
Member

chendaniely commented Apr 7, 2020

#15 parsed out parts of all the json files found in the comm_use_subset folder.
Let's run the code so you get the dataset on your computer and load the dataset up in Python for you to explore

Note: for all the code and commands I'm assuming you are in the root directory (i.e., the covid19 folder) and in the db_covid19 environment.

Getting the dataset

  1. Look at the README.md on how to get the lastest master codebase and update your environments. Parse and write out comm_use_subset as tsv #15 brought a few new packages into the environment, mainly pandas and a package I wrote called pyprojroot.
  2. The Makefile has a new target for data_kgl_text, you can either run the data parsing script with make data_kgl_text or look at the Makefile and copy+paste the command (python ./analysis/db/dan/load_data.py)
    • This should trigger a progress bar of 9000+ items. It's going through all 9000+ articles in the comm_use_subset folder and getting the text from the papers.

Loading the dataset

The script from ./analysis/db/dan/load_data.py saved the parsed dataset in ./data/db/final/kaggle/paper_text/comm_use_subset.tsv, in python you can load the dataset using pandas with

import pandas as pd
from pyprojroot import here

# load the tab separated file
dat = pd.read_csv(here("./data/db/final/kaggle/paper_text/comm_use_subset.tsv"), sep="\t")

# look at the first few lines
dat.head()

# look at the column names, their data types, and number of non missing elements
dat.info()

# number of rows and columns
dat.shape

If interest would be the abstract and text columns of the dataset. We can look into the spaCy library for NLP functions

@chendaniely
Copy link
Member Author

Everyone, please take a look at #26.

@chendaniely chendaniely added the setup Tasks to get things started label Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
setup Tasks to get things started
Projects
None yet
Development

No branches or pull requests

7 participants