You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#15 parsed out parts of all the json files found in the comm_use_subset folder.
Let's run the code so you get the dataset on your computer and load the dataset up in Python for you to explore
Note: for all the code and commands I'm assuming you are in the root directory (i.e., the covid19 folder) and in the db_covid19 environment.
Getting the dataset
Look at the README.md on how to get the lastest master codebase and update your environments. Parse and write out comm_use_subset as tsv #15 brought a few new packages into the environment, mainly pandas and a package I wrote called pyprojroot.
The Makefile has a new target for data_kgl_text, you can either run the data parsing script with make data_kgl_text or look at the Makefile and copy+paste the command (python ./analysis/db/dan/load_data.py)
This should trigger a progress bar of 9000+ items. It's going through all 9000+ articles in the comm_use_subset folder and getting the text from the papers.
Loading the dataset
The script from ./analysis/db/dan/load_data.py saved the parsed dataset in ./data/db/final/kaggle/paper_text/comm_use_subset.tsv, in python you can load the dataset using pandas with
importpandasaspdfrompyprojrootimporthere# load the tab separated filedat=pd.read_csv(here("./data/db/final/kaggle/paper_text/comm_use_subset.tsv"), sep="\t")
# look at the first few linesdat.head()
# look at the column names, their data types, and number of non missing elementsdat.info()
# number of rows and columnsdat.shape
If interest would be the abstract and text columns of the dataset. We can look into the spaCy library for NLP functions
#15 parsed out parts of all the
json
files found in thecomm_use_subset
folder.Let's run the code so you get the dataset on your computer and load the dataset up in Python for you to explore
Note: for all the code and commands I'm assuming you are in the root directory (i.e., the
covid19
folder) and in thedb_covid19
environment.Getting the dataset
master
codebase and update your environments. Parse and write out comm_use_subset as tsv #15 brought a few new packages into the environment, mainlypandas
and a package I wrote calledpyprojroot
.Makefile
has a new target fordata_kgl_text
, you can either run the data parsing script withmake data_kgl_text
or look at theMakefile
and copy+paste the command (python ./analysis/db/dan/load_data.py
)comm_use_subset
folder and getting the text from the papers.Loading the dataset
The script from
./analysis/db/dan/load_data.py
saved the parsed dataset in./data/db/final/kaggle/paper_text/comm_use_subset.tsv
, in python you can load the dataset usingpandas
withIf interest would be the
abstract
andtext
columns of the dataset. We can look into the spaCy library for NLP functionsThe text was updated successfully, but these errors were encountered: