# Overview



In [1]:
import pandas as pd
import re
from glob import glob
from pathlib import Path
import os
from pprint import pprint
from RISparser import readris
import nbib
import time

In [2]:
start_time = time.time()
pd.options.display.max_columns = None


In [3]:
df = pd.read_csv('/Users/brian/Coding/COVID-research/data/Journals.csv')
df = df['Journals'].tolist()

# Create formatted query

This section takes the journal list and produces a formatted query to search the three major databases.  The procedure involves pasting search operators around the journal names and concatenating everything into a single query that can be copied from the notebook and pasted directly into the search bar.  All searches were performed on September 22, 2021.  



### _Web of science_
The Web of Science (WoS) was specified to capture all records from January 1, 2014 through September 22, 2021.  The exports were made to include all the cited references to each article.  By doing so, the batch limit for export was limited to 500 records.  The records were downloaded as tab-demilited files. The files were then combined and processed using this notebook.   

```
web_of_science = ' OR '.join(df)
web_of_science = "SO = " + "(" + web_of_science + ")"
```

### _PubMed_ 
PubMed contains only a subset of social work journals.  No date limits were put on the search.  The search produced 10,033 articles, going back to the 1970's.  The maximum number of records that can be extracted is 10,000.  Thus, the search results were sorted in the web interface from newest to oldest, and then the extraction was made.  Thus, 33 articles were not retaiend, but these are the oldest records that will not be used in our current work.  Additonally the current cleaning script also excludes all articles publised before 2014.  The extract from PubMed was in a standard RIS format (\*.nbib).  This file was imported into Zotero, exported as a csv file, and processed with this notebook.  

```
pub_med = pub_med = ["(\"" + k +"\"[Journal])" for k in df]
pub_med = ' OR '.join(pub_med)
```

### _EBSCOhost_

Search was performed on September 22, 2021 using the followign search, capturing everything published since 2014 (inclusive).  For this pull, I extracted article records with all the cited references.  In doing so, the extract batch limit was 500, whereas the article records without the cited references allows for batches of 1,000.  The files were saved in a RIS format, which is used by bibliogrphy managers, like Zotero and EndNote.  To process these files, we used a Python module called `RISparsers`.  This converted the files to a Pandas dataframe, which was then prepared for analysis.   

* Social Work Abstracts
* APA PsycInfo
* Abstracts in Social Gerontology;
* Child Development & Adolescent Studies;
* Family Studies Abstracts;
* Political Science Complete;
* Violence & Abuse Abstracts;
* Women's Studies International

```
ebsco = ["SO " + "\"" + k + "\"" +  " OR" for k in df]
ebsco = " ".join(ebsco) 
ebsco = re.sub("OR$", "", ebsco)
```

# Data Preparation

This section prepares the data for cleaning.  Each 

### Prepare _Web of Science_

In [4]:
# Read and combine the WoS data files

%cd /Users/brian/Coding/Data/SocialWorkJournals/WoS
wos = glob('save*.txt')

wos = pd.DataFrame()
for f in glob("save*.txt"):
    wos = pd.read_table(f, sep = "\t", skip_blank_lines=True,\
                        engine = "python",\
                        header=0,\
                        usecols=([0,4,7,8,11,12,18,19,20,21,22,23,\
                                 25,26,27,28,29,31,34,40,\
                                 41,42,43,55,60,61,62,65]),\
                        parse_dates = True,\
                        infer_datetime_format = True, 
                        quotechar="\"", 
                        on_bad_lines = "skip")
    wos = wos.append(wos,ignore_index=True)


# The columns are incorrectly named in the file.  This procedure renames the 
# column retrieved.  This mistake can be observed by looking at the raw
# text files.  
wos.columns = ['authors', 'author_full_name', 'article_title', 'journal', 'language',
       'document_type', 'keywords_au', 'keywords_journal', 'abstract',
       'author_address', 'corresponding_author', 'email_address', 'ID_ORCID',
       'funding_agency', 'funding_text', 'cited_references',
       'cited_reference_count', 'total_times_cited', 'publisher',
       'journal_abbv', 'journal_abbv_dot', 'publication_date',
       'publication_year', 'early_access_date', 'ID_database', 'ID_PubMed',
       'open_access_indicator', 'date_of_data_pull']

/Users/brian/Coding/Data/SocialWorkJournals/WoS


In [5]:
wos['data_source'] = "wos"

wos = wos[['ID_database', 'authors', 'article_title', 'journal', 'language', 'document_type',
           'abstract', 'author_address', 'cited_references', 'journal_abbv', 'publication_date',
           'publication_year', 'early_access_date', 'ID_PubMed', 'open_access_indicator',
           'date_of_data_pull']]

wos['early_access_date'] = pd.to_datetime(wos['early_access_date'])
wos['ID_database'] = wos['ID_database'].str.extract('(\d+)')
    


### Prepare _PubMed_

WARNING: The scripts dateadd and datediff are installed in '/Users/brian/.local/bin' which is not on PATH.
  Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

In [6]:
%cd '/Users/brian/Coding/Data/SocialWorkJournals/PubMed'
files = glob('/Users/brian/Coding/Data/SocialWorkJournals/PubMed/pubmed_2021_09_22.nbib')

refs = nbib.read_file('/Users/brian/Coding/Data/SocialWorkJournals/PubMed/pubmed_2021_09_22.nbib')
#refs = nbib.read_file(files) 
pubmed = pd.DataFrame.from_dict(refs)
pubmed.shape

/Users/brian/Coding/Data/SocialWorkJournals/PubMed


(10000, 39)

In [7]:
pubmed.columns

Index(['pubmed_id', 'citation_owner', 'nlm_status', 'last_revision_date',
       'print_issn', 'electronic_issn', 'linking_issn', 'publication_date',
       'title', 'abstract', 'copyright', 'authors', 'language',
       'publication_types', 'electronic_publication_date',
       'journal_abbreviated', 'journal', 'nlm_journal_id', 'pmcid', 'keywords',
       'conflict_of_interest', 'received_time', 'revised_time',
       'accepted_time', 'entrez_time', 'pubmed_time', 'medline_time', 'pii',
       'doi', 'publication_status', 'pages', 'place_of_publication',
       'journal_volume', 'journal_issue', 'grants', 'pmc-release_time',
       'descriptors', 'secondary_source', 'corporate_author'],
      dtype='object')

In [8]:
pubmed = pubmed.iloc[ : , [0,7,8,9,11,12, 13, 15, 16, 26, 31]]

pubmed.rename(columns = {'pubmed_ID' :'ID_PubMed', 'title':'article_title', 
                         'publication_types':'document_type', 
                         'journal_abbreviated':'journal_abbv'})

Unnamed: 0,pubmed_id,publication_date,article_title,abstract,authors,language,document_type,journal_abbv,journal,medline_time,place_of_publication
0,34548838,2021 Jul 24,Risk and protective factors associated with gr...,COVID-19 and its related policy measures have ...,"[{'author': 'Xu, Yanfeng', 'author_abbreviated...",eng,[Journal Article],Child Fam Soc Work,Child & family social work,2021-09-23 06:00:00,
1,34545773,2021 Sep 21,Grandparents' Mental Health and Lived Experien...,Understanding grandparents' lived experiences ...,"[{'author': 'Zakari, Nazik M A', 'author_abbre...",eng,[Journal Article],J Gerontol Soc Work,Journal of gerontological social work,2021-09-22 06:00:00,United States
2,34545771,2021 Sep 21,A COVID Reset: The Future of the Long-Term Car...,,"[{'author': 'Nelson, H Wayne', 'author_abbrevi...",eng,[Journal Article],J Gerontol Soc Work,Journal of gerontological social work,2021-09-22 06:00:00,United States
3,34542018,2021 Sep 19,Experiences of LGBTQ Adults Who Have Accessed ...,This study examines the experiences of adults ...,"[{'author': 'Ecker, John', 'author_abbreviated...",eng,[Journal Article],Soc Work Public Health,Social work in public health,2021-09-21 06:00:00,United States
4,34533421,2021 Sep 17,Perceptions of older adults? Measuring positiv...,The COVID-19 pandemic has disproportionately i...,"[{'author': 'Carlson, Kristy J', 'author_abbre...",eng,[Journal Article],J Gerontol Soc Work,Journal of gerontological social work,2021-09-18 06:00:00,United States
...,...,...,...,...,...,...,...,...,...,...,...
9995,828309,1976 Summer,Social work in the nursing home: a need and an...,Working on a Nursing Home Demonstration Projec...,"[{'author': 'Jorgensen, L A', 'author_abbrevia...",eng,"[Journal Article, Research Support, U.S. Gov't...",Soc Work Health Care,Social work in health care,1976-01-01 00:01:00,United States
9996,828308,1976 Spring,"Soft services: a major, cost-effective compone...",,"[{'author': 'Nason, F', 'author_abbreviated': ...",eng,"[Case Reports, Journal Article]",Soc Work Health Care,Social work in health care,1976-01-01 00:01:00,United States
9997,798320,1976 Spring,Dialysis and transplantation: a mothers' group.,Two social workers are helping mothers of tran...,"[{'author': 'Glass, L', 'author_abbreviated': ...",eng,[Journal Article],Soc Work Health Care,Social work in health care,1976-01-01 00:01:00,United States
9998,138954,1976 Fall,The social work component in community-based a...,When the Delaware Valley chapter of the Commit...,"[{'author': 'Miller, E', 'author_abbreviated':...",eng,[Journal Article],Soc Work Health Care,Social work in health care,1976-01-01 00:01:00,United States


### Prepare _EBSCOhost_

In [None]:

%cd '/Users/brian/Coding/Data/SocialWorkJournals/EBSCO/'

# Make sure the directory is in the right place.
# Write to a file and reead back in to increase performance

files = glob('*.ris')

ebsco = pd.DataFrame()
for file in files:
    with open(file, errors="ignore") as bibliography_file:
        entries = readris(bibliography_file)
        for entry in entries:
            try:
                # Added to handle an unknown tag
                del entry['unknown_tag']
            except:
                fail = 0
            # Changes from dict form to dataframe form.
            df = pd.DataFrame.from_dict(entry, orient='index')
            df = df.transpose()
            holder = pd.DataFrame.from_dict(df)
            ebsco = ebsco.append(holder)


/Users/brian/Coding/Data/SocialWorkJournals/EBSCO


In [None]:

ebsco = ebsco.iloc[ :,[1,2,3,5,8,9,10,13,15,17]]

ebsco.columns = ['authors', 'article_title', 'journal', 'publication_date_full',\
                 'document_type', 'start_page', 'end_page', 'abstract', \
                 'notes', 'data_source']
ebsco.head()


# Prepare the final database

In [None]:
# Put in order of quality
final = pd.concat([pubmed, wos, ebsco])

In [None]:
final.head()


In [None]:
final.shape

In [None]:
# group_count = final.groupby('journal').size().reset_index()
# group_count.columns = ['journal', 'journal_count']
# final = pd.merge(final, group_count, left_on='journal', right_on = 'journal')



#df = pd.DataFrame(s).reset_index()
#df.columns = ['Gene', 'count']


In [None]:
end_time = time.time()
print((end_time-start_time)/60)