# Environmental Justice Data Handling after Screening
<div style="background-color: green; padding: 10px; border-radius: 5px;">
</div>

<div style="background-color: purple; padding: 10px; border-radius: 5px;">
This jupyter notebook is intended to: 
    
    1. Import the tables from the abstract screening google spreadsheet
    2. Check errors in the data (it happens that people make alter the table without intention)
    3. Organize the included articles according to the type of articles (conceptual, empirical, review, unknown) 
    4. Download the full text pdfs,
    5. Label the pdfs correctly and organize into different folders,
    6. Come up with a solution for cases when a articles are actually excluded or are from different article type
</div>

## Importing google spreadhsheet and inspection
<div style="background-color: orange; padding: 10px; border-radius: 2px;">
</div>

In [2]:
import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

<div style="background-color: purple; padding: 2px; border-radius: 2px;">
The id of the google sheet is in the json file that is given separate to you, so it is not included in the repository.
</div>

In [3]:
#loading file with the id
con_file = open("gsheet.json")
file_key = json.load(con_file)
con_file.close()

#loading google spreadsheet
spreadsheet_id = file_key["id"]  #from the json file
url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?format=xlsx"
xls = pd.ExcelFile(url)

<div style="background-color: purple; padding: 2px; border-radius: 2px;">
Check the dataframe: if there are null values in the first 12 columns, columns added by others, etc
</div>

In [4]:
#putting all sheets into a single huge df
sheets = xls.sheet_names
dataframes = []
for sheet in sheets:
    dataframe = xls.parse(sheet)
    dataframe["responsible"] = sheet
    dataframes.append(dataframe)

df = pd.concat(dataframes, ignore_index=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5676 entries, 0 to 5675
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   short_id          5676 non-null   int64 
 1   eid               5676 non-null   object
 2   doi               5676 non-null   object
 3   weblink           5676 non-null   object
 4   scholar_link      5676 non-null   object
 5   journal           5676 non-null   object
 6   author_names      5676 non-null   object
 7   year              5676 non-null   int64 
 8   title             5676 non-null   object
 9   abstract          5676 non-null   object
 10  included          5676 non-null   object
 11  article_type      5676 non-null   object
 12  responsible       5676 non-null   object
 13  intercoder check  57 non-null     object
 14  Unnamed: 13       3 non-null      object
 15  Comment           49 non-null     object
 16  Column 1          9 non-null      object
 17  Unnamed: 12   

In [6]:
#getting rid of unnecessary columns
df = df.iloc[:,:13]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5676 entries, 0 to 5675
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   short_id      5676 non-null   int64 
 1   eid           5676 non-null   object
 2   doi           5676 non-null   object
 3   weblink       5676 non-null   object
 4   scholar_link  5676 non-null   object
 5   journal       5676 non-null   object
 6   author_names  5676 non-null   object
 7   year          5676 non-null   int64 
 8   title         5676 non-null   object
 9   abstract      5676 non-null   object
 10  included      5676 non-null   object
 11  article_type  5676 non-null   object
 12  responsible   5676 non-null   object
dtypes: int64(2), object(11)
memory usage: 576.6+ KB


<div style="background-color: purple; padding: 2px; border-radius: 2px;">
Check how many articles are included, and if there are some left to be coded yet.
</div>

In [8]:
#articles that have not been screened
df.loc[df["included"]=="-",]["short_id"].count()

1309

In [9]:
#double check if anyone has still abstracts to be screened
df.loc[df["included"]=="-",]["responsible"].unique()

array(['CG', 'HVW', 'MFK', 'KBB', 'DBM', 'EO', 'LK', 'MFK2', 'AU2', 'EX8',
       'EX9'], dtype=object)

In [11]:
#check if any there are other values than "yes", "no" and "-". If so, you need to find how and tell Ellie
df["included"].unique()

array(['yes', 'no', '-', '?'], dtype=object)

In [17]:
#check if any there are other values than the established in the article_type. If so, you need to find how and tell Ellie
df["article_type"].unique() #there should be only "conceptual", "-", "empirical", "unknown", "review"

array(['conceptual', '-', 'empirical', 'unknown', 'review'], dtype=object)

In [18]:
#dobule check if any has included articles but not selected the article type
df.loc[(df["included"]=="yes") & (df["article_type"]=="-"),]["responsible"].unique()

array(['HVW', 'KBB', 'AU', 'DBM', 'PL', 'AU2'], dtype=object)

<div style="background-color: purple; padding: 2px; border-radius: 2px;">
At the end, in the "included" column there should be only "yes" and "no" values. For those who are included, only "conceptual", "empirical", "review" and "unkown". After it is cleaned, create a local copy, and upload it to ghe EJ review google drive
</div>

In [None]:
df.to_excel("yourpath", index=False)

## Organizing included articles
<div style="background-color: orange; padding: 10px; border-radius: 2px;">
</div>

In [23]:
#creating dataframe with the included articles
included_df = df.loc[df["included"]=="yes",].reset_index(drop=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2771 entries, 0 to 2770
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   short_id      2771 non-null   int64 
 1   eid           2771 non-null   object
 2   doi           2771 non-null   object
 3   weblink       2771 non-null   object
 4   scholar_link  2771 non-null   object
 5   journal       2771 non-null   object
 6   author_names  2771 non-null   object
 7   year          2771 non-null   int64 
 8   title         2771 non-null   object
 9   abstract      2771 non-null   object
 10  included      2771 non-null   object
 11  article_type  2771 non-null   object
 12  responsible   2771 non-null   object
dtypes: int64(2), object(11)
memory usage: 281.6+ KB


<div style="background-color: darkblue; padding: 2px; border-radius: 2px;">
There is some metadata that I left out to make the screening process smoother. Some metadata can help in the downloading process, for example, knowing the publisher, if it is open access or not. For that, you need to use pybliometrics library

https://pybliometrics.readthedocs.io/en/stable/

However, before that, you need to get your Scopus API key. You need to get to Scopus through the leuphana network 
https://dbis.ur.de/UBLUE/resources/3636?lang=de
</div>

In [24]:
from pybliometrics.scopus import ScopusSearch, AbstractRetrieval

import pybliometrics
pybliometrics.scopus.init()     #probably you need your API_KEY loaded in a .json file, similar to gsheet.json

<div style="background-color: purple; padding: 2px; border-radius: 2px;">
Now, the only thing you need to have are eid or the unique identifiers
</div>

In [25]:
eids = included_df["eid"].tolist()
articles = []
for eid in eids:
    articles.append(AbstractRetrieval(eid, view='FULL'))  #this is the class to retrieve metadata

In [27]:
# for example getting the publisher of one article
articles[93].publisher

'John Wiley and Sons Inc'

In [34]:
#or getting if it is oppen acces
articles[1034].openaccessFlag 

True

<div style="background-color: purple; padding: 2px; border-radius: 2px;">
You can create another dataframe with new columns
</div>

In [33]:
newdf = included_df.copy()

for i, article in enumerate(articles):
    newdf.loc[i,"publisher"] = article.publisher
    newdf.loc[i,"openaccess"] = article.openaccessFlag

In [35]:
newdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2771 entries, 0 to 2770
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   short_id      2771 non-null   int64 
 1   eid           2771 non-null   object
 2   doi           2771 non-null   object
 3   weblink       2771 non-null   object
 4   scholar_link  2771 non-null   object
 5   journal       2771 non-null   object
 6   author_names  2771 non-null   object
 7   year          2771 non-null   int64 
 8   title         2771 non-null   object
 9   abstract      2771 non-null   object
 10  included      2771 non-null   object
 11  article_type  2771 non-null   object
 12  responsible   2771 non-null   object
 13  publisher     2234 non-null   object
 14  openaccess    2389 non-null   object
dtypes: int64(2), object(13)
memory usage: 324.9+ KB


<div style="background-color: purple; padding: 2px; border-radius: 2px;">
Now you can split the dataframes into different ones depending on the type of article
</div>

In [38]:
empirical_df = newdf.loc[newdf["article_type"]=="empirical",].reset_index(drop=True)

In [41]:
empirical_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1628 entries, 0 to 1627
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   short_id      1628 non-null   int64 
 1   eid           1628 non-null   object
 2   doi           1628 non-null   object
 3   weblink       1628 non-null   object
 4   scholar_link  1628 non-null   object
 5   journal       1628 non-null   object
 6   author_names  1628 non-null   object
 7   year          1628 non-null   int64 
 8   title         1628 non-null   object
 9   abstract      1628 non-null   object
 10  included      1628 non-null   object
 11  article_type  1628 non-null   object
 12  responsible   1628 non-null   object
 13  publisher     1361 non-null   object
 14  openaccess    1441 non-null   object
dtypes: int64(2), object(13)
memory usage: 190.9+ KB


## Downloading
<div style="background-color: orange; padding: 10px; border-radius: 2px;">
</div>

<div style="background-color: purple; padding: 2px; border-radius: 2px;">
I have tried different approaches such as API, web scraping, etc, but one is investing much more time in automatizing that than in the actual downloading. My personal recommendation is to download manually the pdfs and put the short id as the label. The only thing to save time is to minimize the amount of clicks.  You have a column with the original web of the article and one with google scholar. 
</div>

<div style="background-color: violet; padding: 2px; border-radius: 2px;">
So, create four different folders (empirical, conceptual, review, unknown) in the same repository or at least where you have the jupyter notebook file. 
</div>

## Labeling system
<div style="background-color: orange; padding: 10px; border-radius: 2px;">
</div>

<div style="background-color: purple; padding: 2px; border-radius: 2px;">
Someone need to check individuall the "unknown" articles and putting in the other folders manually, or you can change the values of article_type in the <code>newdf</code> dataframe. You can come up with another solution. 
 
Once done, rename the files as follows(all in lowercase):

    "j_{type}_{first last name of first author}_{year}_short_id.pdf"

    
For example, a review article would be:

    "j_r_rodriguez_2021_467.pdf"


</div>

## Setting up table
<div style="background-color: orange; padding: 10px; border-radius: 2px;">
</div>

<div style="background-color: purple; padding: 2px; border-radius: 2px;">
To make the full-text review smoother. You need to make a readable table for the coders, that only have the relevant columns. You need to add one column <code>to_exclude</code>, with preset values "-" and "yes". Also add another column <code>change_type_to</code> with preset values "empirical", "review", and "conceptual". 
</div>