## Upload Dataset

### Import Libraries

In [1]:
import os
import re
from pathlib import Path

import pandas as pd
from tqdm import tqdm

### Set Class -> ResearchThemeParser

In [2]:
class ResearchThemeParser:

    @staticmethod
    def get_research_theme_files():
        """
        a list with all the csv files included in the 
        research_theme/documents directory
        :return: 
        """
        return list(Path('research_theme').joinpath(
            'documents').glob('**/*.csv'))

    def create_paper_level_dataset(self) -> pd.DataFrame:
        """
        a Dataframe of document dictionaries containing text information 
        :return:
        """
        docs = list()
        # read the csv that is contained in the list from 
        # get_research_theme_files() list

        for fpath in tqdm(self.get_research_theme_files()):
            df = pd.read_csv(fpath)
            
            # get the finelame (last column) of each line 
            filename = str(fpath).split(os.sep)[-1]
            
            # remove csv extension
            filename = re.sub('.csv', '', filename)
            paper_corpora = ' '.join(df['sentence'])

            # construct a new document dictionary with the captured information
            doc = {'text': paper_corpora}
            docs.append(doc)

        return pd.DataFrame(docs)

### Run the Parser to extract data

In [3]:
if __name__ == "__main__":
    parser = ResearchThemeParser()
    paper_level_rt = parser.create_paper_level_dataset()
    print(paper_level_rt)

100%|█████████████████████████████████████████████████████████████████████████████████| 92/92 [00:00<00:00, 181.05it/s]

                                                 text
0   Multivariate Granger causality between CO2 emi...
1   Clean fuels for resource-poor settings: A syst...
2   Adverse health impacts of cooking with kerosen...
3   Effect of hydrogen supplementation on engine p...
4   Upcoming Power Crisis in India – Increasing El...
..                                                ...
87  Political, economic and environmental impacts ...
88  Environmental impacts of micro-wind turbines a...
89  A new model to assess the environmental impact...
90  Climatic physical snowpack properties for larg...
91  Empirical determinants of eco-innovation in Eu...

[92 rows x 1 columns]





## Exploratory Analysis

### Overview of the data

In [4]:
import glob

path = 'research_theme\documents'
all_files = glob.glob(path + "/*.csv")

sliced_list = [string_slice[25:] for string_slice in all_files]

In [5]:
# first, set the parameters of the notebook
pd.set_option('display.max_columns', 10) # display up to 10 columns
pd.set_option('display.max_rows', 10) # display up to 10 rows

paper_level_rt['filename'] = sliced_list

display(paper_level_rt)

Unnamed: 0,text,filename
0,Multivariate Granger causality between CO2 emi...,ABC_G1B1_10.1016 j.energy.2010.09.041.csv
1,Clean fuels for resource-poor settings: A syst...,ABC_G1B1_10.1016_j.envres.2016.01.002.csv
2,Adverse health impacts of cooking with kerosen...,ABC_G1B1_10.1016_j.envres.2020.109851.csv
3,Effect of hydrogen supplementation on engine p...,ABC_G1B1_Corpus ID 104118586.csv
4,Upcoming Power Crisis in India – Increasing El...,ABC_G1B1_Corpus ID 190457136.csv
...,...,...
87,"Political, economic and environmental impacts ...",RST_G7B4_S0306261909001688.csv
88,Environmental impacts of micro-wind turbines a...,RST_G7B4_S0360544213005355.csv
89,A new model to assess the environmental impact...,RST_G7B4_S0959652614006386.csv
90,Climatic physical snowpack properties for larg...,RST_G7B4_S1873965212000060.csv


In [6]:
# See the text of the first paper
idx = 0

print(paper_level_rt.loc[idx]['text'])

Multivariate Granger causality between CO2 emissions, energy consumption, FDI (foreign direct investment) and GDP (gross domestic product): Evidence from a panel of BRIC (Brazil, Russian Federation, India, and China) countries Abstract: This paper addresses the impact of both economic growth and financial development on environmental degradation using a panel cointegration technique for the period between 1980 and 2007, except for Russia (1992–2007). In long-run equilibrium, CO2 emissions appear to be energy consumption elastic and FDI inelastic, and the results seem to support the Environmental Kuznets Curve (EKC) hypothesis. The causality results indicate that there exists strong bidirectional causality between emissions and FDI and unidirectional strong causality running from output to FDI. The evidence seems to support the pollution haven and both the halo and scale effects. Therefore, in attracting FDI, developing countries should strictly examine the qualifications for foreign in

### Variables' types

In [7]:
paper_level_rt.dtypes

text        object
filename    object
dtype: object

### Dataset's size & shape

In [8]:
print("Dataset size:", len(paper_level_rt))
print('Dataset shape: {}'.format(paper_level_rt.shape))

Dataset size: 92
Dataset shape: (92, 2)


### Check for missing values

In [9]:
paper_level_rt.isna().sum() 

text        0
filename    0
dtype: int64

In [10]:
# Some basic description 
paper_level_rt.describe().T 

Unnamed: 0,count,unique,top,freq
text,92,92,Title: Indigenous men's groups and social and ...,1
filename,92,92,LNR_G5B2_10.1007_s00267-008-9197-0.csv,1


### Drop duplicate papers

In [11]:
newdata = paper_level_rt.drop_duplicates(subset=['text'], keep='last')

In [12]:
# Sanity check
newdata.describe().T 

Unnamed: 0,count,unique,top,freq
text,92,92,Title: Indigenous men's groups and social and ...,1
filename,92,92,LNR_G5B2_10.1007_s00267-008-9197-0.csv,1


## Extract data

### Create a new folder in directory (provided in github)

In [13]:
# you can either download the "Dataset" directory from github, or uncomment below line and execute
# !mkdir Dataset

### Extract the dataframes to CSV files

In [14]:
path = "./Dataset/"

filename_write = os.path.join(path, "heldout_dataset.csv")
paper_level_rt.to_csv(filename_write, index=False)
print("Extraction of 'heldout_dataset.csv' is finished")


Extraction of 'heldout_dataset.csv' is finished
