# Data Engineering Project 
## Importing the raw data, exporting the clean data

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to clean the main raw data frame and write a new, clean data frame for further use. In this notebook, the comparisons of different read- and write-methods are demonstrated.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [1]:
## NB!! run the installs from terminal


########### Library Installations ##############
# !pip install opendatasets # install the library for downloading the data set
# ! pip install habanero
#!pip install genderize
!pip install pybliometrics
################################################

################### Imports ####################
### Data wrangling
import pandas as pd # working with dataframes
import numpy as np # vector operations

### Specific-purpose libraries
import opendatasets as od # downloading the data set from Kaggle
# from habanero import Crossref # CrossRef API

### Misc
import warnings # suppress warnings
import time # tracking time
import os # accessing directories

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # suppress warnings



## 1. Data Import
In order to download the data from Kaggle to a machine, it would be necessary to create a Kaggle API token. Make sure to include the `kaggle.json` fle in the same directory as this notebook.

Some additional resources:
- How to download the datasets from kaggle with `opendatasets` library https://www.analyticsvidhya.com/blog/2021/04/how-to-download-kaggle-datasets-using-jupyter-notebook/
- Github repo for `opendatasets` library: https://github.com/JovianML/opendatasets

First download the file (should be around `1.09 GB`. It will be stored in the `.arxiv/` directory. In case the file already exists, the download will be ignored with the `force = False` argument.

In [None]:
# Initialize the time of pipeline
start_pipe = time.time()

print(f'Time of pipeline start: {time.ctime(start_pipe)}')

In [None]:
od.download("https://www.kaggle.com/datasets/Cornell-University/arxiv", 
                     force = True # force = True downloads the file even if it finds a file with the same name
                    )

Import the JSON file as pandas dataframe. For testing purposes, select how many rows are included. if `n_rows = "all"`, the entire data set is imported.

In [2]:
n_rows = 'all'

start_time = time.time()
if n_rows == "all":
    df_raw = pd.read_json("arxiv/arxiv-metadata-oai-snapshot.json", lines = True)
else:
    df_raw  = pd.read_json("arxiv/arxiv-metadata-oai-snapshot.json", lines = True, nrows = n_rows)

end_time = time.time()

print(f'Time elapsed: {end_time - start_time} seconds.')
print(f'Memory usage of raw df: {df_raw.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
print(f'Dataframe dimensions: {df_raw.shape}')
df_raw.head(2)

Time elapsed: 226.76795101165771 seconds.
Memory usage of raw df: 4.094871174544096 GB.
Dataframe dimensions: (2178366, 14)


Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,ANL-HEP-PR-07-12,hep-ph,,A fully differential calculation in perturba...,"[{'version': 'v1', 'created': 'Mon, 2 Apr 2007...",2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0002,Louis Theran,Ileana Streinu and Louis Theran,Sparsity-certifying Graph Decompositions,To appear in Graphs and Combinatorics,,,,math.CO cs.CG,http://arxiv.org/licenses/nonexclusive-distrib...,"We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",2008-12-13,"[[Streinu, Ileana, ], [Theran, Louis, ]]"


## 2. Preliminary Data Cleaning
In this step, data cleaning is performed. Here are the guidelines from the assignment:

- You can drop the abstract as it is not required in the scope of this project,
- You can drop publications with very short titles, e.g. one word, with empty authors

What we do is we first drop all the columns that we are not planning to use in the project. Then, we are excluding the rows where works do not have a DOI. While we aknowledge that some valid publications do not have a DOI, a DOI demonstrates that this work is published (whether in a journal, as a pre-print, etc) and, hence, serves as a marker for publication quality. Finally, we exclude titles which have a length smaller than 10 characters - here, the main idea is to exclude all non-validly titled works, as <10 characters would amount to three words of three characters with two spaces - a rather rare title.

In [3]:
# Drop the abstract, submitter, comments, report-no, versions, journal-ref, and license, as these features are not used in this project
## Of note, journal name will be retrieved later with a more standard label
df_raw = df_raw.drop(['abstract', 'submitter', 'comments', 'report-no', 'license', 'versions', 'journal-ref'], axis = 1)
df_raw.shape

(2178366, 7)

In [4]:
# Drop duplicates 
df_raw = df_raw.drop_duplicates(subset=['id'])
df_raw.shape

(2178362, 7)

In [5]:
# Include only works with non-null values in doi
df_raw = df_raw[~df_raw['doi'].isnull()]
df_raw.shape

(1088467, 7)

In [6]:
# Drop the publications with very short titles (less than 3 words)
df_raw = df_raw[(df_raw['title'].map(len) > 10)]
df_raw = df_raw.reset_index(drop = True)
print(df_raw.shape)

# Set the index of each paper to 'id'
# df = df.set_index('id')
print(f'Dataframe dimensions: {df_raw.shape}')
print(f'Memory usage of raw pandas df: {df_raw.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
df_raw.head(3)

(1088269, 7)
Dataframe dimensions: (1088269, 7)
Memory usage of raw pandas df: 0.6853806916624308 GB.


Unnamed: 0,id,authors,title,doi,categories,update_date,authors_parsed
0,704.0001,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,hep-ph,2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0006,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,cond-mat.mes-hall,2015-05-13,"[[Pong, Y. H., ], [Law, C. K., ]]"
2,704.0007,"Alejandro Corichi, Tatjana Vukasinac and Jose ...",Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,gr-qc,2008-11-26,"[[Corichi, Alejandro, ], [Vukasinac, Tatjana, ..."


## 3. Fact and Dimension tables for Data Warehouse (DWH)

Here, we create the tables with placeholder columns. In this data schema, we are using two factless fact tables: `authorship` that links articles (and its properties) with authors, and `article_category` which reflects scientific domain information for each article.

**Fact table** <br>
- `authorship`: links articles to authors
    - `article_id`: VARCHAR article id (allows to retrieve this id from the original, raw df)
    - `author_id`: VARCHAR composed from author's last name and first name initial (e.g., LastF)
    
    
- `article_category`: links articles to authors
    - `article_id`: VARCHAR article id (allows to retrieve this id from the original, raw df)
    - `category_id`: VARCHAR composed from author's last name and first name initial (e.g., LastF)

**Dimension tables** <br>
- `article`: contains the information about all unique publications and links the dimension tables. The columns are:
    - PK `article_id`: VARCHAR article id (allows to retrieve this id from the original, raw df)
    - `title`: VARCHAR article title
    - `doi`: VARCHAR article DOI
    - `journal_id`:VARCHAR journal ID based on ISSN linking to the `journal` table
    - `year`: INT year of publication
    - `n_cites`: INT the number of citations (FACT)
    - `n_authors`: INT the number of co-authors
    

- `author`: includes all individual authors of publications.
    - PK `author_id`: VARCHAR composed from author's last name and first name initial (e.g., LastF)
    - `lastname`: VARCHAR author's last name 
    - `first`: VARCHAR author's first name initial
    - `middle`: VARCHAR author's middle name initial (if any)
    - `gender`: INT (1 or 0), denoting 'Female' and 'Male', respectively (AUGMENTED VIA API!)
    - `affiliation`: VARCHAR author's affiliation (AUGMENTED VIA API!)
    - `hindex`: VARCHAR author's hindex (AUGMENTED VIA API OR COMPUTED (N PAPERS W/ N CITES)!
    
    
- `journal`: includes all unique journals in which works were published
    - PK `journal_id`: VARCHAR journal ID
    - `issn`: VARCHAR journal ISSN (necessary for augmentation)
    - `title`: VARCHAR journal title
    - `if_latest`: FLOAT journal's latest Impact Factor (AUGMENTED VIA API!)

- `category`: includes categories associated with articles
    - PK `category_id`: VARCHAR
    - `superdom`: VARCHAR super-domain of the category
    - `subdom`: VARCHAR sub-domain of the category
    
    
The DWH ERD figure is below:

<img src="images/dwh_erd.png"/>

**<font color = 'red'> USE A TEST DATA SET OF 1000 SAMPLES: </font>**

In [7]:
## Prepare data for small-scale testing
df = df_raw.iloc[:1000,:] # Take a thousand rows for testing
df.head()

Unnamed: 0,id,authors,title,doi,categories,update_date,authors_parsed
0,704.0001,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,hep-ph,2008-11-26,"[[Balázs, C., ], [Berger, E. L., ], [Nadolsky,..."
1,704.0006,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,cond-mat.mes-hall,2015-05-13,"[[Pong, Y. H., ], [Law, C. K., ]]"
2,704.0007,"Alejandro Corichi, Tatjana Vukasinac and Jose ...",Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,gr-qc,2008-11-26,"[[Corichi, Alejandro, ], [Vukasinac, Tatjana, ..."
3,704.0008,Damian C. Swift,Numerical solution of shock and ramp compressi...,10.1063/1.2975338,cond-mat.mtrl-sci,2009-02-05,"[[Swift, Damian C., ]]"
4,704.0009,"Paul Harvey, Bruno Merin, Tracy L. Huard, Luis...","The Spitzer c2d Survey of Large, Nearby, Inste...",10.1086/518646,astro-ph,2010-03-18,"[[Harvey, Paul, ], [Merin, Bruno, ], [Huard, T..."


### 3.1. Factless fact tables

#### 3.1.1. Factless fact table: `authorship`

In [8]:
# Create the table fro article id and authors list
## NB! Creating `authorship_raw` - for later authors extraction
authorship_raw = df[['id', 'authors_parsed']].set_index('id')
authorship_raw['n_authors'] = authorship_raw['authors_parsed'].str.len()
authorship_raw = pd.DataFrame(authorship_raw['authors_parsed'].explode()).reset_index()

# Create additional columns
authorship_raw['last_name'] = np.nan
authorship_raw['first_name'] = np.nan
authorship_raw['middle_name'] = np.nan

# Update the last, first, and middle names
for i in range(len(authorship_raw)):
    authorship_raw['last_name'][i] = authorship_raw['authors_parsed'][i][0]
    authorship_raw['first_name'][i] = authorship_raw['authors_parsed'][i][1]
    authorship_raw['middle_name'][i] = authorship_raw['authors_parsed'][i][2]

# Drop the redundant column
authorship_raw = authorship_raw.drop(columns = 'authors_parsed')

# Author_identifier
authorship_raw['author_id'] = authorship_raw['last_name'] + authorship_raw['first_name'].str[0]
# Rename article id column
authorship_raw = authorship_raw.rename({'id':'article_id'}, axis = 1)

# Final table
authorship = authorship_raw.drop(columns = ['last_name', 'first_name', 'middle_name'])

print(f'Dataframe dimensions: {authorship.shape}')
print(f'Memory usage of raw pandas df: {authorship.memory_usage(deep = True).sum()/1024/1024/1024} GB.')
authorship.head()

Dataframe dimensions: (3651, 2)
Memory usage of raw pandas df: 0.000444863922894001 GB.


Unnamed: 0,article_id,author_id
0,704.0001,BalázsC
1,704.0001,BergerE
2,704.0001,NadolskyP
3,704.0001,YuanC
4,704.0006,PongY


#### 3.1.2. Factless fact table: `article_category`

In [9]:
# Article-category factless fact table
article_category = df[['id', 'categories']].set_index('id')
article_category = pd.DataFrame(article_category['categories'].str.split(' ').explode()) # extract category codes for articles in long-df
article_category = article_category.reset_index()

article_category = article_category.rename(columns = {'id':'article_id', 'categories':'category_id'})

print(f'Dataframe dimensions: {article_category.shape}')
print(f'Memory usage of raw pandas df: {article_category.memory_usage(deep = True).sum()/1024/1024} MB.')
article_category.head()

Dataframe dimensions: (1470, 2)
Memory usage of raw pandas df: 0.18647289276123047 MB.


Unnamed: 0,article_id,category_id
0,704.0001,hep-ph
1,704.0006,cond-mat.mes-hall
2,704.0007,gr-qc
3,704.0008,cond-mat.mtrl-sci
4,704.0009,astro-ph


### 3.2. Dimensions tables

#### 3.2.1. Dimension table: `article`

In [10]:
article = pd.DataFrame(columns = ['article_id', 'title', 'doi', 'n_authors', 'journal_issn', 'n_cites', 'year'])
article['article_id'] = df['id']
article['title'] = df['title']
article['doi'] = df['doi']
article['n_authors'] = df['authors_parsed'].str.len() # get the number of authors
article['year'] = df['update_date'].str.split('-').map(lambda x: x[0]).astype(int)
#article = article.drop(column = 'date')

print(f'Dataframe dimensions: {article.shape}')
print(f'Memory usage of raw pandas df: {article.memory_usage(deep = True).sum()/1024/1024} MB.')
article.head()

Dataframe dimensions: (1000, 7)
Memory usage of raw pandas df: 0.3371152877807617 MB.


Unnamed: 0,article_id,title,doi,n_authors,journal_issn,n_cites,year
0,704.0001,Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,4,,,2008
1,704.0006,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,2,,,2015
2,704.0007,Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,3,,,2008
3,704.0008,Numerical solution of shock and ramp compressi...,10.1063/1.2975338,1,,,2009
4,704.0009,"The Spitzer c2d Survey of Large, Nearby, Inste...",10.1086/518646,7,,,2010


In [11]:
article.head()

Unnamed: 0,article_id,title,doi,n_authors,journal_issn,n_cites,year
0,704.0001,Calculation of prompt diphoton production cros...,10.1103/PhysRevD.76.013009,4,,,2008
1,704.0006,Bosonic characters of atomic Cooper pairs acro...,10.1103/PhysRevA.75.043613,2,,,2015
2,704.0007,Polymer Quantum Mechanics and its Continuum Limit,10.1103/PhysRevD.76.044016,3,,,2008
3,704.0008,Numerical solution of shock and ramp compressi...,10.1063/1.2975338,1,,,2009
4,704.0009,"The Spitzer c2d Survey of Large, Nearby, Inste...",10.1086/518646,7,,,2010


#### 3.2.2. Dimension table: `author`
NB! Dependency on `authorship_raw` table, i.e., data is extracted from it.

In [12]:
# Create the table from the `authorship` table
author = authorship_raw[['author_id', 'last_name', 'first_name', 'middle_name']]

# Drop duplicates
author.drop_duplicates(keep=False,inplace=True)

# Add the `gender` column to be augmented
author['gender'] = np.nan
author['affiliation'] = np.nan
author['hindex'] = np.nan

# Sort alphabetically by last name
author = author.sort_values('author_id').reset_index(drop = True)

# Final table
print(f'Dataframe dimensions: {author.shape}')
print(f'Memory usage of raw pandas df: {author.memory_usage(deep = True).sum()/1024/1024} MB.')
author.head()

Dataframe dimensions: (3214, 7)
Memory usage of raw pandas df: 0.8408908843994141 MB.


Unnamed: 0,author_id,last_name,first_name,middle_name,gender,affiliation,hindex
0,AarsethS,Aarseth,Sverre J.,,,,
1,AbabnehB,Ababneh,Bashar S.,,,,
2,AbbottD,Abbott,Derek,,,,
3,AbeE,Abe,Eisuke,,,,
4,AbrahamsE,Abrahams,E.,,,,


#### 3.2.3. Dimension table: `journal `

In [13]:
journal = pd.DataFrame(columns = ['issn', 'title', 'if_latest'])

print(f'Dataframe dimensions: {journal.shape}')
print(f'Memory usage of raw pandas df: {journal.memory_usage(deep = True).sum()/1024/1024} MB.')
journal.head()

Dataframe dimensions: (0, 3)
Memory usage of raw pandas df: 0.0 MB.


Unnamed: 0,issn,title,if_latest


#### 3.2.4. Dimension table: `category`
NB! Dependency on `article_category` table, i.e., data is extracted from it.

In [14]:
# Categories dimension table
category = pd.DataFrame(article_category['category_id'].copy().reset_index(drop = True))
category[['superdom', 'subdom']] = category['category_id'].str.split('.', expand = True) # exract supr- and subdomain
category = category.drop_duplicates() # drop duplicate rows
category = category.sort_values('category_id').reset_index(drop = True) # sort values, reset index

print(f'Dataframe dimensions: {category.shape}')
print(f'Memory usage of raw pandas df: {category.memory_usage(deep = True).sum()/1024/1024} MB.')
category.head()

Dataframe dimensions: (88, 3)
Memory usage of raw pandas df: 0.015672683715820312 MB.


Unnamed: 0,category_id,superdom,subdom
0,astro-ph,astro-ph,
1,astro-ph.HE,astro-ph,HE
2,cond-mat.dis-nn,cond-mat,dis-nn
3,cond-mat.mes-hall,cond-mat,mes-hall
4,cond-mat.mtrl-sci,cond-mat,mtrl-sci


# 2. Data Augmentation

In [None]:
# Tables:
## authorship
## article_category
## category
## journal <-- augment all data (use ISSN from DOI)
## article <-- augment with number of citations
## author <-- augment with gender and affiliation

In [15]:
import pandas as pd
import requests
from pybliometrics.scopus import AbstractRetrieval

# Sample a df of size N 
N = 50

#original
#test_journal = journal
#test_article = article

test_journal = journal.reset_index(drop = True)
test_article = article.sample(N).reset_index(drop = True)

# Set the API base URL
doi_url = "https://api.crossref.org/works/"
jornals_url = "https://api.crossref.org/journals/"

# Define a function to retrieve data from the API
def get_publication_data(number, doi = True):
    if doi:
        url = doi_url + number 
    else:
        url = jornals_url + number
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        return data
    else:
        return None
    
# With Scopus 
    
#index2 = 0
# Iterate over the rows of the dataframe
#for index, row in test_article.iterrows():
#    doi = row["doi"]
#    doi_data = AbstractRetrieval(doi)
#    doi = row["doi"]
#    data = AbstractRetrieval(doi)
    # Check if data is retrieved
#    if doi_data is not None:
#        test_article.loc[index, "n_cites"] = data.citedby_count
#        if "issn-type" in data["message"]:
#            value = doi_data["message"]["issn-type"][0]["value"]
#            test_article.loc[index, "journal_issn"] = value
#            if value not in test_journal:
#                data_issn = get_publication_data(value, False)
#                journal.loc[index2, "issn"] = value
#                if data_issn is not None:
#                    journal.loc[index2, "title"] = data_issn["message"]["title"]
#                index2 += 1
            
#With Crossref   
index2 = 0
#Iterate over the rows of the dataframe
for index, row in test_article.iterrows():
    doi = row["doi"]
    data = get_publication_data(doi)
    # Check if data is retrieved
    if data is not None:
        if "reference-count" in data["message"]:
            test_article.loc[index, "n_cites"] = data["message"]["reference-count"]
        if "issn-type" in data["message"]:
            value = data["message"]["issn-type"][0]["value"]
            test_article.loc[index, "journal_issn"] = value
            if value not in test_journal:
                data_issn = get_publication_data(value, False)
                journal.loc[index2, "issn"] = value
                if data_issn is not None:
                    journal.loc[index2, "title"] = data_issn["message"]["title"]
                index2 += 1

            
print(journal.sample())

         issn              title if_latest
28  1098-0121  Physical Review B       NaN


In [16]:
from pybliometrics.scopus import AbstractRetrieval
from pybliometrics.scopus import AuthorRetrieval

ab = AbstractRetrieval("10.1016/j.softx.2019.100263")

#Scopus429Error ?
#au2 = AuthorRetrieval(ab.authors[1].auid) 

In [None]:
import pandas as pd
import requests
from urllib.request import urlopen
from genderize import Genderize
from pybliometrics.scopus import AbstractRetrieval

#Test sets 
N = 50
test_authorship = authorship.sample(N).reset_index(drop = True)
test_article = article.sample(N).reset_index(drop = True)
test_author = author.sample(N).reset_index(drop = True)

# Set the API base URL
base_url = "https://api.crossref.org/works/"

gender_url = "https://api.genderize.io/?name="

# Define a function to retrieve data from the API
def get_publication_data(number): 
    url = base_url + number
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        return data
    else:
        return None

# Iterate over the rows of the dataframe
for index, row in test_authorship.iterrows():
    #Get row of Article
    art_row = article.loc[article["article_id"] == row["article_id"]]
    doi = art_row["doi"].iloc[0]
    data = get_publication_data(doi)
    if data is not None:
        # Update the dataframe with data from the API
        if "author" in data["message"]:
            message = data["message"]["author"]
            for i in range(len(message)):
                if "name" in message[i]['affiliation']:
                    test_author.loc[test_author['last_name'] == message[i]['family'], 'affiliation'] = message[i]['affiliation'][0]
        try:
            gender = Genderize().get(message[i]["given"])
            test_author.loc[test_author['last_name'] == message[i]['family'], 'gender'] = gender[0]['gender']
        except:
            print("Request limit too low to process request")

                                         
                
#test_author["gender"] = np.where(test_author["gender"] == "female", 0, 1)               

### To .csv

In [None]:
# Make a directory 'tables'
!mkdir tables

In [None]:
authorship.to_csv('tables/authorship.csv', index = False)
article_category.to_csv('tables/article_category.csv', index = False)
category.to_csv('tables/category.csv', index = False)
journal.to_csv('tables/journal.csv', index = False)
article.to_csv('tables/article.csv', index = False)
author.to_csv('tables/author.csv', index = False)

# 3. From Pandas to PostgreSQL

In [None]:
import psycopg2

In [None]:
# Import the data from Pandas
authorship = pd.read_csv('tables/authorship.csv')
article_category = pd.read_csv('tables/article_category.csv')
category = pd.read_csv('tables/category.csv')
journal = pd.read_csv('tables/journal.csv')
article = pd.read_csv('tables/article.csv')
author = pd.read_csv('tables/author.csv')

In [None]:
authorship

# Database Connection

In [None]:
# Connect to the database
conn = psycopg2.connect(host="postgres", user="postgres", password="password", database="postgres")
conn.set_session(autocommit=True)
cur = conn.cursor()

# create sparkify database with UTF8 encoding
cur.execute("DROP DATABASE IF EXISTS research_db")
cur.execute("CREATE DATABASE research_db WITH ENCODING 'utf8' TEMPLATE template0")

# Drop Tables

In [None]:
# Drop Tables 
authorship_drop = "DROP TABLE IF EXISTS authorship;"
article_category_drop = "DROP TABLE IF EXISTS article_category;"
category_drop = "DROP TABLE IF EXISTS category;"
journal_drop = "DROP TABLE IF EXISTS journal;"
article_drop = "DROP TABLE IF EXISTS article;"
author_drop = "DROP TABLE IF EXISTS author;"

drop_tables = [authorship_drop, article_category_drop, category_drop, journal_drop, article_drop, author_drop]

for query in drop_tables:
    cur.execute(query)
    conn.commit()

# Create Tables

In [None]:
# Create Tables
## authorship
authorship_create = ("""
CREATE TABLE IF NOT EXISTS authorship
(article_id VARCHAR, 
author_id VARCHAR,
PRIMARY KEY (article_id, author_id) 
);
""")

## article
article_category_create = ("""
CREATE TABLE IF NOT EXISTS article_category
(article_id VARCHAR, 
category_id VARCHAR,
PRIMARY KEY (article_id, category_id) 
);
""")

## category
category_create =  ("""
CREATE TABLE IF NOT EXISTS category
(category_id VARCHAR,
superdom VARCHAR,
subdom VARCHAR,
PRIMARY KEY (category_id) 
);
""")

## article
article_create =  ("""
CREATE TABLE IF NOT EXISTS article
(article_id VARCHAR,
title VARCHAR,
doi VARCHAR,
n_authors INT,
journal_issn VARCHAR,
n_cites VARCHAR,
year VARCHAR,
PRIMARY KEY (article_id) 
);
""")

## author
author_create =  ("""
CREATE TABLE IF NOT EXISTS author
(author_id VARCHAR,
last_name VARCHAR,
first_name VARCHAR,
middle_name VARCHAR,
gender VARCHAR,
affiliation VARCHAR,
hindex VARCHAR,
PRIMARY KEY (author_id) 
);
""")

## journal
journal_create =  ("""
CREATE TABLE IF NOT EXISTS journal
(journal_id VARCHAR,
issn VARCHAR,
title VARCHAR,
if_latest FLOAT,
PRIMARY KEY (journal_id) 
);
""")

create_tables = [authorship_create, article_category_create, category_create, article_create, author_create, journal_create]

for query in create_tables:
        cur.execute(query)
        conn.commit()

# Insert into Tables

In [None]:
authorship_insert = ("""
INSERT INTO authorship (article_id, author_id)
VALUES (%s, %s)
""") ## ON CONFLICT (article_id) DO NOTHING <-- might need to add to the end

article_category_insert = ("""
INSERT INTO article_category (article_id, category_id)
VALUES (%s, %s)
""")

category_insert = ("""
INSERT INTO category (category_id, superdom, subdom)
VALUES (%s, %s, %s)
""")

article_insert = ("""
INSERT INTO article (article_id, title, doi, n_authors, journal_issn, n_cites, year)
VALUES (%s, %s, %s, %s, %s, %s, %s)
""")

author_insert = ("""
INSERT INTO author (author_id, last_name, first_name, middle_name, gender, affiliation, hindex)
VALUES (%s, %s, %s, %s, %s, %s, %s)
""")

journal_insert = ("""
INSERT INTO journal (journal_id, issn, title, if_latest)
VALUES (%s, %s, %s, %s)
""")

# ----- #

# Name of tables (for later print)
tables = [authorship, article_category, category, article, author, journal]
authorship.name = 'authorship'
article_category.name = 'article_category'
category.name = 'category'
article.name = 'article'
author.name = 'author'
journal.name = 'journal'

insert_tables = [authorship_insert, article_category_insert, category_insert, article_insert, author_insert, journal_insert]


def insert_to_tables(table, query):
    ''' Helper function for inserting values to Postresql tables
    Args:
        table (pd.DataFrame): pandas table
        query (SQL query): correspondive SQL query for 'table' for data insertion in DB
    '''
    
    print(f'Trying to insert table -- {table.name} -- ...')
    
    try:
        for i, row in table.iterrows():
            cur.execute(query, list(row))
        print(f'Table -- {table.name} -- successfully inserted!')
    except:
        print(f'Error with table -- {table.name} --')
    print()
        
for  i in range(len(tables)):
    insert_to_tables(tables[i], insert_tables[i])

# Test Queries

In [None]:
#!pip3 install ipython-sql
%load_ext sql
%sql postgresql://postgres:password@postgres/postgres

In [None]:
%sql SELECT * FROM authorship LIMIT 10;

In [None]:
%sql SELECT * FROM article_category LIMIT 10;

In [None]:
%sql SELECT * FROM article LIMIT 10;

In [None]:
%sql SELECT * FROM author LIMIT 10;

In [None]:
%sql SELECT * FROM category LIMIT 10;

# 4. Preparing Graph DB Data
In essence, we need to (a) rename the attributes to be compliant with Neo4J notation, and (b) save the above-created tables to .csv-s: https://medium.com/@st3llasia/analyzing-arxiv-data-using-neo4j-part-1-ccce072a2027

- about network analysis with these data in Neo4J: https://medium.com/swlh/network-analysis-of-arxiv-dataset-to-create-a-search-and-recommendation-engine-of-articles-cd18b36a185e

- link prediction: https://towardsdatascience.com/link-prediction-with-neo4j-part-2-predicting-co-authors-using-scikit-learn-78b42356b44c

The Graph Database Schema is pictured below:
<img src="images/graph_db_schema.png"/>

# 5. Example Queries

## 5.1. Data Warehouse

## 5.2. Graph Database

## Total Pipeline Runtime

In [None]:
end_pipe = time.time()

print(f'Time of pipeline start: {time.ctime(end_pipe)}')
print(f'Total pipeline runtime: {(end_pipe - start_pipe)/60} min.')