# Data Engineering Project 
## ETL

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to clean the main raw data frame and write a new, clean data frame for further use. In this notebook, the comparisons of different read- and write-methods are demonstrated.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [None]:
#!pip install -r requirements.txt

In [1]:
## NB!! run the installs from terminal
########### Library Installations ##############

################### Imports ####################
### Data wrangling
import pandas as pd # working with dataframes
import numpy as np # vector operations


### Specific-purpose libraries
# NB! Most configure with an API key
#from pybliometrics.scopus import AbstractRetrieval
from habanero import Crossref # CrossRef API
from genderize import Genderize # Gender API

### Misc
from math import floor
import time
import requests
import warnings # suppress warnings
import os # accessing directories
from tqdm import tqdm # track loop runtime
from unidecode import unidecode # international encoding fo names

### Custom Scripts (ETL, augmentations, SQL)
#from dags.scripts.raw_to_tables import *
#from dags.scripts.augmentations import *
#from dags.scripts.final_tables import *
from dags.scripts.sql_queries import *
from dags.scripts.neo4j_queries import *

### Database drivers
import psycopg2
from neo4j import GraphDatabase

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # suppress warnings

## Pipeline start

In [2]:
def find_tables_or_ingest_raw():
    if os.path.exists('dags/tables') and len(os.listdir('dags/tables')) == 8: # directory + 7 tables
        print('Tables exist...')
        author = pd.read_csv('dags/tables/author.csv')
        authorshiphip = pd.read_csv('dags/tables/authorship.csv')
        article = pd.read_csv('dags/tables/article.csv')
        article_category = pd.read_csv('dags/tables/article_category.csv')
        category = pd.read_csv('dags/tables/category.csv')
        journal = pd.read_csv('dags/tables/journal.csv')
        print('Tables are in the working directory!')

    ## If tables do not exist, pull from kaggle (or local machine), proprocess to tables
    else: 
        print('Preparing tables...')
        print()
        ingest_and_prepare()
        print('Tables are in the working directory!')

In [3]:
find_tables_or_ingest_raw()

Tables exist...
Tables are in the working directory!


# 2. Loading Clean Data (or Running Additional Data Augmentation)

In [4]:
def check_or_augment():
    """Function to either check if clean tables exist
    or clean the data and write them to .csv
    """
    article = article_ready()
    journal = journal_ready()

    # Remove not found journals from articles
    article = article[article['journal_issn'].isin(journal['journal_issn'])].reset_index(drop = True)
    # Update 'article.csv' in 'data_ready' directory
    article.to_csv('dags/data_ready/article.csv', index = False)

    authorship = authorship_ready(article)
    author = author_ready(article, authorship)
    article_category = article_category_ready(article)
    category = category_ready(article_category)

In [5]:
check_or_augment()

NameError: name 'article_ready' is not defined

### Author update and augments
In order to query 'gender' of a given author, we first extract all valid (length > 3) first names. We acknowledge that there may be first names that are smaller than four characters in length, but given that query amount is limited, we are going with a more robust way to extract as many names as possible.

### Journal
In order to get the journal information, we need the journal ISSN list from the `article` table. Although journal Impact Factor are more common metrics, they are trademarked and, hence, retrieving them is not open-source. The alternative is to use SNIP - source-normalized impact per publication. This is the average number of citations per publication, corrected for differences in citation practice between research domains. Fortunately, the list of journals and their SNIP is available from the CWTS website (https://www.journalindicators.com/).

# 3. From Pandas to PostgreSQL

In [None]:
# Insert into tables (helper function)
def insert_to_tables(table, query):
    ''' Helper function for inserting values to Postresql tables
    Args:
        table (pd.DataFrame): pandas table
        query (SQL query): correspondive SQL query for 'table' for data insertion in DB
    '''
    print(f'Inserting table -- {table.name} -- ...')
    
    try:
        for i, row in table.iterrows():
            cur.execute(query, list(row))
        print(f'Table -- {table.name} -- successfully inserted!')
    except:
        print(f'Error with table -- {table.name} --')
    print()

In [None]:
def pandas_to_dwh():
    # Import the data
    try:
        article = pd.read_csv('dags/data_ready/article.csv')
        author = pd.read_csv('dags/data_ready/author.csv')
        authorship = pd.read_csv('dags/data_ready/authorship.csv')
        category = pd.read_csv('dags/data_ready/category.csv')
        article_category = pd.read_csv('dags/data_ready/article_category.csv')
        journal = pd.read_csv('dags/data_ready/journal.csv')
        tables = [article, author, authorship, category, article_category, journal]

        # Name of tables (for later print)
        article.name = 'article'
        author.name = 'author'
        authorship.name = 'authorship'
        category.name = 'category'
        article_category.name = 'article_category'
        journal.name = 'journal'
        print(article.head(2))
        print(author.head(2))
        print(authorship.head(2))
        print(category.head(2))
        print(article_category.head(2))
        print(journal.head(2))
        print('All tables inserted to DWH.')
    except:
        print('Error with importing the data tables')
       
    # Connect to the database
    conn = psycopg2.connect(host="postgres", user="airflow", password="airflow", database ="airflow", port = 5432)
    conn.set_session(autocommit=True)
    cur = conn.cursor()

    # create sparkify database with UTF8 encoding
    cur.execute("DROP DATABASE IF EXISTS research_db")
    cur.execute("CREATE DATABASE research_db WITH ENCODING 'utf8' TEMPLATE template0")

    # Drop Tables 
    try: 
        for query in drop_tables:
            cur.execute(query)
            conn.commit()
        print('All tables dropped.')
    except:
        print('Error with dropping tables.')
        
    # Create Tables
    try: 
        for query in create_tables:
            cur.execute(query)
            conn.commit()
        print('All tables created.')
    except:
        print('Error with creating tables.')

    # Insert into tables
    for i in tqdm(range(len(tables))):
        insert_to_tables(cur, tables[i], insert_tables[i])

# Data Warehouse Connection

## Load the possiblity to run magic function

In [6]:
%reload_ext sql
%sql postgresql://airflow:airflow@postgres/airflow

# Test Queries

In [8]:
%sql SELECT COUNT(*) FROM author COUNT;

 * postgresql://airflow:***@postgres/airflow
1 rows affected.


count
56202


In [None]:
%sql SELECT * FROM article_category LIMIT 10;

In [None]:
%sql SELECT * FROM article LIMIT 10;

In [None]:
%sql SELECT * FROM category LIMIT 10;

In [None]:
%sql SELECT * FROM journal LIMIT 10;

In [None]:
author = 'WangX'
# Get the articles
papers = authorship[authorship['author_id'] == author]['article_id'].values

# Get all authors
co_authors = authorship[authorship['article_id'].isin(papers)]

# N pubs with unique co-authors
npubs_coauthors = co_authors[co_authors['author_id'] != author].groupby(['author_id']).size()

# n Cites with unique co-authors


# 4. Preparing Graph DB Data

- about network analysis with these data in Neo4J: https://medium.com/swlh/network-analysis-of-arxiv-dataset-to-create-a-search-and-recommendation-engine-of-articles-cd18b36a185e

- link prediction: https://towardsdatascience.com/link-prediction-with-neo4j-part-2-predicting-co-authors-using-scikit-learn-78b42356b44c

The Graph Database Schema is pictured below:
<img src="images/graph_db_schema.png"/>

Tutorial: https://www.youtube.com/watch?v=PfySvVqHAWo&t=33s

In [None]:
conn_neo = Neo4jConnection(uri='bolt://neo:7687', user='', pwd='')

In [None]:
# Delete all nodes
# conn.query('MATCH (a) DELETE a')

### Add constraints to ID variables

In [8]:
def pandas_to_neo():
    # Import the data
    try:
        article = pd.read_csv('dags/data_ready/article.csv')
        author = pd.read_csv('dags/data_ready/author.csv')
        authorship = pd.read_csv('dags/data_ready/authorship.csv')
        category = pd.read_csv('dags/data_ready/category.csv')
        article_category = pd.read_csv('dags/data_ready/article_category.csv')
        journal = pd.read_csv('dags/data_ready/journal.csv')
        tables = [article, author, authorship, category, article_category, journal]

        # Name of tables (for later print)
        article.name = 'article'
        author.name = 'author'
        authorship.name = 'authorship'
        category.name = 'category'
        article_category.name = 'article_category'
        journal.name = 'journal'
        print(article.head(2))
        print(author.head(2))
        print(authorship.head(2))
        print(category.head(2))
        print(article_category.head(2))
        print(journal.head(2))
        print('All tables staged for Neo4J.')
    except:
        print('Error with importing the data tables.')

    # Neo4J Connection
    conn_neo = Neo4jConnection(uri='bolt://neo:7687', user='', pwd='')

    # Add ID uniqueness constraint to optimize queries
    conn_neo.query('CREATE CONSTRAINT ON(n:Category) ASSERT n.id IS UNIQUE')
    conn_neo.query('CREATE CONSTRAINT ON(j:Journal) ASSERT j.id IS UNIQUE')
    conn_neo.query('CREATE CONSTRAINT ON(au:Author) ASSERT au.id IS UNIQUE')
    conn_neo.query('CREATE CONSTRAINT ON(ar:Article) ASSERT ar.id IS UNIQUE')
   
    print(f'Inserting pandas to NEO4J...')
    try:
        add_category(conn_neo, category)
        add_journal(conn_neo, journal)
        add_author(conn_neo, author)
        add_article(conn_neo, article) 
        add_article_category(conn_neo, article_category)
        add_authorship(conn_neo, authorship)
        print(f'pandas to Neo4J inserted!')
    except:
        print('Error or entities already exist (check the subsequent info)!')
        print('Below are the counts of entities in the Neo4J database (must be non-null):')
        n_articles = conn_neo.query('MATCH (n:Article) RETURN COUNT(n) AS ct')
        n_authors = conn_neo.query('MATCH (n:Author) RETURN COUNT(n) AS ct')
        n_journals = conn_neo.query('MATCH (n:Journal) RETURN COUNT(n) AS ct')   
        n_categories =  conn_neo.query('MATCH (n:Category) RETURN COUNT(n) AS ct')  
        
        print(f"Number of articles in the NEO4J database: {n_articles[0]['ct']}")
        print(f"Number of authors in the NEO4J database: {n_authors[0]['ct']}")
        print(f"Number of journals in the NEO4J database: {n_journals[0]['ct']}")
        print(f"Number of categories in the NEO4J database: {n_categories[0]['ct']}")

[]

In [None]:
result = conn_neo.query('MATCH (n:Article) RETURN COUNT(n) AS ct')
print(result[0]['ct'])

In [None]:
result = conn_neo.query('MATCH (n:Author) RETURN COUNT(n) AS ct')
print(result[0]['ct'])

# 5. Example Queries

## 5.1. Data Warehouse

### Who are the top 0.01% scientists with the most publications in the sample?
Outcome: list of 0.01% top scientists, count of publications, ranking in terms of the total sample.

In [18]:
%%sql query_one <<
SELECT author_id, rank_total_pubs as rank, total_pubs as publications
FROM author 
ORDER BY rank_total_pubs 
LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100;

 * postgresql://airflow:***@postgres/airflow
6 rows affected.
Returning data to local variable query_one


In [19]:
query_one

author_id,rank,publications
WangY,1,174
WangX,2,167
LiuY,3,159
ZhangJ,4,156
ZhangY,5,133
WangZ,6,107


### Proportionally, in which journals have the top 0.01% of scientists (in terms of publication count) published their work the most?

In [24]:
%%sql query_two <<
SELECT final.author_id, final.rank, final.publications, final.journal_title as top_journal,  TO_CHAR((final.number * 100 / final.publications), 'fm99%') as percentage_of_all_publications
FROM (select a.author_id, rank, publications, mode() within group (order by j.journal_title) AS journal_title, COUNT(j.journal_title) as number
      from (SELECT author_id, rank_total_pubs as rank, total_pubs as publications
      FROM author 
      ORDER BY rank_total_pubs 
      LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a
      INNER JOIN authorship au ON a.author_id = au.author_id
      INNER JOIN article ar ON au.article_id = ar.article_id
      INNER JOIN journal j ON ar.journal_issn = j.journal_issn
      group by a.author_id, rank, publications,j.journal_title
      having j.journal_title = mode() within group (order by j.journal_title)) as final
LEFT JOIN (select a.author_id, rank, publications, mode() within group (order by j.journal_title) AS journal_title, COUNT(j.journal_title) as number
      from (SELECT author_id, rank_total_pubs as rank, total_pubs as publications
      FROM author 
      ORDER BY rank_total_pubs 
      LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a
      INNER JOIN authorship au ON a.author_id = au.author_id
      INNER JOIN article ar ON au.article_id = ar.article_id
      INNER JOIN journal j ON ar.journal_issn = j.journal_issn
      group by a.author_id, rank, publications,j.journal_title
      having j.journal_title = mode() within group (order by j.journal_title)) as final1 ON 
    final.author_id = final1.author_id AND final.number < final1.number
WHERE final1.author_id IS NULL
ORDER BY final.rank 
LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100;

 * postgresql://airflow:***@postgres/airflow
6 rows affected.
Returning data to local variable query_two


In [25]:
query_two

author_id,rank,publications,top_journal,percentage_of_all_publications
WangY,1,174,IEEE Transactions on Image Processing,7%
WangX,2,167,IEEE Transactions on Signal Processing,8%
LiuY,3,159,Lecture Notes in Computer Science,7%
LiuY,3,159,IEEE Transactions on Image Processing,7%
LiuY,3,159,IEEE Transactions on Signal Processing,7%
ZhangJ,4,156,IEEE Transactions on Wireless Communications,7%


### What was the most productive year (N publications) for top 0.01% scientists?

In [26]:
%%sql query_three <<

SELECT final.author_id, final.rank, final.publications, final.most_productive_year as most_productive_year, final.number as count_of_pub
FROM (SELECT a.author_id, rank, publications, mode() within group (order by ar.year) AS most_productive_year, sum(publications) as number
    FROM (SELECT author_id, rank_total_pubs as rank, total_pubs as publications
    FROM author 
    ORDER BY rank_total_pubs 
    LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a
    INNER JOIN authorship au ON a.author_id = au.author_id
    INNER JOIN article ar ON au.article_id = ar.article_id
    GROUP BY a.author_id, rank, publications, ar.year
    having ar.year = mode() within group (order by ar.year)) as final
LEFT JOIN (SELECT a.author_id, rank, publications, mode() within group (order by ar.year) AS most_productive_year, sum(publications) as number 
    FROM (SELECT author_id, rank_total_pubs as rank, total_pubs as publications
    FROM author 
    ORDER BY rank_total_pubs 
    LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a
    INNER JOIN authorship au ON a.author_id = au.author_id
    INNER JOIN article ar ON au.article_id = ar.article_id
    GROUP BY a.author_id, rank, publications, ar.year
    having ar.year = mode() within group (order by ar.year)) as final1 ON 
    final.author_id = final1.author_id AND final.number < final1.number
WHERE final1.author_id IS NULL
ORDER BY final.rank 
LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100;

 * postgresql://airflow:***@postgres/airflow
6 rows affected.
Returning data to local variable query_three


In [27]:
query_three

author_id,rank,publications,most_productive_year,count_of_pub
WangY,1,174,2021,7482
WangX,2,167,2020,4509
LiuY,3,159,2022,5247
LiuY,3,159,2020,5247
LiuY,3,159,2021,5247
ZhangJ,4,156,2022,5460


### What was the most influential (in terms of N citations/ N publications) year for top 3% scientists?
Outcome: list of (a) top 3% scientists, most influential year, count of publications for that year, average N of citations per publication.

In [28]:
%%sql query_four <<

SELECT final.author_id, final.rank, final.hindex, final.pub, final.avg_cites, final.year
FROM (SELECT a.author_id, rank, sum(hindex::DECIMAL) as hindex, sum(publications::DECIMAL) as pub, sum(avg_cites::DECIMAL) as avg_cites, ar.year
    FROM (SELECT author_id, rank_total_pubs as rank, total_pubs as publications, hindex, avg_cites
    FROM author 
    ORDER BY rank_total_pubs 
    LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a
    INNER JOIN authorship au ON a.author_id = au.author_id
    INNER JOIN article ar ON au.article_id = ar.article_id
    GROUP BY a.author_id, rank, ar.year) as final
LEFT JOIN (SELECT a.author_id, rank, sum(hindex::DECIMAL) as hindex, sum(publications::DECIMAL) as pub, sum(avg_cites::DECIMAL) as avg_cites, ar.year 
    FROM (SELECT author_id, rank_total_pubs as rank, total_pubs as publications, hindex, avg_cites
    FROM author 
    ORDER BY rank_total_pubs 
    LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100) AS a
    INNER JOIN authorship au ON a.author_id = au.author_id
    INNER JOIN article ar ON au.article_id = ar.article_id
    GROUP BY a.author_id, rank, ar.year) as final1 ON 
    final.author_id = final1.author_id AND final.hindex < final1.hindex
WHERE final1.author_id IS NULL
ORDER BY final.rank 
LIMIT  0.01 * (SELECT COUNT(*) FROM author) / 100;

 * postgresql://airflow:***@postgres/airflow
6 rows affected.
Returning data to local variable query_four


In [29]:
query_four

author_id,rank,hindex,pub,avg_cites,year
WangY,1,1290,7482,1143.456,2021
WangX,2,1053,4509,1124.955,2020
LiuY,3,1089,5247,1154.373,2022
LiuY,3,1089,5247,1154.373,2020
LiuY,3,1089,5247,1154.373,2021
ZhangJ,4,1295,5460,1446.9,2022


## 5.2. Graph Database

In [None]:

result = conn_neo.query(
"""
MATCH (author:Author)-[:AUTHORED]->(article:Article) 
WHERE author.id = "GousiosG" 
WITH author, COUNT(article) AS number_of_articles, collect(article) AS articles
ORDER BY number_of_articles DESC 
UNWIND articles AS article
MATCH (coauthor:Author)-[:AUTHORED]->(article)
RETURN article, collect(coauthor), COUNT(article), COUNT(coauthor)
"""
)
for r in result:
    print(r)

In [None]:
# Ego-network WITH the author
MATCH (author:Author)-[:AUTHORED]->(article:Article) 
WHERE author.id = "GousiosG" # add specific name
WITH author, COUNT(article) AS number_of_articles, collect(article) AS articles
ORDER BY number_of_articles DESC 
LIMIT 1
UNWIND articles AS article
MATCH (coauthor:Author)-[:AUTHORED]->(article)
RETURN article, collect(coauthor), COUNT(article)

In [None]:
# Ego-network WITHOUT the author
# https://stackoverflow.com/questions/28816222/finding-a-list-of-neo4j-nodes-which-have-the-most-relationships-back-to-another
MATCH (author:Author)-[:AUTHORED]->(article:Article) 
WITH author, COUNT(article) AS number_of_articles, collect(article) AS articles
ORDER BY number_of_articles DESC 
LIMIT 1
UNWIND articles AS article
MATCH (coauthor:Author)-[:AUTHORED]->(article)
WHERE coauthor <> author
RETURN article, collect(coauthor)

## Total Pipeline Runtime

In [None]:
end_pipe = time.time()

print(f'Time of pipeline start: {time.ctime(end_pipe)}')
print(f'Total pipeline runtime: {(end_pipe - start_pipe)/60} min.')