# Data Engineering Project 
## ETL

**Authors**: 
- Dmitri Rozgonjuk
- Eerik Sven Puudist
- Lisanne Siniväli
- Cheng-Han Chung


The aim of this script is to clean the main raw data frame and write a new, clean data frame for further use. In this notebook, the comparisons of different read- and write-methods are demonstrated.

First, we install and import the necessary libraries from one cell (to avoid having libraries in some individual cells below). The packages and their versions to be installed will later be added to the `requirements.txt` file.

We also use this section to set global environment parameters.

In [None]:
!conda install psycopg2 -y

In [None]:
!pip install -r requirements.txt

In [1]:
## NB!! run the installs from terminal
########### Library Installations ##############

################### Imports ####################
### Data wrangling
import pandas as pd # working with dataframes
import numpy as np # vector operations


### Specific-purpose libraries
# NB! Most configure with an API key
#from pybliometrics.scopus import AbstractRetrieval
from habanero import Crossref # CrossRef API
from genderize import Genderize # Gender API

### Misc
import requests
import warnings # suppress warnings
import os # accessing directories
from tqdm import tqdm # track loop runtime
from unidecode import unidecode # international encoding fo names

### Custom Scripts (ETL, SQL)
from scripts.raw_to_tables import *
from scripts.sql_queries import *

#import psycopg2

########## SETTING ENV PARAMETERS ################
warnings.filterwarnings('ignore') # suppress warnings

## Pipeline start

In [2]:
start_pipe = time.time() # Initialize the time of pipeline

# First check if the tables are already in the system
## If tables exist, import from .csv

if os.path.exists('./tables') and len(os.listdir('./tables')) == 7: # directory + 6 tables
    print('Tables exist...')
    author = pd.read_csv('tables/author.csv')
    authorshiphip = pd.read_csv('tables/authorship.csv')
    article = pd.read_csv('tables/article.csv')
    article_category = pd.read_csv('tables/article_category.csv')
    category = pd.read_csv('tables/category.csv')
    journal = pd.read_csv('tables/journal.csv')
    print('Tables in the working directory!')
    

## If tables do not exist, pull from kaggle (or local machine), proprocess to tables
else: 
    
    start_etl = time.time() # Initialize the time of ETL
    print(f'Time of pipeline start: {time.ctime(start_pipe)}')
    print()
    # Data ingestion
    df = ingest_and_process(force = False)

    # Prepare Pandas dataframes
    authorship, author = authorship_author_extract(df)
    article_category, category = article_category_category_extract(df)
    article = article_extract(df)
    journal = journal_extract()
    
    # Clean the data last time: remove all authors with NaNs or too short names
    ## NaNs
    author = author[~author['author_id'].isnull()]
    nan_authors = authorship[authorship['author_id'].isnull()]['article_id'].values
    article = article.loc[~article['article_id'].isin(nan_authors)]
    authorship = authorship.loc[~authorship['article_id'].isin(nan_authors)]

    ## Too short (< 4) names
    author = author[~(author['author_id'].str.len() < 4)].reset_index(drop = True)
    short_authors = authorship[(authorship['author_id'].str.len() < 4)]['article_id'].values
    article = article.loc[~article['article_id'].isin(short_authors)].reset_index(drop = True)
    authorship = authorship.loc[~authorship['article_id'].isin(short_authors)].reset_index(drop = True)
    
    ## Write .csv-s to 'tables' directory
    ### Create the 'tables' directory
    !mkdir tables
    
    ### Write the tables as csv
    authorship.to_csv('tables/authorship.csv', index = False)
    article_category.to_csv('tables/article_category.csv', index = False)
    category.to_csv('tables/category.csv', index = False)
    journal.to_csv('tables/journal.csv', index = False)
    article.to_csv('tables/article.csv', index = False)
    author.to_csv('tables/author.csv', index = False)

    end_etl = time.time() # Endtime of ETL
    print(f'ETL Runtime: {round(end_etl - start_etl, 6)} sec.')

Tables exist...
Tables in the working directory!


# 2. Data Augmentation

In [None]:
# Tables:
## authorship
## journal <-- augment all data (use ISSN from DOI)
## article <-- augment with number of citations
## author <-- augment with gender and affiliation

### Article
In this section, we use the `requests` library to fetch the citation based onthe Crossref URL of the work's DOI. We have found that this method is faster than querying the Crossref API. We extract the work type and the number of citations that the work has received; additionally, the journal ISSN for the publication is retrieved if it is available.

We want to note that although we initially also wanted to fetch author affiliation, it is not really feasible, as most of this information is missing.

In [3]:
# article = pd.read_csv('tables/article.csv') # import if necessary
article.head()


Unnamed: 0,article_id,title,doi,n_authors,journal_issn,type,n_cites,year
0,704.0046,A limit relation for entropy and channel capac...,10.1063/1.2779138,3,,,,2009
1,704.0062,On-line Viterbi Algorithm and Its Relationship...,10.1007/978-3-540-74126-8_23,3,,,,2010
2,704.0098,Sparsely-spread CDMA - a statistical mechanics...,10.1088/1751-8113/40/41/004,2,,,,2009
3,704.0217,Capacity of a Multiple-Antenna Fading Channel ...,10.1109/TIT.2008.2011437,2,,,,2010
4,704.0301,Differential Recursion and Differentially Alge...,10.1145/1507244.1507252,1,,,,2009


In [10]:
# Candidate for citation updating!!

# Use the base url
base_url = 'http://api.crossref.org/works/'

def fetch_article_augments(start_range, end_range):
    for i in range(start_range, end_range, 1): # don't use tqdm if range specified like that...
        try:
            # Check if the value in work type is of len 3 ('NaN')
            if len(article['type'].astype(str)[i]) == 3:
                doi = base_url + article.loc[i, 'doi'] # append the doi for the base URL
                rqst = requests.get(doi) # request by URL
                qr_result = rqst.json() # get the json

                if qr_result['status'] == 'ok': # if request successful, make the queries, update fields
                    msg = qr_result['message']
                    work_type = msg['type']

                    article.loc[i, 'type'] = work_type # add work type
                    article.loc[i, 'n_cites'] =  qr_result['message']['is-referenced-by-count'] # add reference count

                    try: 
                        article.loc[i, 'journal_issn'] =  qr_result['message']['ISSN'][0] # add journal ISSN
                    except:
                        pass

                else:
                    pass
        except:
            continue
    # Overwrite the csv
    article.to_csv('tables/article.csv', index = False)

In [None]:
print(time.ctime())
start_article = time.time()
start_range = 15001
end_range = 20000

fetch_article_augments(start_range, end_range)

end_article = time.time()
end_article - start_article/60

# 5k records in appx 30min

Fri Dec 23 15:39:37 2022


In [None]:
article.iloc[19999]

### Journal
In order to get the journal information, we need the journal ISSN list from the `article` table. Although journal Impact Factor are more common metrics, they are trademarked and, hence, retrieving them is not open-source. The alternative is to use SNIP - source-normalized impact per publication. This is the average number of citations per publication, corrected for differences in citation practice between research domains. Fortunately, the list of journals and their SNIP is available from the CWTS website (https://www.journalindicators.com/).

In [33]:
cwts_data = pd.read_excel('https://www.journalindicators.com/Content/CWTS%20Journal%20Indicators%20April%202022.xlsx',
                         sheet_name = 'Sources')

# Filter out the unnecessary stuff 
---

# save as a table to augmentations dir

# fetch from augmentations dir

# list nique ISSNs to journal table

# extract the journal names and SNIPs from swts data

# save the journal table

34.44174875418345

### Author: gender
In order to query 'gender' of a given author, we first extract all valid (length > 3) first names. We acknowledge that there may be first names that are smaller than four characters in length, but given that query amount is limited, we are going with a more robust way to extract as many names as possible.

To that end, we are querying the names via the Genderize.io API. It allows for querying 1500 names per day. We exract the names and probabilities, and update our own data table with these data. We then finally join the table by firstname to include the gender column.

In [None]:
# import the author table
author = pd.read_csv('tables/author.csv') 
author.head()

In [None]:
# Extract unique valid first names and create a temporary df with firstname and gender
names_genders = pd.DataFrame(np.sort(author[author['first_name'].str.len() > 4]['first_name'].unique()))
names_genders.columns = ['first_name']
names_genders['alph_value'] = names_genders['first_name'].str.extract('([A-Z]+)') # add a column with first letter
names_genders = names_genders.loc[(names_genders['alph_value'].str.len() < 2)].reset_index(drop = True) # remove rows where there's more than one letter
names_genders = pd.concat([names_genders,pd.DataFrame(columns=['gender', 'prob'])])
names_genders

In [None]:
names_genders_file = pd.read_csv('names_genders.csv')
names_genders_file['first_name'] = names_genders_file['first_name'].apply(unidecode)
names_genders_file = names_genders_file.drop_duplicates(['first_name', 'gender'])
names_genders_file
names_genders.head()

In [None]:
def search_gender_from_data(external_dataset, name_var, gender_var, prob_var = None):
    
    # Search for names from the UCI name data set
    for i in tqdm(range(len(names_genders))):

        if names_genders.loc[i, 'prob'] >= 0 and names_genders.loc[i, 'prob'] <= 1:
                pass
        else:
            # Extract the name and letter
            firstname = names_genders.loc[i, 'first_name']

            # Search in a subset of the externalm dataset
            idx = external_dataset[external_dataset[name_var] == firstname].index

            # If no index found, no name -> do nothing (augment later with API)
            if len(idx) == 0:
                pass
            else:
                # If there is no gender data, do nothing
                if len(external_dataset.loc[idx, gender_var]) == 0:
                    
                    if prob_var == None:
                        pass
                    # If prob-var is provided, update the prob var
                    else:
                        names_genders.loc[i, 'prob'] = external_dataset.loc[idx, prob_var] # get the gender
                else:
                    idx = idx.values[0]
                    names_genders.loc[i, 'gender'] = external_dataset.loc[idx, gender_var] # get the gender
                    names_genders.loc[i, 'prob'] = 1 # set prob to 1

In [None]:
search_gender_from_data(names_genders_file, 'first_name', 'gender', 'prob')
names_genders.to_csv('names_genders.csv', index = False)

In [None]:
# UCI dataset
uci = pd.read_csv('uci_name_gender_dataset.csv')[['Name', 'Gender']].sort_values('Name').reset_index(drop = True)
uci = uci[~(uci['Name'].str.len() < 3)].reset_index(drop = True) # remove too short names
uci['alph_value'] = uci['Name'].str.extract('([A-Z]+)') # create a column for partialling
uci = uci.loc[(uci['alph_value'].str.len() < 2)] # remove rows where there's more than one letter
uci['Name'] = uci['Name'].apply(unidecode)
uci['Name'] = uci['Name'].str.replace('[^a-zA-Z0-9]', '', regex=True).str.strip()
uci.head()

In [None]:
search_gender_from_data(uci, 'Name', 'Gender')
names_genders.to_csv('names_genders.csv', index = False)

In [None]:
def update_names_table(start_n, names_genders):
    
    # For loop querying the genderize.io API
    for i in tqdm(range(start_n, len(names_genders), 1)):
        # Extract the name
        first_name = names_genders.loc[i, 'first_name'] # first name
        # Check if the name has already been checked
        ## Query only if the name hasn't been checked already
        if names_genders.loc[i, 'prob'] >= 0 and names_genders.loc[i, 'prob'] <= 1:
            pass
        else:
            try: 
                gender_info = Genderize().get([first_name])
                names_genders.loc[i, 'gender'] = gender_info[0]['gender']
                names_genders.loc[i,'prob'] = gender_info[0]['probability']
            except:
                print(f'Iteration nr {i}')
                print('Limit likely exceeded.')
                break
            finally:
                # Write to csv once no more pulls are possible
                names_genders.to_csv('names_genders.csv', index = False)

In [None]:
update_names_table(3106, names_genders)

In [None]:
print(f"Number of yet not checked names: {names_genders['gender'].isnull().sum()}")

#### Merge author-names-genders

In [None]:
# Import gender table
names_genders = pd.read_csv('names_genders.csv')
# Exclude the names that were not found
found_names = names_genders[names_genders['prob']>0]
# Gender values to 'M' and 'F'
found_names['gender'] = found_names['gender'].replace(to_replace=['male','female'], value=['M', 'F'])

author = author.merge(found_names[['first_name', 'gender']], on = ['first_name'], how = 'right')
author

# 3. From Pandas to PostgreSQL

In [None]:
# Import the data from Pandas
authorship = pd.read_csv('tables/authorship.csv')
article_category = pd.read_csv('tables/article_category.csv')
category = pd.read_csv('tables/category.csv')
article = pd.read_csv('tables/article.csv')
author = pd.read_csv('tables/author.csv')
journal = pd.read_csv('tables/journal.csv')

tables = [authorship, article_category, category, article, author, journal]

# Name of tables (for later print)
authorship.name = 'authorship'
article_category.name = 'article_category'
category.name = 'category'
article.name = 'article'
author.name = 'author'
journal.name = 'journal'

In [None]:
journal

# Database Connection

In [None]:
# Connect to the database
conn = psycopg2.connect(host="postgres", user="postgres", password="password", database="postgres")
conn.set_session(autocommit=True)
cur = conn.cursor()

# create sparkify database with UTF8 encoding
cur.execute("DROP DATABASE IF EXISTS research_db")
cur.execute("CREATE DATABASE research_db WITH ENCODING 'utf8' TEMPLATE template0")

## Load the possiblity to run magic function

In [None]:
%load_ext sql
%sql postgresql://postgres:password@postgres/postgres

# Drop Tables

In [None]:
# Drop Tables 
for query in drop_tables:
    cur.execute(query)
    conn.commit()

In [None]:
# Check that a table, e.g., 'jounal', is not in the database
%sql SELECT * FROM journal

# Create Tables

In [None]:
for query in create_tables:
        cur.execute(query)
        conn.commit()

In [None]:
# Check that the tables (e.g., 'author') are created
## Should be empty
%sql SELECT * FROM journal

# Insert into Tables

In [None]:
def insert_to_tables(table, query):
    ''' Helper function for inserting values to Postresql tables
    Args:
        table (pd.DataFrame): pandas table
        query (SQL query): correspondive SQL query for 'table' for data insertion in DB
    '''
    
    print(f'Inserting table -- {table.name} -- ...')
    
    try:
        for i, row in table.iterrows():
            cur.execute(query, list(row))
        print(f'Table -- {table.name} -- successfully inserted!')
    except:
        print(f'Error with table -- {table.name} --')
    print()
        
for  i in range(len(tables)):
    insert_to_tables(tables[i], insert_tables[i])

In [None]:
%sql SELECT * FROM author LIMIT 10

# Test Queries

In [None]:
%sql SELECT * FROM authorship LIMIT 10;

In [None]:
%sql SELECT * FROM article_category LIMIT 10;

In [None]:
%sql SELECT * FROM article LIMIT 10;

In [None]:
%sql SELECT * FROM category LIMIT 10;

In [None]:
%sql SELECT * FROM journal LIMIT 10;

# 4. Preparing Graph DB Data
In essence, we need to (a) rename the attributes to be compliant with Neo4J notation, and (b) save the above-created tables to .csv-s: https://medium.com/@st3llasia/analyzing-arxiv-data-using-neo4j-part-1-ccce072a2027

- about network analysis with these data in Neo4J: https://medium.com/swlh/network-analysis-of-arxiv-dataset-to-create-a-search-and-recommendation-engine-of-articles-cd18b36a185e

- link prediction: https://towardsdatascience.com/link-prediction-with-neo4j-part-2-predicting-co-authors-using-scikit-learn-78b42356b44c

The Graph Database Schema is pictured below:
<img src="images/graph_db_schema.png"/>

# 5. Example Queries

## 5.1. Data Warehouse

## 5.2. Graph Database

## Total Pipeline Runtime

In [None]:
end_pipe = time.time()

print(f'Time of pipeline start: {time.ctime(end_pipe)}')
print(f'Total pipeline runtime: {(end_pipe - start_pipe)/60} min.')