# Pre-Processing

This Jupyter Notebook provides the source used to pre-process the results obtained during the search in each database.  
The following steps are performed:

1. Removal of self duplicates
2. Removal of pairwise duplicates
3. Application of IC1 (filter by year)
4. Export data

## 1. Importing the libraries

In [1]:
import pandas as pd
from pathlib import Path

## 2. Loading the dataset
The CSV from each database was download from their respective websites after performing the search using the search string defined in the protocol.  
Each CSV is loaded into a DataFrame.

In [2]:
ROOT_DIR = Path('.').resolve().parent
DATA_DIR = ROOT_DIR / 'data/primary_selection'
RAW_DIR = DATA_DIR / 'raw'

acm = pd.read_csv(RAW_DIR / 'ACM.csv')
ieee = pd.read_csv(RAW_DIR / 'IEEE_mod.csv')
scopus = pd.read_csv(RAW_DIR / 'Scopus.csv')

### 2.1 Preprocessing the DataFrames

Select only the relevant columns from the DataFrames, rename their columns and apply lowercase to the title of papers.

In [3]:
def preprocess(df, columns):
    new_columns = ['author', 'title', 'year', 'source']
    
    df = df[columns].copy()
    df.columns = new_columns
    
    df['title'] = df['title'].str.lower()
    
    return df

acm = preprocess(acm, ['author', 'title', 'year', 'booktitle'])
ieee = preprocess(ieee, ['Authors', 'Document Title', 'Publication_Year', 'Publication Title'])
scopus = preprocess(scopus, ['Authors', 'Title', 'Year', 'Source title'])

## 3. Duplicate papers
In this step, the DataFrames are sanitized through the removal of duplicates within itself and between different databases.   

### 3.1 Removing self duplicates
Uses an auxiliar function that receives the original DataFrame, and the database name as parameters.  
The DataFrame's **duplicated** function produces a mask to remove duplicates based on the authors and publication title.  
The function returns the sanitized version of the DataFrame, which does not contain self duplicates and contains only the relevant columns.

In [4]:
def remove_self_duplicates(df, name):
    
    mask = df.duplicated(subset=df.columns[:2])
    dups = df[mask]
    sanitized = df[~mask]
    
    print(f'{name} (self) duplicates: {dups.shape[0]}')
    
    return sanitized

acm = remove_self_duplicates(acm, 'ACM')
ieee = remove_self_duplicates(ieee, 'IEEE')
scopus = remove_self_duplicates(scopus, 'Scopus')

ACM (self) duplicates: 5
IEEE (self) duplicates: 1
Scopus (self) duplicates: 3


### 3.2 Removing pairwise duplicates
Uses an auxiliar function that receives two DataFrames, column name, and the name of the database pair.  
The DataFrame's **isin** function produces a mask of the papers present in both DataFrames.  
The function returns the sanitized version of the first DataFrame, which does not contain duplicates found in the second.  
It contains only the relevant columns that were previously selected by the self duplicate removal function.

In [5]:
def remove_pairwise_duplicates(df1, df2, column, pair_name): 
    mask = df1[column].isin(df2[column].values)
    dups = df1[mask]   
    sanitized = df1[~mask]
        
    print(f'{pair_name} (pairwise) duplicates: {dups.shape[0]}')
    
    return sanitized

ieee = remove_pairwise_duplicates(ieee, acm, 'title', 'IEEE on ACM')
scopus = remove_pairwise_duplicates(scopus, acm, 'title', 'Scopus on ACM')
scopus = remove_pairwise_duplicates(scopus, ieee, 'title', 'Scopus on IEEE')

IEEE on ACM (pairwise) duplicates: 17
Scopus on ACM (pairwise) duplicates: 58
Scopus on IEEE (pairwise) duplicates: 306


## 4. Applying IC1 (year filter)
In this step, a mask is produced with papers published after 2018.

In [6]:
def filter_by_year(df, column, name):
    mask = df[column] >= 2018
    filtered = df[mask]
    
    print(f'{name} papers selected by year (IC1): {filtered.shape[0]}')

    return filtered

acm = filter_by_year(acm, 'year', 'ACM')
ieee = filter_by_year(ieee, 'year', 'IEEE')
scopus = filter_by_year(scopus, 'year', 'Scopus')

ACM papers selected by year (IC1): 9
IEEE papers selected by year (IC1): 57
Scopus papers selected by year (IC1): 122


## 5. Saving the DataFrames as CSV files
Save each one of the pre processed DataFrames to a CSV file.

In [7]:
OUTPUT_DIR = DATA_DIR / 'automatic'
OUTPUT_DIR.mkdir(exist_ok=True)

acm.to_csv(OUTPUT_DIR / 'ACM.csv', index=False)
ieee.to_csv(OUTPUT_DIR / 'IEEE.csv', index=False)
scopus.to_csv(OUTPUT_DIR / 'Scopus.csv', index=False)