# Cleaning data Kennisinstituut

Suggested improvements:
* Retrieve missing doi links (can be done using [this script](https://github.com/asreview/synergy-dataset/blob/461a0f757439c226acbc6bc320359001ecd26c69/scripts/enrich.py))
* Anonymize the experts and report all inclusions and exclusions

## Data extraction from csv files

In this part we will extract data from the csv files that are exported from rayyan. 

First load in the packages

In [26]:
# import packages
import os
import pandas as pd
import numpy as np # for np.nan
import re # for searching in notes column

Get all csv files from the folder

In [None]:
# List all .csv files in the current directory
csv_files = [file for file in os.listdir('.') if file.endswith('.csv')]

# Display the list of .csv files
print(csv_files)

Figure out how many colums each file has (are they equal? -> No). Then find the common column names so we can extract the relevant data from the files.

In [None]:
ncols = [] # number of columns in each file
column_sets = []  # List to store the set of columns from each file

for file_name in csv_files:
    df = pd.read_csv(file_name, sep=';', engine='python')
    ncols.append(df.shape[1]) # Append the number of columns to the list
    
    # Append the set of column names to the list
    column_sets.append(set(df.columns))

print(ncols)

# Find columns present in all files
common_columns = set.intersection(*column_sets)
print("Columns present in all files:", common_columns)

## Load in a dataset
In the final script, this will be done in a loop. This document is there to explain what is going on. So for now, load in a single dataset:

In [29]:
# Read the CSV file with semicolon separator
df = pd.read_csv(csv_files[0], sep=';', engine='python')

### Looking for inclusion status

The inclusion status, that is the labels provided by the experts, can usualy be found in the notes column. However, for some rows in some documents this information seems to have jumped around somehow. Therefore, for each line we need to evaluate if the relevant information present, and where. Then we should store it in a new column.

The patern that we look for goes like this: 
RAYYAN-INCLUSION: {""RATER 1""=>""Excluded"", ""RATER 2""=>""Excluded""}

So we define a function to look for the patern.

In [30]:
# Function to search all columns in a row for inclusion status
def find_inclusion_status_in_row(row):
    for col in row.index:
        value = str(row[col])
        if pd.notnull(value):
            match = re.search(r'RAYYAN-INCLUSION:\s*({.*?})', value)
            if match:
                return match.group(1)
    return None

This code can be used to create a new column on `inclusion_status` as follows.

In [38]:
# Apply the function to each row to find inclusion_status
df['inclusion_status'] = df.apply(find_inclusion_status_in_row, axis=1)

Then we need to extract the inclusion status provided by each expert and map them to a final decision. For that we use the following code:

In [35]:
# Function to map decisions to codes
def map_decision(decision):
    if decision.lower() == "excluded":
        return 0
    elif decision.lower() == "included":
        return 1
    elif decision.lower() == "maybe":
        return 999
    else:
        return None

# Function to extract names and coded decisions from inclusion_status
def extract_decisions(inclusion_status):
    if pd.isnull(inclusion_status):
        return None
    decisions = re.findall(r'"(.*?)"\s*=>\s*"(.*?)"', inclusion_status)
    return {name: map_decision(decision) for name, decision in decisions}

Which can be applied by using:

In [37]:
# Apply the function to the 'inclusion_status' column
df['coded_decisions'] = df['inclusion_status'].apply(extract_decisions)

Finally, the TI-AB label is created by means of:

In [40]:
# Create the TI-AB column based on the coded_decisions
df['TI-AB'] = df['coded_decisions'].apply(
    lambda decisions: np.nan if decisions is None or decisions == 'None' else (
        0 if decisions and all(decision == 0 for decision in decisions.values()) else 1
    )
)

That is, if all experts agree to exclude, exclude. If any experts does not agree, move it on the the next phase of screening and include it in the title-abstract screening phase. 

### Removing irrelevant columns
Not all the common columns are relevant so after looking into the content we remove some columns that we don't need to select.

Columns from the original data that are removed are:
- authors
- issue
- key
- language
- month
- volume
- publisher
- journal
- issn
- pages
- day
- pmc_id
- location
- year

Or phrased differently, we keep the original columns:

- title
- abstract
- pubmed_id
- url 

and the created column
- TI-AB

### Extracting doi links where possible

The url column contains links to the relevant papers. These can be doi link or others. We want to get all the DOI links and use a function to get them:

In [47]:
# Function to extract DOI link from a string
def extract_doi(url_string):
    if pd.isnull(url_string):
        return None
    # Regular expression pattern to match DOI URLs
    doi_pattern = r'(https?://(?:dx\.)?doi\.org/[^\s]+)'
    matches = re.findall(doi_pattern, url_string)
    if matches:
        # Return the first DOI link found
        return matches[0]
    else:
        return None

**Explanation:**

- **Regular Expression Breakdown:**
  - `https?://` matches `http://` or `https://`.
  - `(?:dx\.)?` matches `dx.` if present; the `?` makes it optional.
  - `doi\.org/` matches `doi.org/`.
  - `[^\s]+` matches one or more non-whitespace characters (the DOI identifier).
  - The parentheses `()` capture the entire DOI URL.

Then run the code to get a `doi` column:

In [48]:
# Extract DOI links from the 'url' column
df['doi'] = df['url'].apply(extract_doi)

This means that we can now drop the `url` and `pubmed_id` columns and only include the `doi` column.  

In [53]:
# extract relevan columns
columns = ['title', 'abstract', 'doi', "TI-AB"]
# select the relevant columns
df_selected = df[columns]

## clean export

Now we need to cleanup the 'df_selected' DataFrame and export the relevant information.

In [None]:
# Export the df_selected DataFrame to a CSV file with the modified name
output_file_name = file_name.replace('.csv', '_CLEAN.csv')
# prepend output_file_name with 'TRAM_'
output_file_name = 'TRAM_' + output_file_name
df_selected.to_csv(output_file_name, index=False, sep=";")

# Display the name of the output file
print(f"DataFrame exported to {output_file_name}")