1. **Text Cleaning**\
Remove any unnecessary characters, such as special characters, punctuation, or HTML tags.
Convert the text to lowercase to achieve case-insensitive matching.
Remove any extra whitespace or leading/trailing spaces.
2. **Tokenization**\
Split the text into individual words or tokens, as it helps the model understand the semantic meaning of each word.
3. **Stopword Removal**\
Remove common words, known as stopwords (e.g., "the," "is," "and"), which may not contribute much to the classification task.
You can use pre-defined stopword lists from libraries like NLTK or spaCy or create a custom list based on your specific domain.
4. **Lemmatization or Stemming**\
Reduce words to their base or root form to normalize the text and reduce vocabulary size.
Lemmatization aims to convert words to their base form (lemma) using linguistic rules.
Stemming reduces words to their root form using simple heuristic algorithms.
5. **Handling Abbreviations and Acronyms**\
Decide whether to expand or keep abbreviations and acronyms as they are, based on their relevance to the classification task.
6. **Handling Numeric Data**\
Decide whether to replace numbers with a generic token or keep them as-is based on their importance in the text.
7. **Handling Rare Words or Outliers**\
Remove extremely rare words that occur infrequently, as they may not contribute significantly to the classification task.
Similarly, remove any outliers or unusual words that may not be relevant to the task.
8. **Vectorization**\
Convert the pre-processed text data into numerical representations that machine learning models can understand.
Techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec or GloVe can be employed for vectorization.

In [11]:
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd

nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")

stopwords = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /home/er/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/er/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/er/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [43]:
# ------- LOADING DATA INTO A DATAFRAME -------


# Loading database created by D. Beillouin et al.
xlsx_file = pd.ExcelFile(
    "/home/er/Documents/Cirad/SOCSciCompiler/data/trainset/trainset.xlsx"
)

# Adding label "excluded" or "included" for each MA
df_incl = xlsx_file.parse("retained_meta-analyses")
df_excl = xlsx_file.parse("non_retained_meta-analyses")
df_incl["Screening"] = "included"
df_excl["Screening"] = "excluded"

# Keeping only useful attributes
attributes_to_keep_incl = [
    "Screening",
    "link",
    "Article Title",
    "Abstract",
    "Keywords",
]
attributes_to_keep_excl = [
    "Screening",
    "lien pour accès",
    "title",
    "Abstract",
    "Keywords",
]
df_incl = df_incl[attributes_to_keep_incl]
df_excl = df_excl[attributes_to_keep_excl]

# Standardising columns names
new_column_names_incl = {"Article Title": "Title", "link": "DOI"}
new_column_names_excl = {"title": "Title", "lien pour accès": "DOI"}
df_incl = df_incl.rename(columns=new_column_names_incl)
df_excl = df_excl.rename(columns=new_column_names_excl)

# Merging exluded and included MA into single dataframe
raw_data = pd.concat([df_incl, df_excl], ignore_index=True)
raw_data = raw_data.fillna("")

size_1 = len(df_incl)
size_2 = len(df_excl)
size_3 = size_1 + size_2
print(
    f"Raw database contains {size_3} entries ({size_1} included MA and {size_2} excluded MA), stored into 'raw_data' variable."
)

Raw database contains 1007 entries (217 included MA and 790 excluded MA), stored into 'raw_data' variable.


In [44]:
# ------- CLEANING -------


# Function to get DOIs from URLs
def extract_doi(url):
    if str(url).startswith("https://doi.org/"):
        return str(url)[len("https://doi.org/") :]
    else:
        return None


# Extracting DOIs from URLs
raw_data["DOI"] = raw_data["DOI"].apply(extract_doi)

# Removing empty DOIs rows
raw_data = raw_data.dropna(subset=["DOI"])
size_4 = len(raw_data)
size_5 = size_3 - size_4
print(f"{size_5} rows removed because of empty DOIs. Cannot check the uniqueness.")

# Removing empty titles rows
raw_data["Title"] = raw_data["Title"].replace("", np.nan)
raw_data = raw_data.dropna(subset=["Title"])
size_6 = len(raw_data)
size_7 = size_4 - size_6
print(
    f"{size_7} rows removed because of empty titles. Cannot be processed by the ML model."
)

# Removing empty abstracts rows
raw_data["Abstract"] = raw_data["Abstract"].replace("", np.nan)
raw_data = raw_data.dropna(subset=["Abstract"])
size_8 = len(raw_data)
size_9 = size_6 - size_8
print(
    f"{size_9} rows removed because of empty abstracts. Cannot be processed by the ML model."
)

# Removing DOIs duplicates and titles duplicates
raw_data = raw_data.drop_duplicates(subset=["DOI"], keep="first")
raw_data = raw_data.drop_duplicates(subset="Title", keep="first")
size_10 = len(raw_data)
size_11 = size_8 - size_10
print(f"{size_11} DOI duplicates and title duplicates removed.")

# Droping column 'DOI' now we have unique values. No needed for the ML model
raw_data = raw_data.drop(columns=["DOI"])

size_12 = raw_data["Screening"].value_counts()
size_incl = size_12.loc["included"]
size_excl = size_12.loc["excluded"]
print(
    f"Cleaned database contains {size_10} entries ({size_incl} included MA and {size_excl} excluded MA), stored into 'train_set' variable."
)

151 rows removed because of empty DOIs. Cannot check the uniqueness.
0 rows removed because of empty titles. Cannot be processed by the ML model.
54 rows removed because of empty abstracts. Cannot be processed by the ML model.
8 DOI duplicates and title duplicates removed.
Cleaned database contains 794 entries (212 included MA and 582 excluded MA), stored into 'train_set' variable.


In [42]:
# ------- PRE-PROCESSING -------


# for title in train_set["Title"]

# cleaned_titles = []
# cleaned_abstracts = []
# cleaned_keywords = []
# tmp_keywords = []

# for title, abstract, keywords in zip(
#     train_set["Title"], train_set["Abstract"], train_set["Keywords"]
# ):
#     title = re.sub(r"[^\w\s]", "", title)  # Remove special characters/punctuation
#     title = title.lower()  # Convert to lowercase
#     title = title.strip()  # Remove leading/trailing spaces
#     cleaned_titles.append(title)

#     abstract = re.sub(r"[^\w\s]", "", abstract)
#     abstract = abstract.lower()
#     abstract = abstract.strip()
#     cleaned_abstracts.append(abstract)

#     keywords = re.split(r"[,;\n]", keywords)  # Split keywords using separators (, ; \n)
#     tmp_keywords.append(
#         [keyword.strip() for keyword in keywords if keyword.strip() != ""]
#     )
#     for list_kw in tmp_keywords:
#         tmp = []
#         for keyword in list_kw:
#             keyword = re.sub(r"[^\w\s]", "", keyword)
#             keyword = keyword.lower()
#             keyword = keyword.strip()
#             tmp.append(keyword)
#         cleaned_keywords.append(tmp)

# # Tokenization, Stopword Removal, Lemmatization
# tokenized_titles = []
# tokenized_abstracts = []
# tokenized_keywords = []

# for title in cleaned_titles:
#     tokens = nltk.word_tokenize(title)  # Tokenization
#     filtered_tokens = [
#         token for token in tokens if token not in stopwords
#     ]  # Stopword Removal
#     lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]  # Lemmatization
#     tokenized_titles.append(lemmas)

# for abstract in cleaned_abstracts:
#     tokens = nltk.word_tokenize(abstract)
#     filtered_tokens = [token for token in tokens if token not in stopwords]
#     lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]
#     tokenized_abstracts.append(lemmas)

# for keywords in cleaned_keywords:
#     tokenized_keyword_list = []
#     for keyword in keywords:
#         tokens = nltk.word_tokenize(keyword)
#         filtered_tokens = [token for token in tokens if token not in stopwords]
#         lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]
#         tokenized_keyword_list.extend(lemmas)
#     tokenized_keywords.append(tokenized_keyword_list)

# # Update the DataFrame with pre-processed data
# train_set["Tokenized Titles"] = tokenized_titles
# train_set["Tokenized Abstracts"] = tokenized_abstracts
# train_set["Tokenized Keywords"] = tokenized_keywords

# rows_to_keep = [
#     "Screening",
#     "Tokenized Titles",
#     "Tokenized Abstracts",
#     "Tokenized Keywords",
# ]

# train_set = train_set[rows_to_keep]
# train_set.columns = ["Screening", "Title", "Abstract", "Keyword"]

# train_set = train_set.sample(frac=1)
# train_set = train_set.drop(train_set[train_set["Abstract"].apply(len) == 0].index)
# train_set = train_set.reset_index(drop=True)

# empty_list_count = 0

# # Iterate over rows in the 'Abstract' column
# for abstract_list in train_set["Keyword"]:
#     if abstract_list == []:
#         empty_list_count += 1

# print("Number of empty lists in 'Abstract' column:", empty_list_count)

# raw_data = raw_data.reset_index(drop=True)
# raw_data.info()
# raw_data.head(10)

151 rows removed because of empty DOIs. Cannot check the uniqueness.
0 rows removed because of empty titles. Cannot be processed by the ML model.
54 rows removed because of empty abstracts. Cannot be processed by the ML model.
8 DOI duplicates and title duplicates removed.
