1. **Text Cleaning**\
Remove any unnecessary characters, such as special characters, punctuation, or HTML tags.
Convert the text to lowercase to achieve case-insensitive matching.
Remove any extra whitespace or leading/trailing spaces.
2. **Tokenization**\
Split the text into individual words or tokens, as it helps the model understand the semantic meaning of each word.
3. **Stopword Removal**\
Remove common words, known as stopwords (e.g., "the," "is," "and"), which may not contribute much to the classification task.
You can use pre-defined stopword lists from libraries like NLTK or spaCy or create a custom list based on your specific domain.
4. **Lemmatization or Stemming**\
Reduce words to their base or root form to normalize the text and reduce vocabulary size.
Lemmatization aims to convert words to their base form (lemma) using linguistic rules.
Stemming reduces words to their root form using simple heuristic algorithms.
5. **Handling Abbreviations and Acronyms**\
Decide whether to expand or keep abbreviations and acronyms as they are, based on their relevance to the classification task.
6. **Handling Numeric Data**\
Decide whether to replace numbers with a generic token or keep them as-is based on their importance in the text.
7. **Handling Rare Words or Outliers**\
Remove extremely rare words that occur infrequently, as they may not contribute significantly to the classification task.
Similarly, remove any outliers or unusual words that may not be relevant to the task.
8. **Vectorization**\
Convert the pre-processed text data into numerical representations that machine learning models can understand.
Techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec or GloVe can be employed for vectorization.

In [67]:
import re
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd

nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")

stopwords = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /home/er/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/er/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/er/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [68]:
# ------- LOADING DATA INTO A DATAFRAME -------


# Loading database created by D. Beillouin et al.
xlsx_file = pd.ExcelFile(
    "/home/er/Documents/Cirad/SOCSciCompiler/data/trainset/trainset.xlsx"
)

# Adding label "excluded" or "included" for each MA
df_incl = xlsx_file.parse("retained_meta-analyses")
df_excl = xlsx_file.parse("non_retained_meta-analyses")
df_incl["Screening"] = "included"
df_excl["Screening"] = "excluded"

# Keeping only useful attributes
attributes_to_keep_incl = [
    "Screening",
    "link",
    "Article Title",
    "Abstract",
    "Keywords",
]
attributes_to_keep_excl = [
    "Screening",
    "lien pour accès",
    "title",
    "Abstract",
    "Keywords",
]
df_incl = df_incl[attributes_to_keep_incl]
df_excl = df_excl[attributes_to_keep_excl]

# Standardising columns names
new_column_names_incl = {"Article Title": "Title", "link": "DOI"}
new_column_names_excl = {"title": "Title", "lien pour accès": "DOI"}
df_incl = df_incl.rename(columns=new_column_names_incl)
df_excl = df_excl.rename(columns=new_column_names_excl)

# Merging exluded and included MA into single dataframe
raw_data = pd.concat([df_incl, df_excl], ignore_index=True)
raw_data = raw_data.fillna("")

size_1 = len(df_incl)
size_2 = len(df_excl)
size_3 = size_1 + size_2
print(
    f"Raw database contains {size_3} entries ({size_1} included MA and {size_2} excluded MA), stored into 'raw_data' variable."
)

Raw database contains 1007 entries (217 included MA and 790 excluded MA), stored into 'raw_data' variable.


In [69]:
# ------- CLEANING -------


# Function to get DOIs from URLs
def extract_doi(url):
    if str(url).startswith("https://doi.org/"):
        return str(url)[len("https://doi.org/") :]
    else:
        return None


# Extracting DOIs from URLs
raw_data["DOI"] = raw_data["DOI"].apply(extract_doi)

# Removing empty DOIs rows
raw_data = raw_data.dropna(subset=["DOI"])
size_4 = len(raw_data)
size_5 = size_3 - size_4
print(f"{size_5} rows removed because of empty DOIs. Cannot check the uniqueness.")

# Removing empty titles rows
raw_data["Title"] = raw_data["Title"].replace("", np.nan)
raw_data = raw_data.dropna(subset=["Title"])
size_6 = len(raw_data)
size_7 = size_4 - size_6
print(
    f"{size_7} rows removed because of empty titles. Cannot be processed by the ML model."
)

# Removing empty abstracts rows
raw_data["Abstract"] = raw_data["Abstract"].replace("", np.nan)
raw_data = raw_data.dropna(subset=["Abstract"])
size_8 = len(raw_data)
size_9 = size_6 - size_8
print(
    f"{size_9} rows removed because of empty abstracts. Cannot be processed by the ML model."
)

# Removing DOIs duplicates and titles duplicates
raw_data = raw_data.drop_duplicates(subset=["DOI"], keep="first")
raw_data = raw_data.drop_duplicates(subset="Title", keep="first")
size_10 = len(raw_data)
size_11 = size_8 - size_10
print(f"{size_11} DOI duplicates and title duplicates removed.")

# Droping column 'DOI' now we have unique values. No needed for the ML model
train_set = raw_data.drop(columns=["DOI"])

size_12 = train_set["Screening"].value_counts()
size_incl = size_12.loc["included"]
size_excl = size_12.loc["excluded"]
print(
    f"Cleaned database contains {size_10} entries ({size_incl} included MA and {size_excl} excluded MA), stored into 'train_set' variable."
)

151 rows removed because of empty DOIs. Cannot check the uniqueness.
0 rows removed because of empty titles. Cannot be processed by the ML model.
54 rows removed because of empty abstracts. Cannot be processed by the ML model.
8 DOI duplicates and title duplicates removed.
Cleaned database contains 794 entries (212 included MA and 582 excluded MA), stored into 'train_set' variable.


In [70]:
# ------- PRE-PROCESSING -------


# PRE-PROCESSING TITLES


# Function to pre-process values of columns Title, Abstract and Keywords
def preprocessor(column: str):
    token_col = []
    for val in train_set[column]:
        val = re.sub(r"[^\w\s]", "", val)  # Remove special characters/punctuation
        val = val.lower()  # Convert to lowercase
        val = val.strip()  # Remove leading/trailing spaces
        tokens = nltk.word_tokenize(val)  # Tokenization
        filtered_tokens = [
            token for token in tokens if token not in stopwords
        ]  # Stopword Removal
        lemmas = [
            lemmatizer.lemmatize(token) for token in filtered_tokens
        ]  # Lemmatization
        token_col.append(lemmas)
    train_set[column] = token_col


# Applying 'preprocessor()' to each column
preprocessor('Title')
preprocessor('Abstract')
preprocessor('Keywords')

# Shuffling and re-indexing
train_set = train_set.sample(frac=1)
train_set = train_set.reset_index(drop=True)

print("Training set stored into 'train_set' variable and ready to be used.")
print("Summary with the 20 first lines:")
train_set.head(20)

Training set stored into 'train_set' variable and ready to be used.
Summary with the 20 first lines:


Unnamed: 0,Screening,Title,Abstract,Keywords
0,included,"[global, pattern, dynamic, soil, carbon, nitro...","[afforestation, proposed, effective, method, c...","[afforestation, carbonnitrogen, interaction, d..."
1,included,"[effect, different, fertilization, mode, soil,...","[evidence, shown, fertilizer, application, cou...","[paddy, field, fertilization, soil, organic, c..."
2,excluded,"[efficacy, tilmanocept, sentinel, lymph, mode,...","[sentinel, lymph, node, sln, mapping, common, ...",[]
3,included,"[experimental, observational, study, find, con...","[manipulative, experiment, observation, along,...","[agriculturemethods, carbonanalysis, climate, ..."
4,excluded,"[atmospheric, co2, soil, extracellular, enzyme...","[rising, atmospheric, co2, concentration, alte...",[]
5,included,"[effect, straw, retention, crop, yield, soil, ...","[crop, straw, retention, field, csrf, technolo...","[metaanalysizs, straw, retention, crop, yield,..."
6,excluded,"[costeffectiveness, analysis, bezlotoxumab, ad...","[introductionclostridium, difficile, infection...","[bezlotoxumab, clostridium, difficile, infecti..."
7,excluded,"[review, allometric, equation, major, land, co...","[review, biomass, study, conducted, 11, southe...","[allometry, wood, density, carbon, land, cover..."
8,included,"[grazing, improves, c, n, cycling, northern, g...","[grazing, potentially, alters, grassland, ecos...","[agriculture, animal, carbon, cycle, conservat..."
9,included,"[enhanced, top, soil, carbon, stock, organic, ...","[suggested, conversion, organic, farming, cont...","[climate, change, soil, quality, agricultural,..."


In [71]:
# import pandas as pd
# from sklearn.feature_extraction.text import TfidfVectorizer

# # Assuming your DataFrame is called 'df' and it contains the columns 'Title', 'Abstract', 'Keywords'

# # Step 7: Handling Rare Words or Outliers
# # Remove extremely rare words or outliers using a threshold
# threshold = 0.01  # Adjust the threshold as per your requirement
# df["Title"] = df["Title"].apply(
#     lambda x: " ".join(
#         [
#             word
#             for word in x.split()
#             if df["Title"].str.count(word).sum() / len(df) > threshold
#         ]
#     )
# )
# df["Abstract"] = df["Abstract"].apply(
#     lambda x: " ".join(
#         [
#             word
#             for word in x.split()
#             if df["Abstract"].str.count(word).sum() / len(df) > threshold
#         ]
#     )
# )
# df["Keywords"] = df["Keywords"].apply(
#     lambda x: " ".join(
#         [
#             word
#             for word in x.split()
#             if df["Keywords"].str.count(word).sum() / len(df) > threshold
#         ]
#     )
# )

# # Step 8: Vectorization
# # Perform TF-IDF vectorization on 'Title', 'Abstract', and 'Keywords'
# vectorizer = TfidfVectorizer()
# title_vectorized = vectorizer.fit_transform(df["Title"])
# abstract_vectorized = vectorizer.fit_transform(df["Abstract"])
# keywords_vectorized = vectorizer.fit_transform(df["Keywords"])

# # Convert the vectorized data into DataFrames
# title_vectorized_df = pd.DataFrame(
#     title_vectorized.toarray(), columns=vectorizer.get_feature_names()
# )
# abstract_vectorized_df = pd.DataFrame(
#     abstract_vectorized.toarray(), columns=vectorizer.get_feature_names()
# )
# keywords_vectorized_df = pd.DataFrame(
#     keywords_vectorized.toarray(), columns=vectorizer.get_feature_names()
# )

# # Concatenate the vectorized DataFrames with the original DataFrame
# df = pd.concat(
#     [df, title_vectorized_df, abstract_vectorized_df, keywords_vectorized_df], axis=1
# )

# # Display the updated DataFrame
# print(df)