### Text Cleaning
    Remove any unnecessary characters, such as special characters, punctuation, or HTML tags.
    Convert the text to lowercase to achieve case-insensitive matching.
    Remove any extra whitespace or leading/trailing spaces.

### Tokenization
    Split the text into individual words or tokens, as it helps the model understand the semantic meaning of each word.

### Stopword Removal
    Remove common words, known as stopwords (e.g., "the," "is," "and"), which may not contribute much to the classification task.
    You can use pre-defined stopword lists from libraries like NLTK or spaCy or create a custom list based on your specific domain.

### Lemmatization or Stemming
    Reduce words to their base or root form to normalize the text and reduce vocabulary size.
    Lemmatization aims to convert words to their base form (lemma) using linguistic rules.
    Stemming reduces words to their root form using simple heuristic algorithms.

### Handling Abbreviations and Acronyms
    Decide whether to expand or keep abbreviations and acronyms as they are, based on their relevance to the classification task.

### Handling Numeric Data
    Decide whether to replace numbers with a generic token or keep them as-is based on their importance in the text.

### Handling Rare Words or Outliers
    Remove extremely rare words that occur infrequently, as they may not contribute significantly to the classification task.
    Similarly, remove any outliers or unusual words that may not be relevant to the task.

### Vectorization
    Convert the pre-processed text data into numerical representations that machine learning models can understand.
    Techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec or GloVe can be employed for vectorization.

In [22]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd

nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")

[nltk_data] Downloading package stopwords to /home/er/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/er/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/er/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [23]:
# Loading database created by D. Beillouin et al.
xlsx_file = pd.ExcelFile(
    "/home/er/Documents/Cirad/SOCSciCompiler/data/trainset/trainset.xlsx"
)

# Adding label "excluded" or "included" for each MA
df_incl = xlsx_file.parse("retained_meta-analyses")
df_excl = xlsx_file.parse("non_retained_meta-analyses")
df_incl["Screening"] = "included"
df_excl["Screening"] = "excluded"

size_1 = len(df_incl)
size_2 = len(df_excl)
size_3 = size_1 + size_2
print(f"Raw database contains {size_3} entries ({size_1} included MA and {size_2} excluded MA)")

# Keeping only useful attributes
attributes_to_keep_incl = [
    "Screening",
    "link",
    "Article Title",
    "Abstract",
    "Keywords",
]
attributes_to_keep_excl = [
    "Screening",
    "lien pour accès",
    "title",
    "Abstract",
    "Keywords",
]
df_incl = df_incl[attributes_to_keep_incl]
df_excl = df_excl[attributes_to_keep_excl]

# Standardising columns names
new_column_names_incl = {"Article Title": "Title", "link": "DOI"}
new_column_names_excl = {"title": "Title", "lien pour accès": "DOI"} 
df_incl = df_incl.rename(columns=new_column_names_incl)
df_excl = df_excl.rename(columns=new_column_names_excl)

# Merging exluded and included MA into single dataframe
raw_data = pd.concat([df_incl, df_excl], ignore_index=True)
raw_data = raw_data.fillna("")

print("Raw database stored into 'raw_data' variable")

Raw database contains 1007 entries (217 included MA and 790 excluded MA)
Raw database stored into 'raw_data' variable
'raw_data' information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Screening  1007 non-null   object
 1   DOI        1007 non-null   object
 2   Title      1007 non-null   object
 3   Abstract   1007 non-null   object
 4   Keywords   1007 non-null   object
dtypes: object(5)
memory usage: 39.5+ KB


Unnamed: 0,Screening,DOI,Title,Abstract,Keywords
0,included,https://doi.org/10.1111/gcbb.12234,Emission of CO2 from biochar-amended soils and...,Soil amendment with pyrogenic organic matter (...,additive effects; carbon sequestration; decomp...
1,included,https://doi.org/10.1890/10-0660.1,Fire effects on temperate forest soil C and N ...,Temperate forest soils store globally signific...,carbon sinks; fire; forest management; meta-an...
2,included,https://doi.org/10.1016/j.still.2020.104575,A calculator to quantify cover crop effects on...,Many producers use cover crops as a means to i...,Conservation agriculture; Soil quality; Meta-A...
3,included,https://doi.org/10.1016/j.dib.2020.105376,Quantifying cover crop effects on soil health ...,The dataset presented here supports the resear...,Soil health\nSoil quality\nCover crop\nConserv...
4,included,https://doi.org/10.1016/j.foreco.2019.117808,Impacts of the Three-North shelter forest prog...,Vegetation restoration in arid and semi-arid a...,Three-North Shelter Forest; Soil organic carbo...
5,included,https://doi.org/10.1088/1748-9326/aaeb5f,Revisiting IPCC Tier 1 coefficients for soil o...,"Agroforestry systems comprise trees and crops,...",carbon sequestration; emission factor; climate...
6,included,https://doi.org/10.1111/sum.12546,Biochar effects on crop yields with and withou...,The added value of biochar when applied along ...,
7,included,https://doi.org/10.1016/j.agee.2016.04.011,Carbon sequestration and net emissions of CH4 ...,While there have been many valuable individual...,Agroforestry\nCarbon sequestration\nSoil\nBiom...
8,included,https://doi.org/10.1007/s10457-017-0147-9,Soil carbon sequestration in agroforestry syst...,Agroforestry systems may play an important rol...,Agroforestry; Carbon sequestration; Soil organ...
9,included,https://doi.org/10.1016/j.foreco.2018.07.029,The effects of forest restoration on ecosystem...,Ecological restoration has become an overarchi...,Pacific Northwest; Ecosystem services; Silvicu...


In [None]:
# Function to get DOIs from URLs
def extract_doi(url):
    if str(url).startswith("https://doi.org/"):
        return str(url)[len("https://doi.org/") :]
    else:
        return None

# Extracting DOIs from URLs
df_incl["link"] = df_incl["link"].apply(extract_doi)
df_excl["lien pour accès"] = df_excl["lien pour accès"].apply(extract_doi)

# Removing exact duplicates (all columns identical) and duplicates with same DOI
df_incl = df_incl.drop_duplicates()
df_incl = df_incl[df_incl["link"].duplicated(keep=False) == False]
df_excl = df_excl.drop_duplicates()
df_excl = df_excl[df_excl["lien pour accès"].duplicated(keep=False) == False]





df_incl = df_incl.rename(columns=new_column_names_incl)
df_excl = df_excl.rename(columns=new_column_names_excl)



# ------ PRE-PROCESSING ------

stopwords = set(stopwords.words("english")) # Loading stopwords
lemmatizer = WordNetLemmatizer() # Loading lemmatizer

# TITLES

for title in train_set["Title"]



cleaned_titles = []
cleaned_abstracts = []
cleaned_keywords = []
tmp_keywords = []

for title, abstract, keywords in zip(
    train_set["Title"], train_set["Abstract"], train_set["Keywords"]
):
    title = re.sub(r"[^\w\s]", "", title)  # Remove special characters/punctuation
    title = title.lower()  # Convert to lowercase
    title = title.strip()  # Remove leading/trailing spaces
    cleaned_titles.append(title)

    abstract = re.sub(r"[^\w\s]", "", abstract)
    abstract = abstract.lower()
    abstract = abstract.strip()
    cleaned_abstracts.append(abstract)

    keywords = re.split(r"[,;\n]", keywords)  # Split keywords using separators (, ; \n)
    tmp_keywords.append(
        [keyword.strip() for keyword in keywords if keyword.strip() != ""]
    )
    for list_kw in tmp_keywords:
        tmp = []
        for keyword in list_kw:
            keyword = re.sub(r"[^\w\s]", "", keyword)
            keyword = keyword.lower()
            keyword = keyword.strip()
            tmp.append(keyword)
        cleaned_keywords.append(tmp)

# Tokenization, Stopword Removal, Lemmatization
tokenized_titles = []
tokenized_abstracts = []
tokenized_keywords = []

for title in cleaned_titles:
    tokens = nltk.word_tokenize(title)  # Tokenization
    filtered_tokens = [
        token for token in tokens if token not in stopwords
    ]  # Stopword Removal
    lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]  # Lemmatization
    tokenized_titles.append(lemmas)

for abstract in cleaned_abstracts:
    tokens = nltk.word_tokenize(abstract)
    filtered_tokens = [token for token in tokens if token not in stopwords]
    lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    tokenized_abstracts.append(lemmas)

for keywords in cleaned_keywords:
    tokenized_keyword_list = []
    for keyword in keywords:
        tokens = nltk.word_tokenize(keyword)
        filtered_tokens = [token for token in tokens if token not in stopwords]
        lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]
        tokenized_keyword_list.extend(lemmas)
    tokenized_keywords.append(tokenized_keyword_list)

# Update the DataFrame with pre-processed data
train_set["Tokenized Titles"] = tokenized_titles
train_set["Tokenized Abstracts"] = tokenized_abstracts
train_set["Tokenized Keywords"] = tokenized_keywords

rows_to_keep = [
    "Screening",
    "Tokenized Titles",
    "Tokenized Abstracts",
    "Tokenized Keywords",
]

train_set = train_set[rows_to_keep]
train_set.columns = ["Screening", "Title", "Abstract", "Keyword"]

train_set = train_set.sample(frac=1)
train_set = train_set.drop(train_set[train_set["Abstract"].apply(len) == 0].index)
train_set = train_set.reset_index(drop=True)

empty_list_count = 0

# Iterate over rows in the 'Abstract' column
for abstract_list in train_set["Keyword"]:
    if abstract_list == []:
        empty_list_count += 1

print("Number of empty lists in 'Abstract' column:", empty_list_count)


train_set.info()
train_set.head(10)

In [21]:
a = [1, 2, 3]
b = ["a", "b", "c"]
c = a
print(c)
c = b
print(c)

[1, 2, 3]
['a', 'b', 'c']
