### Text Cleaning
    Remove any unnecessary characters, such as special characters, punctuation, or HTML tags.
    Convert the text to lowercase to achieve case-insensitive matching.
    Remove any extra whitespace or leading/trailing spaces.

### Tokenization
    Split the text into individual words or tokens, as it helps the model understand the semantic meaning of each word.

### Stopword Removal
    Remove common words, known as stopwords (e.g., "the," "is," "and"), which may not contribute much to the classification task.
    You can use pre-defined stopword lists from libraries like NLTK or spaCy or create a custom list based on your specific domain.

### Lemmatization or Stemming
    Reduce words to their base or root form to normalize the text and reduce vocabulary size.
    Lemmatization aims to convert words to their base form (lemma) using linguistic rules.
    Stemming reduces words to their root form using simple heuristic algorithms.

### Handling Abbreviations and Acronyms
    Decide whether to expand or keep abbreviations and acronyms as they are, based on their relevance to the classification task.

### Handling Numeric Data
    Decide whether to replace numbers with a generic token or keep them as-is based on their importance in the text.

### Handling Rare Words or Outliers
    Remove extremely rare words that occur infrequently, as they may not contribute significantly to the classification task.
    Similarly, remove any outliers or unusual words that may not be relevant to the task.

### Vectorization
    Convert the pre-processed text data into numerical representations that machine learning models can understand.
    Techniques such as TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings like Word2Vec or GloVe can be employed for vectorization.

In [14]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd

xlsx_file = pd.ExcelFile(
    "/home/er/Documents/Cirad/SOCSciCompiler/data/trainset/trainset.xlsx"
)
df_incl = xlsx_file.parse("retained_meta-analyses")
df_excl = xlsx_file.parse("non_retained_meta-analyses")
df_incl["Screening"] = "included"
df_excl["Screening"] = "excluded"


def extract_doi(url):
    if str(url).startswith("https://doi.org/"):
        return str(url)[len("https://doi.org/") :]
    else:
        return None


df_incl["link"] = df_incl["link"].apply(extract_doi)
df_excl["lien pour accès"] = df_excl["lien pour accès"].apply(extract_doi)

df_incl = df_incl.drop_duplicates()
df_incl = df_incl[df_incl["link"].duplicated(keep=False) == False]
df_excl = df_excl.drop_duplicates()
df_excl = df_excl[df_excl["lien pour accès"].duplicated(keep=False) == False]

attributes_to_keep_incl = [
    "Screening",
    "Article Title",
    "Abstract",
    "Keywords",
]

attributes_to_keep_excl = [
    "Screening",
    "title",
    "Abstract",
    "Keywords",
]

df_incl = df_incl[attributes_to_keep_incl]
df_excl = df_excl[attributes_to_keep_excl]

new_column_names_incl = {"Article Title": "Title"}
new_column_names_excl = {"title": "Title"}

df_incl = df_incl.rename(columns=new_column_names_incl)
df_excl = df_excl.rename(columns=new_column_names_excl)

train_set = pd.concat([df_incl, df_excl], ignore_index=True)
train_set = train_set.fillna("")

# Pre-processing

nltk.download("stopwords")
nltk.download("wordnet")
stopwords = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

cleaned_titles = []
cleaned_abstracts = []
cleaned_keywords = []

for title in train_set["Title"]:
    title = re.sub(r"[^\w\s]", "", title)  # Remove special characters/punctuation
    title = title.lower()  # Convert to lowercase
    title = title.strip()  # Remove leading/trailing spaces
    cleaned_titles.append(title)

for abstract in train_set["Abstract"]:
    abstract = re.sub(r"[^\w\s]", "", abstract)
    abstract = abstract.lower()
    abstract = abstract.strip()
    cleaned_abstracts.append(abstract)

for keywords in train_set["Keywords"]:
    keywords = re.split(r"[,;\n]", keywords)  # Split keywords using separators (, ; \n)
    cleaned_keywords.append(
        [keyword.strip() for keyword in keywords if keyword.strip() != ""]
    )  # Remove empty keywords


train_set.info()
train_set.head(50)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 852 entries, 0 to 851
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Screening  852 non-null    object
 1   Title      852 non-null    object
 2   Abstract   852 non-null    object
 3   Keywords   852 non-null    object
dtypes: object(4)
memory usage: 26.8+ KB


Unnamed: 0,Screening,Title,Abstract,Keywords
0,included,Emission of CO2 from biochar-amended soils and...,Soil amendment with pyrogenic organic matter (...,additive effects; carbon sequestration; decomp...
1,included,Fire effects on temperate forest soil C and N ...,Temperate forest soils store globally signific...,carbon sinks; fire; forest management; meta-an...
2,included,A calculator to quantify cover crop effects on...,Many producers use cover crops as a means to i...,Conservation agriculture; Soil quality; Meta-A...
3,included,Quantifying cover crop effects on soil health ...,The dataset presented here supports the resear...,Soil health\nSoil quality\nCover crop\nConserv...
4,included,Impacts of the Three-North shelter forest prog...,Vegetation restoration in arid and semi-arid a...,Three-North Shelter Forest; Soil organic carbo...
5,included,Revisiting IPCC Tier 1 coefficients for soil o...,"Agroforestry systems comprise trees and crops,...",carbon sequestration; emission factor; climate...
6,included,Biochar effects on crop yields with and withou...,The added value of biochar when applied along ...,
7,included,Carbon sequestration and net emissions of CH4 ...,While there have been many valuable individual...,Agroforestry\nCarbon sequestration\nSoil\nBiom...
8,included,Soil carbon sequestration in agroforestry syst...,Agroforestry systems may play an important rol...,Agroforestry; Carbon sequestration; Soil organ...
9,included,The effects of forest restoration on ecosystem...,Ecological restoration has become an overarchi...,Pacific Northwest; Ecosystem services; Silvicu...


In [None]:

# Tokenization, Stopword Removal, Lemmatization
tokenized_titles = []
tokenized_abstracts = []
tokenized_keywords = []

for title in cleaned_titles:
    tokens = nltk.word_tokenize(title)  # Tokenization
    filtered_tokens = [
        token for token in tokens if token not in stopwords
    ]  # Stopword Removal
    lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]  # Lemmatization
    tokenized_titles.append(lemmas)

for abstract in cleaned_abstracts:
    tokens = nltk.word_tokenize(abstract)
    filtered_tokens = [token for token in tokens if token not in stopwords]
    lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    tokenized_abstracts.append(lemmas)

for keywords in cleaned_keywords:
    tokenized_keyword_list = []
    for keyword in keywords:
        tokens = nltk.word_tokenize(keyword)
        filtered_tokens = [token for token in tokens if token not in stopwords]
        lemmas = [lemmatizer.lemmatize(token) for token in filtered_tokens]
        tokenized_keyword_list.extend(lemmas)
    tokenized_keywords.append(tokenized_keyword_list)

# Update the DataFrame with pre-processed data
data["Cleaned Titles"] = cleaned_titles
data["Tokenized Titles"] = tokenized_titles
data["Cleaned Abstracts"] = cleaned_abstracts
data["Tokenized Abstracts"] = tokenized_abstracts
data["Cleaned Keywords"] = cleaned_keywords
data["Tokenized Keywords"] = tokenized_keywords

# Print the updated DataFrame
print(data)