## Projet de ML pour résumer des articles scientifiques 
---> analyse finale des tendances des articles (powr BI)

### 1. Initialisation et récupération des articles 

Sources : PubMed, arXiv etc...

Les articles sont au format PDF, il faut maintenant les convertir en format exploitable par la machine. J'utilise PyMuPDF.

In [3]:
!pip install PyMuPDF



On commence ensuite a traiter un article pour l'exemple 

In [2]:
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    document = fitz.open(pdf_path)
    text = ""
    for page_num in range(document.page_count):
        page = document.load_page(page_num)
        text += page.get_text()
    return text

# Example usage
pdf_path = "C:/Doc_cyril/mon_projet_ml.git/articles_pdf/brain_derived_neurotrophic_factor_signaling.pdf"
text = extract_text_from_pdf(pdf_path)
print(text)


394  ｜NEURAL REGENERATION RESEARCH｜Vol 20｜No. 2｜February 2025
NEURAL REGENERATION RESEARCH
www.nrronline.org
Review
Abstract  
During the development of the nervous system, there is an overproduction of neurons 
and synapses. Hebbian competition between neighboring nerve endings and synapses 
performing different activity levels leads to their elimination or strengthening. We have 
extensively studied the involvement of the brain-derived neurotrophic factor-Tropomyosin-
related kinase B receptor neurotrophic retrograde pathway, at the neuromuscular 
junction, in the axonal development and synapse elimination process versus the synapse 
consolidation. The purpose of this review is to describe the neurotrophic influence on 
developmental synapse elimination, in relation to other molecular pathways that we and 
others have found to regulate this process. In particular, we summarize our published 
results based on transmitter release analysis and axonal counts to show the different 
involv

Après avoir réaliser cette fonction pour transformer le texte en format utilisable, je dois le pre processer, c'est à dire retirer les éléments inutiles, comme les références ou les "stop words". Cela va aussi tokenizer le texte et lemmatiser, c'est à dire ramener chaque mot à leur forme, sens, de base, ce qui permet de relier les mots ayant le même champ lexical. On utilise la librairie spacy. 

In [3]:
import warnings

# Suppress specific warnings
warnings.filterwarnings("ignore", message="torch.set_default_tensor_type() is deprecated as of PyTorch 2.1")

import nmslib
import scispacy 
import pydantic
import spacy
nlp = spacy.load('en_core_sci_sm') 

def preprocess_text(text):
    
    # Process the text with SpaCy
    doc = nlp(text)
    
    # Remove stop words and punctuation, and lemmatize the text
    cleaned_tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    
    # Join tokens back into a single string
    cleaned_text = ' '.join(cleaned_tokens)
    
    return cleaned_text

# on test avec l'article utilisé plus haut 
pdf_path = "C:/Doc_cyril/mon_projet_ml.git/articles_pdf/brain_derived_neurotrophic_factor_signaling.pdf"
text = extract_text_from_pdf(pdf_path)
cleaned_text = preprocess_text(text)
print(cleaned_text)


  _C._set_default_tensor_type(t)


394   ｜neural REGENERATION RESEARCH｜Vol 20｜No 2｜february 2025 
 NEURAL REGENERATION research 
 www.nrronline.org 
 Review 
 Abstract  
 development nervous system overproduction neuron 
 synapsis hebbian competition neighbor nerve ending synapsis 
 perform different activity level lead elimination strengthening 
 extensively study involvement brain-derived neurotrophic factor-tropomyosin- 
 related kinase b receptor neurotrophic retrograde pathway neuromuscular 
 junction axonal development synapse elimination process versus synapse 
 consolidation purpose review describe neurotrophic influence 
 developmental synapse elimination relation molecular pathway 
 find regulate process particular summarize publish 
 result base transmitter release analysis axonal count different 
 involvement presynaptic acetylcholine muscarinic autoreceptor couple 
 downstream serine-threonine protein kinase c pka pkc voltage-gated 
 calcium channel different nerve ending developmental competition dynamic 


Notre fonction de preprocess a bien fonctionné. Il faut donc maintenant réaliser le process entier sur les différents articles et itérer pour sauvegarder les différentes données obtenues dans des fichiers texts qui seront dans un dossier spécifique.

In [6]:
import os

def save_text_to_file(text, file_path): # permet de stocker les fichiers créés dans un dossier
    with open(file_path, 'w', encoding='utf-8') as file:
        file.write(text)

# on itère sur les fichiers pdf contenus dans le dossier de base, on extrait le texte, on preprocess et on sauve le texte ainsi créé dans un nouveau dossier 
def process_pdfs(input_dir, output_dir):
     # le dossier pour stocker les fichiers textes de sortie est créé si il n'existe pas déjà
    os.makedirs(output_dir, exist_ok=True)
    for filename in os.listdir(input_dir):
        if filename.endswith('.pdf'):
            pdf_path = os.path.join(input_dir, filename)
            text = extract_text_from_pdf(pdf_path)
            cleaned_text = preprocess_text(text)
            output_path = os.path.join(output_dir, f"{os.path.splitext(filename)[0]}.txt")
            save_text_to_file(cleaned_text, output_path)
            print(f"Processed {filename}")

# on applique les fonctions à nos dossiers
input_dir = "C:/Doc_cyril/mon_projet_ml.git/articles_pdf"
output_dir = "C:/Doc_cyril/mon_projet_ml.git/articles_txt"
os.makedirs(output_dir, exist_ok=True)
process_pdfs(input_dir, output_dir)

Processed brain_derived_neurotrophic_factor_signaling.pdf
Processed fphys-15-1337442.pdf
Processed fphys-15-1349313.pdf
Processed jad-prepress_jad--1--1-jad231425_jad--1-jad231425.pdf
Processed jpd-prepress_jpd--1--1-jpd240075_jpd--1-jpd240075.pdf
Processed nanomaterials_mediated_lysosomal_regulation__a.11.pdf
Processed PIIS2211124724005448.pdf
Processed sciadv.adk3229.pdf
Processed small_extracellular_vesicles_derived_from_human.34.pdf
