Machine Learning for tortured phrases extraction

Summary of the work presented in this repository

In a context where more and more scientific papers are partially or totally generated and plagiarised [1], some of these regular-looking papers actually contain tortured phrases [2] (e.g. "profund learning" which is used in the psychologic field by which should be "deep learning" in the field). In this repository, we propose a first machine learning approach to extract such phrases from suspicious scientific literature, based on the Problematic Paper Screener [3] (PPS) "Tortured" detector assessments.

This repository is separated in two parts, the dataset_analysis directory provides the used code to build and explore a dataset from scratch, and the supervised_learning provides the used code to train several machine learning models to automatically extract tortured phrases.

The dataset_analysis directory contains four files:

The assessments.csv file which contains the suspicious literature metadata such as DOI and manually extracted tortured phrases, gathered from the PPS.
The dataset.ipynb IPython notebook exlaining the whole process to build and explore the dataset used for tortured phrases extraction.
The dataset_metadata.csv file which contains the metadata (DOI and tortured phrases) of the dataset used for tortured phrases extraction.
The dependencies.txt file listing the used Python version and associated libraries.

The supervised_learning directory contains eight files:

The cnn.zip archive containing the CNN model with its best weights and its results for tortured phrases extraction.
The freeze.txt file listing the used Python version and associated libraries.
The lstm.zip archive containing the LSTM (RNN) model with its best weights and its results for tortured phrases extraction.
The models.py script containing the models implementations with Python.
The rcnn.zip archive containing the RCNN model with its best weights and its results for tortured phrases extraction.
The run_train.sh Shell script to run the models training.
The supervised_learning.ipynb IPython notebook explaining the whole process to build the labilsed dataset and display the obtained results.
The train_models.py script containing the Python code to train the implemented models.

Remark: the papers contents used in our dataset cannot be accessible through this repository since some of them are not freely available.

Acknowledgements and references

We thank Guillaume Cabanc, Cyril Labbé and Alexander Magazinov for their precious work on the depollution of the scientific litterature through the PPS.

[1] Cyril Labbé and Dominique Labbé. Duplicate and fake publications in the scientific literature: how many scigen papers in the computer science? Scientometrics, June 2012.

[2] Guillaume Cabanac and Cyril Labbé. Prevalence of nonsensical algorithmically generated papers in the scientific litearture. Journal of the Association for Information Science and Technology, 72(12):1461-1476, 2021.

[3] Guillaume Cabanac, Cyril Labbé, and Alexander Magazinov. The 'problematic paper screener' automatically selects suspect publications for postpublication (re)assessment, 2022.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
dataset_analysis		dataset_analysis
supervised_learning		supervised_learning
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning for tortured phrases extraction

Summary of the work presented in this repository

Acknowledgements and references

About

Releases

Packages

Languages

License

alexcla99/ml_4_tortured_sentences

Folders and files

Latest commit

History

Repository files navigation

Machine Learning for tortured phrases extraction

Summary of the work presented in this repository

Acknowledgements and references

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages