- 🔭 I’m currently working on NLP applications at OppScience
- 📖 Master of Science in Computer Science at Université de Technologie de Compiègne (French Engineering degree), graduated in 2023
- 🌱 I'm always learning new techs to keep myself up to date
- ⚡ I am passionate about finance, entrepreneurship, traveling and robotics
- 📫 Feel free to contact me on LinkedIn
To see the details of the certificates and the authenticity verification page, feel free to click on them. The diplomas are available on my LinkedIn profile.
To get the name of the skill, place your cursor on it.
To improve the readability of my projects, here is a legend of the emojis in the title of the projects:
-
🔒 : private project, for a company for example, so I can not show the code
-
🧮 : data science project
-
🤖 : robotics / IoT project
-
📚 : educational project
-
👨 : personal project
-
- 🧮 📚 Winning Space race with Data Science
- 🧮 📚 Regression ML algorithms for CVSS estimation
- 🧮 🔒 Multilingual NER models evaluation
- 🧮 🔒 OCR benchmark and preprocess optimization
- 🧮 🔒 Machine Learning for Sentence Bounding Detection capabilities
- 🧮 🔒 Relation Extraction models using clustering and semantic vectors combination systems
- 🧮 🔒 Active Data Generation pipeline for automatic reinforcement learning systems using a multi-agent approach
- 🧮 📚 Classification model on the Adults Income dataset
- 🤖 📚 Following green target for Turtle Bot 3 Burger
In order to get the IBM Professional Data Science Certificate, I developed solutions in order to solve this applied Data Science capstone.
SpaceX intends to reduce the costs of spatial flights by reusing the first stage of their rockets. This project goal is to predict if the first stage of a Falcon 9 rocket will lang back to Earth successfully.
The data used was collected using the SpaceX REST API, and a Wikipedia article about the Falcon 9 rockets. I performed Web Scraping in order to extract useful data from the second data source.
I performed an Exploratory Data Analysis using Folium to make a map to study distances, Dash to create an interactive dashboard, and of course other Python libraries to create scatterplots, bar charts and lots of other visualizations.
I used 4 classification models and optimized their parameters to find the best solution to solve the problem.
At the end of the project, I created a synthetic but complete 45 pages report to explain the work using a storytelling method.
This project comes from an issue : it is important in some fields to evaluate the danger of a cyber attack, to sort their correction priority. Therefore, the CVSS, which goes from 0 to 10, evaluates this criticity. Using Machine Learning to evaluate this risk might be a good solution to make it quicker based on the stack and other information. In this Notebook, we will try this solution, using the regressive approach.
Most of the work done is a data exploration, using matplotlib and seaborn most of the time to draw correlations and highlight useful information for future use. I have used scikit learn and pandas to manipulate the data and to create preprocessing and modeling functions, to get the best combination of preprocess+models, using the mean median error as the most representative metric for performance.
This project is a project done for researchers, at university, to detect cyber attacks on autonomous trains.
The Named Entity Recognition is a Natural Language Processing domain. It is a problem of automatic data analysis, consisting in extracting a type of entity from a text. A NER model can for example extract all the people, dates, locations etc. from a document:
These models are usually monolingual. However the company needed to explore the possibility of using one model to extract entities from lots of documents, in 5 languages. This would allow the company to avoid its customers from deploying 5 different models, and from detecting the language of each document analyzed.
To do so, I found lots of annotated datasets (from Kaggle and other sources), several pre-trained models, and designed a benchmark to calculate the metrics of the models. To understand why some results were low, I analyzed the results by tag, like this:
This further study allowed me to understand the semantic reasons for these disparities in results. It allowed the team to correct this issue by using transfer learning, such as fine tuning for example.
The company needed to quantify its OCR tool performances. An Optical Character Recognition is a Computer Vision technology that extracts the text from an image.
Therefore, I designed a benchmark to do so. First, I had to think about how to evaluate an OCR:
- which preprocessing functions ?
- what kind of data ?
- how to quantify the quality that results from an OCR ?
I chose three criterias to find their impact on the metrics: the font, the font size, and the quality (dimensions) of the document. I studied the impact of the gray scale and the rotation preprocessing functions. Based on these choices, I searched on the internet several datasets to have a representative rate of documents. I then standardized them to respect the HOCR format, cutting the picture in boxes, to locate the text. This allowed me to match the extracted text to its true value.
To quantify the capacities of the OCR, I used a Levenshtein distance calculator function I optimized to calculate the precision, recall and f1 score.
After that, I improved the rotation preprocessing function to reduce the processing time, based on the determined angle.
At the end of the project, I made a presentation to explain all the propositions to the other members of the team to decide what changes we must integrate into the program.
To improve an NLP processing, the company needed to ameliorate the preprocessing of the data. Tokenization is one of them, and is essential. There are several ways to tokenize a text, but one of them is to cut it in paragraphs, to extract the sentences from each paragraph, and to get each word and punctuation of these sentences. The interesting fact in this analyze is that it keeps the fact that two words in the same sentence are more linked than if they are in separate ones. The issue is that sentences can end differently than a dot, an exclamation point or any other punctuation. Sentence Boundary Detection is an actual NLP problematic, and models exist to do this task. Therefore, I had to evaluate the actual models of the company's solution, and then to try to find SBD models that do better.
To do so, I had to find SBD datasets, that contained various data forms such as tables, headers, lists etc., and standardize their format to be able to evaluate models in the most complete way possible. Then, I made a state of the art to list the available models. Then I created a benchmark to test these models, using Python, and approaching the problem by a binary classification (0 if the index is a bound, else 1). Then, I was able to evaluate the precision, recall, and f1 score of the 'is a bound' class (1).
The main issue I faced in the SBD problem is that there are many ways to consider that a substring of a text is a sentence or not. It really depends on the annotated dataset used to train the model, as shown on the next picture. That is why I had to add some tolerance to be objective.
I tried lots of models, some that are syntaxic only (such as PySBD), or complax models using Neural Networks (for example Stanza). At the end of the study, I made a presentation with my analyzis of the limits of each models (punctuation impact mostly), to make one better model.
On the best model, I obtained very good results that hugely improved the solution:
This project came after the optimization of the NER models. Given extracted entities, we wanted to know the relation between each of them. For instance, is the relation between a person and a date a "birthdate" relation, or not? The following picture shows how relations can be useful in NLP systems.
This problem becomes more complex as we wanted to be able to introduce new relations in few-shot contexts, and also be able to reject samples (no relation), without having to train again the whole pipeline. Therefore, we designed a system that combines vectors (like BERT vectors), of each of the two entities, then projects them into a new space using a distance metric learning algorithm, and finally clusters all samples into labeled relations, as shown in the following picture.
After various optimizations, especially on the vector combination step, we achieved very good results, that allowed this system to be industriliazed and go to production.
This unique system allowed us to solve very complex real-world problems in few-shot contexts, with very minimal training time as we used pre-trained encoders.
🧮 🔒 Active Data Generation pipeline for automatic reinforcement learning systems using a multi-agent approach
Data has always been a challenge for the Data Science industry. With the raise of LLMs, lots of projects around data generation were born. I pushed it a step further and designed a fully automatic system for NER and RE applications, in which a pre-trained model is sequentially fine-tuned on generated data that follows the style of a small production dataset, until it reaches satisfying results. The goal is here to optimize the fine-tuning of the model based on its struggles, focusing on its worst mistakes.
The algorithm, shown in the following picture, is pretty simple thanks to the multi-agent approach.
The first step is evaluating the pre-trained model on a small production dataset, from a client for instance. Then, based on the results of each entity class and relation class, scenarii are created for the data generation agent (a large language model). In the prompt are also injected samples of the production dataset, as the generated data should have the same "style" (defined precisely through criterions in the prompt). After that, a second agent (a smaller LLM) will evaluate the similarity in style of the generated and the original data (out of 100, using criterion given in the prompt), and give directions for the generation agent to improve the generated data. Using other metrics (self-BLEU for instance using n-grams), we made sure to avoid having duplicates in the generated dataset. Then, the model is fine-tuned on the generated dataset. Last step, the model is evaluated on the production dataset, and if the score is lower than a threshold (for instance f1-score < 0.8), the pipeline starts again. This allows to obtain an increase of the results of the models, effectively, in few-shot contexts. The following picture shows the results obtained on 4 consecutive iterations of the pipeline, on a NER model.
Even though it might look like a data contamination, as we train the model on augmented production data that we used to evaluate the model, other steps are added to avoid that. This pipeline allowed to get increase of up to 20% on the NER models used in production bu the company.
As a project for a Machine Learning course, I had to explore with a teamate solutions to predict if the income of people were less or more than 50K dollars a year (binary classification). The only two rules to respect were:
- to find quickly a solution (we had maximum 10 hours per person on this project)
- to use the famous Adults Income dataset
We used a Jupyter Notebook to capitalize all the work done on the study. To process algorithms on the dataset, we used libraries such as Pandas, Seaborn (for the data exploration), and Scikit-learn to test several models. We optimized the preprocessing to get the best processing chain.
As I usually do, I have done the study in 4 parts:
- exploratory data analysis, using graphs, statistics, plots etc.
- data preparation, by creating several preprocessing chains
- modelisation, by creating several models, each one using every preprocessing-model combination
- evaluation, to know the performances of our models, and making sure we had no under/over fitting
We did no optimization on this model, because of the time we had, but we could have optimized some parameters of the best model we got, using GridSearchCV for example.
After doing the study, we have presented it orally using a Power Point presentation to explain our choices and the results we obtained.
The goal of this project was to develop a multithreaded program to make a robot follow a green target, using only:
- A camera
- A LiDAR
The project had requirements to respect:
- Stop the robot if any obstacle is closer than 15 centimeters from the robot, all around it
- Make it follow a colored target that would move in front of the target
To respect these needs, I have implemented several features, using Python and the ROS library to parallelize the data processing in 7 nodes, as explained on the following picture.
To optimize the robot’s motion, I have implemented a distance estimator that uses the image of the target, knowing its real size. Therefore, I have added a function to make the robot go backward if the target was too close.