I'm Elias, a Junior Data Scientist 👨💻 Working since June 2020💻
- 🔭 I’m currently working on NLP applications at OppScience
- 📖 Master of Science in Computer Science at Université de Technologie de Compiègne (French Engineering degree), graduated in 2023
- 🌱 I'm always learning new techs to keep myself up to date
- 📫 Feel free to contact me on LinkedIn
- ⚡ I love making robots, using Machine Learning to improve them
Certifications & Badges
To see the details of the certificates and the authenticity verification page, feel free to click on them. The diplomas are available on my LinkedIn profile.
To get the name of the skill, place your cursor on it.
Robotics & IoT
DevOps & Miscs
To improve the lisibility of my projects, here is a legend of the emojis in the title of the projects:
- 🔒 : private project, for a company for example, so I can not show the code
- 🧮 : data science project
- 🤖 : robotics / IoT project
- 📚 : educational project
- 👨 : personal project
List of the projects:
- 🧮 📚 Winning Space race with Data Science
- 🧮 📚 Regression ML algorithms for CVSS estimation
- 🧮 🔒 Multilingual NER models evaluation
- 🧮 🔒 OCR benchmark and preprocess optimization
- 🧮 🔒 Machine Learning for Sentence Bounding Detection capabilities
- 🧮 📚 Classification model on the Adults Income dataset
- 🤖 📚 Following green target for Turtle Bot 3 Burger
In order to get the IBM Professional Data Science Certificate, I developed solutions in order to solve this applied Data Science capstone.
SpaceX intends to reduce the costs of spatial flights by reusing the first stage of their rockets. This project goal is to predict if the first stage of a Falcon 9 rocket will lang back to Earth successfully.
The data used was collected using the SpaceX REST API, and a Wikipedia article about the Falcon 9 rockets. I performed Web Scraping in order to extract useful data from the second data source.
I performed an Exploratory Data Analysis using Folium to make a map to study distances, Dash to create an interactive dashboard, and of course other Python libraries to create scatterplots, bar charts and lots of other visualizations.
I used 4 classification models and optimized their parameters to find the best solution to solve the problem.
At the end of the project, I created a synthetic but complete 45 pages report to explain the work using a storytelling method.
This project comes from an issue : it is important in some fields to evaluate the danger of a cyber attack, to sort their correction priority. Therefore, the CVSS, which goes from 0 to 10, evaluates this criticity. Using Machine Learning to evaluate this risk might be a good solution to make it quicker based on the stack and other information. In this Notebook, we will try this solution, using the regressive approach.
Most of the work done is a data exploration, using matplotlib and seaborn most of the time to draw correlations and highlight useful information for future use. I have used scikit learn and pandas to manipulate the data and to create preprocessing and modeling functions, to get the best combination of preprocess+models, using the mean median error as the most representative metric for performance.
This project is a project done for researchers, at university, to detect cyber attacks on autonomous trains.
🧮 🔒 Multilingual NER models evaluation
The Named Entity Recognition is a Natural Language Processing domain. It is a problem of automatic data analysis, consisting in extracting a type of entity from a text. A NER model can for example extract all the people, dates, locations etc. from a document:
These models are usually monolingual. However the company needed to explore the possibility of using one model to extract entities from lots of documents, in 5 languages. This would allow the company to avoid its customers from deploying 5 different models, and from detecting the language of each document analyzed.
To do so, I found lots of annotated datasets (from Kaggle and other sources), several pre-trained models, and designed a benchmark to calculate the metrics of the models. To understand why some results were low, I analyzed the results by tag, like this:
This further study allowed me to understand the semantic reasons for these disparities in results. It allowed the team to correct this issue by using transfer learning, such as fine tuning for example.
🧮 🔒 OCR benchmark and preprocess optimization
The company needed to quantify its OCR tool performances. An Optical Character Recognition is a Computer Vision technology that extracts the text from an image.
Therefore, I designed a benchmark to do so. First, I had to think about how to evaluate an OCR:
- which preprocessing functions ?
- what kind of data ?
- how to quantify the quality that results from an OCR ?
I chose three criterias to find their impact on the metrics: the font, the font size, and the quality (dimensions) of the document. I studied the impact of the gray scale and the rotation preprocessing functions. Based on these choices, I searched on the internet several datasets to have a representative rate of documents. I then standardized them to respect the HOCR format, cutting the picture in boxes, to locate the text. This allowed me to match the extracted text to its true value.
To quantify the capacities of the OCR, I used a Levenshtein distance calculator function I optimized to calculate the precision, recall and f1 score.
After that, I improved the rotation preprocessing function to reduce the processing time, based on the determined angle.
At the end of the project, I made a presentation to explain all the propositions to the other members of the team to decide what changes we must integrate into the program.
🧮 🔒 Machine Learning for Sentence Bounding Detection capabilities
To improve an NLP processing, the company needed to ameliorate the preprocessing of the data. Tokenization is one of them, and is essential. There are several ways to tokenize a text, but one of them is to cut it in paragraphs, to extract the sentences from each paragraph, and to get each word and punctuation of these sentences. The interesting fact in this analyze is that it keeps the fact that two words in the same sentence are more linked than if they are in separate ones. The issue is that sentences can end differently than a dot, an exclamation point or any other punctuation. Sentence Boundary Detection is an actual NLP problematic, and models exist to do this task. Therefore, I had to evaluate the actual models of the company's solution, and then to try to find SBD models that do better.
To do so, I had to find SBD datasets, that contained various data forms such as tables, headers, lists etc., and standardize their format to be able to evaluate models in the most complete way possible. Then, I made a state of the art to list the available models. Then I created a benchmark to test these models, using Python, and approaching the problem by a binary classification (0 if the index is a bound, else 1). Then, I was able to evaluate the precision, recall, and f1 score of the 'is a bound' class (1).
The main issue I faced in the SBD problem is that there are many ways to consider that a substring of a text is a sentence or not. It really depends on the annotated dataset used to train the model, as shown on the next picture. That is why I had to add some tolerance to be objective.
I tried lots of models, some that are syntaxic only (such as PySBD), or complax models using Neural Networks (for example Stanza). At the end of the study, I made a presentation with my analyzis of the limits of each models (punctuation impact mostly), to make one better model.
On the best model, I obtained very good results that hugely improved the solution:
As a project for a Machine Learning course, I had to explore with a teamate solutions to predict if the income of people were less or more than 50K dollars a year (binary classification). The only two rules to respect were:
- to find quickly a solution (we had maximum 10 hours per person on this project)
- to use the famous Adults Income dataset
We used a Jupyter Notebook to capitalize all the work done on the study. To process algorithms on the dataset, we used libraries such as Pandas, Seaborn (for the data exploration), and Scikit-learn to test several models. We optimized the preprocessing to get the best processing chain.
As I usually do, I have done the study in 4 parts:
- exploratory data analysis, using graphs, statistics, plots etc.
- data preparation, by creating several preprocessing chains
- modelisation, by creating several models, each one using every preprocessing-model combination
- evaluation, to know the performances of our models, and making sure we had no under/over fitting
We did no optimization on this model, because of the time we had, but we could have optimized some parameters of the best model we got, using GridSearchCV for example.
After doing the study, we have presented it orally using a Power Point presentation to explain our choices and the results we obtained.
The goal of this project was to develop a multithreaded program to make a robot follow a green target, using only:
- A camera
- A LiDAR
The project had requirements to respect:
- Stop the robot if any obstacle is closer than 15 centimeters from the robot, all around it
- Make it follow a colored target that would move in front of the target
To respect these needs, I have implemented several features, using Python and the ROS library to parallelize the data processing in 7 nodes, as explained on the following picture.
To optimize the robot’s motion, I have implemented a distance estimator that uses the image of the target, knowing its real size. Therefore, I have added a function to make the robot go backward if the target was too close.