Skip to content
View eliaccess's full-sized avatar
🧬
Working on Deep Learning research topics
🧬
Working on Deep Learning research topics

Highlights

  • Pro

Block or report eliaccess

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
eliaccess/README.md

banner that says Elias Limouni, portfolio of a junior data scientist

I'm Elias, a Senior Data Scientist 👨‍💻 Working since June 2020💻

  • 🔭 I’m currently working on NLP applications at OppScience
  • 📖 Master of Science in Computer Science at Université de Technologie de Compiègne (French Engineering degree), graduated in 2023
  • 🌱 I'm always learning new techs to keep myself up to date
  • ⚡ I am passionate about finance, entrepreneurship, traveling and robotics
  • 📫 Feel free to contact me on LinkedIn

Certifications & Badges

To see the details of the certificates and the authenticity verification page, feel free to click on them. The diplomas are available on my LinkedIn profile.

IBM Data Science Professional Certificate

Machine Learning Specialization Certificate

Deep Learning Specialization Certificate

Machine Learning Engineering for Production (MLOps) Specialization Certificate

Skills

To get the name of the skill, place your cursor on it.

Data Science

Python Google Colab Jupyter-Notebook Scikit-learn Seaborn Pandas TensorFlow PyTorch OpenCV Keras R MatLab Elastic Search MySQL

Robotics & IoT

Linux Raspberry Pi Arduino C C++ ROS, Robot Operating System

DevOps & Miscs

Git GitLab GitHub Bash GCP Docker Nginx Node.js Express.js WordPress PHP Java HTML5 CSS3 Latex

Projects

To improve the readability of my projects, here is a legend of the emojis in the title of the projects:

In order to get the IBM Professional Data Science Certificate, I developed solutions in order to solve this applied Data Science capstone.

SpaceX intends to reduce the costs of spatial flights by reusing the first stage of their rockets. This project goal is to predict if the first stage of a Falcon 9 rocket will lang back to Earth successfully.

Data Collection on two sources for the Applied Data Science Capstone from IBM

The data used was collected using the SpaceX REST API, and a Wikipedia article about the Falcon 9 rockets. I performed Web Scraping in order to extract useful data from the second data source.

I performed an Exploratory Data Analysis using Folium to make a map to study distances, Dash to create an interactive dashboard, and of course other Python libraries to create scatterplots, bar charts and lots of other visualizations.

Dashboard created with Dash from Plotly for the Applied Data Science Capstone from IBM

I used 4 classification models and optimized their parameters to find the best solution to solve the problem.

At the end of the project, I created a synthetic but complete 45 pages report to explain the work using a storytelling method.

This project comes from an issue : it is important in some fields to evaluate the danger of a cyber attack, to sort their correction priority. Therefore, the CVSS, which goes from 0 to 10, evaluates this criticity. Using Machine Learning to evaluate this risk might be a good solution to make it quicker based on the stack and other information. In this Notebook, we will try this solution, using the regressive approach.

Learning curve of the Random Forest model

Most of the work done is a data exploration, using matplotlib and seaborn most of the time to draw correlations and highlight useful information for future use. I have used scikit learn and pandas to manipulate the data and to create preprocessing and modeling functions, to get the best combination of preprocess+models, using the mean median error as the most representative metric for performance.

This project is a project done for researchers, at university, to detect cyber attacks on autonomous trains.

🧮 🔒 Multilingual NER models evaluation

The Named Entity Recognition is a Natural Language Processing domain. It is a problem of automatic data analysis, consisting in extracting a type of entity from a text. A NER model can for example extract all the people, dates, locations etc. from a document:

Schema of entities extracted, NER

These models are usually monolingual. However the company needed to explore the possibility of using one model to extract entities from lots of documents, in 5 languages. This would allow the company to avoid its customers from deploying 5 different models, and from detecting the language of each document analyzed.

To do so, I found lots of annotated datasets (from Kaggle and other sources), several pre-trained models, and designed a benchmark to calculate the metrics of the models. To understand why some results were low, I analyzed the results by tag, like this:

Example of the scores of tags in a NER task

This further study allowed me to understand the semantic reasons for these disparities in results. It allowed the team to correct this issue by using transfer learning, such as fine tuning for example.

🧮 🔒 OCR benchmark and preprocess optimization

The company needed to quantify its OCR tool performances. An Optical Character Recognition is a Computer Vision technology that extracts the text from an image.

OCR picture working

Therefore, I designed a benchmark to do so. First, I had to think about how to evaluate an OCR:

  • which preprocessing functions ?
  • what kind of data ?
  • how to quantify the quality that results from an OCR ?

I chose three criterias to find their impact on the metrics: the font, the font size, and the quality (dimensions) of the document. I studied the impact of the gray scale and the rotation preprocessing functions. Based on these choices, I searched on the internet several datasets to have a representative rate of documents. I then standardized them to respect the HOCR format, cutting the picture in boxes, to locate the text. This allowed me to match the extracted text to its true value.

To quantify the capacities of the OCR, I used a Levenshtein distance calculator function I optimized to calculate the precision, recall and f1 score.

After that, I improved the rotation preprocessing function to reduce the processing time, based on the determined angle.

At the end of the project, I made a presentation to explain all the propositions to the other members of the team to decide what changes we must integrate into the program.

🧮 🔒 Machine Learning for Sentence Bounding Detection capabilities

To improve an NLP processing, the company needed to ameliorate the preprocessing of the data. Tokenization is one of them, and is essential. There are several ways to tokenize a text, but one of them is to cut it in paragraphs, to extract the sentences from each paragraph, and to get each word and punctuation of these sentences. The interesting fact in this analyze is that it keeps the fact that two words in the same sentence are more linked than if they are in separate ones. The issue is that sentences can end differently than a dot, an exclamation point or any other punctuation. Sentence Boundary Detection is an actual NLP problematic, and models exist to do this task. Therefore, I had to evaluate the actual models of the company's solution, and then to try to find SBD models that do better.

What is SBD

To do so, I had to find SBD datasets, that contained various data forms such as tables, headers, lists etc., and standardize their format to be able to evaluate models in the most complete way possible. Then, I made a state of the art to list the available models. Then I created a benchmark to test these models, using Python, and approaching the problem by a binary classification (0 if the index is a bound, else 1). Then, I was able to evaluate the precision, recall, and f1 score of the 'is a bound' class (1).

The main issue I faced in the SBD problem is that there are many ways to consider that a substring of a text is a sentence or not. It really depends on the annotated dataset used to train the model, as shown on the next picture. That is why I had to add some tolerance to be objective.

Different kind of bounds

I tried lots of models, some that are syntaxic only (such as PySBD), or complax models using Neural Networks (for example Stanza). At the end of the study, I made a presentation with my analyzis of the limits of each models (punctuation impact mostly), to make one better model.

On the best model, I obtained very good results that hugely improved the solution:

Different kind of bounds

🧮 🔒 Relation Extraction models using clustering and semantic vectors combination systems

This project came after the optimization of the NER models. Given extracted entities, we wanted to know the relation between each of them. For instance, is the relation between a person and a date a "birthdate" relation, or not? The following picture shows how relations can be useful in NLP systems.

Relation Extraction use case example

This problem becomes more complex as we wanted to be able to introduce new relations in few-shot contexts, and also be able to reject samples (no relation), without having to train again the whole pipeline. Therefore, we designed a system that combines vectors (like BERT vectors), of each of the two entities, then projects them into a new space using a distance metric learning algorithm, and finally clusters all samples into labeled relations, as shown in the following picture.

Algorithm of the relation extraction solution

After various optimizations, especially on the vector combination step, we achieved very good results, that allowed this system to be industriliazed and go to production.

Results of the relation extraction model

This unique system allowed us to solve very complex real-world problems in few-shot contexts, with very minimal training time as we used pre-trained encoders.

🧮 🔒 Active Data Generation pipeline for automatic reinforcement learning systems using a multi-agent approach

Data has always been a challenge for the Data Science industry. With the raise of LLMs, lots of projects around data generation were born. I pushed it a step further and designed a fully automatic system for NER and RE applications, in which a pre-trained model is sequentially fine-tuned on generated data that follows the style of a small production dataset, until it reaches satisfying results. The goal is here to optimize the fine-tuning of the model based on its struggles, focusing on its worst mistakes.

The algorithm, shown in the following picture, is pretty simple thanks to the multi-agent approach.

Algorithm of the active data generation tool

The first step is evaluating the pre-trained model on a small production dataset, from a client for instance. Then, based on the results of each entity class and relation class, scenarii are created for the data generation agent (a large language model). In the prompt are also injected samples of the production dataset, as the generated data should have the same "style" (defined precisely through criterions in the prompt). After that, a second agent (a smaller LLM) will evaluate the similarity in style of the generated and the original data (out of 100, using criterion given in the prompt), and give directions for the generation agent to improve the generated data. Using other metrics (self-BLEU for instance using n-grams), we made sure to avoid having duplicates in the generated dataset. Then, the model is fine-tuned on the generated dataset. Last step, the model is evaluated on the production dataset, and if the score is lower than a threshold (for instance f1-score < 0.8), the pipeline starts again. This allows to obtain an increase of the results of the models, effectively, in few-shot contexts. The following picture shows the results obtained on 4 consecutive iterations of the pipeline, on a NER model.

Evolution of the f1 score of the model using the active data generation pipeline

Even though it might look like a data contamination, as we train the model on augmented production data that we used to evaluate the model, other steps are added to avoid that. This pipeline allowed to get increase of up to 20% on the NER models used in production bu the company.

As a project for a Machine Learning course, I had to explore with a teamate solutions to predict if the income of people were less or more than 50K dollars a year (binary classification). The only two rules to respect were:

  • to find quickly a solution (we had maximum 10 hours per person on this project)
  • to use the famous Adults Income dataset

Metrics obtained with a model

We used a Jupyter Notebook to capitalize all the work done on the study. To process algorithms on the dataset, we used libraries such as Pandas, Seaborn (for the data exploration), and Scikit-learn to test several models. We optimized the preprocessing to get the best processing chain.

As I usually do, I have done the study in 4 parts:

  • exploratory data analysis, using graphs, statistics, plots etc.
  • data preparation, by creating several preprocessing chains
  • modelisation, by creating several models, each one using every preprocessing-model combination
  • evaluation, to know the performances of our models, and making sure we had no under/over fitting

Training and validation scores depending on the data learned on

We did no optimization on this model, because of the time we had, but we could have optimized some parameters of the best model we got, using GridSearchCV for example.

After doing the study, we have presented it orally using a Power Point presentation to explain our choices and the results we obtained.

The goal of this project was to develop a multithreaded program to make a robot follow a green target, using only:

  • A camera
  • A LiDAR

The project had requirements to respect:

  • Stop the robot if any obstacle is closer than 15 centimeters from the robot, all around it
  • Make it follow a colored target that would move in front of the target

To respect these needs, I have implemented several features, using Python and the ROS library to parallelize the data processing in 7 nodes, as explained on the following picture.

ROS nodes and topics

To optimize the robot’s motion, I have implemented a distance estimator that uses the image of the target, knowing its real size. Therefore, I have added a function to make the robot go backward if the target was too close.

Relation between the size of an object and its size seen by the camera

Connect with me


Popular repositories Loading

  1. Subscriber-Count Subscriber-Count Public

    A simple JS script to show your subscriber count on a webpage.

    HTML 5 3

  2. NF92-Rapport-Latex NF92-Rapport-Latex Public

    Rapport à réaliser en Latex au cours de l'UV NF92 à l'UTC

    TeX 1

  3. eliaccess eliaccess Public

    My personal portfolio !

    1

  4. Following-Cart Following-Cart Public

    This project is a following cart, that allows you to move heavy objects without effort, just by charging it and walking.

    C++

  5. NF92-Site-Auto-Ecole NF92-Site-Auto-Ecole Public

    Site d'auto école à réaliser au cours de l'UV NF92 à l'UTC

    PHP

  6. Site-Youascapegame Site-Youascapegame Public

    Escape Game du clan "Youarille" pour l'intégration des nouveaux élèves à l'UTC.

    HTML