Skip to content
View eliaccess's full-sized avatar
🧬
Working on Deep Learning research topics
🧬
Working on Deep Learning research topics

Highlights

  • Pro

Block or report eliaccess

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
eliaccess/README.md

banner that says Elias Limouni, portfolio of a junior data scientist

I'm Elias, a Junior Data Scientist 👨‍💻 Working since June 2020💻

  • 🔭 I’m currently working on NLP applications at OppScience
  • 📖 Master of Science in Computer Science at Université de Technologie de Compiègne (French Engineering degree), graduated in 2023
  • 🌱 I'm always learning new techs to keep myself up to date
  • 📫 Feel free to contact me on LinkedIn
  • ⚡ I love making robots, using Machine Learning to improve them

Certifications & Badges

To see the details of the certificates and the authenticity verification page, feel free to click on them. The diplomas are available on my LinkedIn profile.

IBM Data Science Professional Certificate

Machine Learning Specialization Certificate

Deep Learning Specialization Certificate

Machine Learning Engineering for Production (MLOps) Specialization Certificate

Skills

To get the name of the skill, place your cursor on it.

Data Science

Python Google Colab Jupyter-Notebook Scikit-learn Seaborn Pandas TensorFlow PyTorch OpenCV Keras R MatLab Elastic Search MySQL

Robotics & IoT

Linux Raspberry Pi Arduino C C++ ROS, Robot Operating System

DevOps & Miscs

Git GitLab GitHub Bash GCP Docker Nginx Node.js Express.js WordPress PHP Java HTML5 CSS3 Latex

Projects

To improve the lisibility of my projects, here is a legend of the emojis in the title of the projects:

  • 🔒 : private project, for a company for example, so I can not show the code
  • 🧮 : data science project
  • 🤖 : robotics / IoT project
  • 📚 : educational project
  • 👨 : personal project

List of the projects:

In order to get the IBM Professional Data Science Certificate, I developed solutions in order to solve this applied Data Science capstone.

SpaceX intends to reduce the costs of spatial flights by reusing the first stage of their rockets. This project goal is to predict if the first stage of a Falcon 9 rocket will lang back to Earth successfully.

Data Collection on two sources for the Applied Data Science Capstone from IBM

The data used was collected using the SpaceX REST API, and a Wikipedia article about the Falcon 9 rockets. I performed Web Scraping in order to extract useful data from the second data source.

I performed an Exploratory Data Analysis using Folium to make a map to study distances, Dash to create an interactive dashboard, and of course other Python libraries to create scatterplots, bar charts and lots of other visualizations.

Dashboard created with Dash from Plotly for the Applied Data Science Capstone from IBM

I used 4 classification models and optimized their parameters to find the best solution to solve the problem.

At the end of the project, I created a synthetic but complete 45 pages report to explain the work using a storytelling method.

This project comes from an issue : it is important in some fields to evaluate the danger of a cyber attack, to sort their correction priority. Therefore, the CVSS, which goes from 0 to 10, evaluates this criticity. Using Machine Learning to evaluate this risk might be a good solution to make it quicker based on the stack and other information. In this Notebook, we will try this solution, using the regressive approach.

Learning curve of the Random Forest model

Most of the work done is a data exploration, using matplotlib and seaborn most of the time to draw correlations and highlight useful information for future use. I have used scikit learn and pandas to manipulate the data and to create preprocessing and modeling functions, to get the best combination of preprocess+models, using the mean median error as the most representative metric for performance.

This project is a project done for researchers, at university, to detect cyber attacks on autonomous trains.

🧮 🔒 Multilingual NER models evaluation

The Named Entity Recognition is a Natural Language Processing domain. It is a problem of automatic data analysis, consisting in extracting a type of entity from a text. A NER model can for example extract all the people, dates, locations etc. from a document:

Schema of entities extracted, NER

These models are usually monolingual. However the company needed to explore the possibility of using one model to extract entities from lots of documents, in 5 languages. This would allow the company to avoid its customers from deploying 5 different models, and from detecting the language of each document analyzed.

To do so, I found lots of annotated datasets (from Kaggle and other sources), several pre-trained models, and designed a benchmark to calculate the metrics of the models. To understand why some results were low, I analyzed the results by tag, like this:

Example of the scores of tags in a NER task

This further study allowed me to understand the semantic reasons for these disparities in results. It allowed the team to correct this issue by using transfer learning, such as fine tuning for example.

🧮 🔒 OCR benchmark and preprocess optimization

The company needed to quantify its OCR tool performances. An Optical Character Recognition is a Computer Vision technology that extracts the text from an image.

OCR picture working

Therefore, I designed a benchmark to do so. First, I had to think about how to evaluate an OCR:

  • which preprocessing functions ?
  • what kind of data ?
  • how to quantify the quality that results from an OCR ?

I chose three criterias to find their impact on the metrics: the font, the font size, and the quality (dimensions) of the document. I studied the impact of the gray scale and the rotation preprocessing functions. Based on these choices, I searched on the internet several datasets to have a representative rate of documents. I then standardized them to respect the HOCR format, cutting the picture in boxes, to locate the text. This allowed me to match the extracted text to its true value.

To quantify the capacities of the OCR, I used a Levenshtein distance calculator function I optimized to calculate the precision, recall and f1 score.

After that, I improved the rotation preprocessing function to reduce the processing time, based on the determined angle.

At the end of the project, I made a presentation to explain all the propositions to the other members of the team to decide what changes we must integrate into the program.

🧮 🔒 Machine Learning for Sentence Bounding Detection capabilities

To improve an NLP processing, the company needed to ameliorate the preprocessing of the data. Tokenization is one of them, and is essential. There are several ways to tokenize a text, but one of them is to cut it in paragraphs, to extract the sentences from each paragraph, and to get each word and punctuation of these sentences. The interesting fact in this analyze is that it keeps the fact that two words in the same sentence are more linked than if they are in separate ones. The issue is that sentences can end differently than a dot, an exclamation point or any other punctuation. Sentence Boundary Detection is an actual NLP problematic, and models exist to do this task. Therefore, I had to evaluate the actual models of the company's solution, and then to try to find SBD models that do better.

What is SBD

To do so, I had to find SBD datasets, that contained various data forms such as tables, headers, lists etc., and standardize their format to be able to evaluate models in the most complete way possible. Then, I made a state of the art to list the available models. Then I created a benchmark to test these models, using Python, and approaching the problem by a binary classification (0 if the index is a bound, else 1). Then, I was able to evaluate the precision, recall, and f1 score of the 'is a bound' class (1).

The main issue I faced in the SBD problem is that there are many ways to consider that a substring of a text is a sentence or not. It really depends on the annotated dataset used to train the model, as shown on the next picture. That is why I had to add some tolerance to be objective.

Different kind of bounds

I tried lots of models, some that are syntaxic only (such as PySBD), or complax models using Neural Networks (for example Stanza). At the end of the study, I made a presentation with my analyzis of the limits of each models (punctuation impact mostly), to make one better model.

On the best model, I obtained very good results that hugely improved the solution:

Different kind of bounds

As a project for a Machine Learning course, I had to explore with a teamate solutions to predict if the income of people were less or more than 50K dollars a year (binary classification). The only two rules to respect were:

  • to find quickly a solution (we had maximum 10 hours per person on this project)
  • to use the famous Adults Income dataset

Metrics obtained with a model

We used a Jupyter Notebook to capitalize all the work done on the study. To process algorithms on the dataset, we used libraries such as Pandas, Seaborn (for the data exploration), and Scikit-learn to test several models. We optimized the preprocessing to get the best processing chain.

As I usually do, I have done the study in 4 parts:

  • exploratory data analysis, using graphs, statistics, plots etc.
  • data preparation, by creating several preprocessing chains
  • modelisation, by creating several models, each one using every preprocessing-model combination
  • evaluation, to know the performances of our models, and making sure we had no under/over fitting

Training and validation scores depending on the data learned on

We did no optimization on this model, because of the time we had, but we could have optimized some parameters of the best model we got, using GridSearchCV for example.

After doing the study, we have presented it orally using a Power Point presentation to explain our choices and the results we obtained.

The goal of this project was to develop a multithreaded program to make a robot follow a green target, using only:

  • A camera
  • A LiDAR

The project had requirements to respect:

  • Stop the robot if any obstacle is closer than 15 centimeters from the robot, all around it
  • Make it follow a colored target that would move in front of the target

To respect these needs, I have implemented several features, using Python and the ROS library to parallelize the data processing in 7 nodes, as explained on the following picture.

ROS nodes and topics

To optimize the robot’s motion, I have implemented a distance estimator that uses the image of the target, knowing its real size. Therefore, I have added a function to make the robot go backward if the target was too close.

Relation between the size of an object and its size seen by the camera

Connect with me


Popular repositories Loading

  1. Subscriber-Count Subscriber-Count Public

    A simple JS script to show your subscriber count on a webpage.

    HTML 5 3

  2. NF92-Rapport-Latex NF92-Rapport-Latex Public

    Rapport à réaliser en Latex au cours de l'UV NF92 à l'UTC

    TeX 1

  3. eliaccess eliaccess Public

    My personal portfolio !

    1

  4. Following-Cart Following-Cart Public

    This project is a following cart, that allows you to move heavy objects without effort, just by charging it and walking.

    C++

  5. NF92-Site-Auto-Ecole NF92-Site-Auto-Ecole Public

    Site d'auto école à réaliser au cours de l'UV NF92 à l'UTC

    PHP

  6. Site-Youascapegame Site-Youascapegame Public

    Escape Game du clan "Youarille" pour l'intégration des nouveaux élèves à l'UTC.

    HTML