Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge #22 -Applying AI capabilities to address Operations challenges in ECMWF Products Team #6

Closed
EsperanzaCuartero opened this issue Jan 17, 2020 · 7 comments
Assignees
Labels
stream-2 Challenges under stream 2

Comments

@EsperanzaCuartero
Copy link

EsperanzaCuartero commented Jan 17, 2020

Challenge #22 - Applying AI capabilities to address Operations challenges in ECMWF Products Team

Stream 2 - Machine-Learning and Artificial Intelligence

Goal

To apply AI capabilities to analyse log data in real-time to be able to predict issues before they occur.

Mentors and skills

  • Mentors: @dueben @Matthew-Manoussakis
  • Skills required
    • Data science experience
    • Strong experience in building AI/ML algorithms
    • Experience in Linux
    • Experience in Python3
    • Experience in applying AIOps ideas on Operational environments would be desirable

Challenge description

Due to the explosion of data in recent years - known as the data avalanche - many companies can no longer cope with the rapid growth in data volumes and the variety of logs produced by their IT environments. On the other hand, ensuring the services' availability and performance is more critical than ever for most businesses.
Leading companies are turning to artificial intelligence (AI) for IT operations (AIOps*) to analyze data real time and predict issues before they occur.
This enables them to continuously track and assess the status of their services to improve monitoring and troubleshooting.

Our services in brief

The ECMWF Meteorological Archival and Retrieval System (MARS) enables users to retrieve meteorological data in GRIB/NetCDF via:

  • the MARS client on ECMWF computers such as ecgate
  • the Web API service (supported Python client software)

In Products team, we are managing the services above and we provide tailored data to Member State users, commercial users and public users.
Our services above produce massive amounts of multi-structured log file every day, spread in several disparate systems, which include underused or hidden valuable information.

Project description

Naturally, the scale and complexity of our services and infrastructures makes monitoring and troubleshooting an increasing challenge.
The suggested project is exploratory research, that investigates how the application of AI/ML techniques can be used to improve Operations in products team.
This would enable our team to proactively understand the behaviour of our services, to take preventative actions manually or ideally through automation, to reduce MTTR and to improve user experience.
If successful, the developed tools could be extended to improve the operational fidelity of other ECMWF services.

Possible datasets available:

Machine logs produced by Web-API and MARS (stored in Splunk)

Expected Outcomes

  • Documentation
  • Analysis
  • Working Python software

Additional information

@jwagemann jwagemann self-assigned this Jan 17, 2020
@jwagemann jwagemann added the stream-2 Challenges under stream 2 label Jan 17, 2020
@adiah80
Copy link
Member

adiah80 commented Apr 10, 2020

Hello, I am Aditya. I'm currently a pre-final year Computer Science undergrad at BITS Pilani. I'm specializing in Machine Learning.

I have worked on similar problems and datasets before wherein I used a combination of dimensionality reduction and clustering algorithms to analyze log files. I also have some experience with anomaly detection.

Do you have any representative dataset (machine logs) that I can do some preliminary analysis on? Also, could you explain a bit more about the kind of analysis expected?

Thanks!

@dueben
Copy link

dueben commented Apr 12, 2020

Many thanks for your interest and sorry for the delayed response.

I am afraid that I cannot provide you a sample for the dataset at this stage. However, Matthew may be able to provide some input here.

In general we have a lot of data for a lot of different diagnostics. With this challenge, we want to start to explore this space with machine learning methods. In particular, we are interested to identify spikes in the requests before they are happening. For example a large number of data retrievals. This could, for example, be done with methods for 1-D timeseries analysis. However, this could also be done using more sophisticated analysis of the various properties of the machine logs that are available. For example via a sensitivity analysis in a multi-dimensional space or a timeseries analysis that takes many parameters into account. We would like to start simple and to increase complexity during the project. However, the applicant can have significant influence on the future directions of the project.

The detection of outliers is the main motivation for the project. However, there is much more that could be done. For example a detailed sensitivity analysis of the various diagnostics that are available within the machine logs.

@jwagemann
Copy link
Contributor

Only 4 days left to apply to be part of ECMWF Summer of Weather Code 2020.
Application deadline: Wednesday, 22 April 2020 at 23:59 (BST).
Submit your proposal here.

@Adithya-MN
Copy link

Adithya-MN commented Apr 18, 2020

Hi, I am Adithya Niranjan, a classmate of Aditya Ahuja's. I found this project really interesting and plan to work with him on this task. I've previously worked in projects on applying deep meta-learning for time-series forecasting and also on applying online classification models on EEG based time-series data, among other projects.

Just had a few questions to understand the problem better -

  1. Would these logs be univariate or multivariate? Asking because this would help us decide which algorithms would be suitable - in general, multivariate time-series have temporal correlations which make the task more complex
  2. Could you give an idea of how the models made will be integrated and deployed with the current system? This would help us plan the timeline in our project proposal i.e. to decide how much to spend on building the models and how much to spend tieing them together with the current system.
  3. Could you give us an estimate of how frequent the spikes/anomalies are? One possible approach I was considering was to use a forecasting model to predict load and then compare the loss with the actual occurrence. This could possibly be used along with another anomaly detection model as well. But I have a feeling this would depend on how frequent the anomalies are.

Thanks!

@dueben
Copy link

dueben commented Apr 19, 2020

Hi Adithya, It is great that you are interested in this challenge.

Here are my responses. Matthew can disagree if he thinks of this differently:

  1. I guess it would make sense to start with the assumption that the logs are univariate. However, as we increase complexity of the solutions, we should also consider multivariate analysis later during the project.
  2. Well, this depends a bit on how far we get during the project. I guess the most likely scenario is the implementation as automated alarm system. The primary focus should be on the testing of different methods to check what is possible. The deployment into the system could also be realised after the project.
  3. Matthew will know this much better but I would guess that spikes would happen on a weekly basis.

I hope this helps. Happy writing!

@Matthew-Manoussakis
Copy link

Matthew-Manoussakis commented Apr 20, 2020 via email

@Adithya-MN
Copy link

@dueben and @Matthew-Manoussakis, thank you for the info.

We'll cover both univariate and multivariate approaches in our proposal then. From your answers to Q2, I gather that the primary focus is on getting a good working ML/DL model on the data - we will focus on the same. For now, we have shortlisted several promising approaches, open implementations and papers on the problem - we'll summarise them in our proposal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stream-2 Challenges under stream 2
Projects
None yet
Development

No branches or pull requests

6 participants