Challenge #22 -Applying AI capabilities to address Operations challenges in ECMWF Products Team #6

EsperanzaCuartero · 2020-01-17T11:25:30Z

Challenge #22 - Applying AI capabilities to address Operations challenges in ECMWF Products Team

Stream 2 - Machine-Learning and Artificial Intelligence

Goal

To apply AI capabilities to analyse log data in real-time to be able to predict issues before they occur.

Mentors and skills

Mentors: @dueben @Matthew-Manoussakis
Skills required
- Data science experience
- Strong experience in building AI/ML algorithms
- Experience in Linux
- Experience in Python3
- Experience in applying AIOps ideas on Operational environments would be desirable

Challenge description

Due to the explosion of data in recent years - known as the data avalanche - many companies can no longer cope with the rapid growth in data volumes and the variety of logs produced by their IT environments. On the other hand, ensuring the services' availability and performance is more critical than ever for most businesses.
Leading companies are turning to artificial intelligence (AI) for IT operations (AIOps*) to analyze data real time and predict issues before they occur.
This enables them to continuously track and assess the status of their services to improve monitoring and troubleshooting.

Our services in brief

The ECMWF Meteorological Archival and Retrieval System (MARS) enables users to retrieve meteorological data in GRIB/NetCDF via:

the MARS client on ECMWF computers such as ecgate
the Web API service (supported Python client software)

In Products team, we are managing the services above and we provide tailored data to Member State users, commercial users and public users.
Our services above produce massive amounts of multi-structured log file every day, spread in several disparate systems, which include underused or hidden valuable information.

Project description

Naturally, the scale and complexity of our services and infrastructures makes monitoring and troubleshooting an increasing challenge.
The suggested project is exploratory research, that investigates how the application of AI/ML techniques can be used to improve Operations in products team.
This would enable our team to proactively understand the behaviour of our services, to take preventative actions manually or ideally through automation, to reduce MTTR and to improve user experience.
If successful, the developed tools could be extended to improve the operational fidelity of other ECMWF services.

Possible datasets available:

Machine logs produced by Web-API and MARS (stored in Splunk)

Expected Outcomes

Documentation
Analysis
Working Python software

Additional information

AIOPS: “AIOps” stands for “artificial intelligence for IT operations.” Originally coined by Gartner in 2017

adiah80 · 2020-04-10T10:54:17Z

Hello, I am Aditya. I'm currently a pre-final year Computer Science undergrad at BITS Pilani. I'm specializing in Machine Learning.

I have worked on similar problems and datasets before wherein I used a combination of dimensionality reduction and clustering algorithms to analyze log files. I also have some experience with anomaly detection.

Do you have any representative dataset (machine logs) that I can do some preliminary analysis on? Also, could you explain a bit more about the kind of analysis expected?

Thanks!

dueben · 2020-04-12T19:25:20Z

Many thanks for your interest and sorry for the delayed response.

I am afraid that I cannot provide you a sample for the dataset at this stage. However, Matthew may be able to provide some input here.

In general we have a lot of data for a lot of different diagnostics. With this challenge, we want to start to explore this space with machine learning methods. In particular, we are interested to identify spikes in the requests before they are happening. For example a large number of data retrievals. This could, for example, be done with methods for 1-D timeseries analysis. However, this could also be done using more sophisticated analysis of the various properties of the machine logs that are available. For example via a sensitivity analysis in a multi-dimensional space or a timeseries analysis that takes many parameters into account. We would like to start simple and to increase complexity during the project. However, the applicant can have significant influence on the future directions of the project.

The detection of outliers is the main motivation for the project. However, there is much more that could be done. For example a detailed sensitivity analysis of the various diagnostics that are available within the machine logs.

jwagemann · 2020-04-18T15:32:48Z

Only 4 days left to apply to be part of ECMWF Summer of Weather Code 2020.
Application deadline: Wednesday, 22 April 2020 at 23:59 (BST).
Submit your proposal here.

Adithya-MN · 2020-04-18T20:12:35Z

Hi, I am Adithya Niranjan, a classmate of Aditya Ahuja's. I found this project really interesting and plan to work with him on this task. I've previously worked in projects on applying deep meta-learning for time-series forecasting and also on applying online classification models on EEG based time-series data, among other projects.

Just had a few questions to understand the problem better -

Would these logs be univariate or multivariate? Asking because this would help us decide which algorithms would be suitable - in general, multivariate time-series have temporal correlations which make the task more complex
Could you give an idea of how the models made will be integrated and deployed with the current system? This would help us plan the timeline in our project proposal i.e. to decide how much to spend on building the models and how much to spend tieing them together with the current system.
Could you give us an estimate of how frequent the spikes/anomalies are? One possible approach I was considering was to use a forecasting model to predict load and then compare the loss with the actual occurrence. This could possibly be used along with another anomaly detection model as well. But I have a feeling this would depend on how frequent the anomalies are.

Thanks!

dueben · 2020-04-19T19:02:39Z

Hi Adithya, It is great that you are interested in this challenge.

Here are my responses. Matthew can disagree if he thinks of this differently:

I guess it would make sense to start with the assumption that the logs are univariate. However, as we increase complexity of the solutions, we should also consider multivariate analysis later during the project.
Well, this depends a bit on how far we get during the project. I guess the most likely scenario is the implementation as automated alarm system. The primary focus should be on the testing of different methods to check what is possible. The deployment into the system could also be realised after the project.
Matthew will know this much better but I would guess that spikes would happen on a weekly basis.

I hope this helps. Happy writing!

Matthew-Manoussakis · 2020-04-20T04:58:04Z

Hi Adithya, 1. The idea is to start with univariate but we may consider multivariate later. 2. The objective of this project is the implementation of a proof of concept automated alarm system. The deployment in production may come later and it is not part of this project. 3. Spikes may happen on an weekly basis but in some cases may also happen on a daily basis. I hope this helps. Matthew Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: dueben <notifications@github.com> Sent: Sunday, April 19, 2020 10:02:52 PM To: esowc/challenges_2020 <challenges_2020@noreply.github.com> Cc: Matthew Manoussakis <Matthew.Manoussakis@ecmwf.int>; Assign <assign@noreply.github.com> Subject: Re: [esowc/challenges_2020] Challenge #22 -Applying AI capabilities to address Operations challenges in ECMWF Products Team (#6) Hi Adithya, It is great that you are interested in this challenge. Here are my responses. Matthew can disagree if he thinks of this differently: 1. I guess it would make sense to start with the assumption that the logs are univariate. However, as we increase complexity of the solutions, we should also consider multivariate analysis later during the project. 2. Well, this depends a bit on how far we get during the project. I guess the most likely scenario is the implementation as automated alarm system. The primary focus should be on the testing of different methods to check what is possible. The deployment into the system could also be realised after the project. 3. Matthew will know this much better but I would guess that spikes would happen on a weekly basis. I hope this helps. Happy writing! — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fesowc%2Fchallenges_2020%2Fissues%2F6%23issuecomment-616207998&data=02%7C01%7Cmatthew.manoussakis%40ecmwf.int%7Cefe7408c15fa4493c37508d7e49443b7%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C0%7C637229197759843011&sdata=DFbmT0J2GOj6TzdP78DcRqK3mdOEZ5zAr4a%2B4bKHPuc%3D&reserved=0>, or unsubscribe<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAOKZ5CYH2LFWC676KGLV6HLRNNDFZANCNFSM4KIG3A2Q&data=02%7C01%7Cmatthew.manoussakis%40ecmwf.int%7Cefe7408c15fa4493c37508d7e49443b7%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C0%7C637229197759848003&sdata=EJmueY6R0o9Bjn5HxsJK%2BsTa5JaNpywhwpErJeraZ9E%3D&reserved=0>.

Adithya-MN · 2020-04-21T08:44:20Z

@dueben and @Matthew-Manoussakis, thank you for the info.

We'll cover both univariate and multivariate approaches in our proposal then. From your answers to Q2, I gather that the primary focus is on getting a good working ML/DL model on the data - we will focus on the same. For now, we have shortlisted several promising approaches, open implementations and papers on the problem - we'll summarise them in our proposal

jwagemann self-assigned this Jan 17, 2020

jwagemann added the stream-2 Challenges under stream 2 label Jan 17, 2020

jwagemann assigned dueben Jan 18, 2020

EsperanzaCuartero assigned Matthew-Manoussakis Jan 27, 2020

jwagemann closed this as completed May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Challenge #22 -Applying AI capabilities to address Operations challenges in ECMWF Products Team #6

Challenge #22 -Applying AI capabilities to address Operations challenges in ECMWF Products Team #6

EsperanzaCuartero commented Jan 17, 2020 •

edited

Loading

adiah80 commented Apr 10, 2020

dueben commented Apr 12, 2020

jwagemann commented Apr 18, 2020

Adithya-MN commented Apr 18, 2020 •

edited

Loading

dueben commented Apr 19, 2020

Matthew-Manoussakis commented Apr 20, 2020 via email

Adithya-MN commented Apr 21, 2020

Challenge #22 -Applying AI capabilities to address Operations challenges in ECMWF Products Team #6

Challenge #22 -Applying AI capabilities to address Operations challenges in ECMWF Products Team #6

Comments

EsperanzaCuartero commented Jan 17, 2020 • edited Loading

Challenge #22 - Applying AI capabilities to address Operations challenges in ECMWF Products Team

Goal

Mentors and skills

Challenge description

Our services in brief

Project description

Possible datasets available:

Expected Outcomes

Additional information

adiah80 commented Apr 10, 2020

dueben commented Apr 12, 2020

jwagemann commented Apr 18, 2020

Adithya-MN commented Apr 18, 2020 • edited Loading

dueben commented Apr 19, 2020

Matthew-Manoussakis commented Apr 20, 2020 via email

Adithya-MN commented Apr 21, 2020

EsperanzaCuartero commented Jan 17, 2020 •

edited

Loading

Adithya-MN commented Apr 18, 2020 •

edited

Loading