Skip to content

asavinov/machine-learning-and-data-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 

Repository files navigation

A collection of resources on machine learning, data processing and related areas

Analysis of different types of data

Time series and forecasting

Books and articles:

Implementations:

Stock price forecasting:

  • stars https://github.com/huseinzol05/Stock-Prediction-Models Gathers machine learning and deep learning models for Stock forecasting including trading bots and simulations
  • stars https://github.com/borisbanushev/stockpredictionai In this noteboook I will create a complete process for predicting stock price movements. Follow along and we will achieve some pretty good results. For that purpose we will use a Generative Adversarial Network (GAN) with LSTM, a type of Recurrent Neural Network, as generator, and a Convolutional Neural Network, CNN, as a discriminator

Lists:

Spatial data

Audio and sound

General audio processing libraries:

Text to speech (TTS)

Pitch trackers:

Other

Books and other resources:

Text NLP

Lists:

Video

Images

Graphs, RDFs etc.

Graph stores:

Databases:

Visualizations and dashboards:

  • graphviz

Statistics

Reinforcement learning

AI, data mining, machine learning algorithms

Resources:

Algorithms:

  • XGBoost, CatBoost, LightGBM
  • stars https://github.com/spotify/annoy Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
    • USP: ability to use static files as indexes, share index across process, that is, in-memory and efficient and multi-process
    • e.g, music recommendation in Spotify, similar images (for labeling etc.)

Data and knowledge engineering

Feature engineering

Time series:

Feature extraction:

Feature selection:

Hyper-parameter optimization:

Lists:

Feature stores

Resources:

Model management

TBD

AutoML

AutoML libs:

Systems and companies:

  • DataRobot
  • DarwinAI
  • H2O.ai Dreverless AI
  • OneClick.ai

Workflow/job management, Data integration, ETL, web scrapping, orchestration

For workflows, it is important how the following concepts are implemented:

  • Workflow task: what can be executed and how we specify the tasks to be executed (programmatically, declaratively etc.)
  • Dependencies: when and how tasks are started. It is not only about simple conditions like whether a (previous) task has been finished and with which status. Conditions might be much more complex and be essentially separate tasks.
    • How to define conditional execution where execution of a task depends on dynamic conditions
    • Choosing a task to be executed dynamically, that is, task to be executed is not known at the time of graph definition
  • Triggering workflows: How a whole workflow execution can be started (from inside or outside). These functions make workflow managers similar to asynchronous systems:
    • From external systems, for example, by listening to some protocol
    • Synchronous execution, for example, using schedulers like once per day

A conventional workflow management system deals with task arguments and task return values. The goal here is to make the whole workflow with all its tasks data-aware, for example, by sharing some data. For the point of view of data processing, there is an additional aspect:

  • If it is a graph of data processing, then the question is where and how the system takes the data properties into account:
    • Data state and data ranges like only data for the last month
    • Data structure like columns and tables, for example, only these tables
  • Such data-aware workflows are most important for data-driven business, and the question is what they know about the data and how this knowledge is used to manage and execute workflow.

General purpose workflow management systems:

  • stars https://github.com/apache/airflow https://airflow.apache.org A platform to programmatically author, schedule, and monitor workflows. "Airflow is based on DAG representation and doesn’t have a concept of input or output, just of flow."
    • Tasks (Operators). Pluggable Python classes with many conventional execution types provided like PythonOperator or BashOperator. Note that these are task types which require custom task code for parameterization and instantiation. For PythonOperator, the custom code is provided as a Python function: python_callable=my_task_fn. Function arguments are passed in another parameter op_kwargs.
    • Data. Small data sharing between tasks is performed via XCOM (cross-communication messages).
    • Dependencies. It is done programmatically and statically, for example: my_task_1 >> my_task_2 >> [my_task_3, my_task_4]
    • Scheduling. It is possible to specify start_date and schedule_interval (using CRON expression or timedelta object).
  • stars https://github.com/celery/celery Distributed Task Queue http://celeryproject.org/
  • stars https://github.com/spotify/luigi build complex pipelines of (long-running) batch jobs like Hadoop jobs, Spark jobs, dumping data to/from databases, running machine learning algorithms, Python snippet etc. "Luigi is based on pipelines of tasks that share input and output information and is target-based".
    • Dependencies. "Luigi doesn’t use DAGs. Instead, Luigi refers to "tasks" and "targets." Targets are both the results of a task and the input for the next task." "Luigi has 3 steps to construct a pipeline: requires() defines the dependencies between the tasks, output() defines the the target of the task, run() defines the computation performed by each task"
    • Scheduling. "Luigi ... has a central scheduler and custom calendar schedule capabilities, providing users with lots of flexibility" (in contrast to Airflow).
  • stars https://github.com/prefecthq/prefect The easiest way to automate your data
  • stars https://github.com/azkaban/azkaban Azkaban workflow manager
  • stars https://github.com/dagster-io/dagster A data orchestrator for machine learning, analytics, and ETL
  • Oozie
  • stars https://github.com/ploomber/ploomber The fastest way to build data pipelines. Develop iteratively, deploy anywhere
  • stars https://github.com/d6t/d6tflow Python library for building highly effective data science workflows (on top of luigi)
  • stars https://github.com/grailbio/reflow A language and runtime for distributed, incremental data processing in the cloud

ML workflow, pipelines, training, deployment etc.

Data science support and tooling

ETL and data integration:

Stream processing:

Web scrapping

Data labeling

Visualization and VA

Dashboards:

Publishing notebooks (from github etc.):

Asynchronous data processing

What is asynchronous data processing

TBD

Reactive programming

Approaches to asynchronous programming:

  • Callback model:

    • A callback function is provided as part of an asynchronous call
    • The call is non-blocking and the source program continues execution
    • It is not possible to await for the return (it is essentially done by the callback function)
    • The callback function can be viewed as a one-time listener for a return event, that is, it represents code which consumes the result
    • The source code where the call is made and the consumer of the result are in different functions and cannot share (local) context
    • The callback function may make its own asynchronous calls which leads to a "callback hell"
  • Future/promise:

    • An asynchronous call is made as usual but return a special wrapper object
    • A callback function is not specified and is not used
    • The returned result is consumed by the code which follows the call (as opposed to its use in a separate callback function)
    • The future/promise is supposed to be awaited. Awaiting denotes a point where we say that the next instruction needs the result
    • The awaiting point is like a one-time single-value listener for the result where the program execution is suspended until the return event is received

Resources:

Reactive streaming

  • Listeners:

    • A callback function is registered and then automatically called for each incoming event
    • Callback functions are (normally) called only sequentially, that is, next event can be processed only when the previous event has been processed by the previous callback invocation. Callbacks are not executed concurrently.
    • The result of a callback invocation is frequently needed because it is executed by the event producer
  • Reactive streams:

    • It is a graph of producers and consumers
    • Consumers and producers are not supposed to communicate in a free manner by sending messages to each other where a sender knows the address(s) of the receivers
    • An element declares which messages it produces but it is unaware of who will subscribe to and consume its messages, and how they will be used (in contrast to the actor model)
    • An element must know what kind of messages it needs and explicitly subscribe to specific producers - messages will not come automatically just because somebody wants to send them to us
    • For data processing, reative streams provide a number of operators which can be applied to an input stream(s) and produce an output stream
    • Resources:
  • Actor model:

    • Each element has an identifier (address, reference) which is used by other actors to send messages
    • A sender (producer) must know what it wants to do when it sends messages and it must know its destination
    • Receiving elements can receive messages from any other element and their task is to respond according to their logic (as expected by the sender)
    • Actors receive messages without subscribing to any producer (as opposed to the actor model where you will not receive anything until you subscribe to some event producer)
    • Each actor has a handler (callback) which is invoked for processing incoming messages
    • Actors are supposed to have a state and frequently it is why we want to define different actors
    • Resources:

Event loops vs. threads

  • Both a thread task and an event loop task are executed until finished, that is, the code to execute is provided as a procedure
  • Thread tasks are dispatched by the system (not application) while dispatching logic of event tasks is part of the application
  • At each moment, there is a fixed number of threads concurrently executed by one process. The number of concurrently executed event loop tasks is not limited.
  • Thread tasks are (automatically) switched at the instruction level and the dispatcher is unaware of the needs of this thread or the application. Event loop tasks are switched at the level of logical application units depending on what this application needs.
  • In a multi-thread application, we need to manage the threads ourselves, e.g., by creating and deleting them. In an event loop application, the tasks (starting, suspending, finishing) is managed by the event loop manager.
  • In an event loop application, tasks specify dependencies on other tasks, and these points are used while dispatching the execution of tasks. Threads cannot declare dependencies on the results provided by other tasks. If we need some external result, then the thread has to wait. This logic has to be implemented manually and the system dispatcher is unaware of these dependencies.

Event loops:

Resources:

Async networking libraries

  • stars https://github.com/gevent/gevent coroutine - based Python networking library. "systems like gevent use lightweight threads to offer performance comparable to asynchronous systems, but they do not actually make things asynchronous"

    • greenlet to provide a high-level synchronous API
      • on top of the libev or libuv event loop (like libevent)
  • stars https://github.com/eventlet/eventlet concurrent networking library for Python

    • epoll or kqueue or libevent for highly scalable non-blocking I/O
  • stars https://github.com/aio-libs/aiohttp Asynchronous HTTP client/server framework for asyncio and Python

  • stars https://github.com/twisted/twisted Event-driven networking engine written in Python.

    • Twisted projects variously support TCP, UDP, SSL/TLS, IP multicast, Unix domain sockets, many protocols (including HTTP, XMPP, NNTP, IMAP, SSH, IRC, FTP, and others), and much more.
    • Twisted supports all major system event loops:
      • select (all platforms),
      • poll (most POSIX platforms),
      • epoll (Linux),
      • kqueue (FreeBSD, macOS),
      • IOCP (Windows),
      • various GUI event loops (GTK+2/3, Qt, wxWidgets)

Async web frameworks

Utilities

Retry libraries:

Libraries, utilities, tools

Python

Resources:

Tools

  • Data structures:

    • stars https://github.com/mahmoud/boltons boltons should be builtins
    • stars https://github.com/pytoolz/toolz List processing tools and functional utilities (replaces itertools and functools)
    • Zict: Composable Mutable Mappings
    • HeapDict: a heap with decrease-key and increase-key operations
    • sortedcontainers: Python Sorted Container Types: SortedList, SortedDict, and SortedSet
  • Networking:

    • certifi: A carefully curated collection of Root Certificates for validating the trustworthiness of SSL certificates while verifying the identity of TLS hosts
    • urllib3: HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more

Other:

  • click: creating beautiful command line interfaces
  • chardet

Formats, persistence and serialization

Authentication, Authorization, Security

Identity and Access Management

Secrets management, encryption as a service, and privileged access management. A secret is anything that you want to tightly control access to, such as API keys, passwords, certificates, and more.

Policy Enforcement Point, Identity And Access Proxy (IAP), Zero-Trust Network Architecture, i.e. a reverse proxy in front of your upstream API or web server that rejects unauthorized requests and forwards authorized ones to your server.

Resources:

Linux and OS

Resources:

Platform and servers

Load balancing and proxy:

Dockerized automated https reverse proxy:

Discussions:

Service registry and orchestrator:

  • etcd
  • consul

Logging, tracing, monitoring

Computing

Resources:

GPU/ML hosting and clouds

Other resources

Data sources

Books

Python: