-
Notifications
You must be signed in to change notification settings - Fork 117
Description
Title: Real-Time Bitcoin Price Processing using Apache Avro
Difficulty: 2 (medium)
Doc Link-https://docs.google.com/document/d/1fEd7_oLhFnA5ovzj_HMb9EeMU84nOGEGeWqNRZSz2wo/edit?usp=sharing
Description
Apache Avro is a data serialization framework developed as part of the Apache Hadoop project. It uses JSON for defining data types and protocols and serializes data in a compact binary format, making it suitable for both batch and streaming data processing. Avro is a versatile data exchange framework, offering a rich data structure and a compact, fast serialization format.
Describe technology
Compact Serialization: Avro serializes data in a compact binary format, minimizing storage space and making it efficient for data transmission.
Schema Evolution: Avro supports schema evolution, allowing data formats to change without requiring applications to be recompiled.
Interoperability: By using JSON to represent schemas, Avro allows data interchange between programs written in different languages.
Integration: Avro integrates with big data tools, such as Apache Kafka, Spark, and Hadoop, to process and store large datasets.
Describe the project
Objective: To build a system using Apache Avro for ingesting and processing real-time Bitcoin price data to analyze trends using time series analysis.
Steps:
Data Ingestion: Retrieve Bitcoin price data from a public API (e.g., CoinGecko) every minute.
Schema Definition: Define an Avro schema that represents the Bitcoin price data structure, including fields such as timestamp, price, currency, and volume.
Serialization: Serialize incoming data using the defined Avro schema for efficient storage and transport.
Integration with Kafka: Set up an Apache Kafka topic to manage streaming data, using Kafka's Avro support for message serialization.
Processing: Write a Python script using Apache Spark to consume messages from Kafka, deserialize the data using Avro, and perform time series analysis to identify Bitcoin trends.
Storing Results: Store the results of the analysis in a structured format (e.g., CSV or Parquet) for further use.
Useful resources
Avro Specification
Apache Avro’s GitHub Repository
Kafka with Avro Integration
Avro with Spark
Is it free?
Yes, Apache Avro is open-source and free to use.
Python libraries / bindings
avro-python3: The official Avro Python library for schema creation, serialization, and deserialization. Install via pip install avro-python3.
confluent-kafka-python: Provides a Kafka client for Python, supporting integration with Confluent's Schema Registry for Avro serialization. Install via pip install confluent-kafka.
pyspark: Used for processing the serialized data in Spark. Install via pip install pyspark.