Streaming data ingestion and consumption from Twitter API into Apache Kafka and compute the amount of words on each tweet using Spark Structured Streaming.
As of June 2023, Twitter has changed the access level for the free accounts. Now, you'll need a Basic subscription to be able to access to most of the endpoints, including search tweets, which is the endpoint used in this project.
Consider using another free API instead the Twitter API to test this project.
Install the necessary packages
pip install -r requirements.txt
First, go to kafka_scripts
and run the 01
, 02
and 03
scripts to get kafka started.
# Produce tweets to your kafka topic
producer getting_started.ini
# Consume the tweets using Spark Structured Streaming and count the amount of words on each tweet
consumer getting_started.ini
These tests are also automated with GitHub actions on push
to the main
branch.
pytest