Skip to content

codingpains/keep-local-copy-of-tweets-updated-specification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

Twitter Updater Specification.

Contents.

Introduction.

Tweets, just like living creatures have a life time, even when they don't get deleted they practically die because the older they get the less people interact with them until no one ever sees them, retweets them or interacts with them in any way. This also means that a young tweet has a better chance to get interactions: It will show in feeds giving the oportunity to be retweeted, favorited, clicked, etc.

Some software applications can benefit from getting this interaction information in a fast and reliable way. It can be that your software is intended to pull accurate stats from a twitter account, or a specific sets of tweets, hashtags, mentions, links that got shared, a keyword, etc.

This is the first version of this specification. It is meant to be used for keeping tweets updated in your system, the analysis algorithms and heuristics you aplly to it depend entirely on the problem you are solving.

Rationale.

As mentioned before, we can benefit from knowing about the interactions with tweets and the retweets they produce. It tells us the results of particular tweets and there are many ways to monetize this information.

This is a simple and powerful approach that focuses on getting updated on-demand information about the tweets you are interested in, it is meant to be easily scaled up, simply by adding more instances of a process to the mixture.

Simplicity is prerequisite for reliability. - Edsger Dijkstra

Problem.

Keep every tweet in a system updated according to their age regardless of how many tweets there are to process.

Constraints.

* API Quota: We must request information form Twitter. We can run out of API quota per credential. * We can request just a 100 tweets at a time. * Each request has to go with the maximum amount of tweet ids (100) or the closest we can get to that, this to make the best use of our credentials. * Avoid tweet starvation when buffering tweets to get the 100 per request: If there are not enough tweets to flush a buffer after S seconds, then just flush the current amount. * Finite amount of credentials. * Tweets can come at any time, individually or in huge batches. * We need an easy to scale solution: A "just add more workers" approach must be enough. * Fail safe: If any worker suffers a sudden and horrible death it must not break the flow or lose tweets. * Update times for each tweet must follow its lifespan cycle (age) (younger - shortest interval).

Background.

Research references

On a Tweet's lifespan

In the aricle When Is My Tweet's Prime of Life? (A brief statistical interlude) by Peter Bay for Moz, we can see a lot of statistics of a Tweets life span. He determines that 18 minutes are the average lifespan of a single tweet.

Another article called The Short Lifespan of a Tweet: Retweets Only Happen Within the First Hour by Frederic Lardinois for readwrite says that retweets (which some consider the currency of twitter) happen mostly during the first hour after its publication.

On Computing Pipelines

Pipelines are workflows that can be found in many industries. The Production lines used in factories are pipelines where there is a series of steps in the process and each step is always working, when one step is done it provides input for the next step.

(1) In computing we are talking about data processing units in series in which one step's output serves as input for the next step.

[1.] http://en.wikipedia.org/wiki/Pipeline_%28computing%29

On Queueing theory

(1) Queueing theory refers to the study of waiting lines, in computing we use queues for jobs where processes that act as workers take jobs or receive jobs from the queues, they do their thing and notify the queue that their job is done, then they take another one or the queue sends them a new one.

[1.] http://en.wikipedia.org/wiki/Queueing_theory

Dead Approaches

Master Process Approach

This was the first idea proposed to solve this problem.

This approach has a few problems.

  • Master Process does way to many things.
  • Scaling this approach requires the increase of Master Process instances and Updater Process instances.
  • Requires locking and is defficient, just one lock is implemented when at least two should be implemented: one for preventing other Master process instances to access the same tweets about to be queued and the currenlty implemented lock when a tweet is being processed to prevent Master process instances from finding it again.
  • Lock release logic is weak, if the Updater suffers a sudden and horrible death it might not release a tweet(s) causing deadlock and starving the locked tweet(s), this forces a bigger query to detect such starved tweets and enqueue them regardless their locked status (unreliable).
  • There are way to many queries, one every Master Process cycle and they pull lots of data.
  • Map reduce is responsible to get all tweets that will get updated, split them in groups of at most 100 elements and send them to queue, its logic is big and hard to mantain.

The Master and Pipeline Approach

Another idea to fix the problem in a simpler way is to keep the Master Process pulling tweets, but this time instead of having this process finding the same tweets and re-scheduling them for update, it would just find those that were not processed ever, those tweets are sent to the Updater Pipeline where they will complete their processing. It relies completely on the pipeline to finish the jobs and the fetcher does a lot less stuff.

  • This approach removes some responsibility from Master Process.

But:

  • Still needs two locks.
  • The map reduce is still complex and big.
  • Master Process decides at which point of the pipeline a tweet should go.

There must be a better simpler way.

Keep just the pipeline

The Master Process is an unnecessary entity, we can send that guy with all his queries and locking complexity straight to oblivion.

We know that we are getting the tweets from some source, we don't really care which one. What we want to do is make that source start the pipeline by buffering the tweets and sending them to the job queue with a delay that matches the first interval of update.

The job queue will handle the entire flow and each node of the pipeline will re-use the queue to push the data to the next step.

Hypothesis.

If: We have a pipeline for incoming tweets in which each node updates the tweets in the correct interval (In) passing the tweets to the next node through a delayed job queue.

Then: We will have updated tweets at any point of their lifespan while the job queue guarantees that no tweet gets lost in transit.

Assumptions.

* A Job Queue capable of scheduling jobs (delayed jobs). * On consistency and reliability: The job queue will not release jobs if they are not marked as finished, failed jobs will have retries. * The job queue can work only in memory but if it does, it must keep a dump backup to recover the jobs if it crashes. * On batches: A job includes just one tweet to request, the updater will build its buffer and flush when necessary. * On intervals: They are provided by configuration in the shape of delays that can be applied in the job queue, they are not part of the problem to solve but given as input. Research ([Background on a tweet lifespan](#tweet-lifespan)) and the needs of the consumer platform produce these heuristics. * The same queue is re used at the end of each node, delays to use will be selected by configuration given a tweet's age. * Pipeline nodes are instances an Updater process (worker). * Tweets and Retweets are treated the same by the updater. * Tweets are provided to the system by a "starter" process, we consider them as our input.

Solution

Blackbox

The pipeline flow.

Looking at it as a flow, we would see many processes interacting with each other through a job queue.

Looking closer to the queue and processes.

Removing one layer of abstraction we can see exactly how the Twitter Updater process interacts with a Delayed Queue to move one step ahead in the pipeline, thus making clear that the amount of actors is fairly small.

Theory of Operation

  1. This process starts right after a tweet or retweet gets inserted into our system.
  2. The Starter Process (SP) gets a tweet from an API call, a stream connection or any other input source.
  3. SP does whatever it needs to do with the received tweet, save to the database might be one of the steps.
  4. SP starts the pipeline by enqueueing the tweet with a delay of I1 this delay is given by configuration.
  5. Updater Process (UP) gets the job to update and pushes it into its buffer, if the buffer is empty at this point it will start a timer.
  6. When the buffer is full or the timer reached its deadline, UP sends a request to 'statuses/lookup' with the buffered tweet ids and flushes the buffer.
  7. Twitter will respond with the new information of the tweets.
  8. UP should parse and save each tweet individualy.
  9. For each tweet updated, if it should be re-scheduled the UP will grab the delay for the tweet and re-schedule it. If the tweet's update cycle is over ignore this step.

Technical Specification

Job Queue with Scheduling/Delay capabilities.

It is assumed to exist, to be able to use it in the spec we will give it an API in the domain of the spec.

JobQue.schedule(namespace, data, delay)

Adds a job to the queue after the delay sent in milliseconds.

JobQueue.get(namespace)

Gets a job from a namespace.

Input:

  • namespace [String]: Namespace or category of the job, this string will be used to pull request jobs.
  • data [String]: JSON String with the data that will be used as input the worker processing this job.
  • delay [Number]: Time to wait in seconds before sending this job to the queue.

Configuration ADT.

Configuration is assumed to be stored in key value pairs. We suggest creating a wrapper (an Abstract Data Type) with set of functions to get such configuration and allow logic in it while hiding the actual data structure that holds the configuration.

Config.getTweetUpdateDelay(birthdate).

Given the age of a tweet this function determines how long to wait before the next update for this tweet.

Data:

  conf->tweetAge[18m]
  conf->tweetAge[60m]
  conf->tweetAge[1d]
  conf->tweetAge[7d]
  conf->tweetAge[30d]

Each key in tweetAge holds the interval of update when a tweet is younger than the time specified by the key.

Input:

  • birthdate [Number]: The time in milliseconds that a tweet has lived.

Output:

  • delay [Number]: The time in milliseconds to wait before updating this tweet. If -1 is returned it means you should not update this tweet.

** Algorithm **:

  1. Let conf be a configuration structure with access to a tweetAge structure with key value pairs.
  2. Let birthdate be the time in milliseconds that the tweet has lived.
  3. If birthdate is not provided return -1.
  4. Let now be the current timestamp.
  5. Let delta be the difference now - birthdate.
  6. If delta is between 0 and 1080000 (18 minues) return the value held by key '18m'.
  7. If delta is between 1080000 and 3600000 return the value held by key '60m'.
  8. If delta is between 3600000 and 86400000 return the value held by key '1d'.
  9. If delta is between 86400000 and 604800000 return the value held by key '7d'.
  10. If delta is between 604800000 and 2592000000 return the value held by key '30d'.
  11. If delta does not fit in any range return -1.

Starter Process

A grey box. It gets input from twitter, can be from a stream, from requests or any other input. You decide what sideffects you want here, like saving the tweet to db, parsing it, get stats from it, etc. There is a part of this process that we need to control, at the end we will add a call to start the pipeline by queueing the tweet.

starterProcessInstance.queueTweet(tweet)

Given a tweet it enqueues it for the next update with the correct delay.

Input:

  • tweet [Object]: Instance of tweet.

Algorithm:

  1. Let t be an instance of a tweet.
  2. t must have an attribute id that holds the value of the id_str attribtue that Twitter sends with each tweet.
  3. t must have an attribute createdAt taht holds the value of the created_at attribute that Twitter sends with each tweet.
  4. Let delay be the waiting time before updating the tweet given by the call Config.getTweetUpdateDelay(t.createdAt)
  5. Call JobQueue.schedule() with the following arguments 'tweet:update', {'data' : tweet.id}, delay.

Updater Process.

This worker runs continually, it never stops unless it fails, given that case the exception should be catched and ended gracefully. The worker handles its internal state and most of its funcitons rely on such state.

Constants from configuration

  • bufferTTL [Number]: Time in milliseconds to wait before flushing a buffer/requesting buffered tweets.

State variables.

  • buffer [Array]: Holds the list of tweets that are going to be requestes, its maximum size is 100, once it reaches 100 elements it must be flushed.
  • startTime [Date]: Holds the date time when the buffer was empty but a first element or batch of elements gets inserted.

run()

Starts the process. Recursive funciton, everytime it ends its flow it will call itself to start again. No arguments.

Algorithm:

  1. Call shouldStartTimer().
  2. If response of shouldStartTimer is true then call startTimer()
  3. Let t be the result from calling getTweetToUpdate().
  4. If t was returned and is not null then get it into the buffer by calling appendToTweets(t)
  5. Call shouldUpdate()
  6. If response of shouldUpdateis false then call self again run()
  7. If response of shouldUpdateis true then call requestTweets()
  8. Let tweets be the response from calling requestTweets()
  9. Call saveTweets(tweets).
  10. Call promoteTweets().
  11. Call emptyTweets() to flush the buffer.
  12. Call self again run().

shouldStartTimer()

Reads the tweets buffer and determines if the startTime should be set.

Algorithm:

  1. If buffer length is 0 return true.
  2. Else return 1.

startTimer()

Algorithm:

Set startTime to current date.

getTweetToUpdate()

Grabs a tweet from the Job queue.

Algorithm:

  1. Call JobQueue.get()
  2. Let data be the job data from the previous call.
  3. Get the id from data.
  4. Return id.

appendToTweets(t)

Adds a tweet to the buffer.

Algorithm:

  1. Get the t tweet id.
  2. Push id to buffer.

shouldUpdate(t)

Validates if the buffer is full or if the bufferTTL is depleated.

Algorithm:

  1. If buffer length is greater or equal to 100.
  2. Return true
  3. Let delta be the difference betweeen the current date
  4. If buffer length is greater than 0 and delta is greater or equal than bufferTTL.
  5. Return true
  6. If no condition was met return false.

requestTweets()

Grabs ids from the buffer and uses them to request information from using the Twitter's statuses/lookup REST API path.

Algorithm:

  1. Let query be the query parameter for the request.
  2. Add an attribute id to query using the joined values from buffer separated by commas.
  3. Send request to statuses/lookup with the query parameter.
  4. Get response.
  5. Return response.

saveTweets(tweets)

Gets the response from Twitter as input and stores them in your system.

Algorithm:

  1. Let tweets be an array of raw tweet objects, as Twitter sends them.
  2. For each tweet.
  3. Parse tweet, parsing algorithm depends on the information you need and the particular logic of your application.
  4. Find your local record by using the tweet id.
  5. Save parsed tweet to your DB.
  6. After all tweets have return tweets.

promoteTweets(tweets)

Sends the already updated tweets to the next step of the pipeline.

Algorithm:

  1. Let tweets be a collection of raw tweet objects.
  2. For each tweet.
  3. Call Config.getTweetUpdateDelay(tweet.created_at).
  4. Let delay be the result from the previous call.
  5. Call JobQueue.schedule() with the following arguments 'tweet:update', {'data' : tweet.id}, delay.

emptyTweets

Flushes the tweets buffer.

Algorithm:

  1. buffer = new Array.

About

A specification to keep locally stored tweets updated using Twitter REST API.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published