Skip to content

amrmagdy4/kite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#Kite ##An open-source system to query Twitter-like data in real time

http://kite.cs.umn.edu

Kite is a distributed system that is built on top of Apache Ignite to manage Twitter-like data at scale. If you are working on analyzing or building applications on top of tweets, Kite allows you to seamlessly exploit this data, that comes in huge numbers, without worrying about how to manage the data itself, just connect and fly with Kite. In general, Kite is managing any micro-length data, called Microblogs, not just Twitter data. Kite is tailored for the requirements of Microblogs applications and supports efficient queries on arbitrary attributes of Microblogs to support a myriad of applications.

##About Kite

Kite is an open-source system to index and query Twitter-like data (Microblogs data). It is implemented as a distributed system on top of Apache Ignite system and Hadoop Distributed File System (HDFS). It is mainly designed and implemented to digest and index fast data in real time and large volumes of historical data, up to billions of Microblogs data records. Microblogs in general are the micro-length posts that are generated by hundreds of millions of web users everyday, like tweets, online reviews for products and movies, user comments on news media or social media, and user check-ins on location-aware web services. This data comes literally in thousands of records every single second, carrying very rich user-generated contents such as news, opinions, discussions, as well as meta data including location information, language information, and personal information. The rich content and the popularity of microblogging platforms results in Microblogs being exploited in a wide variety of important applications including disseminating news and citizen journalism, events detection and analysis, rescue services during natural disasters, geo-targeted advertising, and several research disciplines, including machine learning, human-computer interaction, social sciences, and medical sciences.

From a data management perspective, Microblogs data represents a unique kind of streaming data with new challenges in the areas of fast data indexing and main-memory management. Microblogs queries require managing incoming data in real time to allow querying data that has just arrived a few seconds ago. In addition, several important queries on Microblogs also exploit large volumes of historical data, such as querying six-month's worth of data of elections tweets, analyzing Ebola or Zika outbreaks, or studying social behaviors of online communities over several months. These queries require searching a huge search space that might include hundreds of billions of data records. Also, Microblogs queries come with a unique signature that promotes top-k and temporal aspects of the queries as first class citizens, as we advocated in our research. Kite comes to fill the gap in existing systems and fullfill the requirements of Microblogs queries so that application developers can exploit this unique data in their applications without worrying about all the data management complications under the hood.

Kite is scalable to digest more than 10,000 Microblog/second on each machine with tunable memory resources usage. It could organize billions of historical data in efficient temporal index structures to be queried very fast. Kite also provides real-time query response in the order of few milliseconds for a vareity of queries on spatial and non-spatial attributes.

##Quick Start

To start working with Kite, simply download its binaries here, edit the settings file, and run the jar file kite-console-xx.jar on the cluster machines using Java 8 JVM (xx refers to Kite version number). Kite is a distributed system that run on commodity hardware clusters. Kite jar file should be executed on each machine seprately, no need to provide a list of machines in the cluster ahead of time. As a new machine runs Kite jar file, it is added to the cluster and automatically discovered by other up-and-running machines as long as they belong to the same network. When a running machine gets down, it is also automatically discovered by other machines. As introduced in About Kite, Kite writes its disk-based artifacts to Hadoop Distributed File System (HDFS). So, before start Kite machines, it is necessary to have an up-and-running HDFS instance. Check here to configure an HDFS cluster. Note that all machines of the same Kite instance should share the same settings of the underlying HDFS, as introduced in Kite Settings File below.

After starting each Kite machine, it is ready to receive and execute MQL query language statements. In addition, the Kite jar file can be added as a dependant to Java projects to use Kite APIs in Java programs or with compatible programming languages. To gracefully stop a Kite machine, type quit or exit. We provide Examples for MQL statements and queries as well as a ready-made example of a streaming data source to start using Kite immediately.

####Main Features Using Kite, system administrators can:

  1. Connect Microblogs streams of arbitrary attributes and schema from local and remote sources.
  2. Create index structures on arbitrary attributes of existing Microblogs streams. Kite provides both spatial and non-spatial index types.
  3. Add and remove machines dynamically to Kite cluster as needed without restarting or interrupting the cluster operation.
  4. Search existing streams using MQL query language and Java-compatible APIs. Kite automatically chooses the right index structures to access to process queries efficiently.
  5. Manage and administrate existing streams and index structures with a variety of utility commands and tools.

####Kite Settings File

On running the Kite jar file on each machine, the system administrator should provide a settings file. By default, Kite jar assumes a settings file named kite.settings and located in the same folder of the jar file. If the settings file is located elsewhere or named differently, it should be provided as a command line argument to the jar file.

The settings file include the HDFS settings as mandatory settings, in addition to other optional settings that allow system administrator to tune and control and system performance and behaviour. Kite settings file is a properties file that includes the following parameters:

  • hdfsHost: a mandatory URL to the master node of HDFS cluster. No default value provided. Example: hdfs://cs-cluster-1.cs.umn.edu:9000
  • hdfsRootDirectory: an optional path to the root HDFS directory. Default value is "/". Example: /user/admin/
  • hdfsUsername: a mandatory HDFS username for an authorized user with write permissions on the hdfsRootDirectory directory. No default value provided. Example: amr
  • hdfsGroupname: an optional HDFS groupname to which the hdfsUsername belongs to. No default value provided. Example: supergroup
  • queryAnswerSize: an optional integer to determine the default answer size that is returned by Kite queries. Default value is 20. Example: 100
  • queryTimeMinutes: an optional integer to determine the default search time horizon in the past. The value of 180 means searching the last 180 minutes (3 hours) of data.
    This default value can be overwritten by every single query, when a query does not indicate a search time period, this value is used. Default value is 180. Example: 10080
  • memoryIndexCapacity: an optional integer to determine the default in-memory index capacity in terms of number of data records. Default value is 1000000.
  • memoryIndexNumSegments: an optional integer to determine the default number of segments that is used to segment in-memory index structures. Default value is 5.
  • logsDirectory: an optional path to a folder to create system logs. Default value is the same directory of Kite jar file.

##MQL Query Language

Kite come with a SQL-like query language, called Microblogs Query Language (MQL), that eases exploiting the system features to system administrators through a declaritive interface. MQL provides the main statements CREATE STREAM, CREATE INDEX, DROP STREAM, DROP INDEX, and SELECT to create and drop streams and index structures and query them. It also provides additional statements to manage and administrate the system assets: SHOW, UNSHOW, PAUSE, RESUME, ACTIVATE, DEACTIVATE, RESTART, and DESC statements. The usage of each statement is detailed at http://kite.cs.umn.edu/tutorial.html#mql.

##Java APIs

All Kite features can be used through Java programs by adding the Kite jar file to the Java project and import edu.umn.cs.kite.*. Actually, all MQL statements are executed through translating them into the equivalent Java lines of code. We describe how to launch a Kite machine and give the equivalent Java lines of code for each MQL statement at http://kite.cs.umn.edu/tutorial.html#java