diff --git a/architecture/componentView/index.rst b/architecture/componentView/index.rst deleted file mode 100644 index f60a117..0000000 --- a/architecture/componentView/index.rst +++ /dev/null @@ -1,184 +0,0 @@ -Component/𝜇-Service View -======================== - - - -.. toctree:: - :maxdepth: 2 - - CatalogManager - IngestionManager - StorageManager - DatasetManager - -The main components/𝜇-services of the DAF platform are: - -- `CatalogManager <#CatalogManager>`__ -- `IngestionManager <#IngestionManager>`__ -- `StorageManager <#StorageManager>`__ -- `DatasetManager <#DatasetManager>`__ - -The following image shows these components/𝜇-services and their mutual -relationships. - -.. figure:: images/daf_arch_component_view.png - :alt: Component View - - Component View - -CatalogManager --------------- - -The *CatalogManager* is responsible for the creation, update and -deletion of datasets in DAF. Furthermore, it takes care of the metadata -information associated to a dataset. - -The CatalogManager provides a common view and a common set of APIs for -operating on datasets and on all related metadata information and -schemas (see the `CatalogManager API & -endpoints `__). - -The CatalogManager is based on the services provided by the -`CKAN `__ service. In fact, one of the most relevant -architectural decisions is to reuse as much as possible the metadata and -catalog features provided by the CKAN service. The idea behind is -simple: treating the data managed by the DAF platform similarly to what -CKAN does with the open data. Part of the metadata are managed by the -CKAN catalog and additional metadata information are managed by the -CatalogManager. - -The CatalogManager is also responsible to store all the schemas -associated to the datasets: these schemas are saved as -`AVRO `__ schemas. - -IngestionManager ----------------- - -The *IngestionManager* manages all the data ingestion activities -associated to datasets. - -The IngestionManager collaborates with the CatalogManager to associate -the proper metadata to the ingested data. - -The IngestionManager provides an API to ingest data from a datasource -into the DAF platfom (see the `IngestionManager API & -endpoints `__). In particular, the -IngestionManager takes as input data and info needed to identify the -dataset to which the data needs to be associated with. Before actually -storing the data in DAF, the IngestionManager performs a set of -coherence checks between the metadata contained in the catalogue and the -data schema implied in the input data. There are two scenarios: - -1. The catalog entry for the dataset has been already set up. In this - case the *IngestionManager* will check if the incoming data and - schemas are congruent with what has been configured in the catalog. -2. There is no catalog entry for the dataset. In this case the - *IngestionManager* will automatically create an entry in the catalog - checking that all the relevant information are provided during the - ingestion phase. - -The *IngestionManager* is also responsible for scheduling the ingestion -tasks based on the information associated to the datasets. The ingestion -for static data (data at rest) is based on a pull model. The dataset -catalog entry should contain information about where and when the data -should be pulled from. - -StorageManager --------------- - -The *StorageManager* is responsible for abstracting the physical medium -where the data is actually stored (see the `StorageManager API & -endpoints `__). - -The StorageManager is based on the Spark dataset abstraction for hiding -the details of the specific storage platform. In fact, Spark provides a -very powerful mechanism for describing a dataset source regardless of -its actual physical place. We leverage this powerful mechanism for -defining the physical URIs as described before, that is: - -- ``dataset:hdfs://`` for HDFS, -- ``dataset:kudu:dbname:tablename`` for Kudu, -- ``dataset:hbase:dbname:tablename`` for Hbase. - -The only restriction we have to impose for making this Spark based -mechanism working is to always have a dataset per HDFS directory. - -DatasetManager --------------- - -The DatasetManager manages operations several related to the dataset, -such as: - -- to return the data of the dataset (or a sample of it) in a specified - format: -- to create a specific view on top of a dataset, -- to get the dataset schema in a given format (e.g. AVRO); -- to create a new dataset based on an existing one but saved into a - different storage mechanism or based on a transformation of the - existing dataset [not sure on this one yet, maybe this should be - managed by the catalog manager?], etc. - -For a list of endpoints and functionalities currently provided by the -DatasetManager see the `DatasetManager API & -endpoints `__. - -Technically speaking, the DatasetManager is responsible for all the -tasks on top of the datasets, indicated by the `logical -URIs <../logicalView>`__. For example tasks like format conversion, AVRO -to Parquet, dataset import/movement, from HDFS to Kudu will be managed -by this 𝜇-service. - -The DatasetManager will interact with the CatalogManager for updating -the information about the dataset is interacting with. For example, a -format conversion means triggering a Spark job that creates first a copy -of the source dataset in the target format. Then the catalog dataset is -updated for taking into account the new dataset format. - -The DatasetManager is also responsible for publishing the dataset into a -proper serving layer. For example, a dataset operation could create an -Impala external mapped on the dataset directory sitting on HDFS. This -publishing operation will provide the user with the JDBC/ODBC connection -informations for connecting an external tool to that table. - - -SemanticManager ----------------- - -The SemanticManager is the default access point for the OntoPA catalog, providing a set of functionalities -between the OntoPA front-end and the DAF platform itself. - -The component is currently still evolving: the first version doesn't handle RDF data directly, -focusing on general purpose abstractions for a repository of ontologies, but such extension is planned -for the next phase of development, accordingly to the evolutions of DAF itself and the various specialized components. - -The functionalities are indeed decomposed in a group of different components, -including operations for storing and querying ontologies on an underlying triplestore -(*semantic-repository*), for indexing and searching on the ontologies and core vocabularies -of the network (*ontonethub*), and a small list of dedicated enpoints for specific usage -(for example *semantic-validator*, designed for the validation of the compliance of a given ontology -to the DCAT-AP_IT standard). - -The SemanticManager currently assist the DAF ingestion form with the informations needed for a simple annotation -of dataset fields in a standardized way. -On the other hand, the component offers brief metadata about the available ontologies to the public front-end of daf / dataportal. - - - -For a list of components and functionalities currently provided by the -SemanticManager and the related specialized component, please see the `daf-semantics `__ repository. - - - - - - - - - - - - - - - - diff --git a/architecture/index.rst b/architecture/index.rst deleted file mode 100644 index 6ff4a0a..0000000 --- a/architecture/index.rst +++ /dev/null @@ -1,55 +0,0 @@ -Big data platform -================= - -More precisely, you can think the Big Data platform as an environment -offering capabilities for: - -- *storing and managing datasets*: users can register datasets and to - load them on the platform specifying the ingestion models (e.g. - batch, streaming), the serialization formats (e.g. Avro, Parquet), - the desired serving layers (e.g. HBase, Impala), metadata, etc; -- *processing and analysing datasets*: the platform provides an - environment composed by a set of tools for data scientists and - analysts. By using these tools these ones can perform analysis on - data, run statistical and machine learning models, and produce data - visualizazions and reports; -- *redistributing datasets, developing data application, publishing - insights*: the platform provides tools for enabling the publication - of opendata, data stories, data application, etc. - - The following image provides an architectural overview of the DAF: - - `Update and Insert - Image `__ - - - the `DAF architecture <../architecture/>`__ - - -DAF Architecture -================ - - -The DAF (Data Analytics Framework) is a platform originally designed to -gather and store data coming from different Italian public -administrations. As a consequence, it provides efficient and easy to use -ingestion mechanisms for allowing external organisations to simply -ingest their data into the platform with minimal human intervention. The -DAF platform shouldn’t only provide support for data at rest and fast -data (streaming), but also for storing and managing collections of -unstructured data, textual documents. Besides providing those storing -capabilities the next main goal is to provide a powerful mechanism for -data integration, i.e. a way for integrating data that traditionally -reside on separate silos. Enabling the correlation of data sets normally -residing on different systems/organizations can become a very powerful -enabling factor for discovering new insights on the data. The platform -should allow the data scientists to access its computational power for -implementing advanced analytics algorithms. - -Take a tour of the DAF architecture looking at: - -.. toctree:: - :maxdepth: 1 - - Logical View - Component/microservice View - Deployment View diff --git a/dataportal/dataportal-private.rst b/dataportal/dataportal-private.rst index 4419e9e..a487627 100644 --- a/dataportal/dataportal-private.rst +++ b/dataportal/dataportal-private.rst @@ -1,6 +1,6 @@ -***************** +****************** Dataportal-private -***************** +****************** =========== What it is? diff --git a/index.rst b/index.rst index dd1b556..bf8faf7 100644 --- a/index.rst +++ b/index.rst @@ -8,18 +8,47 @@ Data & Analytics Framework (DAF) - Developer Documentation .. NOTE:: - This documentation refers to the Alpha version of the DAF - (released in October 2017) and it is daily updated and improved. - For comments and enhancement requests about the documentation please open - an issue on `Github `_. + This documentation refers to the Alpha version of the DAF (released in October 2017) and it is daily updated and improved. + For comments and enhancement requests about the documentation please open an issue on `Github `_. +The `Data & Analytics Framework `_ (DAF, in short) is an open source project +developed in the context of the activities planned by the +Italian `Three-Year Plan for ICT in Public Administration 2017 - 2019 `_, +approved by the Italian Government in the 2017. + +In this scenario, the main goal of the DAF is to promote the exchange of +public data between Italian PAs, to support the diffusion of open data, +and to enable data-driven policies. +The Italian instance of the DAF is developed and maintanied by a **Data Team** composed by data scientists and data engineers, +which uses and evolves the framework to analyze data, create machine learning models and build data applications and data visualization products. + +Anyway, the DAF is a generic enough tool to be re-used in other countries +and other application domains. In fact, the DAF is composed by: + +- a **Dataportal**, a Web user interface providing: + + - a catalog of open-data datasets based on `CKAN `_; + - a set of tools for data analysis and visualization; + - a tool to handle data ingestion, data and metadata management processes; + - a tool for publishing and sharing data stories. + +- a **Big Data platform** to centralize and store, manipulate and standardize and re-distribute data and insights. + +The DAF is under development. This is a snapshot of the roadmap: + +- By October 2017: Alpha release. +- By November 2017: Beta release. +- By December 2017: 1.0 release. + +Both the Alpha and Beta releases will be tested by selected communities and Italian PAs. + +All contributions are welcome! Contents: .. toctree:: :maxdepth: 2 - Introduction Overview Concepts Data Portal diff --git a/microservices/example-microservice.rst b/microservices/example-microservice.rst deleted file mode 100644 index 5bdb080..0000000 --- a/microservices/example-microservice.rst +++ /dev/null @@ -1,19 +0,0 @@ - -This is an example microservice -============================================================ - - -What is? ----------- - -... - -Install ----------- - -... - -Setup ----------- - -...