The objective of our project work is to create an integrated dataset consisting of several attributes related to books. This task can have several use cases, with the most straightforward one being enacting a catalog of books, similar to those maintained by several National Libraries. Other use cases may include building a content-based recommender system or observing pricing attributes across different platforms where the respective books are sold. Relevant data is widely available across various platforms, such as Google Books, Goodreads, or Amazon. As book data is actively maintained, we can exploit many other data sources should unexpected issues arise during data integration.
In this project, 3 datasets are being dealt with. All in csv format.
This dataset contains 49197 entities.
This dataset contains 27159 entities.
This dataset contains 271360 entities.
8 attributes are included in the target schema: title, authors, rating, pages, year, publisher, genres, price.
Example of the target schema in xml format:
<book>
<id>book1</id>
<title>bookTtitle</title>
<authors>
<author>
<name>author</name>
</author>
</authors>
<rating>4</rating>
<pages>0</pages>
<year>1000</year>
<publisher>xyz</publisher>
<genres>
<genre_type>comedy</genre_type>
</genres>
<price>15</price>
<language> english </language>
</book>
An overview of the datasets and the count of nulls/attributes will be found in the Exploring the data.ipynb.
The Data Translation was done using Altova mapfore2021 software. The files can be found in the Data Translation folder.
The data was cleaned and examined for duplicates. The nulls in Authors were dropped. More detailes are in Cleaning.ipynb.
In order to create the gold standard by hand, a mixed work between python and manualy looking into the data was used. More details can be found in the Creating_GS.ipynb
Note that by running this notebook, you will generate files like goodreads_recommendation_H.csv which are meant to be examined by hand. These files aren't included in the repository.
With the help of winter framework in java, the identity resolution was created and evaluted with the gold standard. The work is included in IR_Main.java
The gold standard was created using the correspondences from the identity resolution and going manualy through the data. More details can be found in the GSxml.ipynb
With the help of winter framework in java, the 3 datasets were merged into a single xml file and evaluted with the gold standard. The work is included in DataFusion_Main.java