Skip to content

DatasetsSummarizer describes the features of a large set of datasets and allows users to compare them in a single view.

Notifications You must be signed in to change notification settings

egm68/dataset-visualization

Repository files navigation

Visualizing and Understanding Dataset Search Results

Group Members: Sonia Castelo, Erin McGowan, and Shirley Berry

Course: Big Data CS-GY 6513 Section C

Project structure

│
├── custom_functions.py                         <- Python functions used to compute similarity based on jaccard similarity
├── netgraph_functions.py                       <- Python functions used to compute x and y values using force-directed graph algorithm
├── taxi_metadata_2023_05_04.csv                <- Taxi metadata generated from a search for the "taxi" keyword on May 5, 2023
├── taxi_metadata_2023_05_04_with_similarity_coordinates.csv                <- The same taxi metadata with the addition of the x and y coordinates as calculated by the similarity measurements.
├── requirements.txt/      	  	        <- Python package versions
├── DatasetsSummarizer_Tool_Demo.ipynb          <- Visualizations of taxi data in Jupyter notebook
├── full_pipeline.ipynb/                        <- See below
├── start_from_metadata.ipynb                   <- See below

Notebooks

  • full_pipeline.ipynb: This notebook takes you through the entire pipeline: data ingestion, calculating dataset similarity, and visualizing the search results with the DatasetsSummarizer. This notebook will show you our results and walk you through the metadata pre-processing and similarity calculations in more detail. You can also use this notebook to change the search keyword and produce new results.
  • start_from_metadata.ipynb: This notebook takes you through the similarity calculations, starting from a metadata dataframe that has already been created. This notebook will show you our results, and walk you through the similarity calculations in more detail.
  • DatasetsSummarizer_Tool_Demo.ipynb: This notebook uses the DatasetsSummarizer library to create the summarizer visualization using pre-generated metadata.

DatasetsSummarizer

Datasets Summarizer is compatible with Jupyter Notebooks. Need the x and y values based on any similarity metric to generated the similarity plot between datasets. Supports the metadata format generated by datamart-profiler library to generate the Detail View to explore each dataset.

Note that we used another GitHub repo (https://github.com/soniacq/DatasetsVis) to manage our DatasetsSummarizer tool versioning and dependencies.

System screen

( Click one dataset from the list of results to open the Detail View.)

Report: It is available here

Video - supported interactions: https://youtu.be/sMwO6fo4SyI

Demo

Live demo (Google Colab):

In Jupyter Notebook (find this demo here):

import DatasetsSummarizer
data = DatasetsSummarizer.get_taxi_data()
DatasetsSummarizer.plot_datasets_summary(data)

Install

install via pip:

pip install datasets-summarizer

Reproducing

Figures 3, 4, 6, 7, and 8 as referenced in our paper can be reproduced and interacted with using the DatasetsSummarizer_Tool_Demo.ipynb notebook. This demo notebook uses the same metadata that we used to produce the figures in the paper. Using one of the other two notebooks could result in slight variations in results if any of the datasets have changed or been updated since our results were produced.

This project should not require any specialized hardware to reproduce, and any specialized packages have been included in the requirements.txt file and the notebooks themselves. We were able to run this project end to end on a computer with the following specs: Mac M1Pro, 10 cores, 16 GB of memory

About

DatasetsSummarizer describes the features of a large set of datasets and allows users to compare them in a single view.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published