Visualizing and Understanding Dataset Search Results

Group Members: Sonia Castelo, Erin McGowan, and Shirley Berry

Course: Big Data CS-GY 6513 Section C

Project structure

│
├── custom_functions.py                         <- Python functions used to compute similarity based on jaccard similarity
├── netgraph_functions.py                       <- Python functions used to compute x and y values using force-directed graph algorithm
├── taxi_metadata_2023_05_04.csv                <- Taxi metadata generated from a search for the "taxi" keyword on May 5, 2023
├── taxi_metadata_2023_05_04_with_similarity_coordinates.csv                <- The same taxi metadata with the addition of the x and y coordinates as calculated by the similarity measurements.
├── requirements.txt/      	  	        <- Python package versions
├── DatasetsSummarizer_Tool_Demo.ipynb          <- Visualizations of taxi data in Jupyter notebook
├── full_pipeline.ipynb/                        <- See below
├── start_from_metadata.ipynb                   <- See below

Notebooks

full_pipeline.ipynb: This notebook takes you through the entire pipeline: data ingestion, calculating dataset similarity, and visualizing the search results with the DatasetsSummarizer. This notebook will show you our results and walk you through the metadata pre-processing and similarity calculations in more detail. You can also use this notebook to change the search keyword and produce new results.
start_from_metadata.ipynb: This notebook takes you through the similarity calculations, starting from a metadata dataframe that has already been created. This notebook will show you our results, and walk you through the similarity calculations in more detail.
DatasetsSummarizer_Tool_Demo.ipynb: This notebook uses the DatasetsSummarizer library to create the summarizer visualization using pre-generated metadata.

DatasetsSummarizer

Datasets Summarizer is compatible with Jupyter Notebooks. Need the x and y values based on any similarity metric to generated the similarity plot between datasets. Supports the metadata format generated by datamart-profiler library to generate the Detail View to explore each dataset.

Note that we used another GitHub repo (https://github.com/soniacq/DatasetsVis) to manage our DatasetsSummarizer tool versioning and dependencies.

( Click one dataset from the list of results to open the Detail View.)

Report: It is available here

Video - supported interactions: https://youtu.be/sMwO6fo4SyI

Demo

Live demo (Google Colab):

Dataset results for Taxi query

In Jupyter Notebook (find this demo here):

import DatasetsSummarizer
data = DatasetsSummarizer.get_taxi_data()
DatasetsSummarizer.plot_datasets_summary(data)

Install

install via pip:

pip install datasets-summarizer

Reproducing

Figures 3, 4, 6, 7, and 8 as referenced in our paper can be reproduced and interacted with using the DatasetsSummarizer_Tool_Demo.ipynb notebook. This demo notebook uses the same metadata that we used to produce the figures in the paper. Using one of the other two notebooks could result in slight variations in results if any of the datasets have changed or been updated since our results were produced.

This project should not require any specialized hardware to reproduce, and any specialized packages have been included in the requirements.txt file and the notebooks themselves. We were able to run this project end to end on a computer with the following specs: Mac M1Pro, 10 cores, 16 GB of memory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

report

report

DatasetsSummarizer_Tool_Demo.ipynb

DatasetsSummarizer_Tool_Demo.ipynb

README.md

README.md

custom_functions.py

custom_functions.py

dataset_ids_2023_05_04.txt

dataset_ids_2023_05_04.txt

full_pipeline.ipynb

full_pipeline.ipynb

netgraph_functions.py

netgraph_functions.py

requirements.txt

requirements.txt

start_from_metadata.ipynb

start_from_metadata.ipynb

taxi_metadata_2023_05_04.csv

taxi_metadata_2023_05_04.csv

taxi_metadata_2023_05_04_with_similarity_coordinates.csv

taxi_metadata_2023_05_04_with_similarity_coordinates.csv

Repository files navigation

Visualizing and Understanding Dataset Search Results

Project structure

Notebooks

DatasetsSummarizer

Demo

Install

install via pip:

Reproducing

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
report		report
DatasetsSummarizer_Tool_Demo.ipynb		DatasetsSummarizer_Tool_Demo.ipynb
README.md		README.md
custom_functions.py		custom_functions.py
dataset_ids_2023_05_04.txt		dataset_ids_2023_05_04.txt
full_pipeline.ipynb		full_pipeline.ipynb
netgraph_functions.py		netgraph_functions.py
requirements.txt		requirements.txt
start_from_metadata.ipynb		start_from_metadata.ipynb
taxi_metadata_2023_05_04.csv		taxi_metadata_2023_05_04.csv
taxi_metadata_2023_05_04_with_similarity_coordinates.csv		taxi_metadata_2023_05_04_with_similarity_coordinates.csv

egm68/dataset-visualization

Folders and files

Latest commit

History

Repository files navigation

Visualizing and Understanding Dataset Search Results

Project structure

Notebooks

DatasetsSummarizer

Demo

Install

install via pip:

Reproducing

About

Resources

Stars

Watchers

Forks

Languages