Group Members: Sonia Castelo, Erin McGowan, and Shirley Berry
Course: Big Data CS-GY 6513 Section C
│
├── custom_functions.py <- Python functions used to compute similarity based on jaccard similarity
├── netgraph_functions.py <- Python functions used to compute x and y values using force-directed graph algorithm
├── taxi_metadata_2023_05_04.csv <- Taxi metadata generated from a search for the "taxi" keyword on May 5, 2023
├── taxi_metadata_2023_05_04_with_similarity_coordinates.csv <- The same taxi metadata with the addition of the x and y coordinates as calculated by the similarity measurements.
├── requirements.txt/ <- Python package versions
├── DatasetsSummarizer_Tool_Demo.ipynb <- Visualizations of taxi data in Jupyter notebook
├── full_pipeline.ipynb/ <- See below
├── start_from_metadata.ipynb <- See below
full_pipeline.ipynb
: This notebook takes you through the entire pipeline: data ingestion, calculating dataset similarity, and visualizing the search results with the DatasetsSummarizer. This notebook will show you our results and walk you through the metadata pre-processing and similarity calculations in more detail. You can also use this notebook to change the search keyword and produce new results.start_from_metadata.ipynb
: This notebook takes you through the similarity calculations, starting from a metadata dataframe that has already been created. This notebook will show you our results, and walk you through the similarity calculations in more detail.DatasetsSummarizer_Tool_Demo.ipynb
: This notebook uses the DatasetsSummarizer library to create the summarizer visualization using pre-generated metadata.
Datasets Summarizer is compatible with Jupyter Notebooks. Need the x and y values based on any similarity metric to generated the similarity plot between datasets. Supports the metadata format generated by datamart-profiler library to generate the Detail View to explore each dataset.
Note that we used another GitHub repo (https://github.com/soniacq/DatasetsVis) to manage our DatasetsSummarizer tool versioning and dependencies.
( Click one dataset from the list of results to open the Detail View.)
Report: It is available here
Video - supported interactions: https://youtu.be/sMwO6fo4SyI
Live demo (Google Colab):
In Jupyter Notebook (find this demo here):
import DatasetsSummarizer
data = DatasetsSummarizer.get_taxi_data()
DatasetsSummarizer.plot_datasets_summary(data)
pip install datasets-summarizer
Figures 3, 4, 6, 7, and 8 as referenced in our paper can be reproduced and interacted with using the DatasetsSummarizer_Tool_Demo.ipynb
notebook. This demo notebook uses the same metadata that we used to produce the figures in the paper. Using one of the other two notebooks could result in slight variations in results if any of the datasets have changed or been updated since our results were produced.
This project should not require any specialized hardware to reproduce, and any specialized packages have been included in the requirements.txt file and the notebooks themselves. We were able to run this project end to end on a computer with the following specs: Mac M1Pro, 10 cores, 16 GB of memory