Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[csv-to-neptune-bulk-format] : Multiple relational tables as source CSV files #282

Open
Anita-Lavania opened this issue Dec 29, 2022 · 3 comments

Comments

@Anita-Lavania
Copy link

Hi,

How will this work with multiple source CSV files with foreign key columns to join on? Taking the Spotify dataset as e.g., suppose you have source_track.csv that contains [track_ID, track_name, track_artist, album_ID] and another CSV file source_album.csv that contains [album_ID, album_name, album_release_date]. Now if we want to create track node CSV file, album node CSV file and track->album edge CSV file, including the given properties on both track nodes and album nodes, how will that be possible?

In other words, can we do what is described here?

Thanks in advance.

@Anita-Lavania Anita-Lavania changed the title Multiple relational tables as source CSV files [csv-to-neptune-bulk-format] : Multiple relational tables as source CSV files Dec 29, 2022
@bechbd
Copy link

bechbd commented Jan 3, 2023

This can be accomplished by adding multiple files into the fileNames portion of the JSON. For example here is what this might look like for the example you provided:

{
    "source_folder": "source",
    "data_folder": "data",
    "fileNames": [
        "track.csv",        
        "album.csv"
    ],
    "nodes": [
        {
            "csvFileName": "Track.csv",
            "select": "'track_name' in row and 'track_artist' in row and row['track_name'] != '' and row['track_artist'] != ''",
            "id": "uuid()",
            "label": "'Track'",
            "uniqueKey": "row['track_artist'] + '-' + row['track_name']",
            "properties": [
                {
                    "property": "track_name",
                    "key": "track_name"
                },
                {
                    "property": "track_artist",
                    "key": "track_artist"
                },
                {
                    "property": "track_popularity",
                    "key": "track_popularity"
                },
                {
                    "property": "danceability",
                    "key": "danceability"
                },
                {
                    "property": "energy",
                    "key": "energy"
                },
                {
                    "property": "key",
                    "key": "key"
                },
                {
                    "property": "loudness",
                    "key": "loudness"
                },
                
                {
                    "property": "mode",
                    "key": "mode"
                },
                {
                    "property": "speechiness",
                    "key": "speechiness"
                },
                {
                    "property": "acousticness",
                    "key": "acousticness"
                },
                {
                    "property": "instrumentalness",
                    "key": "instrumentalness"
                },
                {
                    "property": "liveness",
                    "key": "liveness"
                },
                {
                    "property": "valence",
                    "key": "valence"
                },
                {
                    "property": "tempo",
                    "key": "tempo"
                },
                {
                    "property": "duration_ms",
                    "key": "duration_ms"
                }
            ]
        },
        {
            "csvFileName": "Album.csv",
            "select": "'track_album_name' in row and row['track_album_name'] != '' and 'track_album_id' in row and row['track_album_id'] != ''",
            "id": "uuid()",
            "label": "'Album'",
            "uniqueKey": "row['track_album_id']",
            "properties": [
                {
                    "property": "name",
                    "value": "row['track_album_name']"
                },
                {
                    "property": "album_release_date",
                    "value": "row['track_album_release_date']"
                }
            ]
        }
    ],
    "edges": [
        {
            "csvFileName": "Track_Album_Edges.csv",
            "select": "'track_name' in row and 'track_artist' in row and row['track_name'] != '' and row['track_artist'] != '' and 'track_album_id' in row and row['track_album_id'] != ''",
            "id": "uuid()",
            "label": "'IN_ALBUM'",
            "from": "row['track_artist'] + '-' + row['track_name']",
            "to": "row['track_album_id']",
            "fromLabel": "'Track'",
            "toLabel": "'Album'",
            "properties": []
        }
    ]
}

@Anita-Lavania
Copy link
Author

Anita-Lavania commented Jan 4, 2023

Thanks @bechbd.
However, when you do:
{ "csvFileName": "Album.csv", "select": "'track_album_name' in row and row['track_album_name'] != '' and 'track_album_id' in row and row['track_album_id'] != ''", "id": "uuid()", "label": "'Album'", "uniqueKey": "row['track_album_id']", "properties": [ { "property": "name", "value": "row['track_album_name']" }, { "property": "album_release_date", "value": "row['track_album_release_date']" } ] }

the alum_id, album_name, album_release_date come from the original album.csv.
Now when you go ahead and try to make edges, i.e.,

{ "csvFileName": "Track_Album_Edges.csv", "select": "'track_name' in row and 'track_artist' in row and row['track_name'] != '' and row['track_artist'] != '' and 'track_album_id' in row and row['track_album_id'] != ''", "id": "uuid()", "label": "'IN_ALBUM'", "from": "row['track_artist'] + '-' + row['track_name']", "to": "row['track_album_id']", "fromLabel": "'Track'", "toLabel": "'Album'", "properties": [] }

it will produce an empty edge file because there has been no mapping between tracks and albums.

@bechbd
Copy link

bechbd commented Jan 4, 2023

I think that this might be due to the configuration of your input files. When I ran this using a track CSV with the following header the edges were created correctly.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants