ValueError: An error occurred while calling the read_csv method registered to the pandas backend #11125

sshivam95 · 2024-05-16T12:31:21Z

Hi. I am using dask.dataframe to read a very large dataset of 20TB containing 97 billion Knowledge graph triples. I am using dask.dataframe.read_csv method to read a smaller version of dataset containing 795 million triples of size 152GB. The .txt file contains 4 columns separated by white space. A sample of the dataset:

<http://whale.data.dice-research.org/resource#node21f760a41de19b4a8370fd8f49f6e87e> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/wo/species> .
<http://whale.data.dice-research.org/resource#nodea7ba3274fe56fb8342b740aef391a3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/wo/species> .
<http://whale.data.dice-research.org/resource#nodea7ba3274fe56fb8342b740aef391a3> <http://purl.org/ontology/wo/kingdom> <http://whale.data.dice-research.org/resource#node4a5dd7cade315a1a7a63e7b6881f18a> .

Context: The dataset consists of KG data with subject, relation, and object as columns 0, 1, and 2 respectively. Column 4 contains '.' meaning the triple has ended. My task is to calculate the total number of triples, and the total number of unique entities and relations in this dataset. Since the file is very huge, I cannot use rdflib as it consumes all the memory.

I was using the following pandas code to read the dataset and calculate the stats required for the dataset:

import pandas as pd

dtype = {'subject': str, 'relation': str, 'object': str}
unique_entities = set()
unique_relations = set()
total_triples = 0

try:
    reader = pd.read_csv(file_path, sep="\s+", header=None, names=['subject', 'relation', 'object'], usecols=[0, 1, 2], dtype=dtype, chunksize=chunk_size, memory_map=True, on_bad_lines='warn')

    for chunk in reader:
        total_triples += len(chunk)
        unique_entities.update(chunk['subject'].unique())  # Update entities from 'subject'
        unique_entities.update(chunk['object'].unique())   # Update entities from 'object'
        unique_relations.update(chunk['relation'].unique())  # Update relations
        
except Exception as e:
    logging.error(f"An error occurred: {str(e)}", exc_info=True)

logging.info(f'Total number of triples: {total_triples}')
logging.info(f'Number of unique entities: {len(unique_entities)}')
logging.info(f'Number of unique relations: {len(unique_relations)}')

For a small dataset of size 84 MB, this works because it uses less RAM to store the data. But for a very very large dataset, it gives an out-of-memory event kill. Therefore, I used dask for that. Here's the code for dask:

import dask.dataframe as dd

ddf = dd.read_csv(file_path, sep="\s+", header=None, usecols=[0, 1, 2], dtype=str, blocksize=25e6)

unique_entities = dd.concat([ddf[0], ddf[2]], axis=0).drop_duplicates()
unique_relations = ddf[1].drop_duplicates()
total_triples = len(ddf)

num_unique_entities = unique_entities.compute()
num_unique_relations = unique_relations.compute()

print(f"Total triples: {total_triples}")  # No need to call compute here
print(f"Unique entities: {len(num_unique_entities)}")
print(f"Unique relations: {len(num_unique_relations)}")

Now this code works. However, when I try to use the following code:

# Reading the dataset
ddf = dd.read_csv(file_path, sep="\s+", header=None, usecols=[0, 1, 2], names=['subject', 'relation', 'object'], dtype=str, blocksize=25e6)

unique_entities = dd.concat([ddf['subject'], ddf['object']], axis=0).drop_duplicates()
unique_relations = ddf['relation'].drop_duplicates()

num_unique_entities = unique_entities.compute()
num_unique_relations = unique_relations.compute()

print(f"Total triples: {total_triples}")  # No need to call compute here
print(f"Unique entities: {len(num_unique_entities)}")
print(f"Unique relations: {len(num_unique_relations)}")

It gives me the error at num_unique_entities = unique_entities.compute() stating that "ValueError: An error occurred while calling the read_csv method registered to the pandas backend. Original Message: Number of passed names did not match number of header fields in the file"

I have discussed this issue on the forms here and it turns out to be an issue in the latest version of dask=2024.5.0. It worked when tested with dask=2024.2.1, and it did not raise the issue. It does raise an issue with dask=2024.4.1 like @guillaumeeb said in the form post.

Environment:

Dask version: 2024.5.0
Python version: 3.10.13
Operating System: Ubuntu 24.04 LTS
Install method (conda, pip, source): pip

The text was updated successfully, but these errors were encountered:

phofl · 2024-05-16T23:28:09Z

Hi, thanks for your report. Could you create a reproducible example that adds the content of the csv file? Can be a small sample, doesn't have to be anywhere close to the full file

sshivam95 · 2024-05-17T08:56:35Z

Hi. Sure let me give a detailed example. We have the following N-triples, separated by whitespace in a file called "data.txt":

<http://whale.data.dice-research.org/resource#node21f760a41de19b4a8370fd8f49f6e87e> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/wo/species> .
<http://whale.data.dice-research.org/resource#nodea7ba3274fe56fb8342b740aef391a3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/wo/species> .
<http://whale.data.dice-research.org/resource#nodea7ba3274fe56fb8342b740aef391a3> <http://purl.org/ontology/wo/kingdom> <http://whale.data.dice-research.org/resource#node4a5dd7cade315a1a7a63e7b6881f18a> .

I am reading this data file using the following code:

# Reading the dataset
ddf = dd.read_csv(file_path, sep="\s+", header=None, usecols=[0, 1, 2], names=['subject', 'relation', 'object'], dtype=str, blocksize=25e6)

unique_entities = dd.concat([ddf['subject'], ddf['object']], axis=0).drop_duplicates()
unique_relations = ddf['relation'].drop_duplicates()

num_unique_entities = unique_entities.compute()
num_unique_relations = unique_relations.compute()

print(f"Total triples: {total_triples}")  # No need to call compute here
print(f"Unique entities: {len(num_unique_entities)}")
print(f"Unique relations: {len(num_unique_relations)}")

This code gives me the ValueError exception in Dask version 2024.5.0. It works fine in the version dask=2024.2.1.

github-actions bot added the needs triage Needs a response from a contributor label May 16, 2024

phofl added dataframe needs info Needs further information from the user dask-expr and removed needs triage Needs a response from a contributor labels May 16, 2024

phofl mentioned this issue May 17, 2024

Fix read_csv with positional usecols dask/dask-expr#1069

Merged

phofl closed this as completed in dask/dask-expr#1069 May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: An error occurred while calling the read_csv method registered to the pandas backend #11125

ValueError: An error occurred while calling the read_csv method registered to the pandas backend #11125

sshivam95 commented May 16, 2024 •

edited

phofl commented May 16, 2024

sshivam95 commented May 17, 2024

ValueError: An error occurred while calling the read_csv method registered to the pandas backend #11125

ValueError: An error occurred while calling the read_csv method registered to the pandas backend #11125

Comments

sshivam95 commented May 16, 2024 • edited

phofl commented May 16, 2024

sshivam95 commented May 17, 2024

sshivam95 commented May 16, 2024 •

edited