Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: An error occurred while calling the read_csv method registered to the pandas backend #11125

Closed
sshivam95 opened this issue May 16, 2024 · 2 comments · Fixed by dask/dask-expr#1069
Labels
dask-expr dataframe needs info Needs further information from the user

Comments

@sshivam95
Copy link

sshivam95 commented May 16, 2024

Hi. I am using dask.dataframe to read a very large dataset of 20TB containing 97 billion Knowledge graph triples. I am using dask.dataframe.read_csv method to read a smaller version of dataset containing 795 million triples of size 152GB. The .txt file contains 4 columns separated by white space. A sample of the dataset:

<http://whale.data.dice-research.org/resource#node21f760a41de19b4a8370fd8f49f6e87e> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/wo/species> .
<http://whale.data.dice-research.org/resource#nodea7ba3274fe56fb8342b740aef391a3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/wo/species> .
<http://whale.data.dice-research.org/resource#nodea7ba3274fe56fb8342b740aef391a3> <http://purl.org/ontology/wo/kingdom> <http://whale.data.dice-research.org/resource#node4a5dd7cade315a1a7a63e7b6881f18a> .

Context: The dataset consists of KG data with subject, relation, and object as columns 0, 1, and 2 respectively. Column 4 contains '.' meaning the triple has ended. My task is to calculate the total number of triples, and the total number of unique entities and relations in this dataset. Since the file is very huge, I cannot use rdflib as it consumes all the memory.

I was using the following pandas code to read the dataset and calculate the stats required for the dataset:

import pandas as pd

dtype = {'subject': str, 'relation': str, 'object': str}
unique_entities = set()
unique_relations = set()
total_triples = 0

try:
    reader = pd.read_csv(file_path, sep="\s+", header=None, names=['subject', 'relation', 'object'], usecols=[0, 1, 2], dtype=dtype, chunksize=chunk_size, memory_map=True, on_bad_lines='warn')

    for chunk in reader:
        total_triples += len(chunk)
        unique_entities.update(chunk['subject'].unique())  # Update entities from 'subject'
        unique_entities.update(chunk['object'].unique())   # Update entities from 'object'
        unique_relations.update(chunk['relation'].unique())  # Update relations
        
except Exception as e:
    logging.error(f"An error occurred: {str(e)}", exc_info=True)

logging.info(f'Total number of triples: {total_triples}')
logging.info(f'Number of unique entities: {len(unique_entities)}')
logging.info(f'Number of unique relations: {len(unique_relations)}')

For a small dataset of size 84 MB, this works because it uses less RAM to store the data. But for a very very large dataset, it gives an out-of-memory event kill. Therefore, I used dask for that. Here's the code for dask:

import dask.dataframe as dd

ddf = dd.read_csv(file_path, sep="\s+", header=None, usecols=[0, 1, 2], dtype=str, blocksize=25e6)

unique_entities = dd.concat([ddf[0], ddf[2]], axis=0).drop_duplicates()
unique_relations = ddf[1].drop_duplicates()
total_triples = len(ddf)

num_unique_entities = unique_entities.compute()
num_unique_relations = unique_relations.compute()

print(f"Total triples: {total_triples}")  # No need to call compute here
print(f"Unique entities: {len(num_unique_entities)}")
print(f"Unique relations: {len(num_unique_relations)}")

Now this code works. However, when I try to use the following code:

# Reading the dataset
ddf = dd.read_csv(file_path, sep="\s+", header=None, usecols=[0, 1, 2], names=['subject', 'relation', 'object'], dtype=str, blocksize=25e6)

unique_entities = dd.concat([ddf['subject'], ddf['object']], axis=0).drop_duplicates()
unique_relations = ddf['relation'].drop_duplicates()

num_unique_entities = unique_entities.compute()
num_unique_relations = unique_relations.compute()

print(f"Total triples: {total_triples}")  # No need to call compute here
print(f"Unique entities: {len(num_unique_entities)}")
print(f"Unique relations: {len(num_unique_relations)}")

It gives me the error at num_unique_entities = unique_entities.compute() stating that "ValueError: An error occurred while calling the read_csv method registered to the pandas backend. Original Message: Number of passed names did not match number of header fields in the file"

I have discussed this issue on the forms here and it turns out to be an issue in the latest version of dask=2024.5.0. It worked when tested with dask=2024.2.1, and it did not raise the issue. It does raise an issue with dask=2024.4.1 like @guillaumeeb said in the form post.

Environment:

  • Dask version: 2024.5.0
  • Python version: 3.10.13
  • Operating System: Ubuntu 24.04 LTS
  • Install method (conda, pip, source): pip
@github-actions github-actions bot added the needs triage Needs a response from a contributor label May 16, 2024
@phofl
Copy link
Collaborator

phofl commented May 16, 2024

Hi, thanks for your report. Could you create a reproducible example that adds the content of the csv file? Can be a small sample, doesn't have to be anywhere close to the full file

@phofl phofl added dataframe needs info Needs further information from the user dask-expr and removed needs triage Needs a response from a contributor labels May 16, 2024
@sshivam95
Copy link
Author

Hi. Sure let me give a detailed example. We have the following N-triples, separated by whitespace in a file called "data.txt":

<http://whale.data.dice-research.org/resource#node21f760a41de19b4a8370fd8f49f6e87e> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/wo/species> .
<http://whale.data.dice-research.org/resource#nodea7ba3274fe56fb8342b740aef391a3> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/ontology/wo/species> .
<http://whale.data.dice-research.org/resource#nodea7ba3274fe56fb8342b740aef391a3> <http://purl.org/ontology/wo/kingdom> <http://whale.data.dice-research.org/resource#node4a5dd7cade315a1a7a63e7b6881f18a> .

I am reading this data file using the following code:

# Reading the dataset
ddf = dd.read_csv(file_path, sep="\s+", header=None, usecols=[0, 1, 2], names=['subject', 'relation', 'object'], dtype=str, blocksize=25e6)

unique_entities = dd.concat([ddf['subject'], ddf['object']], axis=0).drop_duplicates()
unique_relations = ddf['relation'].drop_duplicates()

num_unique_entities = unique_entities.compute()
num_unique_relations = unique_relations.compute()

print(f"Total triples: {total_triples}")  # No need to call compute here
print(f"Unique entities: {len(num_unique_entities)}")
print(f"Unique relations: {len(num_unique_relations)}")

This code gives me the ValueError exception in Dask version 2024.5.0. It works fine in the version dask=2024.2.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask-expr dataframe needs info Needs further information from the user
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants