Override constructor with transformation function #33

tewe · 2020-07-23T14:48:05Z

This generalises #17.

Currently any class not special-cased by the library must have a constructor that takes a single string.

Even for built-in types this leads to workarounds like the following.

class Time(datetime.time):
    def __new__(cls, value: str):
        hour, minute = value.split(":")
        return super().__new__(cls, int(hour), int(minute))

I think supporting transformation functions would add much-needed flexibility.

One possible syntax might be this.

def strptime(value: str):
    hour, minute = value.split(":")
    return datetime.time(int(hour), int(minute))

reader.map('time').using(strptime)  # With or without .to()

dfurtado · 2020-07-27T18:04:13Z

I actually have been working adding a similar functionality but only for datetime values where I could see the most use cases. The way I'm implementing is through a new decorator called (temporarily) dtfunc which you can specify a function that will be used for parsing every datetime value.

The reason I have implemented only for datetime and not for every type is that it is possible to achieve similar functionality overriding the __post_init__ in the dataclass and modify its value.

For instance, if I had a CSV file with a column firstname that I would like to map to the field name in a dataclass User, and convert to every value to uppercase, I could do:

CSV:

firstname
daniel

Code:

from dataclass_csv import DataclassReader
from dataclasses import dataclass

@dataclass
class User:
    name :str

    def __post_init__(self):
        self.name = self.name.upper()


def main():
    with open("users.csv") as f:
        reader = DataclassReader(f, User)
        reader.map("firstname").to("name")
        data = list(reader)
        print(data)


if __name__ == "__main__":
    main()

Output:

[User(name='DANIEL')]

That way it doesn't change so much the way of working with dataclasses. The method __post_init__ is called for every row that is processed so it doesn't slow down the process of creating the instances of the dataclass.

tewe · 2020-07-28T00:26:07Z

Your upper example does not change the type of the attribute. My use-case is classes that cannot be instantiated with a string.

It is possible to use __post_init__ for this. But every such method needs to branch to support both strings and the actual type of the attribute. Otherwise the dataclass becomes unusable by code not related to CSV parsing. I feel like that defeats the purpose of the library.

karthicraghupathi · 2022-02-28T02:14:25Z

@dfurtado Thank you for creating this library. It is great and helped me quickly convert my dataclass to a CSV file.

However I ran into a situation while writing the CSV file and I think @tewe's implementation suggestion could also work while writing. Consider this example:

import sys
from typing import List

from dataclasses import dataclass, field
from dataclass_csv import DataclassWriter


@dataclass
class Score:
    subject: str
    grade: str

    def __str__(self):
        return "{} - {}".format(self.subject, self.grade)


@dataclass
class Student:
    name: str
    scores: List[Score] = field(default_factory=list)

    def __str__(self):
        return self.name


s = Student(
    name="Student 1",
    scores=[Score(subject="Science", grade="A"), Score(subject="Math", grade="A")],
)

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.write()

The output for this is:

name,scores
Student 1,"[('Science', 'A'), ('Math', 'A')]"

Following @tewe's example, to achieve an output like this:

name,scores
Student 1,Science-A|Math-A

We could do something like this:

def format_scores(value) -> str:
    return "|".join(["{}-{}".format(item.subject, item.grade) for item in value])

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.map("scores").using(format_scores)
    writer.write()

Or is there a better way to achieve this?

dfurtado · 2022-03-02T18:49:18Z

Hello @karthicraghupathi , thanks a lot for the kind words about my project. Really appreciate it.

Yes, I like this solution. I am actually working on something along those lines, trying out different solutions. I want to do something that will not feel unfamiliar when it comes to dataclasses usage and also the usage of the dataclass-csv package.

I'll ping you in this issue when I have something done.

liudonghua123 · 2022-11-08T17:07:10Z

@dfurtado Thank you for creating this library. It is great and helped me quickly convert my dataclass to a CSV file.

However I ran into a situation while writing the CSV file and I think @tewe's implementation suggestion could also work while writing. Consider this example:

import sys
from typing import List

from dataclasses import dataclass, field
from dataclass_csv import DataclassWriter


@dataclass
class Score:
    subject: str
    grade: str

    def __str__(self):
        return "{} - {}".format(self.subject, self.grade)


@dataclass
class Student:
    name: str
    scores: List[Score] = field(default_factory=list)

    def __str__(self):
        return self.name


s = Student(
    name="Student 1",
    scores=[Score(subject="Science", grade="A"), Score(subject="Math", grade="A")],
)

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.write()

The output for this is:

name,scores
Student 1,"[('Science', 'A'), ('Math', 'A')]"

Following @tewe's example, to achieve an output like this:

name,scores
Student 1,Science-A|Math-A

We could do something like this:

def format_scores(value) -> str:
    return "|".join(["{}-{}".format(item.subject, item.grade) for item in value])

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.map("scores").using(format_scores)
    writer.write()

Or is there a better way to achieve this?

I got this error when using .using.

Traceback (most recent call last):
  File "D:\code\python\cnki_crawler_playwright\main.py", line 202, in <module>
    main()
  File "D:\code\python\cnki_crawler_playwright\main.py", line 188, in main
    w.map("paper_name").using(format_link)
AttributeError: 'HeaderMapper' object has no attribute 'using'

And from the code, it seems using is not exists.

dataclass-csv/dataclass_csv/header_mapper.py

Lines 4 to 19 in 2dc71be

    
           class HeaderMapper: 
        
               """The `HeaderMapper` class is used to explicitly map property in a 
        
               dataclass to a header. Useful when the header on the CSV file needs to 
        
               be different from a dataclass property name. 
        
               """ 
        
               def __init__(self, callback: Callable[[str], None]): 
        
                   def to(header: str) -> None: 
        
                       """Specify how a property in the dataclass will be 
        
                       displayed in the CSV file 
        
                       :param header: Specify the CSV title for the dataclass property 
        
                       """ 
        
                       callback(header) 
        
                   self.to: Callable[[str], None] = to

dataclass-csv/dataclass_csv/field_mapper.py

Lines 4 to 18 in 2dc71be

    
           class FieldMapper: 
        
               """The `FieldMapper` class is used to explicitly map a field 
        
               in the CSV file to a specific `dataclass` field. 
        
               """ 
        
               def __init__(self, callback: Callable[[str], None]): 
        
                   def to(property_name: str) -> None: 
        
                       """Specify the dataclass field to receive the value 
        
                       :param property_name: The dataclass property that 
        
                       will receive the csv value. 
        
                       """ 
        
                       callback(property_name) 
        
                   self.to: Callable[[str], None] = to

karthicraghupathi · 2022-11-08T18:42:27Z

@liudonghua123 You are right. It does not exist yet. This thread is to discuss @tewe's proposal and other ways of achieving that. We'll need to wait on @dfurtado to see which direction they take.

dfurtado · 2022-11-09T07:32:20Z

Hi @karthicraghupathi and @liudonghua123 thanks for the contribution to this thread and using the lib.

Yes, I see the use case for this for sure. I will try to put something together and create a PR.

I think the first suggestion seems great, however it might make the API a bit complicate. The .map function is used when we have a column in the CSV file is named differently from the dataclass field name. Eg.:

Let's say we have a column First Name in the CSV and the dataclass is defined as firstname

reader.map("First Name").to("firstname")

In a case that I don't have any differences it would make the API inconsistent since the argument to map is the name of the column in the CSV, eg.:

reader.map("firstname").using(fn)

So it would be difficult to use .map in these cases. Perhaps we would need a second function like reader.transform("field").using(fn) or a decorator, eg.:

from dataclass_csv import transform


def fn(value):
    ....

@dataclass
@transform("firstname", fn)
class User:
    firstname: str
    lastname: srt

Please, share your thoughts about these solutions.

tewe · 2022-11-09T11:31:15Z

To me "map using" isn't any less intuitive than "map to using", so I'd avoid introducing another name like transform.

The decorator way breaks down when you have two kinds of CSV you want to map to the same class.

karthicraghupathi · 2022-11-09T16:28:34Z

@dfurtado thanks for continuing to work on this. I agree with @tewe. It just feels intuitive and pythonic when I see map.using() or map.to().using().

dfurtado · 2022-11-09T19:18:18Z

Hello 👋🏼 ,

As I have explained above having something like reader.map("name").using(fn) with name here being the name of the dataclass property would be a breaking change since the argument of .map is the name of the column in the CSV file. I really don't want to change that because I know there are a lot of code out there that would break.

It could work do something like reader.map("name").to("name").using(fn), however, it would look strange specially when the dataclass property name matches the name of the column in the CSV file. In this particular case it would be requiring the the user's to add code that is not necessary and repetitive.

It would be fine when reader.map("First name").to("name").using(fn) but when the names match seems wrong to write that explicitly when the lib does all the mapping automatically.

I have to put more though on this one to find a good solution that will look nice without breaking the current functionality. 🤔

tewe · 2022-11-10T16:23:41Z

Sorry, I didn't catch that distinction the first time. But what's wrong with reader.map("csv_column_that_matches_a_dataclass_attribute").using(f)?

liudonghua123 · 2022-11-15T07:26:28Z

I have another question, is there any ways to split some complex object properties to different columns?

Say If I have the following classes for serialization to csv.

@dataclass
class Link:
    title: str
    url: str
    
@dataclass
class  SearchResult:	
    paper_name: Link
    authors: list[Link]
    publication: Link

I would expected to have split paper_name into paper_name.title and paper_name.url columns.

tewe · 2022-11-16T02:04:18Z

@liudonghua123 I think that is a separate issue.

I haven't tried if mapping a field twice already works.

writer.map("paper_name").to("title")
writer.map("paper_name").to("url")

But you'd additionally need something like the proposed API.

writer.map("paper_name").to("url").using(lambda n: f"https://doi/{n}")

liudonghua123 · 2022-11-16T02:37:16Z

@tewe Thanks, I will open a new issue to track. 😄

mgperry · 2023-03-23T20:33:28Z

Hey, I took a look and this and it's possible to do this currently by overriding the type_hints attribute on the Reader class to do this with an ordinary function.

test.csv:

name,values
A,1;2;3
B,8;9
C,3

then run:

from dataclass_csv import DataclassReader
import dataclasses

@dataclasses.dataclass
class Variable:
    name: str
    values: list[int]


fh = open("test_split.csv")
reader = DataclassReader(fh, Variable)

# define our conversion function
read_vals =  lambda s: [int(x) for x in s.split(";")]

# monkey patch reader
reader.type_hints["values"] = read_vals

for var in reader:
    print(var)

Of course you can package them up in a nice method if you want :)

liudonghua123 mentioned this issue Nov 16, 2022

Support to split object properties to separate columns. #54

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Override constructor with transformation function #33

Override constructor with transformation function #33

tewe commented Jul 23, 2020 •

edited

dfurtado commented Jul 27, 2020 •

edited

tewe commented Jul 28, 2020

karthicraghupathi commented Feb 28, 2022

dfurtado commented Mar 2, 2022

liudonghua123 commented Nov 8, 2022

karthicraghupathi commented Nov 8, 2022

dfurtado commented Nov 9, 2022

tewe commented Nov 9, 2022 •

edited

karthicraghupathi commented Nov 9, 2022

dfurtado commented Nov 9, 2022 •

edited

tewe commented Nov 10, 2022

liudonghua123 commented Nov 15, 2022

tewe commented Nov 16, 2022

liudonghua123 commented Nov 16, 2022

mgperry commented Mar 23, 2023

Override constructor with transformation function #33

Override constructor with transformation function #33

Comments

tewe commented Jul 23, 2020 • edited

dfurtado commented Jul 27, 2020 • edited

tewe commented Jul 28, 2020

karthicraghupathi commented Feb 28, 2022

dfurtado commented Mar 2, 2022

liudonghua123 commented Nov 8, 2022

karthicraghupathi commented Nov 8, 2022

dfurtado commented Nov 9, 2022

tewe commented Nov 9, 2022 • edited

karthicraghupathi commented Nov 9, 2022

dfurtado commented Nov 9, 2022 • edited

tewe commented Nov 10, 2022

liudonghua123 commented Nov 15, 2022

tewe commented Nov 16, 2022

liudonghua123 commented Nov 16, 2022

mgperry commented Mar 23, 2023

tewe commented Jul 23, 2020 •

edited

dfurtado commented Jul 27, 2020 •

edited

tewe commented Nov 9, 2022 •

edited

dfurtado commented Nov 9, 2022 •

edited