Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override constructor with transformation function #33

Open
tewe opened this issue Jul 23, 2020 · 15 comments
Open

Override constructor with transformation function #33

tewe opened this issue Jul 23, 2020 · 15 comments

Comments

@tewe
Copy link

tewe commented Jul 23, 2020

This generalises #17.

Currently any class not special-cased by the library must have a constructor that takes a single string.

Even for built-in types this leads to workarounds like the following.

class Time(datetime.time):
    def __new__(cls, value: str):
        hour, minute = value.split(":")
        return super().__new__(cls, int(hour), int(minute))

I think supporting transformation functions would add much-needed flexibility.

One possible syntax might be this.

def strptime(value: str):
    hour, minute = value.split(":")
    return datetime.time(int(hour), int(minute))

reader.map('time').using(strptime)  # With or without .to()
@dfurtado
Copy link
Owner

dfurtado commented Jul 27, 2020

I actually have been working adding a similar functionality but only for datetime values where I could see the most use cases. The way I'm implementing is through a new decorator called (temporarily) dtfunc which you can specify a function that will be used for parsing every datetime value.

The reason I have implemented only for datetime and not for every type is that it is possible to achieve similar functionality overriding the __post_init__ in the dataclass and modify its value.

For instance, if I had a CSV file with a column firstname that I would like to map to the field name in a dataclass User, and convert to every value to uppercase, I could do:

CSV:

firstname
daniel

Code:

from dataclass_csv import DataclassReader
from dataclasses import dataclass

@dataclass
class User:
    name :str

    def __post_init__(self):
        self.name = self.name.upper()


def main():
    with open("users.csv") as f:
        reader = DataclassReader(f, User)
        reader.map("firstname").to("name")
        data = list(reader)
        print(data)


if __name__ == "__main__":
    main()

Output:

[User(name='DANIEL')]

That way it doesn't change so much the way of working with dataclasses. The method __post_init__ is called for every row that is processed so it doesn't slow down the process of creating the instances of the dataclass.

@tewe
Copy link
Author

tewe commented Jul 28, 2020

Your upper example does not change the type of the attribute. My use-case is classes that cannot be instantiated with a string.

It is possible to use __post_init__ for this. But every such method needs to branch to support both strings and the actual type of the attribute. Otherwise the dataclass becomes unusable by code not related to CSV parsing. I feel like that defeats the purpose of the library.

@karthicraghupathi
Copy link

@dfurtado Thank you for creating this library. It is great and helped me quickly convert my dataclass to a CSV file.

However I ran into a situation while writing the CSV file and I think @tewe's implementation suggestion could also work while writing. Consider this example:

import sys
from typing import List

from dataclasses import dataclass, field
from dataclass_csv import DataclassWriter


@dataclass
class Score:
    subject: str
    grade: str

    def __str__(self):
        return "{} - {}".format(self.subject, self.grade)


@dataclass
class Student:
    name: str
    scores: List[Score] = field(default_factory=list)

    def __str__(self):
        return self.name


s = Student(
    name="Student 1",
    scores=[Score(subject="Science", grade="A"), Score(subject="Math", grade="A")],
)

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.write()

The output for this is:

name,scores
Student 1,"[('Science', 'A'), ('Math', 'A')]"

Following @tewe's example, to achieve an output like this:

name,scores
Student 1,Science-A|Math-A

We could do something like this:

def format_scores(value) -> str:
    return "|".join(["{}-{}".format(item.subject, item.grade) for item in value])

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.map("scores").using(format_scores)
    writer.write()

Or is there a better way to achieve this?

@dfurtado
Copy link
Owner

dfurtado commented Mar 2, 2022

Hello @karthicraghupathi , thanks a lot for the kind words about my project. Really appreciate it.

Yes, I like this solution. I am actually working on something along those lines, trying out different solutions. I want to do something that will not feel unfamiliar when it comes to dataclasses usage and also the usage of the dataclass-csv package.

I'll ping you in this issue when I have something done.

@liudonghua123
Copy link

@dfurtado Thank you for creating this library. It is great and helped me quickly convert my dataclass to a CSV file.

However I ran into a situation while writing the CSV file and I think @tewe's implementation suggestion could also work while writing. Consider this example:

import sys
from typing import List

from dataclasses import dataclass, field
from dataclass_csv import DataclassWriter


@dataclass
class Score:
    subject: str
    grade: str

    def __str__(self):
        return "{} - {}".format(self.subject, self.grade)


@dataclass
class Student:
    name: str
    scores: List[Score] = field(default_factory=list)

    def __str__(self):
        return self.name


s = Student(
    name="Student 1",
    scores=[Score(subject="Science", grade="A"), Score(subject="Math", grade="A")],
)

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.write()

The output for this is:

name,scores
Student 1,"[('Science', 'A'), ('Math', 'A')]"

Following @tewe's example, to achieve an output like this:

name,scores
Student 1,Science-A|Math-A

We could do something like this:

def format_scores(value) -> str:
    return "|".join(["{}-{}".format(item.subject, item.grade) for item in value])

with sys.stdout as csv_file:
    writer = DataclassWriter(csv_file, [s], Student)
    writer.map("scores").using(format_scores)
    writer.write()

Or is there a better way to achieve this?

I got this error when using .using.

Traceback (most recent call last):
  File "D:\code\python\cnki_crawler_playwright\main.py", line 202, in <module>
    main()
  File "D:\code\python\cnki_crawler_playwright\main.py", line 188, in main
    w.map("paper_name").using(format_link)
AttributeError: 'HeaderMapper' object has no attribute 'using'

And from the code, it seems using is not exists.

class HeaderMapper:
"""The `HeaderMapper` class is used to explicitly map property in a
dataclass to a header. Useful when the header on the CSV file needs to
be different from a dataclass property name.
"""
def __init__(self, callback: Callable[[str], None]):
def to(header: str) -> None:
"""Specify how a property in the dataclass will be
displayed in the CSV file
:param header: Specify the CSV title for the dataclass property
"""
callback(header)
self.to: Callable[[str], None] = to

class FieldMapper:
"""The `FieldMapper` class is used to explicitly map a field
in the CSV file to a specific `dataclass` field.
"""
def __init__(self, callback: Callable[[str], None]):
def to(property_name: str) -> None:
"""Specify the dataclass field to receive the value
:param property_name: The dataclass property that
will receive the csv value.
"""
callback(property_name)
self.to: Callable[[str], None] = to

@karthicraghupathi
Copy link

@liudonghua123 You are right. It does not exist yet. This thread is to discuss @tewe's proposal and other ways of achieving that. We'll need to wait on @dfurtado to see which direction they take.

@dfurtado
Copy link
Owner

dfurtado commented Nov 9, 2022

Hi @karthicraghupathi and @liudonghua123 thanks for the contribution to this thread and using the lib.

Yes, I see the use case for this for sure. I will try to put something together and create a PR.

I think the first suggestion seems great, however it might make the API a bit complicate. The .map function is used when we have a column in the CSV file is named differently from the dataclass field name. Eg.:

Let's say we have a column First Name in the CSV and the dataclass is defined as firstname

reader.map("First Name").to("firstname")

In a case that I don't have any differences it would make the API inconsistent since the argument to map is the name of the column in the CSV, eg.:

reader.map("firstname").using(fn)

So it would be difficult to use .map in these cases. Perhaps we would need a second function like reader.transform("field").using(fn) or a decorator, eg.:

from dataclass_csv import transform


def fn(value):
    ....

@dataclass
@transform("firstname", fn)
class User:
    firstname: str
    lastname: srt

Please, share your thoughts about these solutions.

@tewe
Copy link
Author

tewe commented Nov 9, 2022

To me "map using" isn't any less intuitive than "map to using", so I'd avoid introducing another name like transform.

The decorator way breaks down when you have two kinds of CSV you want to map to the same class.

@karthicraghupathi
Copy link

@dfurtado thanks for continuing to work on this. I agree with @tewe. It just feels intuitive and pythonic when I see map.using() or map.to().using().

@dfurtado
Copy link
Owner

dfurtado commented Nov 9, 2022

Hello 👋🏼 ,

As I have explained above having something like reader.map("name").using(fn) with name here being the name of the dataclass property would be a breaking change since the argument of .map is the name of the column in the CSV file. I really don't want to change that because I know there are a lot of code out there that would break.

It could work do something like reader.map("name").to("name").using(fn), however, it would look strange specially when the dataclass property name matches the name of the column in the CSV file. In this particular case it would be requiring the the user's to add code that is not necessary and repetitive.

It would be fine when reader.map("First name").to("name").using(fn) but when the names match seems wrong to write that explicitly when the lib does all the mapping automatically.

I have to put more though on this one to find a good solution that will look nice without breaking the current functionality. 🤔

@tewe
Copy link
Author

tewe commented Nov 10, 2022

Sorry, I didn't catch that distinction the first time. But what's wrong with reader.map("csv_column_that_matches_a_dataclass_attribute").using(f)?

@liudonghua123
Copy link

I have another question, is there any ways to split some complex object properties to different columns?

Say If I have the following classes for serialization to csv.

@dataclass
class Link:
    title: str
    url: str
    
@dataclass
class  SearchResult:	
    paper_name: Link
    authors: list[Link]
    publication: Link

I would expected to have split paper_name into paper_name.title and paper_name.url columns.

@tewe
Copy link
Author

tewe commented Nov 16, 2022

@liudonghua123 I think that is a separate issue.

I haven't tried if mapping a field twice already works.

writer.map("paper_name").to("title")
writer.map("paper_name").to("url")

But you'd additionally need something like the proposed API.

writer.map("paper_name").to("url").using(lambda n: f"https://doi/{n}")

@liudonghua123
Copy link

@tewe Thanks, I will open a new issue to track. 😄

@mgperry
Copy link

mgperry commented Mar 23, 2023

Hey, I took a look and this and it's possible to do this currently by overriding the type_hints attribute on the Reader class to do this with an ordinary function.

test.csv:

name,values
A,1;2;3
B,8;9
C,3

then run:

from dataclass_csv import DataclassReader
import dataclasses

@dataclasses.dataclass
class Variable:
    name: str
    values: list[int]


fh = open("test_split.csv")
reader = DataclassReader(fh, Variable)

# define our conversion function
read_vals =  lambda s: [int(x) for x in s.split(";")]

# monkey patch reader
reader.type_hints["values"] = read_vals

for var in reader:
    print(var)

Of course you can package them up in a nice method if you want :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants