Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type discrepancy between null and undefined #49

Closed
smohiuddin opened this issue Oct 17, 2019 · 15 comments
Closed

Type discrepancy between null and undefined #49

smohiuddin opened this issue Oct 17, 2019 · 15 comments

Comments

@smohiuddin
Copy link

Hi,

I am loading a CSV file with the fromCSV function. When I run detectTypes() on the dataframe at that time, I see something like:
0 string 8.1 phone
1 object 91.9 phone

DataForge is treating empty values / nulls as an object.

After running a transform such as below to uppercase a value, there are some type differences.
dataFrame.transformSeries({ [sourceColumn]: value => value && value.toLocaleUpperCase() })

Running detectTypes() again generates:
0 string 8.1 phone
1 undefined 91.9 phone

Even explicitly returning a null such as:
dataFrame.transformSeries({ [sourceColumn]: value => value ? value.toLocaleUpperCase() : null })

returns an undefined.

It seems there is an inconsistency between fromCSV dynamic typing parsing and transformSeries handling nulls and/or undefined.

@ashleydavis
Copy link
Member

Thanks for your feedback.

Are you able to submit your data so I can test this?

@smohiuddin
Copy link
Author

Attached
test-data.csv.zip

@ashleydavis
Copy link
Member

ashleydavis commented Oct 18, 2019

Thanks. Now you need to help me understand the code that produces the problem.

I've tried the following code and it seems ok:

const dataForge = require("data-forge");
require("data-forge-fs");

const df = await dataForge.readFile("./test-data.csv").parseCSV();

display(df.detectTypes());

Please let me know what changes you would make to that code to produce the problem.

I used Data-Forge Notebook to test that code.

@ashleydavis
Copy link
Member

This is what the result looks like in Data-Forge Notebook:

image

@smohiuddin
Copy link
Author

const dataFrame = fromCSV(csvInput, { dynamicTyping: true });
console.log(df.detectTypes().toString());

__index__  Type     Frequency           Column       
---------  -------  ------------------  -------------
0          string   8.1                 phone        
1          object   91.9                phone  

const updatedDf = dataFrame.transformSeries({ ['phone']: value => value &&  `+1 ${value}` });
console.log(df.detectTypes().toString());

__index__  Type       Frequency           Column       
---------  ---------  ------------------  -------------
0          string     8.1                 phone        
1          undefined  91.9                phone      

Seems like theres a discrepancy between how types are evaluated in fromCSV and after a transformSeries. I have also tried something like:

dataFrame.transformSeries({ ['phone']: value => value ? `+1 ${value}` : null });

@ashleydavis
Copy link
Member

I think your use of transformSeries is wrong.

I suspect it should look like this:

dataFrame.transformSeries({ phone: value => value ? `+1 ${value}` : null });

Also I'm not really sure why you are outputting null as value.

Typically Data-Forge treats undefined as an 'empty' value. But I don't think it does anything special with null.

I think the main problem here is that PapaParse (the CSV parser that Data-Forge uses) is returning null values for empty fields when dynamicTyping is enabled. I might be able to fix this by making Data-Forge treat nulls the same as undefined.

image

@ashleydavis
Copy link
Member

ashleydavis commented Oct 19, 2019

I need to think this through.

For now though you might use this workaround:

image

Basically rewrite null with undefined and it might work the way you expect.

In the meantime I'll see if I can fix this properly.

@ashleydavis
Copy link
Member

@smohiuddin Just letting you know I've just published Data-Forge version 1.7.7 and it now treats null values like it does undefined. This should solve your problem.

Please try it out and let me know if it works for you.

@smohiuddin
Copy link
Author

Thanks for this fix. After further investigation, it appears that data forge is only checking the first element of a column to determine the type instead of the distribution from detectTypes.

test-data-inverse.csv.zip

I am attaching the same data set as before except I added a string as the first element in the zip column and a number as the first element in the city state column.

data forge now detects these column types as string and number respectively.

@ashleydavis
Copy link
Member

ashleydavis commented Oct 24, 2019

Ah, well that's something you can already change!

When you construct your dataframe you should use the considerAllRows option:

const dataframe = new DataFrame({ values: [...], considerAllRows: true });

The reason it doesn't do that by default is because it can be quite expensive to compute for large data sets!

@smohiuddin
Copy link
Author

makes sense!

Whats the best way to do that if using readFile or parseCSV since those options are not available through there.

@ashleydavis
Copy link
Member

ashleydavis commented Oct 24, 2019

I was about to say you shouldn't have this problem if you are using readFile/parseCSV, but then I realise it's again related to the use of dynamicTyping.

You might be best to remove the dynamicTyping field because it doesn't seem to play nicely with your data (this is a feature of PapaParse and I don't have much control over it).

If you remove that you can manually type your data.

See the example on the landing page where it calls parseInts and parseDates to manually parse data types from columns:

image

@smohiuddin
Copy link
Author

The issue is we want to auto-detect the types as we may not know it in advance. can the data forge constructor accept the output of readFileSync?

@ashleydavis
Copy link
Member

This is a change that could definitely be made. Would you like to have a go at it? Be great to have you as a contributor.

I think you would need to make a change to the fromCSV function in Data-Forge and the parseCSV in Data-Forge FS.

If you are interested in doing this maybe we can discuss it more in the Data-Forge Gitter channel?

@ashleydavis
Copy link
Member

Closing due to no activity. Please reopen if this issue still needs attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants