Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LatLong type #57

Merged
merged 21 commits into from Jan 18, 2018
Merged

LatLong type #57

merged 21 commits into from Jan 18, 2018

Conversation

Seth-Rothschild
Copy link
Contributor

The issue in testing comes from mock_ds.py where the mock retail entityset is made with es.entity_from_csv(entity, (line 292). This makes the latlong type in that entityset a string rather than a tuple. The options as I understand them are:

  1. Modify Latitude and Longitude to check if the latlong is a string
  2. Modify entity_from_csv to convert certain strings to tuples
  3. Change the test to do the pandas _from_csv, modify the dataframe and then make entity_from_dataframe.
  4. Leave Latitude and Longitude with no real tests for now.

My gut is to go with 3. Do you have a preference @kmax12?

@codecov-io
Copy link

codecov-io commented Jan 3, 2018

Codecov Report

Merging #57 into master will decrease coverage by 0.56%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #57      +/-   ##
=========================================
- Coverage   87.37%   86.8%   -0.57%     
=========================================
  Files          74      74              
  Lines        6993    7041      +48     
=========================================
+ Hits         6110    6112       +2     
- Misses        883     929      +46
Impacted Files Coverage Δ
tests/testing_utils/mock_ds.py 85.95% <0%> (-8.65%) ⬇️
entityset/entityset.py 83.6% <0%> (-6.42%) ⬇️
tests/entityset_tests/test_es.py 92.85% <0%> (-3.51%) ⬇️
utils/gen_utils.py 64.44% <0%> (-2.23%) ⬇️
tests/entityset_tests/test_pandas_es.py 99.73% <0%> (-0.01%) ⬇️
synthesis/deep_feature_synthesis.py 92.11% <0%> (ø) ⬆️
variable_types/variable.py 78.82% <0%> (+0.25%) ⬆️
primitives/transform_primitive.py 97.75% <0%> (+0.28%) ⬆️
.../feature_function_tests/test_transform_features.py 86.12% <0%> (+0.67%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbc82d5...737dae8. Read the comment docs.

@kmax12
Copy link
Contributor

kmax12 commented Jan 3, 2018

@Seth-Rothschild I agree. Let's go with 3

@Seth-Rothschild
Copy link
Contributor Author

Changed entity_from_csv to entity_from_dataframe in mock_ds.py which lets us put an honest tuple into the make_ecommerce_entityset function. This makes the test pass for Latitude and Longitude but induces 4 new entityset type errors. Three seem to deal with compression of the entityset and the fourth is unicode. They are:

  1. test_glob_entityset in test_es.py
  2. test_gzip_entityset in test_es.py
  3. test_gzip_glob_entityset in test_es.py and
  4. test_combine_variables in test_pandas_es.py

@rwedge any idea what's causing these?

time_index=ti_name,
secondary_time_index=secondary)

df = pd.read_csv(filenames[entity])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding encoding='utf-8' to this read_csv call should fix problem 4.

@rwedge
Copy link
Contributor

rwedge commented Jan 3, 2018

Errors 1-3 stem from tests for how entity_from_csv handles csv paths with glob patterns, csvs compressed with gzip, and a combination of the two situations. Changing the testing entityset to be loaded in using pandas.read_csv instead of entity_from_csv has caused these tests to error.

Taking a step back, if we don't want to support entity_from_csv working with all variable types we could remove/deprecate it and have entity_from_dataframe be the standard way to create entities. These test cases would no longer be necessary.

@Seth-Rothschild
Copy link
Contributor Author

There are three more explicit uses of entityset.from_csv in the codebase.

  1. Two of the examples in the docs (Creating entities and the retail data demo)
  2. The API reference
  3. The function definition itself

As a way forward I'd propose not testing from_csv, replacing the uses in the docs and putting a warning that it will be depreciated in the function call.

The two other things we might want to do with this are to add to the docs that a latlong type has to be explicitly constructed in the dataframe and add a primitive like Haversine to take advantage of the new type. Does that sound like a good plan?

@kmax12
Copy link
Contributor

kmax12 commented Jan 12, 2018

sounds like a good plan to me!

@Seth-Rothschild
Copy link
Contributor Author

Ok - this should be good to go after review. Overall changes to master:

  • Add two columns to mock_ds for testing: latlong and latlong2
  • Add Latitude variable type and description
  • Add Latitude, Longitude and Haversine primitives and tests
  • Add warning to entity_from_csv, remove tests gzip, glob, glob_gzip and test_time_index_components
  • Remove entity_from_csv from docs notebook and example

@Seth-Rothschild Seth-Rothschild changed the title (WIP) LatLong type LatLong type Jan 12, 2018
@kmax12
Copy link
Contributor

kmax12 commented Jan 17, 2018

@Seth-Rothschild what do you think about adding these as default primitives in DFS? And if we do, we should probably add NumCharacters and NumWords as well

@Seth-Rothschild
Copy link
Contributor Author

@kmax12 I think NumCharacters and NumWords are probably worth adding as defaults but Latitude, Longitude and Haversine might not be.

A reason to add them would be that latlongs don't show up in a feature matrix, because they're tuples. That being said, by the time a user has made a LatLong variable (i.e. combined the two columns and marked the variable type as LatLong) they're probably at a stage in processing where they'd want to actively choose the trans_primitives and they'd already have a choice of primitives in mind.

@kmax12
Copy link
Contributor

kmax12 commented Jan 17, 2018

got it. what if we just add the 2 text primitives and Haversine? I think if someone provides two latlong's it useful to automatically return Haversine

@Seth-Rothschild
Copy link
Contributor Author

That sounds like a reasonable plan. Should probably implement this after merging so that I can do both at once, since NumWords and NumCharacters came after this branch.

@Seth-Rothschild
Copy link
Contributor Author

With the merges there's the new error which stems from latlong1 no longer being an array of tuples in python3. Not sure which change has this breaking, but here's the output in python3 versus python2.

python3

(Pdb) latlong1
array([<map object at 0x10fcf6860>, <map object at 0x10fcf6860>,
       <map object at 0x10fcf6860>, <map object at 0x10fcf6860>,
       <map object at 0x10fcf6860>, <map object at 0x10fcf6860>,
       <map object at 0x10fcf6860>, <map object at 0x10fcf6860>,
       <map object at 0x10fcf6860>, <map object at 0x10fcf6860>,
       <map object at 0x10fcf6860>, <map object at 0x10fcf6860>,
       <map object at 0x10fcf6860>, <map object at 0x10fcf6860>,
       <map object at 0x10fcf6860>, <map object at 0x10fcf6860>,
       <map object at 0x10fcf6860>], dtype=object)

python2

(Pdb) latlong1
array([(0.0, 0.0), (5.0, 2.0), (10.0, 4.0), (15.0, 6.0), (20.0, 8.0),
       (0.0, 0.0), (1.0, 1.0), (2.0, 2.0), (3.0, 3.0), (0.0, 0.0),
       (0.0, 0.0), (5.0, 2.0), (0.0, 0.0), (7.0, 3.0), (14.0, 6.0)], dtype=object)

@kmax12
Copy link
Contributor

kmax12 commented Jan 18, 2018

in python 3 map is an iterator. look for usages of map(...) and do list(map(...)).

df['latlong'] = map(lambda x: latlong_unstringify(x), df['latlong'])
df['latlong2'] = map(lambda x: latlong_unstringify(x), df['latlong2'])
df['latlong'] = list(map(
lambda x: latlong_unstringify(x), df['latlong']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this line can actually be

df['latlong'] = df['latlong'].apply(latlong_unstringify)

@kmax12
Copy link
Contributor

kmax12 commented Jan 18, 2018

can we add the new primitives and variable types to the api reference as well as remove entity_from_csv?

@kmax12
Copy link
Contributor

kmax12 commented Jan 18, 2018

This looks good to merge (once Circle CI passes). @Seth-Rothschild can you put up a PR for removing entity_from_csv when you get a chance?

@kmax12 kmax12 merged commit 75d05ec into master Jan 18, 2018
@rwedge rwedge mentioned this pull request Jan 18, 2018
@Seth-Rothschild Seth-Rothschild deleted the latlong-type branch January 18, 2018 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants