Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize and Deserialize from S3/URLs #685

Merged
merged 86 commits into from Aug 2, 2019

Conversation

@jeremyliweishih
Copy link
Contributor

commented Jul 25, 2019

This pull request adds on to #263 by adding read and write functionality into S3 and read functionality for URLs. This was largely done using smart_open. Included is code and tests for serializing/deserializing features and entity sets.

@jeremyliweishih jeremyliweishih changed the title Read from url Serialize and Deserialize from S3/URLs Jul 25, 2019

session = boto3.Session()
if isinstance(profile_name, str):
transport_params = {'session': boto3.Session(profile_name=profile_name)}
if _is_s3(features) and (session.get_credentials() is None or profile_name is False):

This comment has been minimized.

Copy link
@rwedge

rwedge Jul 30, 2019

Contributor

If the user doesn't have a default profile but does have a named profile in their credentials file this logic will end up using the anonymous way of reading the file.

session = boto3.Session()
if isinstance(profile_name, str):
transport_params = {'session': boto3.Session(profile_name=profile_name)}
if session.get_credentials() is not None or profile_name is not False:

This comment has been minimized.

Copy link
@rwedge

rwedge Jul 31, 2019

Contributor

If profile_name is None and session.get_credentials() is None the logic will try to use smartopen instead of s3fs. Also if profile_name is False and session.get_credentials() finds credentials than it will use smartopen instead of s3fs.

import pandas as pd

from featuretools.entityset.relationship import Relationship
from featuretools.entityset.serialize import FORMATS
from featuretools.utils.gen_utils import check_schema_version
from featuretools.utils import is_python_2

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

is_python_2 is in the gen_utils file so we can include it with the other gen_utils imports

'''Write entityset to disk in the pickle format, location specified by `path`.
Args:
path (str): location on disk to write to (will be created as a directory)
compression (str) : Name of the compression to use. Possible values are: {'gzip', 'bz2', 'zip', 'xz', None}.
profile_name (str) : Name of AWS profile to use or None.

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

Should mention False option

'''Write entityset to disk in the parquet format, location specified by `path`.
Args:
path (str): location on disk to write to (will be created as a directory)
engine (str) : Name of the engine to use. Possible values are: {'auto', 'pyarrow', 'fastparquet'}.
compression (str) : Name of the compression to use. Possible values are: {'snappy', 'gzip', 'brotli', None}.
profile_name (str) : Name of AWS profile to use or None.

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

Same mention of False

import boto3

from featuretools.utils import is_python_2
from featuretools.utils.gen_utils import use_s3fs_es, use_smartopen_es

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

move is_python_2 to here

def test_serialize_s3_csv(es, s3_client, s3_bucket):
es.to_csv(TEST_S3_URL, encoding='utf-8', engine='python')
obj = list(s3_bucket.objects.all())[0].key
s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL='public-read-write')

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

You can run one test without adding this ACL, but the second one fails. Maybe the teardown isn't going as we expected? Let's add a TODO about fixing this and we can look into it later

except EnvironmentError:
pass

f = open(test_path, "w+")

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

let's convert these to with open(...) as f format

try:
os.remove(test_path)
os.remove(test_path_config)
except EnvironmentError:

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

EnvironmentError seems too broad. We only want to catch the error if os.remove fails because there is not file present, right?

except EnvironmentError:
pass

f = open(test_path, "w+")

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

same comment about using with open

def test_features_anon_s3(es):
# TODO: Feature ordering is different in py3.5 vs 3.6+
features_original = sorted(ft.dfs(target_entity='sessions', entityset=es, features_only=True), key=lambda x: x.unique_name())
features_deserialized = sorted(ft.load_features(S3_URL, profile_name=False), key=lambda x: x.unique_name())

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

the only thing that changes between test_deserialize_features_default_s3, test_features_anon_s3, and test_deserialize_features_url is the url path and the profile. We merge these into a single test that parametrizes url and profile as variables


features_deserialized = ft.load_features(TEST_S3_URL)

for feat_1, feat_2 in zip(features_original, features_deserialized):

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

we can make this a helper function that tests in this file call to reduce code repetition

for line in fin:
fout.write(line)
else:
with open(file_path + ".tar", 'rb') as fin:

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

I think instead of modifying the file path by adding a tar extension we should update the docstrings to make it clear to users that uploading to s3 creates a tar archive instead of a directory.

'''Write entityset to disk in the pickle format, location specified by `path`.
def to_pickle(self, path, compression=None, profile_name=None):
'''Write entityset in the pickle format, location specified by `path`.
Path could be a local path or a S3 path (will serialize as a tar archive).

This comment has been minimized.

Copy link
@rwedge

rwedge Aug 2, 2019

Contributor

This could be interpreted as saying that tar archives are created locally as well.

@rwedge
rwedge approved these changes Aug 2, 2019
Copy link
Contributor

left a comment

Looks good!

@jeremyliweishih jeremyliweishih merged commit d00799e into master Aug 2, 2019

4 checks passed

codecov/patch 98.25% of diff hit (target 97.49%)
Details
codecov/project 97.5% (+<.01%) compared to c583ff1
Details
license/cla Contributor License Agreement is signed.
Details
test_all_python_versions Workflow: test_all_python_versions
Details
@rwedge rwedge referenced this pull request Aug 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.