Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize and Deserialize from S3/URLs #685

Merged
merged 86 commits into from
Aug 2, 2019
Merged

Conversation

jeremyliweishih
Copy link
Contributor

@jeremyliweishih jeremyliweishih commented Jul 25, 2019

This pull request adds on to #263 by adding read and write functionality into S3 and read functionality for URLs. This was largely done using smart_open. Included is code and tests for serializing/deserializing features and entity sets.

@jeremyliweishih jeremyliweishih changed the title Read from url Serialize and Deserialize from S3/URLs Jul 25, 2019
session = boto3.Session()
if isinstance(profile_name, str):
transport_params = {'session': boto3.Session(profile_name=profile_name)}
if _is_s3(features) and (session.get_credentials() is None or profile_name is False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user doesn't have a default profile but does have a named profile in their credentials file this logic will end up using the anonymous way of reading the file.

session = boto3.Session()
if isinstance(profile_name, str):
transport_params = {'session': boto3.Session(profile_name=profile_name)}
if session.get_credentials() is not None or profile_name is not False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If profile_name is None and session.get_credentials() is None the logic will try to use smartopen instead of s3fs. Also if profile_name is False and session.get_credentials() finds credentials than it will use smartopen instead of s3fs.

import pandas as pd

from featuretools.entityset.relationship import Relationship
from featuretools.entityset.serialize import FORMATS
from featuretools.utils.gen_utils import check_schema_version
from featuretools.utils import is_python_2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_python_2 is in the gen_utils file so we can include it with the other gen_utils imports

'''Write entityset to disk in the pickle format, location specified by `path`.

Args:
path (str): location on disk to write to (will be created as a directory)
compression (str) : Name of the compression to use. Possible values are: {'gzip', 'bz2', 'zip', 'xz', None}.
profile_name (str) : Name of AWS profile to use or None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention False option

'''Write entityset to disk in the parquet format, location specified by `path`.

Args:
path (str): location on disk to write to (will be created as a directory)
engine (str) : Name of the engine to use. Possible values are: {'auto', 'pyarrow', 'fastparquet'}.
compression (str) : Name of the compression to use. Possible values are: {'snappy', 'gzip', 'brotli', None}.
profile_name (str) : Name of AWS profile to use or None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same mention of False

import boto3

from featuretools.utils import is_python_2
from featuretools.utils.gen_utils import use_s3fs_es, use_smartopen_es
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move is_python_2 to here

def test_serialize_s3_csv(es, s3_client, s3_bucket):
es.to_csv(TEST_S3_URL, encoding='utf-8', engine='python')
obj = list(s3_bucket.objects.all())[0].key
s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL='public-read-write')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run one test without adding this ACL, but the second one fails. Maybe the teardown isn't going as we expected? Let's add a TODO about fixing this and we can look into it later

except EnvironmentError:
pass

f = open(test_path, "w+")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's convert these to with open(...) as f format

try:
os.remove(test_path)
os.remove(test_path_config)
except EnvironmentError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EnvironmentError seems too broad. We only want to catch the error if os.remove fails because there is not file present, right?

except EnvironmentError:
pass

f = open(test_path, "w+")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about using with open

def test_features_anon_s3(es):
# TODO: Feature ordering is different in py3.5 vs 3.6+
features_original = sorted(ft.dfs(target_entity='sessions', entityset=es, features_only=True), key=lambda x: x.unique_name())
features_deserialized = sorted(ft.load_features(S3_URL, profile_name=False), key=lambda x: x.unique_name())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only thing that changes between test_deserialize_features_default_s3, test_features_anon_s3, and test_deserialize_features_url is the url path and the profile. We merge these into a single test that parametrizes url and profile as variables


features_deserialized = ft.load_features(TEST_S3_URL)

for feat_1, feat_2 in zip(features_original, features_deserialized):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make this a helper function that tests in this file call to reduce code repetition

for line in fin:
fout.write(line)
else:
with open(file_path + ".tar", 'rb') as fin:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of modifying the file path by adding a tar extension we should update the docstrings to make it clear to users that uploading to s3 creates a tar archive instead of a directory.

'''Write entityset to disk in the pickle format, location specified by `path`.
def to_pickle(self, path, compression=None, profile_name=None):
'''Write entityset in the pickle format, location specified by `path`.
Path could be a local path or a S3 path (will serialize as a tar archive).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be interpreted as saying that tar archives are created locally as well.

Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jeremyliweishih jeremyliweishih merged commit d00799e into master Aug 2, 2019
@rwedge rwedge mentioned this pull request Aug 19, 2019
@rwedge rwedge deleted the read_from_url branch February 19, 2021 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants