-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialize and Deserialize from S3/URLs #685
Conversation
…into read_from_url
…retools into read_from_url
session = boto3.Session() | ||
if isinstance(profile_name, str): | ||
transport_params = {'session': boto3.Session(profile_name=profile_name)} | ||
if _is_s3(features) and (session.get_credentials() is None or profile_name is False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't have a default profile but does have a named profile in their credentials file this logic will end up using the anonymous way of reading the file.
session = boto3.Session() | ||
if isinstance(profile_name, str): | ||
transport_params = {'session': boto3.Session(profile_name=profile_name)} | ||
if session.get_credentials() is not None or profile_name is not False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If profile_name
is None
and session.get_credentials()
is None
the logic will try to use smartopen instead of s3fs. Also if profile_name
is False
and session.get_credentials()
finds credentials than it will use smartopen instead of s3fs.
import pandas as pd | ||
|
||
from featuretools.entityset.relationship import Relationship | ||
from featuretools.entityset.serialize import FORMATS | ||
from featuretools.utils.gen_utils import check_schema_version | ||
from featuretools.utils import is_python_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_python_2
is in the gen_utils file so we can include it with the other gen_utils imports
featuretools/entityset/entityset.py
Outdated
'''Write entityset to disk in the pickle format, location specified by `path`. | ||
|
||
Args: | ||
path (str): location on disk to write to (will be created as a directory) | ||
compression (str) : Name of the compression to use. Possible values are: {'gzip', 'bz2', 'zip', 'xz', None}. | ||
profile_name (str) : Name of AWS profile to use or None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mention False
option
featuretools/entityset/entityset.py
Outdated
'''Write entityset to disk in the parquet format, location specified by `path`. | ||
|
||
Args: | ||
path (str): location on disk to write to (will be created as a directory) | ||
engine (str) : Name of the engine to use. Possible values are: {'auto', 'pyarrow', 'fastparquet'}. | ||
compression (str) : Name of the compression to use. Possible values are: {'snappy', 'gzip', 'brotli', None}. | ||
profile_name (str) : Name of AWS profile to use or None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same mention of False
featuretools/entityset/serialize.py
Outdated
import boto3 | ||
|
||
from featuretools.utils import is_python_2 | ||
from featuretools.utils.gen_utils import use_s3fs_es, use_smartopen_es |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move is_python_2
to here
def test_serialize_s3_csv(es, s3_client, s3_bucket): | ||
es.to_csv(TEST_S3_URL, encoding='utf-8', engine='python') | ||
obj = list(s3_bucket.objects.all())[0].key | ||
s3_client.ObjectAcl(BUCKET_NAME, obj).put(ACL='public-read-write') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can run one test without adding this ACL, but the second one fails. Maybe the teardown isn't going as we expected? Let's add a TODO about fixing this and we can look into it later
except EnvironmentError: | ||
pass | ||
|
||
f = open(test_path, "w+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's convert these to with open(...) as f
format
try: | ||
os.remove(test_path) | ||
os.remove(test_path_config) | ||
except EnvironmentError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EnvironmentError seems too broad. We only want to catch the error if os.remove fails because there is not file present, right?
except EnvironmentError: | ||
pass | ||
|
||
f = open(test_path, "w+") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about using with open
def test_features_anon_s3(es): | ||
# TODO: Feature ordering is different in py3.5 vs 3.6+ | ||
features_original = sorted(ft.dfs(target_entity='sessions', entityset=es, features_only=True), key=lambda x: x.unique_name()) | ||
features_deserialized = sorted(ft.load_features(S3_URL, profile_name=False), key=lambda x: x.unique_name()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only thing that changes between test_deserialize_features_default_s3, test_features_anon_s3, and test_deserialize_features_url is the url path and the profile. We merge these into a single test that parametrizes url and profile as variables
|
||
features_deserialized = ft.load_features(TEST_S3_URL) | ||
|
||
for feat_1, feat_2 in zip(features_original, features_deserialized): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can make this a helper function that tests in this file call to reduce code repetition
featuretools/utils/gen_utils.py
Outdated
for line in fin: | ||
fout.write(line) | ||
else: | ||
with open(file_path + ".tar", 'rb') as fin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of modifying the file path by adding a tar extension we should update the docstrings to make it clear to users that uploading to s3 creates a tar archive instead of a directory.
featuretools/entityset/entityset.py
Outdated
'''Write entityset to disk in the pickle format, location specified by `path`. | ||
def to_pickle(self, path, compression=None, profile_name=None): | ||
'''Write entityset in the pickle format, location specified by `path`. | ||
Path could be a local path or a S3 path (will serialize as a tar archive). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be interpreted as saying that tar archives are created locally as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
This pull request adds on to #263 by adding read and write functionality into S3 and read functionality for URLs. This was largely done using smart_open. Included is code and tests for serializing/deserializing features and entity sets.