Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLinearModel (trained) when saving it generates an error OSError: [Errno 95] Operation not supported #247

Closed
preetbawa opened this issue Jul 17, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@preetbawa
Copy link

Description

We are trying to leverage PECOS for XMR problem in ranking autocomplete suggestions given query say ("clinical"), a prefix ("ac"), so suggestion could be "acupuncture", this seems to be implemented for a similar use case of Amazon in pecosq2q.py under examples/qp2q/models/

I am able to get to point of running and training XLinear model, i am following examples in pecosq2q.py but not really using that class, i have picked up snippets and built my label embeddings and cluster matrix chain and then finally built Xlinear model.

How to Reproduce?

from pecos.xmc.xlinear.model import XLinearModel

xlinear_model = XLinearModel.train(
input_feature_matrix,
labels_y_ohe_matrix,
cluster_matrix,
threads=16,
Cp=1.0,
Cn=1.0,
threshold=0.1

)

xlinear_model.save("/dbfs/FileStore/pzn_ai/contextualized_autocomplete/model/")

Steps to reproduce

(Please provide minimal example of code snippet that reproduces the error. For existing examples, please provide link.)

this last line above generates the error
xlinear_model.save("/dbfs/FileStore/pzn_ai/contextualized_autocomplete/model/")

I noticed that XLinearModel attribute self.model is of type HierarchicalMLModel
and its able to successfully execute following lines as part of save method in Xlinear class

param = self.append_meta({})
with open(f"{model_folder}/param.json", "w", encoding="utf-8") as fout:
fout.write(json.dumps(param, indent=True))

as i do see "param.json" under base path i am leveraging.

the problem arises when it does this (line 103) of pecos/xmc/xlinear/model.py
self.model.save(path.join(model_folder, "ranker"))

i do see ranker folder as well and i see another param.json there too which seems to isolate to this part which is choking

for d in range(self.depth):
local_folder = f"{folder}/{d}.model"
self.model_chain[d].save(local_folder)

Error message or code output

OSError Traceback (most recent call last)
OSError: [Errno 95] Operation not supported

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
in <cell line: 15>()
13 )
14
---> 15 xlinear_model.save("/dbfs/FileStore/pzn_ai/contextualized_autocomplete/model/")

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/xlinear/model.py in save(self, model_folder)
101 with open(f"{model_folder}/param.json", "w", encoding="utf-8") as fout:
102 fout.write(json.dumps(param, indent=True))
--> 103 self.model.save(path.join(model_folder, "ranker"))
104
105 @classmethod

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder)
1317 for d in range(self.depth):
1318 local_folder = f"{folder}/{d}.model"
-> 1319 self.model_chain[d].save(local_folder)
1320
1321 @classmethod

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder)
789 with open("{}/param.json".format(folder), "w") as f:
790 f.write(json.dumps(param, indent=True))
--> 791 smat_util.save_matrix("{}/W.npz".format(folder), self.W)
792 smat_util.save_matrix("{}/C.npz".format(folder), self.C)
793

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/utils/smat_util.py in save_matrix(tgt, mat)
96 smat.save_npz(tgt_file, mat, compressed=False)
97 else:
---> 98 raise NotImplementedError("Save not implemented for matrix type {}".format(type(mat)))

Environment

  • Operating system: Databricks Runtime 11.2 LTS ML
  • Python version: 3.9
  • PECOS version: latest mainline

(Add as much information about your environment as possible, e.g. dependencies versions.)

@preetbawa preetbawa added the bug Something isn't working label Jul 17, 2023
@preetbawa
Copy link
Author

Even though i don't use directly class PecosQP2QModel i did use XLinearModel to train the model, and i used following code to see type of matrices for each of model chain item

from pecos.xmc.xlinear.model import XLinearModel

xlinear_model = XLinearModel.train(
input_feature_matrix,
labels_y_ohe_matrix,
cluster_matrix,
threads=16,
Cp=1.0,
Cn=1.0,
threshold=0.1

)
for i in range(3):
print(f"Cluster Matrix for model chain index {i} is of type {type(xlinear_model.model.model_chain[i].C)}")
print(f"Weight matrix for model chain index {i} is of type {type(xlinear_model.model.model_chain[i].W)}")

and they generated this result
Cluster Matrix for model chain index 0 is of type <class 'scipy.sparse.csc.csc_matrix'>
Weight matrix for model chain index 0 is of type <class 'scipy.sparse.csc.csc_matrix'>
Cluster Matrix for model chain index 1 is of type <class 'scipy.sparse.csc.csc_matrix'>
Weight matrix for model chain index 1 is of type <class 'scipy.sparse.csc.csc_matrix'>
Cluster Matrix for model chain index 2 is of type <class 'scipy.sparse.csc.csc_matrix'>
Weight matrix for model chain index 2 is of type <class 'scipy.sparse.csc.csc_matrix'>

i am not sure why it not saving model_chain upto depth iteration - why its failing here
for d in range(self.depth):
local_folder = f"{folder}/{d}.model"
self.model_chain[d].save(local_folder)

@preetbawa
Copy link
Author

i would appreciate if anyone of core authors of Pecos can revert back, as we can't render this model via api unless we save to disk and then load it from there nto our application.

@preetbawa
Copy link
Author

@rofuyu @Patrick-H-Chen guys any ideas?

@preetbawa
Copy link
Author

i noticed comparison here in the smat_util.py save_matrix method is like this:
with open(tgt, "wb") as tgt_file:
if isinstance(mat, np.ndarray):
np.save(tgt_file, mat, allow_pickle=False)
elif isinstance(mat, smat.spmatrix):
smat.save_npz(tgt_file, mat, compressed=False)

i think our cluster chain matrices are of type csr_matrix and spmatrix is base class - so not sure what's happening here, isinstance method should return the base type, i can try cloning github pecos repo on my databricks cluster, and doing local pip install with changes of printing matrix type -
@jiong-zhang any ideas , help would be appreciated.

@preetbawa
Copy link
Author

i realize that its not cluster matrix that's issue ,its weight matrix that is not been saved.
thread is
/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder)
789 with open("{}/param.json".format(folder), "w") as f:
790 f.write(json.dumps(param, indent=True))
--> 791 smat_util.save_matrix("{}/W.npz".format(folder), self.W)

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/pecos/xmc/base.py in save(self, folder)
789 with open("{}/param.json".format(folder), "w") as f:
790 f.write(json.dumps(param, indent=True))
--> 791 smat_util.save_matrix("{}/W.npz".format(folder), self.W)

xlinear_model_1 is trained using this XLinearModel.train

so to debug a bit more i did following:
for d in range(xlinear_model_1.model.depth):
print(type(xlinear_model_1.model.model_chain[d].W))
model depth is 5 :
with 121 k unique labels nows using HybridIndexer as cluster chain.

class 'scipy.sparse.csc.csc_matrix'>
<class 'scipy.sparse.csc.csc_matrix'>
<class 'scipy.sparse.csc.csc_matrix'>
<class 'scipy.sparse.csc.csc_matrix'>
<class 'scipy.sparse.csc.csc_matrix'>

@preetbawa
Copy link
Author

@nishant4995 man can you help here as no one else is responding. I will also try to save explicility each model weight directly to see maybe shed some light there.

@preetbawa
Copy link
Author

so i tried saving model chain weight matrix directly and i see the error from numpy

<array_function internals> in savez(*args, **kwargs)

/databricks/python/lib/python3.9/site-packages/numpy/lib/npyio.py in savez(file, *args, **kwds)
615 array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
616 """
--> 617 _savez(file, args, kwds, False)
618
619

/databricks/python/lib/python3.9/site-packages/numpy/lib/npyio.py in _savez(file, args, kwds, compress, allow_pickle, pickle_kwargs)
718 # always force zip64, gh-10776
719 with zipf.open(fname, 'w', force_zip64=True) as fid:
--> 720 format.write_array(fid, val,
721 allow_pickle=allow_pickle,
722 pickle_kwargs=pickle_kwargs)

/usr/lib/python3.9/zipfile.py in close(self)
1168 self._fileobj.seek(self._zinfo.header_offset)
1169 self._fileobj.write(self._zinfo.FileHeader(self._zip64))
-> 1170 self._fileobj.seek(self._zipfile.start_dir)

this looks an issue with numpy version compatibility issue ?
i found this in setup.py - its possible databricks ml cluster i am using causing this issue with numpy version, let me look into it.
numpy_requires = [
'numpy<1.20.0; python_version<"3.7"', # setup_requires needs correct version for <3.7
'numpy>=1.19.5; python_version>="3.7"'
]

@preetbawa
Copy link
Author

so turns out this is bizarre, if you use dbfs path even though using /dbfs/ in databricks, there is issue where zipping file to databricks path doesn't work, but when i tried using /tmp/ path it worked !!!

so issue is somehow related to storing weight matrices as zipped numpy npz file , its issue with seek when trying to store weight matrix as zipped file using dbfs path, when using local databricks path like /tmp/ it works.

@preetbawa
Copy link
Author

something to be aware, if folks are training this model in databricks, its internal zip issue with saving matrices to dbfs path - use local path like /tmp and then move matrices later to different path if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant