Skip to content
This repository has been archived by the owner on Aug 31, 2021. It is now read-only.

Python3/Anaconda compatability #12

Closed
huu4ontocord opened this issue Mar 12, 2017 · 7 comments
Closed

Python3/Anaconda compatability #12

huu4ontocord opened this issue Mar 12, 2017 · 7 comments

Comments

@huu4ontocord
Copy link

I got it working for Anaconda3 by doing the following:

In cluster_pruning.py

123c123
< records_index = np.arange(features.shape[0])

        records_index = list(np.arange(features.shape[0]))

131c131
< np.arange(clusters_selection.shape[0]))

                             list(np.arange(clusters_selection.shape[0])))

223c223
< if feature <> None and record <> None:

    if feature != None and record != None:

273a274

                    elements = list(elements)

In matrix_distance.py:

123c123,124
< arg_index = np.random.choice(len(scores), k, replace=False)

            lenScores = len(scores)
            arg_index = np.random.choice(lenScores, min(lenScores, k), replace=False)

329a331

In init.py:

7c7
< from cluster_pruning import ClusterIndex, MultiClusterIndex

from .cluster_pruning import ClusterIndex, MultiClusterIndex

I you should just create two more files for ClusterIndex and MultiClusterIndex. Otherwise it will cause issues with importing in Python3 and backwards compatability for Python2

@spencebeecher
Copy link
Contributor

Wow thanks! Ill take a look.

Do you have any intuition for why this change needs to happen?
records_index = np.arange(features.shape[0])
to
records_index = list(np.arange(features.shape[0]))

The other changes you suggest should be compatible. And this line
if feature != None and record != None:
should be something like
if (not feature is None) and (not record is None):
Thanks again!!!

@huu4ontocord
Copy link
Author

"np.range" produces an iterator. It's like "range" in python3. You need to wrap a "list" function around it.

Btw, check out https://github.com/known-ai/KeyedVectorsANN

I folded your code into Gensim's KeyedVectors.

It was easier to fold all the code it into one file, but I can refactor to use the pysparnn package when its compatible with python 3. I made some changes to add a new method "most_similar", and storing indexes as the records_data instead of the actual words. This saves some space.

My model 260MB, and I'd like to find out how to reduce this size. I suspect it's mostly duplicates of the matrices.

Feel free to email me directly at ontocord@gmail.com

@spencebeecher
Copy link
Contributor

Thanks @known-ai ! I made the requested changes in this diff - 1f976fa

@spencebeecher
Copy link
Contributor

spencebeecher commented Mar 18, 2017

Ill - send you an email.

@spencebeecher
Copy link
Contributor

I am not sure that there is much extra that is kept around in memory.

Check this modification to DenseMatrix
dense_matrix-Copy1.pdf which also includes a study of data sizes.

  • input features matrix is about the size of the ClusterIndex data structure
  • You can reduce memory footprint by 4x (so long as your data can fit well into an int16) - see the DenseIntCosineDistance class.

@huu4ontocord
Copy link
Author

huu4ontocord commented Mar 19, 2017 via email

@spencebeecher
Copy link
Contributor

^ very cool. I think there is probably a 'better' (for some def of better) way to pick the clusters other than random. I am going to leave this open but ill close it in 2 weeks if the thread dies down.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants