Skip to content

Conversation

@VarIr
Copy link
Contributor

@VarIr VarIr commented Aug 22, 2019

Under certain circumstances ThunderSVM is surprisingly slow when invoked via the Python bindings. In my current use case, I have replicated nested cross-validation procedures with relatively small data sets (that is, many calls to fit and predict). There, currently significant time seems to be spent in list comprehensions that create numpy arrays from ctypes.

A quick-fix seemed to be exchanging the list comprehensions with calls to np.frombuffer, which improved speed quite a bit.

I created this PR in case you're interested to include these changes into upstream.

@zeyiwen
Copy link
Collaborator

zeyiwen commented Aug 23, 2019

Thanks. We'll run some tests, and merge the PR. Our team is currently working on paper deadlines, so it may take some time.

@zeyiwen zeyiwen requested a review from QinbinLi September 9, 2019 12:50
Copy link
Member

@QinbinLi QinbinLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! First, there is an error in your updates. In line 98, you should remove "self.". Second, I'm not very familiar with python memory management. If you use "frombuffer", is it possible that the original memory (e.g., csr_row) has been released and then the class member (e.g., self.row) becomes invalid although the class is still being used? If this case cannot happen, I can merge your pull request. Thanks.

@VarIr
Copy link
Contributor Author

VarIr commented Sep 10, 2019

I'm not very familiar with C/C++ memory management, so let's discuss this.

My current understanding is this: After fitting, the support vectors live somewhere in memory not handled by Python initially. In Python, first ctypes arrays are allocated, e.g.
csr_row = (c_int * arr_length)(). Then, thundersvm.get_sv(csr_row, ...) fills the arrays. I assume, in this step the data is copied from the original location to the new one. (Is this correct? I have not checked, what the function actually does). np.frombuffer(csr_row, ...) then creates a numpy array referencing the new location. This should increment the ctypes array's refcount, which should, thus, not be garbage collected. In this case, we should be fine.

If there is no copy made in get_sv, I don't know, but I would assume that self.row could get corrupted, if the underlying data is modified. This could be easily avoided, if we do np.frombuffer().copy(), however. There would be no gain compared to the current implementation in terms of memory, but it will still be faster (avoiding all the type checks within the list comprehension, etc.)

@QinbinLi
Copy link
Member

Hi, @VarIr

You are right. There is copy inside thundersvm.get_sv(csr_row, ...). As long as csr_row will not be released by python garbage collection, it should be fine to use frombuffer. I'll merge the request. Thanks.

@QinbinLi QinbinLi merged commit 8445763 into Xtra-Computing:master Sep 10, 2019
@VarIr VarIr deleted the performance branch September 10, 2019 08:31
@beevabeeva
Copy link

beevabeeva commented Oct 11, 2019

EDIT: Please see issue #172

I am having a similar issue. I am using the latest master branch of thundersvm.
Compared to serial sklearn (LibSVM), thundersvm is orders of magnitude slower.
I am probably doing something wrong though.
I have tested this on a GTX 750 Ti and 1060 Ti with the same results.
I have to stop thundersvm because it just seems like it will never end, while the serial sklearn takes 0.5 seconds on 50 000 instances of data (5 features).

Here is my test code if you would like to try replicate this (dataset is BNG_COMET: https://www.openml.org/d/5648):

ThunderSVM test:

from thundersvm import SVC
import numpy as np
import time



data = np.loadtxt(open("atm/demos/BNG_COMET.csv", "rb"), delimiter=",", skiprows=1)
# data = np.loadtxt(open("atm/demos/pollution.csv", "rb"), delimiter=",", skiprows=1)


# print(data, data.shape)




X= data[:5000,:-1]

y = data[:5000,-1]


xp_lots_of_test_samples = data[5100:5103,:-1]

print("X",X, X.shape)


print(y)

start=time.time()




clf =  SVC(C=176.6677880062673, cache_size=150, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma=12005.61948153516, gpu_id=1,
    kernel='linear', max_iter=5, max_mem_size=-1, n_jobs=-1, probability=True,
    random_state=None, shrinking=True, tol=0.001, verbose=True)

clf.fit(X,y)



end_time =time.time()

totaltime = end_time-start

print('time: ',totaltime)

print("predictions:")
print(clf.predict(xp_lots_of_test_samples))
print("true labels:")

print(data[5100:5103,-1])

Sklearn test:

from sklearn import svm
import numpy as np
import time

# data = np.loadtxt(open("atm/demos/pollution.csv", "rb"), delimiter=",", skiprows=1)
data = np.loadtxt(open("atm/demos/BNG_COMET.csv", "rb"), delimiter=",", skiprows=1)


X= data[:50000,:-1]

y = data[:50000,-1]


xp_lots_of_test_samples = data[50100:50103,:-1]


# clf = svm.SVC(kernel='rbf',
#          verbose=True,
#          gamma=0.5, 
#          C=120.51564536384429, 
#          max_iter = 50000,
#          class_weight = 'balanced'
#                 )

start =time.time()
clf = svm.SVC(C=176.6677880062673, cache_size=150, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma=12005.61948153516,
    kernel='linear', max_iter=5, probability=True,
    random_state=None, shrinking=True, tol=0.001, verbose=True)

clf.fit(X,y)

end_time =time.time()

totaltime = end_time-start

print('time: ',totaltime)
print("predictions:")
print(clf.predict(xp_lots_of_test_samples))
print("true lables:")
# print(data[137:150,-1])
print(data[50100:50103,-1])

@zeyiwen
Copy link
Collaborator

zeyiwen commented Oct 13, 2019

Is the problem still there? If so, please open a new issue, such that we can find a time to fix it.

@beevabeeva
Copy link

@zeyiwen thanks for your help. I will open a new one now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants