Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError: Unable to allocate 51.0 GiB #131

Closed
ssmoha7 opened this issue May 14, 2020 · 5 comments
Closed

MemoryError: Unable to allocate 51.0 GiB #131

ssmoha7 opened this issue May 14, 2020 · 5 comments

Comments

@ssmoha7
Copy link

ssmoha7 commented May 14, 2020

extractor.candidate_weighting(alpha=1.1, threshold=0.74,method='average')
For: MultipartiteRank
Used text file size = 11 MB
Platform: Windows 10 with 32 GB RAM
Error:

MemoryError Traceback (most recent call last)
in
4 extractor.candidate_weighting(alpha=1.1,
5 threshold=0.74,
----> 6 method='average')

~\Anaconda3\lib\site-packages\pke\unsupervised\graph_based\multipartiterank.py in candidate_weighting(self, threshold, method, alpha)
213
214 # cluster the candidates
--> 215 self.topic_clustering(threshold=threshold, method=method)
216
217 # build the topic graph

~\Anaconda3\lib\site-packages\pke\unsupervised\graph_based\multipartiterank.py in topic_clustering(self, threshold, method)
98
99 # compute the distance matrix
--> 100 Y = pdist(X, 'jaccard')
101 Y = np.nan_to_num(Y)
102

~\Anaconda3\lib\site-packages\scipy\spatial\distance.py in pdist(X, metric, *args, **kwargs)
2002 out = kwargs.pop("out", None)
2003 if out is None:
-> 2004 dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
2005 else:
2006 if out.shape != (m * (m - 1) // 2,):

MemoryError: Unable to allocate 51.0 GiB for an array with shape (6848888203,) and data type float64

@ygorg
Copy link
Collaborator

ygorg commented May 14, 2020

Hi,
Please explain your issue, an error message does not provide enough information.
Good day.

@ssmoha7
Copy link
Author

ssmoha7 commented May 14, 2020

Hello,
I tried to run the Multipartite Rank on a text file for data mining abstracts (size of file is 11 MB).
//This is my code
extractor = pke.unsupervised.MultipartiteRank()
extractor.load_document(input='../rawdata/dm_abstracts.txt', language='en', normalization='stemming', max_length=12000000)
pos = {'NOUN', 'PROPN', 'ADJ'}
stoplist = list(string.punctuation)
stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
stoplist += stopwords.words('english')
extractor.candidate_selection(pos=pos, stoplist=stoplist)
extractor.candidate_weighting(alpha=1.1,threshold=0.74, method='average') //ERROR HERE

I get the following error:
MemoryError Traceback (most recent call last)
in
4 extractor.candidate_weighting(alpha=1.1,
5 threshold=0.74,
----> 6 method='average')

~\Anaconda3\lib\site-packages\pke\unsupervised\graph_based\multipartiterank.py in candidate_weighting(self, threshold, method, alpha)
213
214 # cluster the candidates
--> 215 self.topic_clustering(threshold=threshold, method=method)
216
217 # build the topic graph

~\Anaconda3\lib\site-packages\pke\unsupervised\graph_based\multipartiterank.py in topic_clustering(self, threshold, method)
98
99 # compute the distance matrix
--> 100 Y = pdist(X, 'jaccard')
101 Y = np.nan_to_num(Y)
102

~\Anaconda3\lib\site-packages\scipy\spatial\distance.py in pdist(X, metric, *args, **kwargs)
2002 out = kwargs.pop("out", None)
2003 if out is None:
-> 2004 dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
2005 else:
2006 if out.shape != (m * (m - 1) // 2,):

MemoryError: Unable to allocate 51.0 GiB for an array with shape (6848888203,) and data type float64

@ygorg
Copy link
Collaborator

ygorg commented May 14, 2020

I think your input document might too big, for PKE to process.
If you are processing abstracts, then I guess your file contains more than one abstract.
The code you executed:

extractor = pke.unsupervised.MultipartiteRank()
extractor.load_document(input='../rawdata/dm_abstracts.txt', language='en', normalization='stemming', max_length=12000000)
pos = {'NOUN', 'PROPN', 'ADJ'}
stoplist = list(string.punctuation)
stoplist += ['-lrb-', '-rrb-', '-lcb-', '-rcb-', '-lsb-', '-rsb-']
stoplist += stopwords.words('english')
extractor.candidate_selection(pos=pos, stoplist=stoplist)
extractor.candidate_weighting(alpha=1.1,threshold=0.74, method='average')

should be executed for each document (so for each abstract) in the file.

@ssmoha7
Copy link
Author

ssmoha7 commented May 14, 2020

Okay, got it. But this will produce disconnected sets, a set for each abstract. Does this mean that the unsupervised methods supported in pke just support single document not a corpus of many documents?

Thanks,

@ygorg
Copy link
Collaborator

ygorg commented May 14, 2020

Well, PKE is a library that aims at providing many implementation of keyphrase extraction method. There are many method in the literature you can find the corresponding articles in the README.
Methods using topic modeling or TfIdf account for thr corpus, but the task of keyphrase extraction is to get keyphrases for documents. Maybe keyphrase extraction does not suit your needs.
I close this issue as this is not a problem with PKE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants