Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in LDAResults #42

Closed
BrianMiner opened this issue Jan 26, 2015 · 12 comments
Closed

Error in LDAResults #42

BrianMiner opened this issue Jan 26, 2015 · 12 comments

Comments

@BrianMiner
Copy link

Following the example in https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md

There is an error when I run LDAResults() the following error prints:

ImportError Traceback (most recent call last)
in ()
3 lda = LDAResults('C:\Users\Desktop\DATA\LDA\topics.dat',
4 'C:\Users\Desktop\DATA\LDA\predictions.dat', 'C:/Users/Desktop/DATA/LDA' + '/sff_basic.pkl',
----> 5 num_topics=num_topics)
6 lda.print_topics()

C:\Anaconda\lib\site-packages\rosetta\text\vw_helpers.pyc in init(self, topics_file, predictions_file, sfile_filter, num_topics, alpha, verbose)
230
231 if not isinstance(sfile_filter, text_processors.SFileFilter):
--> 232 sfile_filter = text_processors.SFileFilter.load(sfile_filter)
233
234 self.sfile_frame = sfile_filter.to_frame()

C:\Anaconda\lib\site-packages\rosetta\common_abc.pyc in load(cls, loadfile)
40 """
41 with smart_open(loadfile, 'rb') as f:
---> 42 return cPickle.load(f)

ImportError: No module named text_processors

@langmore
Copy link
Contributor

The fact that text_processors is not seen points to an installation issue. I see you're on Windows. I'm not sure to what extent windows + rosetta has been tested.

@BrianMiner
Copy link
Author

Other functions up to this point appear to work, following https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md.

@BrianMiner
Copy link
Author

I am probably missing something, but the bit that seems to cause the error is this:
text_processors.SFileFilter.load

But when I look on github and class SFileFilter under text_processors I dont see a load function. There is a load_sfile function though.

@dkrasner
Copy link
Contributor

the load() method comes from the SaveLoad class which SFileFilter inherits

perhaps you can trace through and point out exact where the error occurs?

also as Ian has pointed out we really don't develop or test in a windows environment so it's a bit hard to see what might or might not work

@BrianMiner
Copy link
Author

Any suggestion on how to trace to the error?

I have never seen any instances were a pure python lib failed on Windows, but I am sure it must happen.

@ApproximateIdentity
Copy link
Contributor

Since it's an import error, can you import text processors? I.e. does the following work?

from rosetta.text import text_processors

Could you post a (ideally minimal) gist that causes this error?

@BrianMiner
Copy link
Author

Yes, I can import that which is what is so strange. I don't have a gist set up, but here is what i was attempting. Note: I wasn't sure how to process a single file that had documents as rows, versus representing each document as a file in a folder. So....I broke up such a document into multiple documents (is this required?).

#imports####################################################
import sklearn.datasets
import re
import pandas as pd
import numpy as np
import nltk
#nltk.download()
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from rosetta.text.text_processors import SFileFilter, VWFormatter
from rosetta.text.vw_helpers import LDAResults
from rosetta.text import text_processors, filefilter, streamers, vw_helpers

#GENERATE DATA############################################
dat=pd.Series(sklearn.datasets.fetch_20newsgroups(subset= 'train').data)

def clean(s):
try:
return " ".join(re.findall(r'\w+', s,flags = re.UNICODE | re.LOCALE)).lower()
except:
return 'ERROR'

dat=dat.apply(clean)

#WRITE OUT THE DOCS IN A FOLDER#######################################
i=0
for doc in dat:
pd.Series(doc).to_csv("C:\Users\Desktop\DATA\LDA\DOCS\doc%d.txt" %i,mode='wb', header=False,index=False)
i=i+1

#CREATE VW FILE#################################################

#create the VW format file
my_tokenizer = text_processors.TokenizerBasic()
stream = streamers.TextFileStreamer(text_base_path='C:\Users\Desktop\DATA\LDA\DOCS', tokenizer=my_tokenizer)
stream.to_vw('C:\Users\Desktop\DATA\LDA\rosetta.vw', n_jobs=-1, raise_on_bad_id=False)

#load the file again
sff = SFileFilter(VWFormatter())
sff.load_sfile('C:\Users\Desktop\DATA\LDA\rosetta.vw')

#Remove extremes
#remove "gaps" in the sequence of numbers (ids)
sff.filter_extremes(doc_freq_min=5, doc_fraction_max=0.8)
sff.compactify()
sff.save('C:\Users\Desktop\DATA\LDA'+'sff_file.pkl')

#Create filtered file for VW
sff.filter_sfile('C:\Users\Desktop\DATA\LDA\rosetta.vw', 'C:\Users\Desktop\DATA\LDA\rosetta_filtered.vw')

#THEN RUN THIS###############################################################################################################################################
#vw --lda 10 --lda_alpha 0.1 --lda_rho 0.1 --lda_D 11314 --minibatch 256 --power_t 0.5 --initial_t 1 -b 22 -k --cache_file C:\Users\Desktop\DATA\LDA\vw.cache --passes 10 -p C:\Users\Desktop\DATA\LDA\predictions.dat --readable_model C:\Users\Desktop\DATA\LDA\topics.dat C:\Users\Desktop\DATA\LDA\rosetta_filtered.vw

#THIS IS WHAT THROWS THE ERROR:
num_topics = 5
lda = LDAResults('C:\Users\Desktop\DATA\LDA\topics.dat',
'C:\Users\Desktop\DATA\LDA\predictions.dat', 'C:\Users\Desktop\DATA\LDA' + 'sff_basic.pkl',
num_topics=num_topics)

@BrianMiner
Copy link
Author

The markup is removing one of the backslashes in the code above, the paths do have double '\'.

All the steps before the LDAResults appear to work fine.

@ApproximateIdentity
Copy link
Contributor

See here for a way to make your code show up as more readable: https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown

I'm not totally sure I understand what you're doing. How about you do the following:

  1. Go into some empy root folder (I'll call root/).
  2. Put your python file in that folder.
  3. Create a docs/ folder in the root/ folder.
  4. Put one of your text files in that root/ folder.
  5. Change all the paths to be relative to this structure.
  6. Finally upload the python file and the text file (this is probably where you should use a gist).

If you do all that, I can maybe figure out what's going wrong.

@BrianMiner
Copy link
Author

I can try that, I am not that great a python programmer as should be
apparent :)

I am trying to follow the example :
https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md
It concerns me you are not sure what I am doing (does not bode well for
my efforts), I tried to copy the code in the example (using method 1). I
did not have the data used in this example (that might help us newbies)
so I created some in a reproducible way using sci-kit learn.

On 1/26/2015 4:24 PM, Thomas Nyberg wrote:

See here for a way to make your code show up as more readable:
https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown

I'm not totally sure I understand what you're doing. How about you do
the following:

  1. Go into some empy root folder (I'll call root/).
  2. Put your python file in that folder.
  3. Create a docs/ folder in the root/ folder.
  4. Put one of your text files in that root/ folder.
  5. Change all the paths to be relative to this structure.
  6. Finally upload the python file and the text file (this is probably
    where you should use a gist).

If you do all that, I can maybe figure out what's going wrong.


Reply to this email directly or view it on GitHub
#42 (comment).

@BrianMiner
Copy link
Author

I gave up and installed rosetta on Ubuntu via Virtual Box. The install worked fine (except one failed test which is already noted as an issue). The examples from the above were all run through w/o issue. So, indeed the issue appears to be Windows (install, use of relative imports ?).

@ApproximateIdentity
Copy link
Contributor

Firstly, can you post information about which test fails? I thought the tests were all passing now and would like to know if they're not...

And sorry I didn't reply to your previous message. As a general rule when submitting error reports (which you're doing in an informal manner) it's good to (1) provide a script and necessary data to reproduce the error and (2) to try to make the script and data as "minimal" as possible. So really try to really think if any lines of code can be removed and still keep the error and if the datafile can be minimized as well (it might be the case that you only need one line in your data file to produce the error...in that case don't upload a file with 10000 lines). That reduces noise for others to look at.

Once you have that, you should write it in a way where you can just run 'python test.py' and have a failure. That failure should produce a stack trace which you should post in full. Everything that is code or error should be posted with either the markdown formatting that I linked earlier or simply uploaded as a gist. That helps keep the formatting from getting screwed up.

Regardless, it might still be hard to help because I think almost all users of the library have either Linux or Macs and so testing/bugs have focused on that. However, if you post a clean script with a full stack trace we might be able to see the error just by reading that. It's good to hear that it's working in Ubuntu/Virtual Box (I use that myself on the one Windows laptop I have), but I think these pointers will help you communicate errors to this or other projects more effectively in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants