Error in LDAResults #42

BrianMiner · 2015-01-26T04:06:26Z

Following the example in https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md

There is an error when I run LDAResults() the following error prints:

ImportError Traceback (most recent call last)
in ()
3 lda = LDAResults('C:\Users\Desktop\DATA\LDA\topics.dat',
4 'C:\Users\Desktop\DATA\LDA\predictions.dat', 'C:/Users/Desktop/DATA/LDA' + '/sff_basic.pkl',
----> 5 num_topics=num_topics)
6 lda.print_topics()

C:\Anaconda\lib\site-packages\rosetta\text\vw_helpers.pyc in init(self, topics_file, predictions_file, sfile_filter, num_topics, alpha, verbose)
230
231 if not isinstance(sfile_filter, text_processors.SFileFilter):
--> 232 sfile_filter = text_processors.SFileFilter.load(sfile_filter)
233
234 self.sfile_frame = sfile_filter.to_frame()

C:\Anaconda\lib\site-packages\rosetta\common_abc.pyc in load(cls, loadfile)
40 """
41 with smart_open(loadfile, 'rb') as f:
---> 42 return cPickle.load(f)

ImportError: No module named text_processors

langmore · 2015-01-26T13:11:22Z

The fact that text_processors is not seen points to an installation issue. I see you're on Windows. I'm not sure to what extent windows + rosetta has been tested.

BrianMiner · 2015-01-26T13:58:08Z

Other functions up to this point appear to work, following https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md.

BrianMiner · 2015-01-26T14:17:18Z

I am probably missing something, but the bit that seems to cause the error is this:
text_processors.SFileFilter.load

But when I look on github and class SFileFilter under text_processors I dont see a load function. There is a load_sfile function though.

dkrasner · 2015-01-26T15:49:36Z

the load() method comes from the SaveLoad class which SFileFilter inherits

perhaps you can trace through and point out exact where the error occurs?

also as Ian has pointed out we really don't develop or test in a windows environment so it's a bit hard to see what might or might not work

BrianMiner · 2015-01-26T16:52:53Z

Any suggestion on how to trace to the error?

I have never seen any instances were a pure python lib failed on Windows, but I am sure it must happen.

ApproximateIdentity · 2015-01-26T19:19:02Z

Since it's an import error, can you import text processors? I.e. does the following work?

from rosetta.text import text_processors

Could you post a (ideally minimal) gist that causes this error?

BrianMiner · 2015-01-26T21:17:48Z

Yes, I can import that which is what is so strange. I don't have a gist set up, but here is what i was attempting. Note: I wasn't sure how to process a single file that had documents as rows, versus representing each document as a file in a folder. So....I broke up such a document into multiple documents (is this required?).

#imports####################################################
import sklearn.datasets
import re
import pandas as pd
import numpy as np
import nltk
#nltk.download()
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from rosetta.text.text_processors import SFileFilter, VWFormatter
from rosetta.text.vw_helpers import LDAResults
from rosetta.text import text_processors, filefilter, streamers, vw_helpers

#GENERATE DATA############################################
dat=pd.Series(sklearn.datasets.fetch_20newsgroups(subset= 'train').data)

def clean(s):
try:
return " ".join(re.findall(r'\w+', s,flags = re.UNICODE | re.LOCALE)).lower()
except:
return 'ERROR'

dat=dat.apply(clean)

#WRITE OUT THE DOCS IN A FOLDER#######################################
i=0
for doc in dat:
pd.Series(doc).to_csv("C:\Users\Desktop\DATA\LDA\DOCS\doc%d.txt" %i,mode='wb', header=False,index=False)
i=i+1

#CREATE VW FILE#################################################

#create the VW format file
my_tokenizer = text_processors.TokenizerBasic()
stream = streamers.TextFileStreamer(text_base_path='C:\Users\Desktop\DATA\LDA\DOCS', tokenizer=my_tokenizer)
stream.to_vw('C:\Users\Desktop\DATA\LDA\rosetta.vw', n_jobs=-1, raise_on_bad_id=False)

#load the file again
sff = SFileFilter(VWFormatter())
sff.load_sfile('C:\Users\Desktop\DATA\LDA\rosetta.vw')

#Remove extremes
#remove "gaps" in the sequence of numbers (ids)
sff.filter_extremes(doc_freq_min=5, doc_fraction_max=0.8)
sff.compactify()
sff.save('C:\Users\Desktop\DATA\LDA'+'sff_file.pkl')

#Create filtered file for VW
sff.filter_sfile('C:\Users\Desktop\DATA\LDA\rosetta.vw', 'C:\Users\Desktop\DATA\LDA\rosetta_filtered.vw')

#THEN RUN THIS###############################################################################################################################################
#vw --lda 10 --lda_alpha 0.1 --lda_rho 0.1 --lda_D 11314 --minibatch 256 --power_t 0.5 --initial_t 1 -b 22 -k --cache_file C:\Users\Desktop\DATA\LDA\vw.cache --passes 10 -p C:\Users\Desktop\DATA\LDA\predictions.dat --readable_model C:\Users\Desktop\DATA\LDA\topics.dat C:\Users\Desktop\DATA\LDA\rosetta_filtered.vw

#THIS IS WHAT THROWS THE ERROR:
num_topics = 5
lda = LDAResults('C:\Users\Desktop\DATA\LDA\topics.dat',
'C:\Users\Desktop\DATA\LDA\predictions.dat', 'C:\Users\Desktop\DATA\LDA' + 'sff_basic.pkl',
num_topics=num_topics)

BrianMiner · 2015-01-26T21:19:19Z

The markup is removing one of the backslashes in the code above, the paths do have double '\'.

All the steps before the LDAResults appear to work fine.

ApproximateIdentity · 2015-01-26T21:24:33Z

See here for a way to make your code show up as more readable: https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown

I'm not totally sure I understand what you're doing. How about you do the following:

Go into some empy root folder (I'll call root/).
Put your python file in that folder.
Create a docs/ folder in the root/ folder.
Put one of your text files in that root/ folder.
Change all the paths to be relative to this structure.
Finally upload the python file and the text file (this is probably where you should use a gist).

If you do all that, I can maybe figure out what's going wrong.

BrianMiner · 2015-01-26T21:35:37Z

I can try that, I am not that great a python programmer as should be
apparent :)

I am trying to follow the example :
https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md
It concerns me you are not sure what I am doing (does not bode well for
my efforts), I tried to copy the code in the example (using method 1). I
did not have the data used in this example (that might help us newbies)
so I created some in a reproducible way using sci-kit learn.

On 1/26/2015 4:24 PM, Thomas Nyberg wrote:

See here for a way to make your code show up as more readable:
https://guides.github.com/features/mastering-markdown/#GitHub-flavored-markdown

I'm not totally sure I understand what you're doing. How about you do
the following:

Go into some empy root folder (I'll call root/).

Put your python file in that folder.

Create a docs/ folder in the root/ folder.

Put one of your text files in that root/ folder.

Change all the paths to be relative to this structure.

Finally upload the python file and the text file (this is probably
where you should use a gist).

If you do all that, I can maybe figure out what's going wrong.

—
Reply to this email directly or view it on GitHub
#42 (comment).

BrianMiner · 2015-01-29T20:28:48Z

I gave up and installed rosetta on Ubuntu via Virtual Box. The install worked fine (except one failed test which is already noted as an issue). The examples from the above were all run through w/o issue. So, indeed the issue appears to be Windows (install, use of relative imports ?).

ApproximateIdentity · 2015-01-30T14:27:55Z

Firstly, can you post information about which test fails? I thought the tests were all passing now and would like to know if they're not...

And sorry I didn't reply to your previous message. As a general rule when submitting error reports (which you're doing in an informal manner) it's good to (1) provide a script and necessary data to reproduce the error and (2) to try to make the script and data as "minimal" as possible. So really try to really think if any lines of code can be removed and still keep the error and if the datafile can be minimized as well (it might be the case that you only need one line in your data file to produce the error...in that case don't upload a file with 10000 lines). That reduces noise for others to look at.

Once you have that, you should write it in a way where you can just run 'python test.py' and have a failure. That failure should produce a stack trace which you should post in full. Everything that is code or error should be posted with either the markdown formatting that I linked earlier or simply uploaded as a gist. That helps keep the formatting from getting screwed up.

Regardless, it might still be hard to help because I think almost all users of the library have either Linux or Macs and so testing/bugs have focused on that. However, if you post a clean script with a full stack trace we might be able to see the error just by reading that. It's good to hear that it's working in Ubuntu/Virtual Box (I use that myself on the one Windows laptop I have), but I think these pointers will help you communicate errors to this or other projects more effectively in the future.

BrianMiner closed this as completed Jan 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in LDAResults #42

Error in LDAResults #42

BrianMiner commented Jan 26, 2015

langmore commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

dkrasner commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

ApproximateIdentity commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

ApproximateIdentity commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

BrianMiner commented Jan 29, 2015

ApproximateIdentity commented Jan 30, 2015

Error in LDAResults #42

Error in LDAResults #42

Comments

BrianMiner commented Jan 26, 2015

langmore commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

dkrasner commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

ApproximateIdentity commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

ApproximateIdentity commented Jan 26, 2015

BrianMiner commented Jan 26, 2015

BrianMiner commented Jan 29, 2015

ApproximateIdentity commented Jan 30, 2015