Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification using LDA #26

Open
amritbhanu opened this issue Jul 14, 2016 · 8 comments
Open

Classification using LDA #26

amritbhanu opened this issue Jul 14, 2016 · 8 comments
Labels
Milestone

Comments

@amritbhanu
Copy link
Contributor

amritbhanu commented Jul 14, 2016

Experiment Setup

  • Datasets - Manney Generator of Stack Exchange sites. 25 datasets.
  • Running tuning experiment with 5 terms overlap. Select those parameters with max stability score.
  • Find clusters, and each topic is assigned a sequential "1,2,3 labels".
  • Now each document will be labelled as 1,2,3, rather than tags
  • Run SVM. Binary classification.

We have the baseline results with no smote svm, smote svm.

@amritbhanu amritbhanu added this to the To Dos milestone Jul 14, 2016
@amritbhanu
Copy link
Contributor Author

@timm
Copy link

timm commented Jul 18, 2016

amrit... is the paper all done? like do that before moving on

t

@amritbhanu
Copy link
Contributor Author

I am on it prof

@amritbhanu
Copy link
Contributor Author

amritbhanu commented Aug 4, 2016

@timm Here is the result of using LDA to automatically label the documents and then use a learner.

From the paper, we cant reproduce results, due to :

  • Mylyn Project, Eclipse Project, FIrefox project, Netbeans. The preprocessed datasets are not available neither the exact preprocessing steps given. they followed some naming conventions which they havent described.

Experiment:

  • Took this as an example. http://dl.acm.org/citation.cfm?id=2390074
  • After doing LDA, they labeled each document to the top weighted topic.
  • Each document will have a label 1,2,3...
  • Selected a target label (yes) and rest will be chosen as no. Converted into binary classification.
  • 5 by 5 cross val. Hashing trick with 10k features. SVM Classifier

Conclusion

  • Baseline SVM didnt perform well, this might be because of the tags which we used to label the Stackexchange websites. This can affect all our previous results which we showed to LN. Basically the numbers will change. Conclusions might remain same or not.
  • LDA is able to correctly label the documents.

Results:

file

@timm
Copy link

timm commented Aug 4, 2016

am now lost in the details.

please bust fscore into precision and recall

this looks like no win with tuning... right?

please write this up as a 2-4 page pdf doc. define all your terms. dont worry about the start up sections (motivation, background)

but what is your justification for "baseline"? what papers use "baseline"?

t

@amritbhanu
Copy link
Contributor Author

Yes no win with tuning, but our result numbers shown to LN might change. Conclusion might remain same or not.

My baseline results is from our BIGDSE paper, where we just used hashing trick with svm as baseline.

I will compile all these terms and my thoughts into a white paper soon.

@timm
Copy link

timm commented Aug 5, 2016

fyi- you may need to tune (1) the feature extraction (of the topics) AND (2) the learner to get improved performance.

right now ur just tuning (1) right?

without doing (2), what you could do is show conclusion instability (a venn diagram of documents classified XYZ via untuned feature extraction repeated 10 times on 10 different data orderings.

with (2) you might get the kinds of improvements wei reported

@amritbhanu
Copy link
Contributor Author

amritbhanu commented Aug 5, 2016

  • I did (1)tuning and then tried labeling the documents with topics X,Y,Z. On the other hand, the original dataset (stackexchange websites or so called manny dataset generator) were labeled with tags. Once I labeled the document using LDA, (2) the feature extraction used is the feature hasher (hashing trick) and then a learner.
    • what my conclusion is with tuning or without tuning, both performed better than the baseline results. So this has to do with the dataset (wrong data) which we used during LN times.
  • On your suggestion, I will try (1) feature extraction of topics and (2) then a learner.
  • I didnt understand you about the venn diagram. From tuned results, I will have documents classified as X1, Y1, Z1...and from untuned results I will have documents classified as X2, Y2, Z2. What do you mean now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants