GitHub - dridon/aml2: Applied Machine Learning Mini-Project 2

dridon / aml2 Public

Notifications You must be signed in to change notification settings
Fork 1
Star 1

Applied Machine Learning Mini-Project 2

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
Code		Code
Datasets		Datasets
Results		Results
report		report
.gitignore		.gitignore
readme.txt		readme.txt

Repository files navigation

**Preprocessing ideas:
	
	- Tokenize
	- Conversion to lower-case	
	- Remove stop words (the, with, to, for, a, we, etc.). We need to write a list.
	- Remove punctuation
	- Remove tokens with less than 2 characters
	- Stemming (ex: forest, forests, forestation, forested ===> forest)
	//- Filter out Angus' error :P; i.e. the "Category" category. Can be done manually, only 3 entries.
	- Do we want to handle formulae? Count amount of formulae?
	
	1. make all words lower case
	2. remove punctuation
	
	3. remove tokens with less than two chars
	4. remove stop words
	5. stemming
	
	6. for group 1 and 2, build dictionaries
	
**Feature extraction:

	- Word presence/absence, bag of words or n-grams?
	- Need some kind of word occurrence threshold
	
**Classifiers:
	1) Basic: Naive Bayes
	2) Standard: To be covered in class (SVM?)
	3) Advanced: I suggest random forests	
		
	
**Sources of info:
	https://de.dariah.eu/tatom/preprocessing.html
	
**Papers:
Keyword: text categorization

General: http://nmis.isti.cnr.it/sebastiani/Publications/TM05.pdf
N-gram: http://odur.let.rug.nl/vannoord/TextCat/textcat.pdf
Bigrams: http://www.cs.ucsb.edu/~yfwang/papers/igm.pdf --> Might be interesting to try that! Pretty straigthforward.
SVM: http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf
Regression: http://www.stat.columbia.edu/~madigan/PAPERS/techno.pdf
Classifier comparison: http://www.inf.ufes.br/~claudine/courses/ct08/artigos/yang_sigir99.pdf
Preprocessing: http://www.di.uevora.pt/~pq/papers/enia-goncalves-quaresma.pdf