My code for the Columbia Data Science final project Kaggle competition. Automated essay grading system.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
ASAP-AES @ 11bbb80
.gitignore
.gitmodules
Essay Set 1--ReadMeFirst.docx
Essay Set 2--ReadMeFirst.docx
Essay Set 3--ReadMeFirst.docx
Essay Set 4--ReadMeFirst.docx
Essay Set 5--ReadMeFirst.docx
README.md
TODO
add_tfidf.py
addtags.py
basicModel.R
basic_tags.py
buildModel.sh
caretStuff.R
columbia-university-introduction-to-data-science-fall-2012_public_leaderboard.csv
com.chmullig.DataScienceLeaderboardFetcher.plist
datascience_leaderboard.png
datascience_leaderboard_closeup.png
graph.html
leaderGrabber.py
makegraph.R
makegraph_closeup.R
pos_dict.py
prep.sh
sample_submission_file.csv
score.py
syllables.py
syllables_buildpickle.py
test.tsv
test_tagged_tfidf.csv
testing_predicted_gbm.csv
testing_predicted_gbma.csv
testing_predicted_lm.csv
testing_predicted_rf.csv
testing_predicted_rfa.csv
train.tsv
train_tagged_tfidf.csv
validateSubmission.py

README.md

chmullig's Kaggle Essay Code

For http://inclass.kaggle.com/c/columbia-university-introduction-to-data-science-fall-2012, as part of the class http://columbiadatascience.wordpress.com.

Implements a few models using R and python.

Requirements:

  • Python (only tested with 2.7)
    • nltk
    • scikit-learn
    • pandas
    • PyEnchant
  • R
    • RandomForest
    • gbm
    • plyr
    • MASS
    • ggplot2 (soft requirement)
    • reshape (soft requirement)

Features Created/Used

  • number of characters
  • numer of sentances
  • number of words
  • number of syllables
  • number of distinct words
  • words / sentances
  • characters / words
  • syllabels / words
  • spell_mistakes
  • correctly spelled words / total words
  • flag for starting with dear
  • flag if has semicolon
  • flag if has exclamation point
  • flag if has question mark
  • number of double quotes
  • flag if has at least 2 double quotes
  • flag indicating whether proper quote punctuation is more common or not (1 if ." is more common than "., -1 if less common, 0 if tied/neither)
  • counts of parts of speech (from NLTK)
  • rollups for number of nouns, verbs, adjectivs, adverbs, superlatives
  • flag for ending with a preposition
  • counts of the NER words (eg number of times they used @MONEY)
  • TF-IDF word and bigram frequencies that were then PCA'd down to 50 cells.

Models Used

  • First model was OLS linear regression using a subset of the variables. I trained 5 models, one per essay set, with identical formulas. Shockingly good.
  • Second model was Random Forest regression, again 5 models. Using more variables.
  • Third model was GBM, same formula as random forest, using 5 models.

Also tried doing rfm and gbm with one model using set as a predictor, but it didn't seem to perform as well.

Basic workflow in buildModel.sh.

  1. Run basic_tags.py on test.tsv and train.tsv. This creates almost all the features/tags/variables we need to use

  2. Run add_tfidf.py train_tagged.csv test_tagged.csv 50` to create tf-idf word vectors for each essay, and PCA down to a more usable 50 variables.

  3. Run the R script basicModel.R to create and predict models.