Skip to content

The task in this assignment is to produce sentiment predictions over a collection of IMDB reviews by various text data representation such as unigram, unigram with tf-idf, bigram and bigram with tf-idf.

License

Notifications You must be signed in to change notification settings

aabs/edx-ai-week11-project

Repository files navigation

edx-ai-week11-project

The task in this assignment is to produce sentiment predictions over a collection of IMDB reviews by various text data representation such as unigram, unigram with tf-idf, bigram and bigram with tf-idf.

Tasks

  • Combine raw DB into a single CSV file (imdb_tr.csv with 3 cols: row_number, text, polarity).
  • Remove all common stopwords.
  • transform text col in imdb_tr.csv into a term-document matrix using unigram model.
  • Train a SGC classifier on it with loss="hinge" and penalty="l1".
  • train a SGD classifier using unigram representation, predict sentiments on imdb_te.csv, and write output to unigram.output.txt
  • train a SGD classifier using bigram representation, predict sentiments on imdb_te.csv, and write output to unigram.output.txt
  • train a SGD classifier using unigram representation with tf-idf, predict sentiments on imdb_te.csv, and write output to unigram.output.txt
  • train a SGD classifier using bigram representation with tf-idf, predict sentiments on imdb_te.csv, and write output to unigram.output.txt

References

Problem Statement

sklearn text feature extraction

sklearn SGD

pandas cheatsheet

Fast streaming of data from big data sources

About

The task in this assignment is to produce sentiment predictions over a collection of IMDB reviews by various text data representation such as unigram, unigram with tf-idf, bigram and bigram with tf-idf.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages