HW1: Amazon Review Classification
If you want to create a virtual environment, run virtualenv <the name of the environment, say 'tekwani'>
To begin using the virtual environment, you must activate it. $ source tekwani/bin/activate
Now install the packages specified in requirements.txt. You can do this using pip freeze > requirements.txt (freeze the current state of the environment) pip install -r requirements.txt
Depending on your installation, NLTK might require WordNet and stopwords data. To install these, run python.
When the graphical installer appears, select WordNet and stopwords from Corpora and install.
Running the kNN classifier
The folder HW1_tekwani/data must contain a test.csv and train.csv file AS IT IS. I've processed the test.data and train.data files for white spaces, formatting and quote line delimiters. Some of the steps I've taken to process these files are in preprocess.py but I've used a combination of Unix commands and Python. The classifier will not work on the train.data & test.data directly.
src/feature_extraction.pyfirst. This usually takes about 20 minutes to generate 19 CSV files that contain similarity measures for train and test data.
- Now run
knn.pywithout any command line arguments. To change the value of
k, edit the last line in
knn.py creates a submission.txt file that contains 18506 rows with either a +1 or a -1.