In [1]:
# Imports - these are all the imports needed for the assignment
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import nltk package 
#   PennTreeBank word tokenizer 
#   English language stopwords
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

# scikit-learn imports
#   SVM (Support Vector Machine) classifer 
#   Vectorizer, which transforms text data into bag-of-words feature
#   TF-IDF Vectorizer that first removes widely used words in the dataset and then transforms test data
#   Metrics functions to evaluate performance
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

In [2]:
'''
For this assignment we will be using nltk: the Natural Language Toolkit.

To do so, we will need to download some text data.

Natural language processing (NLP) often requires corpus data (lists of words, and/or example text data) 
which is what we will download here now, if you don’t already have them.
'''

'\nFor this assignment we will be using nltk: the Natural Language Toolkit.\n\nTo do so, we will need to download some text data.\n\nNatural language processing (NLP) often requires corpus data (lists of words, and/or example text data) \nwhich is what we will download here now, if you don’t already have them.\n'

In [3]:
# Download the NLTK English tokenizer and the stopwords of all languages
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/solavancininoceti/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/solavancininoceti/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
'''
Downloading Data

If you download this notebook to run locally, you will also need some data files.

Running the next cell will download the required files for this assignment.

You can also view and download these files from https://github.com/DataScienceInPractice/Data.
'''

'\nDownloading Data\n\nIf you download this notebook to run locally, you will also need some data files.\n\nRunning the next cell will download the required files for this assignment.\n\nYou can also view and download these files from https://github.com/DataScienceInPractice/Data.\n'

In [5]:
'''
Part 1: Sentiment Analysis on Movie Review Data (4.25 points)
In part 1 we will apply sentiment analysis to Movie Review (MR) data.

The MR data contains more than 10,000 reviews collected from the IMDB website, 
and each of the reviews is annotated as either positive or negative. 
The number of positive and negative reviews are roughly the same. 
For more information about the dataset, you can visit http://www.cs.cornell.edu/people/pabo/movie-review-data/
For this homework assignment, we’ve already shuffled the data, and truncated the data to contain only 5000 reviews.
In this part of the assignment we will:

Transform the raw text data into vectors with the BoW encoding method
Split the data into training and test sets
Write a function to train an SVM classifier on the training set
Test this classifier on the test set and report the results
1a) Import data

Import the textfile ‘rt-polarity.tsv’ into a DataFrame called MR_df,

Set the column names as ‘index’, ‘label’, ‘review’

Note that ‘rt-polarity.tsv’ is a tab separated raw text file, in which data is separated by tabs (‘\t’). 
You can load this file with read_csv, specifying the sep (separator) argument as tabs (‘\t’). 
You will have to set header as None.
'''

'\nPart 1: Sentiment Analysis on Movie Review Data (4.25 points)\nIn part 1 we will apply sentiment analysis to Movie Review (MR) data.\n\nThe MR data contains more than 10,000 reviews collected from the IMDB website, \nand each of the reviews is annotated as either positive or negative. \nThe number of positive and negative reviews are roughly the same. \nFor more information about the dataset, you can visit http://www.cs.cornell.edu/people/pabo/movie-review-data/\nFor this homework assignment, we’ve already shuffled the data, and truncated the data to contain only 5000 reviews.\nIn this part of the assignment we will:\n\nTransform the raw text data into vectors with the BoW encoding method\nSplit the data into training and test sets\nWrite a function to train an SVM classifier on the training set\nTest this classifier on the test set and report the results\n1a) Import data\n\nImport the textfile ‘rt-polarity.tsv’ into a DataFrame called MR_df,\n\nSet the column names as ‘index’, ‘lab

In [29]:
MR_df = pd.read_csv('rt-polarity.tsv', sep='\t', header=None, names=['index', 'label', 'review'])

In [30]:
assert isinstance(MR_df, pd.DataFrame)

In [31]:
# Check the data
MR_df.head()

Unnamed: 0,index,label,review
0,8477,neg,except as an acting exercise or an exceptional...
1,4031,pos,japanese director shohei imamura 's latest fil...
2,10240,neg,i walked away not really know who `` they `` w...
3,8252,neg,what could have been a neat little story about...
4,1346,pos,no screen fantasy-adventure in recent memory h...


In [32]:
'''
1b) Create a function that converts string labels to numerical labels

Function name: convert_label

The function should do the following:

if the input label is “pos” return 1.0
if the input label is “neg” return 0.0
otherwise, return the input label as is
'''

'\n1b) Create a function that converts string labels to numerical labels\n\nFunction name: convert_label\n\nThe function should do the following:\n\nif the input label is “pos” return 1.0\nif the input label is “neg” return 0.0\notherwise, return the input label as is\n'

In [33]:
def convert_label(label):
    if label == 'pos':
        return 1.0
    elif label == 'neg':
        return 0.0
    else:
        return label

In [34]:
assert callable(convert_label)

In [35]:
'''
1c) Numerical Labels

Convert all labels in MR_df["label"] to numerical labels, using the convert_label function.

Save them as a new column named “y” in MR_df.
'''

'\n1c) Numerical Labels\n\nConvert all labels in MR_df["label"] to numerical labels, using the convert_label function.\n\nSave them as a new column named “y” in MR_df.\n'

In [37]:
MR_df['y'] = MR_df['label'].apply(convert_label)

In [38]:
assert sorted(set(MR_df['y'])) == [0., 1.]

In [39]:
# Check the MR_df data
MR_df.head()

Unnamed: 0,index,label,review,y
0,8477,neg,except as an acting exercise or an exceptional...,0.0
1,4031,pos,japanese director shohei imamura 's latest fil...,1.0
2,10240,neg,i walked away not really know who `` they `` w...,0.0
3,8252,neg,what could have been a neat little story about...,0.0
4,1346,pos,no screen fantasy-adventure in recent memory h...,1.0


In [40]:
'''
1d) Convert Text data into vector

We will now create a CountVectorizer object to transform the text data into vectors with numerical values.

To do so, we will initialize a CountVectorizer object, and name it as vectorizer.

We need to pass 4 arguments to initialize a CountVectorizer:

analyzer: 'word' Specify to analyze data from word-level.
max_features: 2000 Set a max number of unique words.
tokenizer: word_tokenize Set to tokenize the text data by using the word_tokenizer from NLTK .
stop_words: stopwords.words('english') Set to remove all stopwords in English. 
We do this since they generally don’t provide useful discriminative information.
'''

"\n1d) Convert Text data into vector\n\nWe will now create a CountVectorizer object to transform the text data into vectors with numerical values.\n\nTo do so, we will initialize a CountVectorizer object, and name it as vectorizer.\n\nWe need to pass 4 arguments to initialize a CountVectorizer:\n\nanalyzer: 'word' Specify to analyze data from word-level.\nmax_features: 2000 Set a max number of unique words.\ntokenizer: word_tokenize Set to tokenize the text data by using the word_tokenizer from NLTK .\nstop_words: stopwords.words('english') Set to remove all stopwords in English. \nWe do this since they generally don’t provide useful discriminative information.\n"

In [41]:
vectorizer = CountVectorizer(analyzer='word', max_features=2000, tokenizer=word_tokenize, stop_words=stopwords.words('english'))

In [42]:
assert vectorizer.analyzer == 'word'
assert vectorizer.max_features == 2000
assert vectorizer.tokenizer == word_tokenize
assert vectorizer.stop_words == stopwords.words('english')
assert hasattr(vectorizer, "fit_transform")

In [43]:
'''
1e) Vectorize reviews

Transform reviews MR_df["review"] into vectors using the vectorizer we created above:

The method you will be using is: MR_X = vectorizer.fit_transform(...).toarray()

Note that we apply the toarray method to the type cast the output to a numpy array. 
This is something we will do multiple times, turning custom sklearn objects back into arrays.

Note this may post a warning about stopwords. This is ok.
'''



In [71]:
MR_X = vectorizer.fit_transform(MR_df['review']).toarray()

(5000, 2000)

In [45]:
assert type(MR_X) == np.ndarray

In [46]:
'''
1f) Outcome variable

Copy out the y column in MR_df and save it as an np.array named MR_y

Make sure the shape of MR_y is (5000,) - depending upon your earlier approach, you may have to use reshape to do so.
'''

'\n1f) Outcome variable\n\nCopy out the y column in MR_df and save it as an np.array named MR_y\n\nMake sure the shape of MR_y is (5000,) - depending upon your earlier approach, you may have to use reshape to do so.\n'

In [47]:
MR_y = np.array(MR_df['y'])

In [67]:
MR_y.shape

(5000,)

In [50]:
assert MR_y.shape == (5000,)

In [51]:
'''
1g) Defining the train & test sets

We first set 80% of the data as the training set to train an SVM classifier. 
We will then test the learnt classifier on the remaining 20% of data samples (test set). 
(Reminder: For this homework assignment, we’ve already shuffled the data)

Calculate the number of training data samples (80% of total) and store it in num_training
Calculate the number of test data samples (20% of total) and store it in num_testing
Make sure both of these variables are of type int
'''

'\n1g) Defining the train & test sets\n\nWe first set 80% of the data as the training set to train an SVM classifier. \nWe will then test the learnt classifier on the remaining 20% of data samples (test set). \n(Reminder: For this homework assignment, we’ve already shuffled the data)\n\nCalculate the number of training data samples (80% of total) and store it in num_training\nCalculate the number of test data samples (20% of total) and store it in num_testing\nMake sure both of these variables are of type int\n'

In [57]:
num_training = int(len(MR_y) * 0.8)
num_testing = int(len(MR_y) * 0.2)

In [58]:
assert type(num_training) == int
assert type(num_testing) == int

In [59]:
'''
1h) Extracting train & test Data

Split the MR_X and MR_y into training set and test set. 
You should use the num_training variable to extract the data from MR_X and MR_y.

Extract the first num_training samples as training data, and extract the rest as test data.

Name them as:

MR_train_X and MR_train_y for the training set
MR_test_X and MR_test_y for the test set
'''

'\n1h) Extracting train & test Data\n\nSplit the MR_X and MR_y into training set and test set. \nYou should use the num_training variable to extract the data from MR_X and MR_y.\n\nExtract the first num_training samples as training data, and extract the rest as test data.\n\nName them as:\n\nMR_train_X and MR_train_y for the training set\nMR_test_X and MR_test_y for the test set\n'

In [89]:
MR_train_X = MR_X[:num_training]
MR_train_y = MR_y[:num_training]
MR_test_X = MR_X[num_training:]
MR_test_y = MR_y[num_training:]

In [90]:
assert MR_train_X.shape[0] == MR_train_y.shape[0]
assert MR_test_X.shape[0] == MR_test_y.shape[0]

assert len(MR_train_X) == 4000
assert len(MR_test_y) == 1000

In [91]:
'''
1i) SVM

Define a function called train_SVM that initializes an SVM classifier and trains it

Inputs:

X: np.ndarray, training samples,
y: np.ndarray, training labels,
kernel: string, set the default value of “kernel” as “linear”
Output: a trained classifier clf

Hint: There are 2 steps involved in this function:

Initializing an SVM classifier: clf = SVC(...)
Training the classifier: clf.fit(X, y)
'''

'\n1i) SVM\n\nDefine a function called train_SVM that initializes an SVM classifier and trains it\n\nInputs:\n\nX: np.ndarray, training samples,\ny: np.ndarray, training labels,\nkernel: string, set the default value of “kernel” as “linear”\nOutput: a trained classifier clf\n\nHint: There are 2 steps involved in this function:\n\nInitializing an SVM classifier: clf = SVC(...)\nTraining the classifier: clf.fit(X, y)\n'

In [94]:
def train_SVM(X, y):
    clf = SVC(kernel='linear')
    clf.fit(X, y)
    return clf

In [95]:
assert callable(train_SVM)

In [102]:
'''
1j) Train SVM

Train an SVM classifier with the default linear kernel on the samples MR_train_X and the labels MR_train_y

You need to call the function train_SVM you just created. Name the returned object as MR_clf.

Note that running this function may take many seconds / up to a few minutes to run.
'''

'\n1j) Train SVM\n\nTrain an SVM classifier with the default linear kernel on the samples MR_train_X and the labels MR_train_y\n\nYou need to call the function train_SVM you just created. Name the returned object as MR_clf.\n\nNote that running this function may take many seconds / up to a few minutes to run.\n'

In [103]:
MR_clf = train_SVM(MR_train_X, MR_train_y)

In [104]:
assert isinstance(MR_clf, SVC)
assert hasattr(MR_clf, "predict")

In [106]:
'''
1k) Predict outcome

Predict labels for both training samples and test samples. You will need to use MR_clf.predict(...)

Name the predicted labels for the training samples as MR_predicted_train_y. 
Name the predicted labels for the testing samples as MR_predicted_test_y.

Your code here will also take a minute to run.

'''

'\n1k) Predict outcome\n\nPredict labels for both training samples and test samples. You will need to use MR_clf.predict(...)\n\nName the predicted labels for the training samples as MR_predicted_train_y. \nName the predicted labels for the testing samples as MR_predicted_test_y.\n\nYour code here will also take a minute to run.\n\n'

In [108]:
MR_predicted_train_y = MR_clf.predict(MR_train_X)
MR_predicted_test_y = MR_clf.predict(MR_test_X)

In [110]:
'''
Now we will use the function classification_report to print out the performance of the classifier on the training set:

'''
# Your classifier should be able to reach above 90% accuracy 
# on the training set
print(classification_report(MR_train_y,MR_predicted_train_y))

              precision    recall  f1-score   support

         0.0       0.91      0.91      0.91      2008
         1.0       0.91      0.91      0.91      1992

    accuracy                           0.91      4000
   macro avg       0.91      0.91      0.91      4000
weighted avg       0.91      0.91      0.91      4000



In [111]:
'''
And finally, we check the performance of the trained classifier on the test set:
'''
# Your classifier should be able to reach around 70% accuracy on the test set.
print(classification_report(MR_test_y, MR_predicted_test_y))

              precision    recall  f1-score   support

         0.0       0.68      0.68      0.68       482
         1.0       0.70      0.70      0.70       518

    accuracy                           0.69      1000
   macro avg       0.69      0.69      0.69      1000
weighted avg       0.69      0.69      0.69      1000



In [112]:
assert MR_predicted_train_y.shape == (4000,)
assert MR_predicted_test_y.shape == (1000,)

precision, recall, _, _ = precision_recall_fscore_support(MR_train_y,MR_predicted_train_y)
assert np.isclose(precision[0], 0.91, 0.02)
assert np.isclose(precision[1], 0.92, 0.02)

In [114]:
'''
Part 2: TF-IDF (1.25 points)
In this part, we will explore TF-IDF on sentiment analysis.

TF-IDF is used as an alternate way to encode text data, as compared to the BoW approach used in Part 1.

To do this, we will:

Transform the raw text data into vectors using TF-IDF
Train an SVM classifier on the training set and report the performance this classifer on the test set
2a) Text Data to Vectors

We will create a TfidfVectorizer object to transform the text data into vectors with TF-IDF

To do so, we will initialize a TfidfVectorizer object, and name it as tfidf.

We need to pass 4 arguments into the “TfidfVectorizer” to initialize a “tfidf”:

sublinear_tf: True Set to apply TF scaling.
analyzer: 'word' Set to analyze the data at the word-level
max_features: 2000 Set the max number of unique words
tokenizer: word_tokenize Set to tokenize the text data by using the word_tokenizer from NLTK
'''

"\nPart 2: TF-IDF (1.25 points)\nIn this part, we will explore TF-IDF on sentiment analysis.\n\nTF-IDF is used as an alternate way to encode text data, as compared to the BoW approach used in Part 1.\n\nTo do this, we will:\n\nTransform the raw text data into vectors using TF-IDF\nTrain an SVM classifier on the training set and report the performance this classifer on the test set\n2a) Text Data to Vectors\n\nWe will create a TfidfVectorizer object to transform the text data into vectors with TF-IDF\n\nTo do so, we will initialize a TfidfVectorizer object, and name it as tfidf.\n\nWe need to pass 4 arguments into the “TfidfVectorizer” to initialize a “tfidf”:\n\nsublinear_tf: True Set to apply TF scaling.\nanalyzer: 'word' Set to analyze the data at the word-level\nmax_features: 2000 Set the max number of unique words\ntokenizer: word_tokenize Set to tokenize the text data by using the word_tokenizer from NLTK\n"

In [115]:
tfidf = TfidfVectorizer(sublinear_tf=True, analyzer='word', max_features=2000, tokenizer=word_tokenize)

In [116]:
assert tfidf.analyzer == 'word'
assert tfidf.max_features == 2000
assert tfidf.tokenizer == word_tokenize
assert tfidf.stop_words == None
assert hasattr(vectorizer, "fit_transform")

In [117]:
'''
2b) Transform Reviews

Transform the review column of MR_df into vectors using the tfidf we created above.

Save the transformed data into a variable called MR_tfidf_X

Hint: You might need to cast the datatype of MR_tfidf_X to numpy.ndarray by using .toarray()
'''

'\n2b) Transform Reviews\n\nTransform the review column of MR_df into vectors using the tfidf we created above.\n\nSave the transformed data into a variable called MR_tfidf_X\n\nHint: You might need to cast the datatype of MR_tfidf_X to numpy.ndarray by using .toarray()\n'

In [134]:
MR_tfidf_X = tfidf.fit_transform(MR_df['review']).toarray()

In [135]:
assert isinstance(MR_tfidf_X, np.ndarray)

assert "skills" in set(tfidf.stop_words_)
assert "risky" in set(tfidf.stop_words_)
assert "adopts" in set(tfidf.stop_words_)

In [136]:
'''
2c)

Split the MR_tfidf_X and MR_y into training set and test set.

Name these variables as:

MR_train_tfidf_X and MR_train_tfidf_y for the training set
MR_test_tfidf_X and MR_test_tfidf_y for the test set
We will use the same 80/20 split as in part 1. You can use the same num_training variable from part 1 to split up the data.
'''

'\n2c)\n\nSplit the MR_tfidf_X and MR_y into training set and test set.\n\nName these variables as:\n\nMR_train_tfidf_X and MR_train_tfidf_y for the training set\nMR_test_tfidf_X and MR_test_tfidf_y for the test set\nWe will use the same 80/20 split as in part 1. You can use the same num_training variable from part 1 to split up the data.\n'

In [137]:
MR_train_tfidf_X = MR_tfidf_X[:num_training]
MR_train_tfidf_y = MR_y[:num_training]
MR_test_tfidf_X = MR_tfidf_X[num_training:]
MR_test_tfidf_y = MR_y[num_training:]

In [138]:
assert MR_train_tfidf_X[0].tolist() == MR_tfidf_X[0].tolist()
assert MR_train_tfidf_X.shape == (4000, 2000)
assert MR_test_tfidf_X.shape == (1000, 2000)

In [139]:
'''
2d) Training

Train an SVM classifier on the samples MR_train_tfidf_X and the labels MR_train_tfidf_y.

You need to call the function train_SVM you created in part 1. Name the returned object as MR_tfidf_clf.

Note that this may take many seconds, up to a few minutes, to run.
'''

'\n2d) Training\n\nTrain an SVM classifier on the samples MR_train_tfidf_X and the labels MR_train_tfidf_y.\n\nYou need to call the function train_SVM you created in part 1. Name the returned object as MR_tfidf_clf.\n\nNote that this may take many seconds, up to a few minutes, to run.\n'

In [140]:
MR_tfidf_clf = train_SVM(MR_train_tfidf_X, MR_train_tfidf_y)

In [141]:
assert isinstance(MR_clf, SVC)
assert hasattr(MR_tfidf_clf, "predict")

In [142]:
'''
2e) Prediction

Predict the labels for both the training and test samples (the ‘X’ data). You will need to use MR_tfidf_clf.predict(...)

Name the predicted labels on training samples as MR_pred_train_tfidf_y. 
Name the predicted labels on testing samples as MR_pred_test_tfidf_y
'''

'\n2e) Prediction\n\nPredict the labels for both the training and test samples (the ‘X’ data). You will need to use MR_tfidf_clf.predict(...)\n\nName the predicted labels on training samples as MR_pred_train_tfidf_y. \nName the predicted labels on testing samples as MR_pred_test_tfidf_y\n'

In [143]:
MR_pred_train_tfidf_y = MR_tfidf_clf.predict(MR_train_tfidf_X)
MR_pred_test_tfidf_y = MR_tfidf_clf.predict(MR_test_tfidf_X)

In [144]:
'''Again, we use classification_report to check the performance on the training set.'''

# Your classifier should be able to reach above 85% accuracy.
print(classification_report(MR_train_tfidf_y, MR_pred_train_tfidf_y))

              precision    recall  f1-score   support

         0.0       0.86      0.88      0.87      2008
         1.0       0.87      0.85      0.86      1992

    accuracy                           0.86      4000
   macro avg       0.87      0.86      0.86      4000
weighted avg       0.87      0.86      0.86      4000



In [146]:
''' Again, check performance on the test set:'''

# Your classifier should be able to reach around 70% accuracy.
print(classification_report(MR_test_tfidf_y, MR_pred_test_tfidf_y))

precision, recall, _, _ = precision_recall_fscore_support(MR_train_tfidf_y, MR_pred_train_tfidf_y)
assert np.isclose(precision[0], 0.86, 0.02)
assert np.isclose(precision[1], 0.87, 0.02)

              precision    recall  f1-score   support

         0.0       0.72      0.71      0.72       482
         1.0       0.74      0.74      0.74       518

    accuracy                           0.73      1000
   macro avg       0.73      0.73      0.73      1000
weighted avg       0.73      0.73      0.73      1000



In [147]:
'''
Written Answer Question

How does the performance of the TF-IDF classifier compare to the classifier used in part 1?
'''

'\nWritten Answer Question\n\nHow does the performance of the TF-IDF classifier compare to the classifier used in part 1?\n'

In [148]:
'''
TF-IDF classifier is not as precise as the classifier used in part 1.
'''

'\nTF-IDF classifier is not as precise as the classifier used in part 1.\n'

In [149]:
'''
Part 3: Sentiment Analysis on Customer Review with TF-IDF (2 points)
In this part, we will use TF-IDF to analyse the sentiment of some Customer Review (CR) data.

The CR data contains around 3771 reviews, and they were all collected from the Amazon website. 
The reviews are annotated by humans as either positive reviews or negative reviews. 
In this dataset, the 2 classes are not balanced, as there are twice as many positive reviews as negative reviews.

For more information on this dataset, you can visit https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

In this part, we have already split the data into a training set and a test set, 
in which the training set has labels for the reviews, but the test set doesn’t.

The goal is to train an SVM classifier on the training set, and then predict pos/neg for each review in the test set.

To do so, we will:

Use the TF-IDF feature engineering method to encode the raw text data into vectors
Train an SVM classifier on the training set
Predict labels for the reviews in the test set
The performance of your trained classifier on the test set will be checked by a hidden test.

3a) Loading the data

Customer review task has 2 files

“custrev_train.tsv” contains training data with labels
“custrev_test.tsv” contains test data without labels which need to be predicted
Import raw textfile custrev_train.csv into a DataFrame called CR_train_df. Set the column names as index, label, review.

Import raw textfile custrev_test.csv into a DataFrame called CR_test_df. Set the column names as index, review

Note that both will need to be imported with sep and header arguments (like in 1a)
'''

'\nPart 3: Sentiment Analysis on Customer Review with TF-IDF (2 points)\nIn this part, we will use TF-IDF to analyse the sentiment of some Customer Review (CR) data.\n\nThe CR data contains around 3771 reviews, and they were all collected from the Amazon website. \nThe reviews are annotated by humans as either positive reviews or negative reviews. \nIn this dataset, the 2 classes are not balanced, as there are twice as many positive reviews as negative reviews.\n\nFor more information on this dataset, you can visit https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html\n\nIn this part, we have already split the data into a training set and a test set, \nin which the training set has labels for the reviews, but the test set doesn’t.\n\nThe goal is to train an SVM classifier on the training set, and then predict pos/neg for each review in the test set.\n\nTo do so, we will:\n\nUse the TF-IDF feature engineering method to encode the raw text data into vectors\nTrain an SVM classifier on

In [151]:
CR_train_file = 'custrev_train.tsv'
CR_test_file = 'custrev_test.tsv'

# YOUR CODE HERE
CR_train_df = pd.read_csv(CR_train_file, names=['index', 'label', 'review'], header=None, sep='\t')
CR_test_df = pd.read_csv(CR_test_file, names=['index', 'review'], header=None, sep='\t')

In [152]:
assert isinstance(CR_train_df, pd.DataFrame)
assert isinstance(CR_test_df, pd.DataFrame)

In [153]:
'''
3b) Concatenation

Concatenate the 2 DataFrames from the last step into a single DataFrame, and name it CR_df.
'''

'\n3b) Concatenation\n\nConcatenate the 2 DataFrames from the last step into a single DataFrame, and name it CR_df.\n'

In [158]:
CR_df = pd.concat([CR_train_df, CR_test_df], axis=0)

In [159]:
assert len(CR_df) == 3771

In [161]:
'''
3c) Cleaning

Convert all labels in CR_df["label"] using the convert_label function we defined above. 
Save these numerical labels as a new column named y in CR_df.
'''

'\n3c) Cleaning\n\nConvert all labels in CR_df["label"] using the convert_label function we defined above. \nSave these numerical labels as a new column named y in CR_df.\n'

In [162]:
CR_df['y'] = CR_df['label'].apply(convert_label)

In [163]:
assert isinstance(CR_df['y'], pd.Series)

In [164]:
'''
3d) Use tfidf

Transform reviews CR_df["review"] into vectors using the tfidf vectorizer we created in part 2. 
Save the transformed data into a variable called CR_tfidf_X.
'''

'\n3d) Use tfidf\n\nTransform reviews CR_df["review"] into vectors using the tfidf vectorizer we created in part 2. \nSave the transformed data into a variable called CR_tfidf_X.\n'

In [168]:
CR_tfidf_X = tfidf.fit_transform(CR_df["review"]).toarray()

In [169]:
assert isinstance(CR_tfidf_X, np.ndarray)

In [170]:
'''
Here we will collect all training samples & numerical labels from CR_tfidf_X. 
The code provided below will extract all samples with labels from the dataframe:
'''
# code provided to collect labels
CR_train_X = CR_tfidf_X[~CR_df['y'].isnull()]
CR_train_y = CR_df['y'][~CR_df['y'].isnull()]

# Note: if these asserts fail, something went wrong
#  Go back and check your code (in part 3) above this cell
assert CR_train_X.shape == (3016, 2000)
assert CR_train_y.shape == (3016, )

In [171]:
'''
3e) SVM

Train an SVM classifier on the samples CR_train_X and the labels CR_train_y:

You need to call the function train_SVM you created above.
Name the returned object as CR_clf.
Note that this function will take many seconds / up to a few minutes to run.
'''

'\n3e) SVM\n\nTrain an SVM classifier on the samples CR_train_X and the labels CR_train_y:\n\nYou need to call the function train_SVM you created above.\nName the returned object as CR_clf.\nNote that this function will take many seconds / up to a few minutes to run.\n'

In [172]:
CR_clf = train_SVM(CR_train_X, CR_train_y)

In [173]:
assert isinstance(CR_clf, SVC)

In [174]:
'''
3f) Predict: training data

Predict labels on the training set, and name the returned variable as CR_pred_train_y
'''

'\n3f) Predict: training data\n\nPredict labels on the training set, and name the returned variable as CR_pred_train_y\n'

In [175]:
CR_pred_train_y = CR_clf.predict(CR_train_X)

In [176]:
# Check the classifier accuracy on the train data
#   Note that your classifier should be able to reach above 90% accuracy.
print(classification_report(CR_train_y, CR_pred_train_y))

              precision    recall  f1-score   support

         0.0       0.91      0.84      0.87      1097
         1.0       0.91      0.95      0.93      1919

    accuracy                           0.91      3016
   macro avg       0.91      0.89      0.90      3016
weighted avg       0.91      0.91      0.91      3016



In [177]:
precision, recall, _, _ = precision_recall_fscore_support(CR_train_y, CR_pred_train_y)
assert np.isclose(precision[0], 0.90, 0.02)
assert np.isclose(precision[1], 0.91, 0.02)

In [178]:
# Collect all test samples from CR_tfidf_X
CR_test_X = CR_tfidf_X[CR_df['y'].isnull()]

In [179]:
'''
3g) Predict: test set

Predict the labels on the test set. Name the returned variable as CR_pred_test_y
'''

'\n3g) Predict: test set\n\nPredict the labels on the test set. Name the returned variable as CR_pred_test_y\n'

In [180]:
CR_pred_test_y = CR_clf.predict(CR_test_X)

In [181]:
assert isinstance(CR_test_X, np.ndarray)
assert isinstance(CR_pred_test_y, np.ndarray)

In [182]:
'''
3h) Convert labels

Convert the predicted numerical labels back to string labels (“pos” and “neg”).

Create a column called label in CR_test_df to store the converted labels.
'''

'\n3h) Convert labels\n\nConvert the predicted numerical labels back to string labels (“pos” and “neg”).\n\nCreate a column called label in CR_test_df to store the converted labels.\n'

In [193]:
lc = []
for label in CR_pred_test_y:
    if label == 1.0:
        lc.append('pos')
    elif label == 0.0:
        lc.append('neg')
    else:
        lc.append(label)
        
CR_test_df['label'] = lc

In [196]:
'''
The hidden assignments tests for the cell above will check that your model predicts the right number 
of pos/neg reviews in the test data provided.

We now have a model that can predict positive or negative sentiment!

In the cell below, as a written answer question, briefly, in your own words, 
what BoW and TF/IDF word representations are, and how they differ. 
Also, think about and write a quick example 
of when and why it might be useful to automatically analyze the sentiment of text data. 
[This whole answer can/should be a couple of sentences].

After you answer this question, you are done!
'''

'\nThe hidden assignments tests for the cell above will check that your model predicts the right number \nof pos/neg reviews in the test data provided.\n\nWe now have a model that can predict positive or negative sentiment!\n\nIn the cell below, as a written answer question, briefly, in your own words, \nwhat BoW and TF/IDF word representations are, and how they differ. \nAlso, think about and write a quick example \nof when and why it might be useful to automatically analyze the sentiment of text data. \n[This whole answer can/should be a couple of sentences].\n\nAfter you answer this question, you are done!\n'

In [197]:
'''
The BoW model interprets that the words that appear more often in a dataset are more important, 
while the TF/IDF model does the opposit.
I think that automatically analyze the sentiment of text data is very important when analyzing
customers feedback, survey's, and basically any data with open responses.
'''

"\nThe BoW model interprets that the words that appear more often in a dataset are more important, \nwhile the TF/IDF model does the opposit.\nI think that automatically analyze the sentiment of text data is very important when analyzing\ncustomers feedback, survey's, and basically any data with open responses.\n"