### IMPLEMENTATION OF VARIOUS MACHINE LEARNING TECHNIQUES FOR SENTIMENT ANALYSIS.
* This notebooks contains the implementation of Multinomial Naive Bayes, LinearSVC, and Random forest for sentiment analysis.
* The dataset used is **"Sentiment Labelled Sentences Dataset", from the UC Irvine Machine Learning Repository.**
* The sentences come from three different websites/fields:
    * amazon.com
    * imdb.com
    * yelp.com
* Each sentence is labelled as either 1 (for positive) or 0 (for negative).
* For each website, there exist 500 positive and 500 negative sentences.
* This dataset was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015.  *(Please cite the paper if you want to use it :))*

* Link to the dataset is: [Sentiment Labelled Sentences Data Set](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences)

In [1]:
from TextPreprocessing import TextPreprocessing
from Model import train_ml_model
import numpy as np
from numpy import array
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.svm import LinearSVC
from sklearn import ensemble
import joblib

### SENTIMENT ANALYSIS OF AMAZON REVIEWS

* Create an object of TextPreprocessing class.
* This calls contains all the methods for text Preprocesing and vectorizaiton.
* Passing "Amazon" as the argument to the consturctor indicates that it must use the Amazon product reviews dataset.

In [3]:
tp = TextPreprocessing('Amazon')

In [4]:
corpus, labels = tp.get_data()

In [5]:
#Preprocessing the corpus
X = []
for c in corpus:
    X.append(tp.preprocess_text(c))

In [6]:
#Splitting into training(80%) and testing data(20%)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.20, random_state=42, shuffle=True)

In [7]:
print(X_train[56])

not impress would not recommend item anyone


### Feature engineering:
raw text data will be transformed into feature vectors. The methods used are: 
1. Count Vectors. (matrix notation representing  the frequency count of a particular term in a particular document)
2. Word Level TF-IDF Vectors. (Matrix representing tf-idf scores of every term in different documents)
3. N-gram Level TF-IDF Vectors. ( N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams)
4. Character Level TF-IDF Vectors. (Matrix representing tf-idf scores of character level n-grams in the corpus)

In [8]:
xtrain_count, xvalid_count = tp.count_vectorize(X_train, X_test)
xtrain_tfidf, xvalid_tfidf = tp.word_TF_IDF_vectorize(X_train, X_test)
xtrain_tfidf_ngram, xvalid_tfidf_ngram = tp.n_gram_TF_IDF_vectorize(X_train, X_test)
xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars = tp.char_TF_IDF_vectorize(X_train, X_test)

### Using Multinomial Naive Bayes for Sentiment Analysis

In [29]:
# Naive Bayes on Count Vectors
NB_cv, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_count, xvalid_count, y_train, y_test)
print("Naive Bayes, Count Vectors: ", accuracy)

# Naive Bayes on Word Level TF IDF Vectors
NB_word_tf_idf, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_tfidf, xvalid_tfidf, y_train, y_test)
print("Naive Bayes, WordLevel TF-IDF: ", accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
NB_n_gram_tf_idf, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, xvalid_tfidf_ngram, y_train, y_test)
print("Naive Bayes, N-Gram Vectors: ", accuracy)

# Naive Bayes on Character Level TF IDF Vectors
NB_char_tf_idf, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars, y_train, y_test)
print("Naive Bayes, CharLevel Vectors: ", accuracy)

Naive Bayes, Count Vectors:  0.82
Naive Bayes, WordLevel TF-IDF:  0.85
Naive Bayes, N-Gram Vectors:  0.61
Naive Bayes, CharLevel Vectors:  0.8


Naive Bayes, Count Vectors:  0.82
Naive Bayes, WordLevel TF-IDF:  0.85
Naive Bayes, N-Gram Vectors:  0.61
Naive Bayes, CharLevel Vectors:  0.8

In [7]:
joblib.dump(NB_cv, '../models/ML/Amazon/NB_cv_amazon.pk1')
joblib.dump(NB_word_tf_idf, '../models/ML/Amazon/NB_word_tf_idf_amazon.pk1')
joblib.dump(NB_n_gram_tf_idf, '../models/ML/Amazon/NB_n_gram_tf_idf_amazon.pk1')
joblib.dump(NB_char_tf_idf, '../models/ML/Amazon/NB_char_tf_idf_amazon.pk1')

### Using Random Forest for Sentiment Analysis

In [15]:
# Random Forest on Count Vectors
RF_cv, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_count, xvalid_count, y_train, y_test)
print("Random Forest, Count Vectors: ", accuracy)

# Random Forest on Word Level TF IDF Vectors
RF_word_tf_idf, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators =100), xtrain_tfidf, xvalid_tfidf, y_train, y_test)
print("Random Forest, WordLevel TF-IDF: ", accuracy)

# Random Forest on Ngram Level TF IDF Vectors
RF_n_gram_tf_idf, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators =100), xtrain_tfidf_ngram, xvalid_tfidf_ngram, y_train, y_test)
print("Random Forest, N-Gram Vectors: ", accuracy)

# Random Forest on Character Level TF IDF Vectors
RF_char_tf_idf, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators =200), xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars, y_train, y_test)
print("Random Forest, CharLevel Vectors: ", accuracy)

Random Forest, Count Vectors:  0.83
Random Forest, WordLevel TF-IDF:  0.8
Random Forest, N-Gram Vectors:  0.645
Random Forest, CharLevel Vectors:  0.8


Random Forest, Count Vectors:  0.83
Random Forest, WordLevel TF-IDF:  0.8
Random Forest, N-Gram Vectors:  0.645
Random Forest, CharLevel Vectors:  0.8

In [32]:
joblib.dump(RF_cv, '../models/ML/Amazon/RF_cv_amazon.pk1')
joblib.dump(RF_word_tf_idf, '../models/ML/Amazon/RF_word_tf_idf_amazon.pk1')
joblib.dump(RF_n_gram_tf_idf, '../models/ML/Amazon/RF_n_gram_tf_idf_amazon.pk1')
joblib.dump(RF_char_tf_idf, '../models/ML/Amazon/RF_char_tf_idf_amazon.pk1')

['../models/ML/RF_char_tf_idf_amazon.pk1']

### Using LinearSVC for Sentiment Analysis

In [16]:
# LinearSVC on Count Vectors
LinearSVC_cv, accuracy = train_ml_model(LinearSVC(), xtrain_count, xvalid_count, y_train, y_test)
print("LinearSVC, Count Vectors: ", accuracy)

# LinearSVC on Word Level TF IDF Vectors
LinearSVC_word_tf_idf, accuracy = train_ml_model(LinearSVC(), xtrain_tfidf, xvalid_tfidf, y_train, y_test)
print("LinearSVC, WordLevel TF-IDF: ", accuracy)

# LinearSVC on Ngram Level TF IDF Vectors
LinearSVC_n_gram_tf_idf, accuracy = train_ml_model(LinearSVC(), xtrain_tfidf_ngram, xvalid_tfidf_ngram, y_train, y_test)
print("LinearSVC, N-Gram Vectors: ", accuracy)

# LinearSVC on Character Level TF IDF Vectors
LinearSVC_char_tf_idf, accuracy = train_ml_model(LinearSVC(), xtrain_tfidf_ngram_chars,  xvalid_tfidf_ngram_chars, y_train, y_test)
print("LinearSVC, CharLevel Vectors: ", accuracy)

LinearSVC, Count Vectors:  0.81
LinearSVC, WordLevel TF-IDF:  0.84
LinearSVC, N-Gram Vectors:  0.715
LinearSVC, CharLevel Vectors:  0.825


LinearSVC, Count Vectors:  0.81
LinearSVC, WordLevel TF-IDF:  0.84
LinearSVC, N-Gram Vectors:  0.715
LinearSVC, CharLevel Vectors:  0.825

In [33]:
joblib.dump(LinearSVC_cv, '../models/ML/Amazon/LinearSVC_cv_amazon.pk1')
joblib.dump(LinearSVC_word_tf_idf, '../models/ML/Amazon/LinearSVC_word_tf_idf_amazon.pk1')
joblib.dump(LinearSVC_n_gram_tf_idf, '../models/ML/Amazon/LinearSVC_n_gram_tf_idf_amazon.pk1')
joblib.dump(LinearSVC_char_tf_idf, '../models/ML/Amazon/LinearSVC_char_tf_idff_amazon.pk1')

['../models/ML/LinearSVC_char_tf_idff_amazon.pk1']

### Resuts:
* As you can see, count vector, and word-level tf-idf gives the best results for all three models.
* Multinomial Naive Bayes seems to be the best model, followed by LinearSVC and Random Forest.

### SENTIMENT ANALYSIS OF IMDB REVIEWS

* Create an object of TextPreprocessing class.
* This calls contains all the methods for text Preprocesing and vectorizaiton.
* Passing "IMDB" as the argument to the consturctor indicates that it must use the IMDB movie reviews dataset.

In [34]:
tp = TextPreprocessing('IMDB')

In [35]:
corpus, labels = tp.get_data()

In [36]:
X = []
for c in corpus:
    X.append(tp.preprocess_text(c))

In [37]:
#Splitting into training(80%) and testing data(20%)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.20, random_state=42, shuffle=True)

### Feature engineering:
raw text data will be transformed into feature vectors. The methods used are: 
1. Count Vectors. (matrix notation representing  the frequency count of a particular term in a particular document)
2. Word Level TF-IDF Vectors. (Matrix representing tf-idf scores of every term in different documents)
3. N-gram Level TF-IDF Vectors. ( N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams)
4. Character Level TF-IDF Vectors. (Matrix representing tf-idf scores of character level n-grams in the corpus)

In [38]:
xtrain_count, xvalid_count = tp.count_vectorize(X_train, X_test)
xtrain_tfidf, xvalid_tfidf = tp.word_TF_IDF_vectorize(X_train, X_test)
xtrain_tfidf_ngram, xvalid_tfidf_ngram = tp.n_gram_TF_IDF_vectorize(X_train, X_test)
xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars = tp.char_TF_IDF_vectorize(X_train, X_test)

### Using Multinomial Naive Bayes for Sentiment Analysis

In [39]:
# Naive Bayes on Count Vectors
NB_cv, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_count, xvalid_count, y_train, y_test)
print("Naive Bayes, Count Vectors: ", accuracy)

# Naive Bayes on Word Level TF IDF Vectors
NB_word_tf_idf, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_tfidf, xvalid_tfidf, y_train, y_test)
print("Naive Bayes, WordLevel TF-IDF: ", accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
NB_n_gram_tf_idf, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, xvalid_tfidf_ngram, y_train, y_test)
print("Naive Bayes, N-Gram Vectors: ", accuracy)

# Naive Bayes on Character Level TF IDF Vectors
NB_char_tf_idf, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars, y_train, y_test)
print("Naive Bayes, CharLevel Vectors: ", accuracy)

Naive Bayes, Count Vectors:  0.82
Naive Bayes, WordLevel TF-IDF:  0.85
Naive Bayes, N-Gram Vectors:  0.61
Naive Bayes, CharLevel Vectors:  0.8


Naive Bayes, Count Vectors:  0.82
Naive Bayes, WordLevel TF-IDF:  0.85
Naive Bayes, N-Gram Vectors:  0.61
Naive Bayes, CharLevel Vectors:  0.8

In [44]:
joblib.dump(NB_cv, '../models/ML/IMDB/NB_cv_amazon.pk1')
joblib.dump(NB_word_tf_idf, '../models/ML/IMDB/NB_word_tf_idf_amazon.pk1')
joblib.dump(NB_n_gram_tf_idf, '../models/ML/IMDB/NB_n_gram_tf_idf_amazon.pk1')
joblib.dump(NB_char_tf_idf, '../models/ML/IMDB/NB_char_tf_idf_amazon.pk1')

['../models/ML/IMDB/NB_char_tf_idf_amazon.pk1']

### Using Random Forest for Sentiment Analysis

In [40]:
# Random Forest on Count Vectors
RF_cv, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_count, xvalid_count, y_train, y_test)
print("Random Forest, Count Vectors: ", accuracy)

# Random Forest on Word Level TF IDF Vectors
RF_word_tf_idf, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators =100), xtrain_tfidf, xvalid_tfidf, y_train, y_test)
print("Random Forest, WordLevel TF-IDF: ", accuracy)

# Random Forest on Ngram Level TF IDF Vectors
RF_n_gram_tf_idf, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators =100), xtrain_tfidf_ngram, xvalid_tfidf_ngram, y_train, y_test)
print("Random Forest, N-Gram Vectors: ", accuracy)

# Random Forest on Character Level TF IDF Vectors
RF_char_tf_idf, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators =200), xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars, y_train, y_test)
print("Random Forest, CharLevel Vectors: ", accuracy)

Random Forest, Count Vectors:  0.79
Random Forest, WordLevel TF-IDF:  0.79
Random Forest, N-Gram Vectors:  0.535
Random Forest, CharLevel Vectors:  0.77


Random Forest, Count Vectors:  0.79
Random Forest, WordLevel TF-IDF:  0.79
Random Forest, N-Gram Vectors:  0.535
Random Forest, CharLevel Vectors:  0.77

In [43]:
joblib.dump(RF_cv, '../models/ML/IMDB/RF_cv_amazon.pk1')
joblib.dump(RF_word_tf_idf, '../models/ML/IMDB/RF_word_tf_idf_amazon.pk1')
joblib.dump(RF_n_gram_tf_idf, '../models/ML/IMDB/RF_n_gram_tf_idf_amazon.pk1')
joblib.dump(RF_char_tf_idf, '../models/ML/IMDB/RF_char_tf_idf_amazon.pk1')

['../models/ML/IMDB/RF_char_tf_idf_amazon.pk1']

### Using LinearSVC for Sentiment Analysis

In [41]:
# LinearSVC on Count Vectors
LinearSVC_cv, accuracy = train_ml_model(LinearSVC(), xtrain_count, xvalid_count, y_train, y_test)
print("LinearSVC, Count Vectors: ", accuracy)

# LinearSVC on Word Level TF IDF Vectors
LinearSVC_word_tf_idf, accuracy = train_ml_model(LinearSVC(), xtrain_tfidf, xvalid_tfidf, y_train, y_test)
print("LinearSVC, WordLevel TF-IDF: ", accuracy)

# LinearSVC on Ngram Level TF IDF Vectors
LinearSVC_n_gram_tf_idf, accuracy = train_ml_model(LinearSVC(), xtrain_tfidf_ngram, xvalid_tfidf_ngram, y_train, y_test)
print("LinearSVC, N-Gram Vectors: ", accuracy)

# LinearSVC on Character Level TF IDF Vectors
LinearSVC_char_tf_idf, accuracy = train_ml_model(LinearSVC(), xtrain_tfidf_ngram_chars,  xvalid_tfidf_ngram_chars, y_train, y_test)
print("LinearSVC, CharLevel Vectors: ", accuracy)

LinearSVC, Count Vectors:  0.795
LinearSVC, WordLevel TF-IDF:  0.815
LinearSVC, N-Gram Vectors:  0.61
LinearSVC, CharLevel Vectors:  0.75


LinearSVC, Count Vectors:  0.795
LinearSVC, WordLevel TF-IDF:  0.815
LinearSVC, N-Gram Vectors:  0.61
LinearSVC, CharLevel Vectors:  0.75

In [42]:
joblib.dump(LinearSVC_cv, '../models/ML/IMDB/LinearSVC_cv_amazon.pk1')
joblib.dump(LinearSVC_word_tf_idf, '../models/ML/IMDB/LinearSVC_word_tf_idf_amazon.pk1')
joblib.dump(LinearSVC_n_gram_tf_idf, '../models/ML/IMDB/LinearSVC_n_gram_tf_idf_amazon.pk1')
joblib.dump(LinearSVC_char_tf_idf, '../models/ML/IMDB/LinearSVC_char_tf_idff_amazon.pk1')

['../models/ML/IMDB/LinearSVC_char_tf_idff_amazon.pk1']

### Resuts:
* For IMDB reviews dataset too, you can see, count vector, and word-level tf-idf gives the best results for all three models.
* Multinomial Naive Bayes seems to be the best model, followed by LinearSVC and Random Forest.

### SENTIMENT ANALYSIS OF YELP REVIEWS

* Create an object of TextPreprocessing class.
* This calls contains all the methods for text Preprocesing and vectorizaiton.
* Passing "Yelp" as the argument to the consturctor indicates that it must use the Yelp restaurant reviews dataset.

In [46]:
tp = TextPreprocessing('Yelp')

In [47]:
corpus, labels = tp.get_data()

In [50]:
X = []
for c in corpus:
    X.append(tp.preprocess_text(c))

In [51]:
#Splitting into training(80%) and testing data(20%)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.20, random_state=42, shuffle=True)

### Feature engineering:
raw text data will be transformed into feature vectors. The methods used are: 
1. Count Vectors. (matrix notation representing  the frequency count of a particular term in a particular document)
2. Word Level TF-IDF Vectors. (Matrix representing tf-idf scores of every term in different documents)
3. N-gram Level TF-IDF Vectors. ( N-grams are the combination of N terms together. This Matrix representing tf-idf scores of N-grams)
4. Character Level TF-IDF Vectors. (Matrix representing tf-idf scores of character level n-grams in the corpus)

In [52]:
xtrain_count, xvalid_count = tp.count_vectorize(X_train, X_test)
xtrain_tfidf, xvalid_tfidf = tp.word_TF_IDF_vectorize(X_train, X_test)
xtrain_tfidf_ngram, xvalid_tfidf_ngram = tp.n_gram_TF_IDF_vectorize(X_train, X_test)
xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars = tp.char_TF_IDF_vectorize(X_train, X_test)

### Using Multinomial Naive Bayes for Sentiment Analysis

In [53]:
# Naive Bayes on Count Vectors
NB_cv, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_count, xvalid_count, y_train, y_test)
print("Naive Bayes, Count Vectors: ", accuracy)

# Naive Bayes on Word Level TF IDF Vectors
NB_word_tf_idf, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_tfidf, xvalid_tfidf, y_train, y_test)
print("Naive Bayes, WordLevel TF-IDF: ", accuracy)

# Naive Bayes on Ngram Level TF IDF Vectors
NB_n_gram_tf_idf, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram, xvalid_tfidf_ngram, y_train, y_test)
print("Naive Bayes, N-Gram Vectors: ", accuracy)

# Naive Bayes on Character Level TF IDF Vectors
NB_char_tf_idf, accuracy = train_ml_model(naive_bayes.MultinomialNB(), xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars, y_train, y_test)
print("Naive Bayes, CharLevel Vectors: ", accuracy)

Naive Bayes, Count Vectors:  0.795
Naive Bayes, WordLevel TF-IDF:  0.78
Naive Bayes, N-Gram Vectors:  0.62
Naive Bayes, CharLevel Vectors:  0.76


Naive Bayes, Count Vectors:  0.795
Naive Bayes, WordLevel TF-IDF:  0.78
Naive Bayes, N-Gram Vectors:  0.62
Naive Bayes, CharLevel Vectors:  0.76

In [56]:
joblib.dump(NB_cv, '../models/ML/Yelp/NB_cv_amazon.pk1')
joblib.dump(NB_word_tf_idf, '../models/ML/Yelp/NB_word_tf_idf_amazon.pk1')
joblib.dump(NB_n_gram_tf_idf, '../models/ML/Yelp/NB_n_gram_tf_idf_amazon.pk1')
joblib.dump(NB_char_tf_idf, '../models/ML/Yelp/NB_char_tf_idf_amazon.pk1')

['../models/ML/Yelp/NB_char_tf_idf_amazon.pk1']

### Using Random Forest for Sentiment Analysis

In [54]:
# Random Forest on Count Vectors
RF_cv, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators=100), xtrain_count, xvalid_count, y_train, y_test)
print("Random Forest, Count Vectors: ", accuracy)

# Random Forest on Word Level TF IDF Vectors
RF_word_tf_idf, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators =100), xtrain_tfidf, xvalid_tfidf, y_train, y_test)
print("Random Forest, WordLevel TF-IDF: ", accuracy)

# Random Forest on Ngram Level TF IDF Vectors
RF_n_gram_tf_idf, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators =100), xtrain_tfidf_ngram, xvalid_tfidf_ngram, y_train, y_test)
print("Random Forest, N-Gram Vectors: ", accuracy)

# Random Forest on Character Level TF IDF Vectors
RF_char_tf_idf, accuracy = train_ml_model(ensemble.RandomForestClassifier(n_estimators =200), xtrain_tfidf_ngram_chars, xvalid_tfidf_ngram_chars, y_train, y_test)
print("Random Forest, CharLevel Vectors: ", accuracy)

Random Forest, Count Vectors:  0.8
Random Forest, WordLevel TF-IDF:  0.79
Random Forest, N-Gram Vectors:  0.61
Random Forest, CharLevel Vectors:  0.74


Random Forest, Count Vectors:  0.8
Random Forest, WordLevel TF-IDF:  0.79
Random Forest, N-Gram Vectors:  0.61
Random Forest, CharLevel Vectors:  0.74

In [57]:
joblib.dump(RF_cv, '../models/ML/Yelp/RF_cv_amazon.pk1')
joblib.dump(RF_word_tf_idf, '../models/ML/Yelp/RF_word_tf_idf_amazon.pk1')
joblib.dump(RF_n_gram_tf_idf, '../models/ML/Yelp/RF_n_gram_tf_idf_amazon.pk1')
joblib.dump(RF_char_tf_idf, '../models/ML/Yelp/RF_char_tf_idf_amazon.pk1')

['../models/ML/Yelp/RF_char_tf_idf_amazon.pk1']

### Using LinearSVC for Sentiment Analysis

In [55]:
# LinearSVC on Count Vectors
LinearSVC_cv, accuracy = train_ml_model(LinearSVC(), xtrain_count, xvalid_count, y_train, y_test)
print("LinearSVC, Count Vectors: ", accuracy)

# LinearSVC on Word Level TF IDF Vectors
LinearSVC_word_tf_idf, accuracy = train_ml_model(LinearSVC(), xtrain_tfidf, xvalid_tfidf, y_train, y_test)
print("LinearSVC, WordLevel TF-IDF: ", accuracy)

# LinearSVC on Ngram Level TF IDF Vectors
LinearSVC_n_gram_tf_idf, accuracy = train_ml_model(LinearSVC(), xtrain_tfidf_ngram, xvalid_tfidf_ngram, y_train, y_test)
print("LinearSVC, N-Gram Vectors: ", accuracy)

# LinearSVC on Character Level TF IDF Vectors
LinearSVC_char_tf_idf, accuracy = train_ml_model(LinearSVC(), xtrain_tfidf_ngram_chars,  xvalid_tfidf_ngram_chars, y_train, y_test)
print("LinearSVC, CharLevel Vectors: ", accuracy)

LinearSVC, Count Vectors:  0.79
LinearSVC, WordLevel TF-IDF:  0.8
LinearSVC, N-Gram Vectors:  0.65
LinearSVC, CharLevel Vectors:  0.8


LinearSVC, Count Vectors:  0.79
LinearSVC, WordLevel TF-IDF:  0.8
LinearSVC, N-Gram Vectors:  0.65
LinearSVC, CharLevel Vectors:  0.8

In [58]:
joblib.dump(LinearSVC_cv, '../models/ML/Yelp/LinearSVC_cv_amazon.pk1')
joblib.dump(LinearSVC_word_tf_idf, '../models/ML/Yelp/LinearSVC_word_tf_idf_amazon.pk1')
joblib.dump(LinearSVC_n_gram_tf_idf, '../models/ML/Yelp/LinearSVC_n_gram_tf_idf_amazon.pk1')
joblib.dump(LinearSVC_char_tf_idf, '../models/ML/Yelp/LinearSVC_char_tf_idff_amazon.pk1')

['../models/ML/Yelp/LinearSVC_char_tf_idff_amazon.pk1']

### Resuts:
* For Yelp reviews dataset too, you can see, count vector, and word-level tf-idf gives the best results for all three models.
* Here, LinearSVC seems to be the best model, followed by Random Forest and then Multinomial Naive Bayes.

## Conclusion:
* Count vector/ Bog of Words(BOW) and word level tf-idf seems to the best feature engineering vectorizaiton methods for sentiment analysis uing machine learning algorithms.
* Multinomial Naive Bayes and LinearSVC outperfrom Random Forest for sentiment analysis.