This notebook will aggregate all the machine learning results

<h3>Import libraries</h3>

In [None]:
import numpy as np
import pandas as pd

# Visualizations
import seaborn as sns 
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from sklearn.metrics import accuracy_score

# Text processing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# ML algorithms
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
# from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
# from xgboost import XGBClassifier

<h3>Load data</h3>


In [None]:
path_data = '../../data/yelp_academic_dataset_review.pickle'
data = pd.read_pickle(path_data)

In [None]:
# Removing all ('\n') characters using list comprehensions
data['text'] = [txt.replace('\n', '') for txt in data['text']]

# Taking only text and stars columns
data = data.loc[:, ['text', 'stars']]

<h3>Text representation</h3>

The classifiers and learning algorithms can not directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length. Therefore, during the preprocessing step, the texts are converted to a more manageable representation.

In [None]:
X = data["text"].tolist()
y = data["stars"].tolist()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def count_vectorize(data):
    count_vectorizer = CountVectorizer()
    
    embedding = count_vectorizer.fit_transform(data)
    
    return embedding, count_vectorizer

def tfidf_transform(data):
    tfidf_transformer = TfidfTransformer()
    
    text_freq = tfidf_transformer.fit_transform(data)
    
    return text_freq, tfidf_transformer

X_train_counts, count_vectorizer = count_vectorize(X_train)
X_test_counts = count_vectorizer.transform(X_test)

X_train_tfidf, tfidf_transformer = tfidf_transform(X_train_counts)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

<h3 style="text-align:center">Evaluate Algorithms:Baseline </h3>

We design our test harness.  We will use 5-fold cross-validation. We will evaluate algorithms using the accuracy metric.

In [None]:
# Test options and evaluation metric
num_folds = 5
seed = 42
scoring = 'accuracy'

<i style="color:MediumSlateBlue;"> Let's create a baseline of performance on this problem and spot-check a number of different
algorithms
</i>

In [None]:
# Spot check algorithms
## TODO : fill hyperparameters of the algorithms 

models = []

models.append( ('NB', MultinomialNB()) )
models.append( ('SGD', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-7, average=False, class_weight=None, epsilon=0.1, \
                                     eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',  max_iter=None, n_iter=5, n_jobs=1, power_t=0.5, \
                                     random_state=42, shuffle=True, tol=None, verbose=0, warm_start=False) )
models.append( ('CART', DecisionTreeClassifier()) )
models.append( ('GBM', GradientBoostingClassifier()) )
models.append( ('RF', RandomForestClassifier(n_estimators=100)) )


We will display the mean and standard deviation of accuracy for each algorithm as we calculate it and collect the results for use later

In [None]:
results = []
names = []
for name, model in models:
    kfold = StratifiedKFold(n_splits=num_folds, random_state=seed)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std() )
    print(msg)

In [None]:
fig = plt.figure(figsize=(16, 12))
plt.title('Scaled Algorithms Comparison', fontsize=12)
ax = fig.add_subplot(111)
sns.boxplot(results)
ax.set_xticklabels(names)
plt.show()