# **MIS 515 HOMEWORK 5/6: ONLINE REVIEW CLASSIFICATION**
Your assignment is to create atoolthat trains several machine learning models to perform the task of classifying online reviews. Some of these online reviews refer to hazardous products, so these machine learning models will help to identify the most serious product complaints.

The dataset is available at https://dgoldberg.sdsu.edu/515/appliance_reviews.json and contains approximately 1,000 reviews, approximately half of which refer to safety hazards. The data is formatted as a JSON array.

The purpose of the machine learning models is to predict the “Safety hazard” field, which is already formatted as a 0 or 1. A value of 1 indicates that the review refers to a safety hazard; a value of 0 indicates that the review does not refer to a safety hazard. However, to transform the reviews into a format usable by the machine learning models, perform the following steps:

- Throughout the problem, ensure that you handle case-sensitivity (for example, by converting all reviews to lowercase).
- Next, createa list of all the *unique* words in the dataset. For example, the word “plastic” occurs multiple times throughout the dataset. However, this is only one unique word, so only append it to your list one time.
- The dataset consists of many words, so the next step is to narrow down which words are relevant to the classification problem (otherwise, the machine learning models may have too many variables to considerand run very slowly). To do so, generate a “relevance score” for each word by first computing totals of A, B, C, and D
- Next, create a 2D list to train the machine learning models based on the relevant words from the previous step. If a review contains a given word, then use a value of 1, and if not, then use a value of 0. For example, suppose that the relevant words are [“dangerous”, “hazard”, “broken”] and that you are considering the review “the product was dangerous and scary.” This review should be treated as [1, 0, 0] because it contains the word “dangerous” but does not contain the words “hazard” or “broken.”
- Finally, train decision tree, k-nearest neighbors, and neural network machine learning models. You may choose your owntraining-test split. Report the accuracy values from all three machine learning models and save a joblib file from the most accurate model.


In [None]:
#Calculate Relevance Score Helper Function
def relevanceScore(a,b,c,d):
  radicand1 = a + b + c + d
  radicand2 = (a + b) * (c + d)
  top = math.sqrt(radicand1) * ((a*d) - (c*b))
  bottom = math.sqrt(radicand2)
  if bottom == 0:
    return 0
  else:
    return top / bottom


In [None]:
import json,joblib,google.colab.files, requests, numpy as np, math, sklearn.neighbors, sklearn.neural_network, sklearn.metrics, sklearn.model_selection, sklearn.tree
from textblob import TextBlob
import nltk
nltk.download('punkt')

response = requests.get('https://dgoldberg.sdsu.edu/515/appliance_reviews.json')

data = json.loads(response.text)   
  
if response:
  print(json.dumps(data, indent=4))
else:
  print('Sorry, could not connnect.')

In [None]:
#Separate data
print('Loading data...')
reviews = []
stars = []
hazard = []
x = []
y = []
for i in range(len(data)):
  reviews.append(data[i]['Review'].lower())
  stars.append(data[i]['Stars'])
  hazard.append(data[i]['Safety hazard'])
  y.append(data[i]['Safety hazard'])
  inner_list = [reviews]


Loading data...


In [None]:
#Create List of Unique Words
unique_words = []
print('Identifying unique words...')
for line in reviews:
  blob = TextBlob(line)
  text = blob.words
  for word in text:
    if word not in unique_words:
      unique_words.append(word)

Identifying unique words...


In [None]:
#create relevant words list
relevant_words = []
print('Generating relevance scores...')
for word in unique_words:
  a = 0
  b = 0
  c = 0
  d = 0
  for i in range(len(reviews)):
    if word in reviews[i] and hazard[i] == 1:
      a += 1
    elif word in reviews[i] and hazard[i] == 0:
      b += 1
    elif word not in reviews[i] and hazard[i] == 1:
      c += 1
    elif word not in reviews[i] and hazard[i] == 0:
      d += 1
  if relevanceScore(a,b,c,d) > 4000:
    relevant_words.append(word)

Generating relevance scores...


In [None]:
#create the 2d list from 3rd pge of prompt
print('Formatting 2D list...')
for review in reviews:
  temp_list =[]
  for word in relevant_words:
    if word in review:
      temp_list.append(1)
    else:
      temp_list.append(0)
  x.append(temp_list)

Formatting 2D list...


In [None]:
#split data into training and testing data 
print('Training machine learning models...')
#x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.8, test_size=0.2, random_state = 25)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.33)
# Decision tree
dt_clf = sklearn.tree.DecisionTreeClassifier()
dt_clf = dt_clf.fit(x_train, y_train)
dt_predictions = dt_clf.predict(x_test)
dt_accuracy = sklearn.metrics.accuracy_score(y_test, dt_predictions)
print("DT accuracy:", dt_accuracy)
# KNN
knn_clf = sklearn.neighbors.KNeighborsClassifier(5)
knn_clf = knn_clf.fit(x_train, y_train)
knn_predictions = knn_clf.predict(x_test)
knn_accuracy = sklearn.metrics.accuracy_score(y_test, knn_predictions)
print("KNN accuracy:", knn_accuracy)

# Neural network
nn_clf = sklearn.neural_network.MLPClassifier()
nn_clf = nn_clf.fit(x_train, y_train)
nn_predictions = nn_clf.predict(x_test)
nn_accuracy = sklearn.metrics.accuracy_score(y_test, nn_predictions)
print("NN accuracy:", nn_accuracy)


Training machine learning models...
DT accuracy: 0.8333333333333334
KNN accuracy: 0.803030303030303
NN accuracy: 0.8575757575757575




In [None]:
#Find highest accuracy score and save joblib file
acc_scores = [dt_accuracy, knn_accuracy, nn_accuracy]
highest = max(acc_scores)
if highest == acc_scores[0]:
  print('Decision Tree model performed best; saved to model.joblib.')
  joblib.dump(dt_clf, "dt_model.joblib")
  google.colab.files.download("dt_model.joblib")
elif highest == acc_scores[1]:
  print('K-Nearest Neighbors model performed best; saved to model.joblib.')
  joblib.dump(knn_clf, "knn_model.joblib")
  google.colab.files.download("knn_model.joblib")
elif highest == acc_scores[2]:
  print('Neural network model performed best; saved to model.joblib.')
  joblib.dump(nn_clf, "nn_model.joblib")
  google.colab.files.download("nn_model.joblib")

Neural network model performed best; saved to model.joblib.


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>