# **Milestone 5:**

Final analysis:

*   distribution of the three categories
*   bigrams of the corpus
*   lists of key phrases
*   scatter plots of the three categories



### **Setting up the environment**

In [3]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Installing requirements

In [2]:
%%capture
!pip install --upgrade transformers  # make sure compatible with tokenizers
!wget https://raw.githubusercontent.com/crow-intelligence/growth-hacking-sentiment/master/requirements.txt
!pip install -r requirements.txt

Installing apex

In [3]:
%%writefile setup.sh

export CUDA_HOME=/usr/local/cuda-10.1
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

Overwriting setup.sh


Writing setup.sh

In [4]:
%%capture
!sh setup.sh

###**Importing the required modules**

In [12]:
# importing relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import random

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

from simpletransformers.classification import ClassificationModel

from keyness import log_likelihood

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### **Making predictions with the saved model on the big corpus textfile**

Loading in the big corpus reviews

In [13]:
# importing and reading txt from drive
with open("/content/drive/MyDrive/Sentiment Analysis for Marketing/data/reviews_without_ratings.txt", "r") as f:
    reviews = f.read().split("\n")

print(len(reviews))

326170


In [14]:
reviews = random.sample(reviews, 150000)

Loading the model

In [15]:
# loading the previously saved model from drive
model2 = ClassificationModel(
    model_type="distilbert",
    model_name= "/content/drive/MyDrive/Sentiment Analysis for Marketing/outputs_2/best_model",
    use_cuda=True,
    num_labels=3,
    args={
        "output_dir": "/content/drive/MyDrive/Sentiment Analysis for Marketing/outputs_2/best_model",
        "reprocess_input_data": True,
        "sliding_window": True,
        "max_seq_length": 512,
    },
)

Making predictions

In [16]:
# predicting results for all reviews
preds = model2.predict(reviews)

predicted_class, predicted_probas = preds[0], preds[1]

  0%|          | 0/150000 [00:00<?, ?it/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (535 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (519 > 512). Running this sequence through the model will result in indexing errors


  0%|          | 0/7612 [00:00<?, ?it/s]

In [17]:
# saving the results to pickle
import pickle
with open("/content/drive/MyDrive/Sentiment Analysis for Marketing/predicted_class.pkl", "wb") as outfile:
  pickle.dump(predicted_class, outfile)

with open("/content/drive/MyDrive/Sentiment Analysis for Marketing/predicted_probs.pkl", "wb") as outfile:
  pickle.dump(predicted_probas, outfile)