<a href="https://colab.research.google.com/github/edward2018211/sentiment-analysis-SOS/blob/master/Sentiment_Analysis_Edward.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is a part of the News Sentiment Analysis project and contains snippets of code.

We will first scrape images from Google using a simple script for processing.

Next, we will manually label the images that are scraped to use as training data for our model in Tensorflow. An additional suggestion is that we could use data augmentation for more data, but we do need be aware of photos that wouldn't make sense to data augment.

We will need to do some feature engineering for better prediction accuracy. We'll need multiple layers for our deep neural network and we'll probably want to work in increased stride length for a faster model that uses less memory. Probably dropout too to combat overfitting.

Once the model is fine-tuned, it will be able to make predictions based on new images (data).

In addition, we would also like to analyze text associated with the news and not simply pictures. The classical approach is the bag of words approach, instead we'll use word vectors which enhances performance to train our SVM. Below is the setup for our text model. 

Note: The dataset contains around 1.6 million texts, so loading the data and training the model takes 20 - 30 minutes when this project is completed. Currently we are able to extract information from all 1.6 million training points.

Next Steps:
1. We want to experiment with other ML models and how accurate they are on the training and testing dataset.

In [2]:
# Download large model
!python -m spacy download en_core_web_lg
!python -m spacy download en

# Download Google API package
!pip install --upgrade pip
!pip install --upgrade google-api-python-client

# Import Libraries
import numpy as np
import nltk
nltk.download('wordnet')
import re
from nltk.stem import WordNetLemmatizer
import spacy
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import io
import csv

# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Need to load the large model to get the vectors
nlp = spacy.load('en')

Collecting en_core_web_lg==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
[K     |████████████████████████████████| 827.9MB 1.2MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp36-none-any.whl size=829180944 sha256=13324076c97eacdfcabdc58cf460896e5905f45f950b6c3571bfd6b08ced39db
  Stored in directory: /tmp/pip-ephem-wheel-cache-f4z3_fqv/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [3]:
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [4]:
# Loading the sentiment data
downloaded = drive.CreateFile({'id': '1YcCIUOA0-5lNI3xKZvi33Vzg3X0lABUh'})
downloaded.GetContentFile('training.1600000.processed.noemoticon.csv') 
sentiment = pd.read_csv('training.1600000.processed.noemoticon.csv', names=['polarity', 'id', 'date', 'query', 'user', 'text'], encoding="latin-1")

In [5]:
# Shuffle data set
sentiment = sentiment.sample(frac=1)

# Get first 5000 rows
sentimentHead = sentiment
#sentiment.head(50000)

In [6]:
# Dictionary of all emojis mapping to their meanings.
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', 
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', 
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink', 
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

## Set of all stopwords in english.
stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're',
             's', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

# Preprocess data
processedText = []
    
# Create Lemmatizer and Stemmer
wordLemm = WordNetLemmatizer()
    
# Defining regex patterns.
urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
userPattern       = '@[^\s]+'
alphaPattern      = "[^a-zA-Z0-9]"
sequencePattern   = r"(.)\1\1+"
seqReplacePattern = r"\1\1"
    
for tweet in sentimentHead['text']:
  # Make all tweets lowercase
  tweet = tweet.lower()
        
  # Replace all URLs with 'URL'
  tweet = re.sub(urlPattern,' URL',tweet)

  # Replace all emojis
  for emoji in emojis.keys():
    tweet = tweet.replace(emoji, "EMOJI" + emojis[emoji])        
    
  # Replace @USERNAME to 'USER'
  tweet = re.sub(userPattern,' USER', tweet)        
    
  # Replace all non alphabets
  tweet = re.sub(alphaPattern, " ", tweet)
    
  # Replace 3 or more consecutive letters by 2 letter
  tweet = re.sub(sequencePattern, seqReplacePattern, tweet)

  tweetwords = ''
  for word in tweet.split():
    # Checking if the word is a stopword
    # if word not in stopwordlist:
    if len(word) > 1:
      # Lemmatize word
      word = wordLemm.lemmatize(word)
      tweetwords += (word + ' ')
            
  processedText.append(tweetwords)

print("Preprocessed data")

# Add preprocessed data to arrays for input to SVM
with nlp.disable_pipes():
  sentiment_binary = np.array([row for row in sentimentHead['polarity']])

X_train, X_test, y_train, y_test = train_test_split(processedText, sentiment_binary,
                                                    test_size=0.03, random_state=0)

# Vectorize data
vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)
clearXTrain = X_train

X_train = vectoriser.transform(X_train)
X_test  = vectoriser.transform(X_test)

print("Data vectorized")

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=0, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print("Fitted data")
score = svc.score(X_test, y_test)
print(f"Accuracy: {score * 100:.3f}%", )

Preprocessed data
Data vectorized
Fitted data
Accuracy: 81.550%


The next step in the process will be to make predictions on unseen data. The code below does that.

In [7]:
import pickle
tuple_objects = (svc, clearXTrain, y_train, score)

# Save tuple
pickle.dump(tuple_objects, open("text_model.pkl", 'wb'))

We can separately also use Google Cloud to analyze sentiment.

In [None]:
from google.cloud import language_v1
from google.cloud.language_v1 import enums
from googleapiclient.discovery import build
import getpass
import os

# Set Google credentials
# Note: Make sure to upload the json file into Colab
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]="/news-sentiment-analysis-283320-5af01bb429bf.json"


def analyze_sentiment(text_content):
    """
    Analyzing Sentiment in a String
    Args:
      text_content The text content to analyze
    """

    client = language_v1.LanguageServiceClient()

    # text_content = 'I am so happy and joyful.'

    # Available types: PLAIN_TEXT, HTML
    type_ = enums.Document.Type.PLAIN_TEXT

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language = "en"
    document = {"content": text_content, "type": type_, "language": language}

    # Available values: NONE, UTF8, UTF16, UTF32
    encoding_type = enums.EncodingType.UTF8

    response = client.analyze_sentiment(document, encoding_type=encoding_type)
    # Get overall sentiment of the input document
    print(u"Document sentiment score: {}".format(response.document_sentiment.score))
    print(
        u"Document sentiment magnitude: {}".format(
            response.document_sentiment.magnitude
        )
    )
    # Get sentiment for all sentences in the document
    for sentence in response.sentences:
        print(u"Sentence text: {}".format(sentence.text.content))
        print(u"Sentence sentiment score: {}".format(sentence.sentiment.score))
        print(u"Sentence sentiment magnitude: {}".format(sentence.sentiment.magnitude))

    # Get the language of the text, which will be the same as
    # the language specified in the request or, if not specified,
    # the automatically-detected language.
    print(u"Language of the text: {}".format(response.language))

APIKEY = getpass.getpass()

lservice = build('language', 'v1beta1', developerKey=APIKEY)
sentimentEmotion = []
shortText = processedText[600:]

for quote in shortText:
  response = lservice.documents().analyzeSentiment(
    body={
      'document': {
         'type': 'PLAIN_TEXT',
         'content': quote
      }
    }).execute()
  polarity = 1 if response['documentSentiment']['polarity'] == 1 else 0
  sentimentEmotion.append(polarity)
  #magnitude = response['documentSentiment']['magnitude']
  #print('POLARITY=%s MAGNITUDE=%s for %s' % (polarity, magnitude, quote))

X_train, X_test, y_train, y_test = train_test_split(shortText, sentimentEmotion,
                                                    test_size=0.03, random_state=0)

# Vectorize data
vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)

X_train = vectoriser.transform(X_train)
X_test  = vectoriser.transform(X_test)

print("Data vectorized")

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=0, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print("Fitted data")
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )

··········


Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/googleapiclient/discovery_cache/__init__.py", line 36, in autodetect
    try:
ModuleNotFoundError: No module named 'google.appengine'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 33, in <module>
    from oauth2client.contrib.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.contrib.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 37, in <module>
    from oauth2client.locked_file import LockedFile
ModuleNotFoundError: No module named 'oauth2client.locked_file'

During handling of the above exception, another exception occurred:

Traceback (most recent c

HttpError: ignored