<a href="https://colab.research.google.com/github/edward2018211/sentiment-analysis-SOS/blob/master/Sentiment_Analysis_Edward.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is a part of the News Sentiment Analysis project and contains snippets of code. Below is initial setup.

In [None]:
# Download dependencies
!pip install ImageScraper # https://pypi.org/project/ImageScraper/

# Import libraries
import tensorflow as tf
import image_scraper as imagescraper

Collecting ImageScraper
  Downloading https://files.pythonhosted.org/packages/43/4b/e1e2af3b0892cf55fe4db06ce16e6d1d41c6bb95ae208b52109316da948c/ImageScraper-2.0.7-py2.py3-none-any.whl
Collecting setproctitle>=1.1.8
  Downloading https://files.pythonhosted.org/packages/5a/0d/dc0d2234aacba6cf1a729964383e3452c52096dc695581248b548786f2b3/setproctitle-1.1.10.tar.gz
Collecting SimplePool
  Downloading https://files.pythonhosted.org/packages/5f/05/1caf229f0baccbbc01978b4c77269e602125815403a7fb1079e63b83be05/SimplePool-0.1.tar.gz
Building wheels for collected packages: setproctitle, SimplePool
  Building wheel for setproctitle (setup.py) ... [?25l[?25hdone
  Created wheel for setproctitle: filename=setproctitle-1.1.10-cp36-cp36m-linux_x86_64.whl size=33908 sha256=ded2ca8f7dea86d30e1907038fcb45814c1458e3ff417d598bcd73f807a6150f
  Stored in directory: /root/.cache/pip/wheels/e6/b1/a6/9719530228e258eba904501fef99d5d85c80d52bd8f14438a3
  Building wheel for SimplePool (setup.py) ... [?25l[?25h

We will first scrape images from Google using a simple script for processing.

Next, we will manually label the images that are scraped to use as training data for our model in Tensorflow. An additional suggestion is that we could use data augmentation for more data, but we do need be aware of photos that wouldn't make sense to data augment.

We will need to do some feature engineering for better prediction accuracy. We'll need multiple layers for our deep neural network and we'll probably want to work in increased stride length for a faster model that uses less memory. Probably dropout too to combat overfitting.

Once the model is fine-tuned, it will be able to make predictions based on new images (data).

In addition, we would also like to analyze text associated with the news and not simply pictures. The classical approach is the bag of words approach, instead we'll use word vectors which enhances performance to train our SVM. Below is the setup for our text model. 

Note: The dataset contains around 1.6 million texts, so loading the data and training the model takes 20 - 30 minutes when this project is completed. Currently we are able to extract information from all 1.6 million training points.

Next Steps:
1. We want to experiment with other ML models and how accurate they are on the training and testing dataset.

In [39]:
# Download large model
!python -m spacy download en_core_web_lg
!python -m spacy download en

# Import Libraries
import numpy as np
import nltk
nltk.download('wordnet')
import re
from nltk.stem import WordNetLemmatizer
import spacy
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from google.colab import files
import io
import csv

# Need to load the large model to get the vectors
nlp = spacy.load('en')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [9]:
# Loading the sentiment data
uploaded = files.upload()
sentiment = pd.read_csv(io.BytesIO(uploaded['training.1600000.processed.noemoticon.csv']), names=['polarity', 'id', 'date', 'query', 'user', 'text'], encoding="latin-1")

Saving training.1600000.processed.noemoticon.csv to training.1600000.processed.noemoticon (1).csv


In [61]:
# Shuffle data set
sentiment = sentiment.sample(frac=1)

# Get first 5000 rows
sentimentHead = sentiment
#sentiment.head(50000)

In [64]:
# Dictionary of all emojis mapping to their meanings.
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', 
          ':-(': 'sad', ':-<': 'sad', ':P': 'raspberry', ':O': 'surprised',
          ':-@': 'shocked', ':@': 'shocked',':-$': 'confused', ':\\': 'annoyed', 
          ':#': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':-D': 'smile', ':-0': 'yell', 'O.o': 'confused',
          '<(-_-)>': 'robot', 'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink', 
          ';-)': 'wink', 'O:-)': 'angel','O*-)': 'angel','(:-D': 'gossip', '=^.^=': 'cat'}

## Set of all stopwords in english.
stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
             'and','any','are', 'as', 'at', 'be', 'because', 'been', 'before',
             'being', 'below', 'between','both', 'by', 'can', 'd', 'did', 'do',
             'does', 'doing', 'down', 'during', 'each','few', 'for', 'from', 
             'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
             'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
             'into','is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
             'me', 'more', 'most','my', 'myself', 'now', 'o', 'of', 'on', 'once',
             'only', 'or', 'other', 'our', 'ours','ourselves', 'out', 'own', 're',
             's', 'same', 'she', "shes", 'should', "shouldve",'so', 'some', 'such',
             't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
             'themselves', 'then', 'there', 'these', 'they', 'this', 'those', 
             'through', 'to', 'too','under', 'until', 'up', 've', 'very', 'was',
             'we', 'were', 'what', 'when', 'where','which','while', 'who', 'whom',
             'why', 'will', 'with', 'won', 'y', 'you', "youd","youll", "youre",
             "youve", 'your', 'yours', 'yourself', 'yourselves']

# Preprocess data
processedText = []
    
# Create Lemmatizer and Stemmer
wordLemm = WordNetLemmatizer()
    
# Defining regex patterns.
urlPattern        = r"((http://)[^ ]*|(https://)[^ ]*|( www\.)[^ ]*)"
userPattern       = '@[^\s]+'
alphaPattern      = "[^a-zA-Z0-9]"
sequencePattern   = r"(.)\1\1+"
seqReplacePattern = r"\1\1"
    
for tweet in sentimentHead['text']:
  # Make all tweets lowercase
  tweet = tweet.lower()
        
  # Replace all URLs with 'URL'
  tweet = re.sub(urlPattern,' URL',tweet)

  # Replace all emojis
  for emoji in emojis.keys():
    tweet = tweet.replace(emoji, "EMOJI" + emojis[emoji])        
    
  # Replace @USERNAME to 'USER'
  tweet = re.sub(userPattern,' USER', tweet)        
    
  # Replace all non alphabets
  tweet = re.sub(alphaPattern, " ", tweet)
    
  # Replace 3 or more consecutive letters by 2 letter
  tweet = re.sub(sequencePattern, seqReplacePattern, tweet)

  tweetwords = ''
  for word in tweet.split():
    # Checking if the word is a stopword
    # if word not in stopwordlist:
    if len(word) > 1:
      # Lemmatize word
      word = wordLemm.lemmatize(word)
      tweetwords += (word + ' ')
            
  processedText.append(tweetwords)

print("Preprocessed data")

# Add preprocessed data to arrays for input to SVM
with nlp.disable_pipes():
  sentiment_binary = np.array([row for row in sentimentHead['polarity']])

X_train, X_test, y_train, y_test = train_test_split(processedText, sentiment_binary,
                                                    test_size=0.03, random_state=0)

# Vectorize data
vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)

X_train = vectoriser.transform(X_train)
X_test  = vectoriser.transform(X_test)

print("Data vectorized")

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=0, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print("Fitted data")
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )

Preprocessed data
Data vectorized
Fitted data
Accuracy: 81.842%
