<a href="https://colab.research.google.com/github/edward2018211/sentiment-analysis-SOS/blob/master/Sentiment_Analysis_Edward.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is a part of the News Sentiment Analysis project and contains snippets of code. Below is initial setup.

In [None]:
# Download dependencies
!pip install ImageScraper # https://pypi.org/project/ImageScraper/

# Import libraries
import tensorflow as tf
import image_scraper as imagescraper

Collecting ImageScraper
  Downloading https://files.pythonhosted.org/packages/43/4b/e1e2af3b0892cf55fe4db06ce16e6d1d41c6bb95ae208b52109316da948c/ImageScraper-2.0.7-py2.py3-none-any.whl
Collecting setproctitle>=1.1.8
  Downloading https://files.pythonhosted.org/packages/5a/0d/dc0d2234aacba6cf1a729964383e3452c52096dc695581248b548786f2b3/setproctitle-1.1.10.tar.gz
Collecting SimplePool
  Downloading https://files.pythonhosted.org/packages/5f/05/1caf229f0baccbbc01978b4c77269e602125815403a7fb1079e63b83be05/SimplePool-0.1.tar.gz
Building wheels for collected packages: setproctitle, SimplePool
  Building wheel for setproctitle (setup.py) ... [?25l[?25hdone
  Created wheel for setproctitle: filename=setproctitle-1.1.10-cp36-cp36m-linux_x86_64.whl size=33908 sha256=ded2ca8f7dea86d30e1907038fcb45814c1458e3ff417d598bcd73f807a6150f
  Stored in directory: /root/.cache/pip/wheels/e6/b1/a6/9719530228e258eba904501fef99d5d85c80d52bd8f14438a3
  Building wheel for SimplePool (setup.py) ... [?25l[?25h

We will first scrape images from Google using a simple script for processing.

Next, we will manually label the images that are scraped to use as training data for our model in Tensorflow. An additional suggestion is that we could use data augmentation for more data, but we do need be aware of photos that wouldn't make sense to data augment.

We will need to do some feature engineering for better prediction accuracy. We'll need multiple layers for our deep neural network and we'll probably want to work in increased stride length for a faster model that uses less memory. Probably dropout too to combat overfitting.

Once the model is fine-tuned, it will be able to make predictions based on new images (data).

In addition, we would also like to analyze text associated with the news and not simply pictures. The classical approach is the bag of words approach, instead we'll use word vectors which enhances performance to train our SVM. Below is the setup for our text model. 

Note: The data we use is in TSV format and we will need to convert that to CSV format. A good website to use is: https://open.blockspring.com/pkpp1233/tsv-to-csv-converter#access-from-tools. In addition, since the dataset contains around 25K various texts, loading the data and training the model takes 15 - 20 minutes when this project is completed. Currently, there are still parsing errors with commas so we discard those texts leaving us with about 15K data points.

Next Steps:
1. We want to clean our data even more by using find and replace in Numbers to clear the way to incorporate the whole 25K dataset.
2. We want to experiment with other ML models and how accurate they are on the training and testing dataset.

In [None]:
# Download large model
!python -m spacy download en_core_web_lg
!python -m spacy download en

# Import Libraries
import numpy as np
import spacy
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from google.colab import files
import io
import csv

# Need to load the large model to get the vectors
nlp = spacy.load('en')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


Saving labeledTrainData.csv to labeledTrainData (9).csv


b'Skipping line 4: expected 13 fields, saw 25\nSkipping line 5: expected 13 fields, saw 40\nSkipping line 6: expected 13 fields, saw 22\nSkipping line 14: expected 13 fields, saw 21\nSkipping line 17: expected 13 fields, saw 15\nSkipping line 18: expected 13 fields, saw 15\nSkipping line 19: expected 13 fields, saw 24\nSkipping line 20: expected 13 fields, saw 14\nSkipping line 24: expected 13 fields, saw 14\nSkipping line 30: expected 13 fields, saw 34\nSkipping line 35: expected 13 fields, saw 22\nSkipping line 37: expected 13 fields, saw 14\nSkipping line 39: expected 13 fields, saw 15\nSkipping line 41: expected 13 fields, saw 33\nSkipping line 46: expected 13 fields, saw 14\nSkipping line 50: expected 13 fields, saw 14\nSkipping line 51: expected 13 fields, saw 17\nSkipping line 52: expected 13 fields, saw 15\nSkipping line 61: expected 13 fields, saw 37\nSkipping line 62: expected 13 fields, saw 28\nSkipping line 63: expected 13 fields, saw 14\nSkipping line 68: expected 13 field

In [None]:
# Loading the sentiment data
uploaded = files.upload()
sentiment = pd.read_csv(io.BytesIO(uploaded['training.1600000.processed.noemoticon.csv']), error_bad_lines=False, quoting=csv.QUOTE_NONE)

In [None]:
# Clean data
sentiment.replace(to_replace=[',', '\'', "<br /><br />", '`', '\\', '\\\\', '""'], value=" ")
print("Cleaned data")

# Parse data and add all lines of text into sentimentText array
with nlp.disable_pipes():
  sentimentText = []
  sentimentBinary = []
  for row in sentiment.itertuples():
    reviewText = ""
    for line in row:
      if isinstance(line, tuple):
        sentimentBinary.append(line[1])
        counter = 0
        for section in line:
          if isinstance(section, str) and counter > 1:
            reviewText += section   
          counter += 1         
      elif isinstance(line, str):
        reviewText += line
    sentimentText.append(reviewText)
  
  doc_vectors = np.array([nlp(row).vector for row in sentimentText])
print("Parsed data")

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, sentimentBinary,
                                                    test_size=0.1, random_state=1)

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print("Fitted data")
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )

Cleaned data
Parsed data
Fitted data
Accuracy: 69.358%
