# Sentiment Analysis Beta Prototyping for data processing

The following notebook is used for exploration purposes. 
It aims to test and experiment with different data processing methods to determine which one is the best for the use of the sentiment analysis project.

In [5]:
import json
import numpy as np
import matplotlib.pyplot as plt
import spacy

In [11]:
#example of how we can use spacy as a tokenizer to help us save time without having to write the tokenizer ourselves
nlp = spacy.load("en_core_web_sm")
doc = nlp("some text")
print([(token.pos_, token.text, token.dep, token.is_alpha) for token in doc])

[('DET', 'some', 415, True), ('NOUN', 'text', 8206900633647566924, True)]


# The following question then arises: how would our data pipeline look like along with spaCy?
<ol>
    <li>Firstly, we have a document, or any other type of file that contains the text. These are our messages, our tweets, our collection of texts in text files, csv files, xml files, json files, and so on. These can also be text from unscraped html files, and so on.</li>
    <li>Then after we've extracted the data from these data formats, we have to cleanse the data from noise, and normalise it.</li>
    <li>After the developed methods to extract data from such files, we should have a way to store them (we have not decided yet whether it will be a simple txt file, json, or other). While the data is stored, we apply the pipeline with the spacy to tokenise the document.</li>
    <li>After we've tokenised it, it's important to do stop words removal: filtering out common words (e.g., “the,” “and”) that might not contribute meaningful sentiment. This is important since the sentiment analysis will only rely on the important words.</li>
    <li>The last step would then be lemmatising the words and retrieving the stems of the words as we have to standarize the words that will be used for then feature extraction and transformation.</li>
</ol>

This is how the pipeline would look like:

<img src="diagram.png" alt="Overall image" width="800" length="500">