## Environment Setup

+ We will be working with Python 3
    + We suggest using Jupyter Lab as we will use Python interactively
    + [Install Jupyter Notebook](https://jupyter.readthedocs.io/en/latest/install.html) or [Jupyter Lab](https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html) if you don't have anaconda already installed
+ Download the Github repository for this tutorial here [https://github.com/ftarlaci/pydata-austin-tutorial.git] 
    + Browse through the repository and locate the files in the folder

### <font color='blue'>Importing Libraries</font>

+ Importing the packages we will use today


In [None]:
''' 
    Natural Language Toolkit (nltk) is a suite of text processing libraries for classification, 
    tokenization, stemming, tagging, parsing, and semantic reasoning. 
    
    Installing nltk: 
    In the command line: 
    
    Windows: pip3 install nltk
    Mac: pip3 install -U nltk
    Anaconda: conda install -c conda-forge nltk 

'''
import nltk 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

import numpy as np 
import pandas as pd 
import re

In [None]:
# May be required to ensure the stopwords data is on your machine
nltk.download('stopwords')

# <center>Sentiment Analysis</center>

### <font color='blue'>What Is It?</font>

+ Sentiment analysis (SA) is **the process of understanding an opinion (sentiment) about a given subject from written or spoken language.** 
+ It is one of the subfields of Natural Language Processing (NLP) that extracts opinion and attributes from text (or speech). 
+ Sentiment Analysis is a supervised learning technique. 

![NLP Tasks Chart](./images/nlpchart.jpg)

+ SA usually counts on four tasks: 
    + *opinion identification*, identifying the text which contains an opinion
    + *feature extraction*, identifying the aspects being commented on, such as a product's price
    + *sentiment classification*, whether the opinion polarity is positive, negative, or neutral
    + *visualization and summarization* of results

### <font color='blue'>How Does It Bring Insight to Textual Data?</font>
+ It has many practical applications: **product reviews, marketing analysis, customer feedback.**
+ It allows us to **make sense of data** in ways that enhance the techniques we use in data science and machine learning for this purpose.
+ It is an example of **a classification problem** where a classifier tells if the sentiment in a piece of **text is positive, negative, (or neutral)** or any other class that is defined for classification. 
+ It allows to gain **insight leading to automate and improve all kinds of processes** by transforming **unstructured data into structured data** of opinions about services, products etc.

![Sentiment Smiley and Angry Faces](./images/sa.jpg)

*image credit [kdnuggets](https://www.kdnuggets.com/2018/03/5-things-sentiment-analysis-classification.html)*

### <font color='blue'>Why Is It Valuable?</font>

Organizations deal with enormous amount of data; customer calls, emails, social media posts, among many other variants. **Making these types of data more meaningful and useful requires a lot of effort and time.** One of the core skills in extracting information from and gaining insight into textual data is NLP. **NLP is one of the must have skills for all data scientists** given the constant increase in textual data everywhere we look. 

+ **Discover how people feel about a particular topic** -- analyze the sentiments of users of various forms (social media, e-commerce, marketing campaigns, among others)
+ **Solve a number of business problems** -- product reviews, customer satisfaction...
+ **Customer first philosophy** -- any insight into customers through what they like an don't like is exteremly valuable 
+ **Gateway to data analytics**

### <font color='blue'>Basic Terminology</font>

+ **Stop Words** -- Removal of words that are not important from the information point of view, such as 'the', 'is' 'a' etc.
+ **Tokenization**  -- Segmentation of text into words (a form of feature extraction)
+ **Lemmatization** -- Assigning the base forms of words (the lemma of 'spoke' is 'speak' and the lemma of 'languages' is 'language')
+ **Stemming** -- Reducing a word to its stem or root form known as a lemma ( car, cars, car’s, cars’ --> car (stem or root word) ) 
+ **Word Embedding** -- Mapping words to vectors of numbers where words with similar meaning have a similar numerical representation.
+ **Text Classification** -- Assigning categories to a document or parts of it
+ **N-grams** -- Consideration of a group of words (phrases) rather than single words to extract meaning. Helps with better understanding of text; 'not happy' instead of 'happy,' (e.g bi-gram per token). See the N-Gram Example below:

    Sentence:	The movie was not great.<br>
    Uni-gram	[‘The’, ‘movie’, ‘was’, ‘not’, ‘great.’]<br>
    Bi-grams	[‘The movie’, ‘was not’, ‘great.’]


### <font color='blue'>Open Source NLP Libraries</font>

+ Many open source libraries are at your service when you want to implement NLP models

    * NLTK
    * Spark NLP
    * SpaCy
    * ...

### <font color='blue'>Text Representation</font>

+ Classifiers and learning algorithms cannot directly process the text documents in their original form, as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length
+ During the text preprocessing step, the text is converted to a computationally more manageable representation
+ **Bag of Words (BoW)** is a common method for extracting features from text:
    + In BoW, the presence, count, and often, the frequency of words are taken into consideration. The order in which they occur is ignored.
+ Other methods include TF-IDF (term frequency-inverse document frequency), a metric that represents how 'important' a word is to a document.

<!--- 
#### How do we measure the importance of words or the sentiment that originates from them?

There are a few methods: 

1) Term presence and term frequency  
  + Term is present 
  + Term is repeated = weighs more in determining the meaning

2) TF-IDF
  + TF-IDF stands for Term Frequency and Inverse Document Frequency and in simple terms it means the importance of a term to a document
    
3) Bag of Words
  + extract the words or tokens from the text and then push them in a bag (set) where the words are stored in the bag without any particular order. Thus the presence of a word in the bag is of main importance and the order of the occurrence of the word in the sentence, as well as its grammatical context carries no value. Bag of words scheme is the simplest way of converting text to numbers.  
  + --- Loses contextual (positional) data, had disadvantages for other downstream tasks --->

### <font color='blue'>What Will We Do Today?</font>

+ Gain a practical understanding of sentiment analysis and its influence on business processes
+ Build a sentiment analysis classifier in Python, using some of the open-source NLP and machine learning libraries
+ Build a *classification* of sentiments in a customer reviews dataset
    + Classification is the process of identifying the category of a new, unseen observation based of a training set of data, which has categories that are known.

### <font color='blue'>Workflow:</font>
+ Import libraries and inspect the dataset 
+ Pre-processing data: clean the data using nltk methods
+ Create a sentiment classification model
+ Evaluate your model with evaluation metrics (Precision, Recall, F1 score, Accuracy)

Continue to **`preprocessing_text_with_nltk.ipynb`**