# March 6, 2019 - Shashikant R. Dhuppe and Samiran Ghosh Roy

# Recap

 - What is Bias Variance Trade off?
 - Why do we do feature engineering?
     - It allows us to take advantage of more features for prediction and allows us to compare things for inference.


# Unit 7 Case Study: Sentiment Analysis

## Q) What steps do we take when approaching a dataset?

 - Find out what kind of problem it is
 - Get the proper data
 - Feature Engineering, SVM, Decision Trees, Linear/Logistic Regression
 
## The two main things we want to do is:

 A) Predict <br>
 B) Inference
 
-  We will have several models and we choose from them using cross validation
-  We fit our models by splitting it, testing and selecting models based on validation
-  We choose based on complexity, MSE

 - Positive and Negative Words
    - Goal is to extract a tweet, extract a sentiment based off text/images
    
 - Goal of the session is to analyze tweets from 2016 campaign
    - With NLP we want to extract and predict from subjective information.

## How would you take text and predict sentiment?

A) We assign weights to words, use polarity based on words obtained from somewhere e.g. Twitter, to analyze classical music reviews, problem is context of the words is missing. Therefore polarity changes with context.
   - Problem with dictionaries in these cases is that they are immutable, limited text, and there is bias while making it.
   
   - We will use IMDB Database (ref slide 7.1)
   - We have 25,000 reviews
   
   ![](eg_of_sentiment_analysis.png)
   
   - We get text e.g. "I Love Hanoi", we take average polarity, if average >0 then sentiment = Positive

## For Regression MSE is the ideal measure.
## For Classification AUC is the ideal measure

- What is AUC?
- What is Threshold?
    - Where you make a decision to predict of positive class is known as threshold

If we are predicting spam, decrease the threshold.

| $\theta$ | Positive Prediction | Negative Prediction |
| --- | --- | --- |
| True (T) | TP | FN |
| False (F) | FP | TN |

![](ROC.png)

## Accuracy is measured using F-1 Score

__Precision__ = TP/(TP+FP) - When looking at Threshold

example: Asset Managers at Wegmans, you want high precision to be sure someone is stealing

__Recall__ = TP/(TP+FN) - When looking at Threshold

example: TSA during thanksgiving rush, wants to catch bad guys so low threshold, will ensure high recall

![](eg_of_ROC.png)

Accuracy must be 99.99% when predicting terrorits

## Let us take a data science approach

- Transform text to features
- Assign weights to features
- Use logistic regression to find proper weights of words
- Predict Positive or Negative Sentiment.

### Transform text to numbers:
    There are many ways to do it, simple common way is,
    
__TF__ and __IDF__ which is Term Frequency and Inverse Document Frequency respectively

- The idea is to assign a feature to each word
    - First count unique words, frequency of these words
    
![](eg_data.png)

   - We transform the originial text into:

![](word_vector.png)


Then we do logistic regression, which has both linear and non-linear parts

![](logistic_regression.png)

   - The more positive words we have the higher Z value and higher the positivity hence, higher the slope.
   - For Negative words, the more we have them, the more positive the Z value.
   - We notice that Stop Words are of high frequency
       - So we create an Inverse Document Frequency

![](tf-idf.png)

The above equation is from TF-IDF of Scikit Learn i.e ML for Python
   - Now we split the weighted matrix of words

60% Train  |  30% Validation  |  10% testing
	
We use Bernoulli’s Loss Function: $L_\theta (x,y)$ 

   - Logistic Regression estimated the weight of each word
        - Apply this to IMDB Data

We get weights of positive and negative words

   - How to interpret Weights? We Divide it by 4
    E.G. Octane -7/4 = -1.7 <br>
            So if Octane appears in a review, probability of the review to be positive reduces by 170%

What is wrong here?


## OVERFITTING

- The weights are easy to notice, something is wrong in the polynomial not in the logistic regression.
- If the model is complex it might be over fitted.
- In our case we have 25,000 Reviews which are Data Points
- M = 250,000 approximately
- Smaller equation than features, this means : Too many features.

## To fix this

Regularized Logistic Regression

    We auto select and eliminate features, it deals with features even if there are many of them.
   - We modify loss function by adding a term that penalizes learning, <br> “Don’t Learn Too Much.”.

Two ways to penalize data:
A) L1 – Ridge <br>
B) L2 – Lasso

The term added is
$\lambda$  $\sum_{j=0}\theta_j^2$

The better we fit the data. the smaller the loss.

$L(\theta)_r$= MSE + $\lambda$  $\sum_{j=0}\theta_j^2$

if we make $\lambda$ -> $\infty$ <br>

we penalize everything except the intercept.

![](bias_and_variance.png)

The problem with L2 Regularization is that even though Stop Words don’t matter they still get some weight.

![](penalization.png)

In L1 it is better to find weights exactly ‘0’
By Definition L1 cannot learn more features than there are data points
Combination of L1 and L2 is Elastic Net Regularization
If Alpha = 1 then L1 = 0 and L2 = 1 and vice versa for Alpha = 0
Lambda controls how much regularized prediction we need
At ‘0’ L1 is undefined.
   - Applying 30% L1, 70% L2 and Lambda = 0.02
        - We increase performance by 3%
    - Now divide the weight by 4

For most problems we use Linear + Logistic Regression

If dictionary is enough, then Stop.

Always use cross validation

![](cross_validation.png)

Computing confidence in Logistic Regression ( Standard Deviation)
![](variance_of_bernauli.png)

Problem with generalizability exists since IMDB database was used to train and the data was actually Twitter Data.

After tokenizing words in sentiment analysis notebook, It converts text into numbers, then we construct a pipeline to create a TF.

There are special objects in Spark that allow you to compute performance measures.

Evaluate object on prediction.