# Sparse Linear Model | BAIS:6100

**Instructor: Qihang Lin**

## Predictive Analytics with Text Data

Predict categorical outcome using text data:
   - spam email / regular email
   - news about sports / news about politics
   - positive reviews / negative reviews
   - true reviews / fake reviews
   - useful reviews / useless reviews
   - election result (through social media posts)

Predict numeric outcome using text data:
   - Star rating 
   - Box office revenue 
   - Stock return
   - Demand of a product

A predictive model can be built on the vectorized form, such as the DTM, of text data.

* Sparse logistic/linear regression
* Decision/regression tree
* Random forest
* Boosting
* K-nearest neighbors
* Naive Bayesian classifier
* Support vector machine
* Deep neural network
* ......

## Training VS Testing
* Each predictive model can be evaluated by its **in-sample performance** and **out-of-sample performance**:
    * **In-sample performance**: The prediction performance on the data points that were used to build the model.
    * **Out-of-sample performance**: The prediction performance on the data points that were not involved in building the model (for example, the real-world instances coming in the future).
    
    
* A **model validation** procedure is used to estimate a model's out-of-sample performance:
    - Split our data into training and testing sets (e.g., 75% vs 25% or 80% vs 20%).
    - Pretend that the testing set is the data in the future.
    - Build the model only using the training set and then evaluate its perfomance on both the training and testing sets.
    - Training performance $=$ in-sample performance 
    - Testing performance $\approx$ out-of-sample performance

## Underfit VS Overfit

* Underfitting: the model is too simple to characterize the underlying truth.
    - High error in both training and testing data.

* Overfitting: the model is too complex so it "memorizes" the label of each document in the training set but ignores the underlying truth.
    - Low error in training but high error in testing data.

* The complexity of a model depends on the size of vocabulary and some hyper-parameters.
    
* Regularization and model validation can help us avoid overfitting

<img src=https://cdn.analyticsvidhya.com/wp-content/uploads/2020/02/Screenshot-2020-02-06-at-11.09.13.png width="400">


## Steps of Predictive Text Analytics 

An end-to-end predictive text analytics pipeline is composed of three main components:

**1. Dataset Preparation:** Load a dataset and split it into training and testing sets (e.g. 80% and 20%). Then perform basic pre-processings, tokenization, words replacement, removing stop words, stemming, and etc.

**2. Feature Engineering:** Create the DTM. Determine a vocabulary. TF VS TFIDF VS Binary. Words VS n-grams. Normalization or not. Whether to include other features like the length of document, POS-tag counts, word embedding, topic models, sentiment scores, and so on. 

**3. Model Training:** Train a machine learning model using the training set. Tune the **hyper-parameter**. Evaluate the performance of model on the testing set. 

Repeat Steps 1, 2 and 3 with different choices to improve the performance.  

<img src=http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1535125878/NLTK3_zwbdgg.png width="500">

## Logistic Regression
* Vectorization of a document: $\mathbf{x}=(x_1,x_2,\dots,x_n)$ 
    * $\mathbf{x}$ is a row in DTM and also called a feature vector.
    
* Each document has a class label, denoted by $y$.
    * For example, $y=pos$ or $neg$ in binary classification.

* Linear score: 
    * $\alpha+\beta_1x_1+\beta_2x_2+\cdots+\beta_nx_n$
* Coefficients: 
    * Intercept: $\alpha$ 
    * Slopes: $\mathbf{\beta}=(\beta_1,\beta_2,\dots,\beta_n)$
* Logistic regression makes prediction based on the linear score:
    * Predict $y$ as "pos" if $\alpha+\beta_1x_1+\beta_2x_2+\cdots+\beta_nx_n>0$ 
    * Predict $y$ as "neg" if $\alpha+\beta_1x_1+\beta_2x_2+\cdots+\beta_nx_n<0$
* Impact of terms: 
    * $\beta_i>0$: A document with a high frequency in term $i$ will be positive. 
    * $\beta_i<0$: A document with a high frequency in term $i$ will be negative.
    * $\beta_i=0$: Term $i$ has no impact on the class label $y$.

## Sparse Logistic Regression

* A logistic regression (LR) model is very likely overfitted when the vocabulary is very large. 


* A sparse logistic regression (SLR) model is a LR model with slope $\beta_i=0$ for terms with no significant impact. These terms are essentially removed from the model, which reduces the size of vocabulary and avoids overfitting.

* The term "sparse" in SLR comes from the fact that the slopes $\mathbf{\beta}=(\beta_1,\beta_2,\dots,\beta_n)$ contain many zeros.


* Theoretically, SLR can be applied with a vocabulary of any size with no worry of overfitting.


* SLR searches for $\mathbf{\beta}$ that minimizes
$$
TrainingError + \frac{1}{C}\times\|\mathbf{\beta}\|_1 
$$
   - $\|\mathbf{\beta}\|_1=|\beta_1|+|\beta_2|+\dots+|\beta_n|$: L1 regularizer. $\mathbf{\beta}$ will be sparse if $\|\mathbf{\beta}\|_1$ is small. Therefore, we can make $\mathbf{\beta}$ sparse by minimizing $\|\mathbf{\beta}\|_1$.
   - $C$ is a **regularization parameter** that balances between $TrainingError$ and $\|\mathbf{\beta}\|_1$ during training.

## Sparsity and Regularization Parameter

* Regularization parameter $C$ must be provided when building a sparse logistic regression model. 
* Small $C$:
    * $\mathbf{\beta}$ becomes more sparse. SLR tends to be simple and underfitted.
* Large $C$:
    * $\mathbf{\beta}$ becomes more dense. SLR tends to be complex and overfitted.
* $C$ must be tuned carefully to achieve the best performance of SLR.

## Dataset Preparation

1. Clean the text data if needed. 

2. Split the text data into training and testing sets. 

3. Create DTMs for both sets. Make sure they have the same vocabulary and that vocabulary must be determined only based on the training set.
  
In principle, no information from the testing set should be used during training. 

In [1]:
import pandas as pd
import re
df = pd.read_csv("classdata/Lies_and_Truths.csv")
df = df[df.Domain=="Electronics"].copy()
df.reset_index(drop=True, inplace=True)
df=df[["Sentiment Polarity","Review"]]   #Keep only the columns we need.

#Clean the text if needed.
df["Review"]=[re.sub("[^\w\s]|_", " ", s) for s in df["Review"]]
df.head()

Unnamed: 0,Sentiment Polarity,Review
0,neg,egyd hr fhjfhjtjr rhjtjt tjfhjfwettert ryur yu...
1,neg,I have never purchased a George Foreman grill ...
2,neg,I ll sum it up in one word SCAM Those people...
3,pos,When our ten year old Mr Coffee stopped worki...
4,neg,I purchased the Stanley Bostitch QuietSharp Ex...


In [2]:
#Split the data
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.33, random_state=2021)
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

Note that a random partition will make the analysis result not reproducible. That's why **random_state=2021** is set to make sure the partition is not random.

## Feature Engineering

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk 

stemmer = nltk.stem.SnowballStemmer("english")
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

nltk_stopwords = nltk.corpus.stopwords.words("english") 

vectorizer=StemmedTfidfVectorizer(stop_words=nltk_stopwords, norm=None)

#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_train["Review"])
train_y = df_train["Sentiment Polarity"]
train_x.shape

(529, 3323)

By applying **vectorizer.fit_transform** to df_train["Review"] above, the vocabulary of train_x are learned from the training data and saved internally. 

Next, we apply **vectorizer.transform** to df_test["Review"], so the saved vocabulary will be used to create test_x. This makes sure train_x and train_y have the same vocabulary.

In [4]:
#Create the testing DTM and the labels
test_x = vectorizer.transform(df_test["Review"])
test_y = df_test["Sentiment Polarity"]
test_x.shape

(261, 3323)

If you are interested, you can see the frequencies of different labels in both sets:

In [5]:
pd.DataFrame({'Train': train_y.value_counts(),
              'Test': test_y.value_counts()})

Unnamed: 0,Train,Test
neg,272,122
pos,257,139


In this example, the label equals "pos" or "neg". **sklearn always sets the alphabetically last value in the class labels as the positive class.** Fortunately, "pos" is after "neg" in the alphabetical order, so the class labels in this example are intuitive.

## Model Training

In the backend, SLR is actually built by a **solver**, which is an algorithm that updates the SLR model in iterations to make it better and better. More iterations generally make the model more accurate but also require a longer runtime. 

The solver stops and returns the SLR model when either one of the following conditions are satisfied:
1. **Tolerance**, which is how much SLR changes between iterations, is smaller than a threshold.
2. **Max number of iterations** is reached.

In [6]:
from sklearn.linear_model import LogisticRegression
sparselr = LogisticRegression(penalty='l1', 
                              solver='liblinear',
                              random_state=2021,
                              tol=0.0001,
                              max_iter=1000, 
                              C=1)
sparselr.fit(train_x,train_y)

* **penalty='l1'** means we are using a L1 regularizer, which makes the model a SLR model.  
* **solver='liblinear'** selects the 'liblinear' algorithm to build the SLR model. All algorithms are supposed to return the same SLR model, but they needs different amounts of runtime and memory.
* **random_state=2021**: Solver uses random search to update model. Fix random_state to make sure the result is reproducible.
* **tol=0.0001** sets the threshold for tolerance.
* **max_iter=1000** sets the max number of iterations. 
* **C=1** sets the regularization parameter.

For a large data, we usually start with a small max_iter or a large tol, for example, max_iter=100 or tol=0.1, and see how long it takes. If the runtime is bearable, we can increase max_iter or decrease tol, and run the code again.

* More solvers: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

* More options: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [7]:
#Slope betas
sparselr.coef_[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [8]:
#Intercept alpha
sparselr.intercept_

array([0.])

In [9]:
#How many non-zero betas in total
sum(sparselr.coef_[0]!=0)

223

## Descriptive Analytics

We can identify the terms that have the largest impacts to pos and neg classes by sorting $\beta_i$'s.

In [10]:
dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Beta': sparselr.coef_[0]
                     })



In [11]:
#Show the most positive terms
dfbeta.sort_values(by="Beta",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

Unnamed: 0,Term,Beta
0,love,1.356127
1,perfect,1.064824
2,sleek,0.980339
3,amaz,0.746911
4,bright,0.736774
5,easi,0.73125
6,favorit,0.520354
7,appeal,0.512988
8,dream,0.476208
9,awesom,0.466109


In [12]:
#Show the most negative terms
dfbeta.sort_values(by="Beta",inplace=True,ascending=True)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

Unnamed: 0,Term,Beta
0,disappoint,-0.665625
1,howev,-0.551987
2,slow,-0.529074
3,expens,-0.468178
4,confus,-0.450569
5,despit,-0.45049
6,seem,-0.449013
7,see,-0.420707
8,alreadi,-0.408001
9,poor,-0.407091


Please keep in mind that the terms with positive (negative) $\beta$'s make the documents more likely to be in the positive (negative) class, but the positive class is determined by the alphabetical order. It is just a coincidence that "pos" is alphabetically after "neg".

## Predictive Analytics

In [13]:
#Apply the model to the testing set and predict the class labels
sparselr.predict(test_x)[0:10]

array(['neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'pos',
       'neg'], dtype=object)

In [14]:
#Predict the probability of each doc being in each class 
#The two columns corresponds to "neg" and "pos" (in alphabetical order)
sparselr.predict_proba(test_x)[0:10]

array([[9.93675779e-01, 6.32422069e-03],
       [7.82184934e-01, 2.17815066e-01],
       [8.60953266e-01, 1.39046734e-01],
       [9.76884247e-01, 2.31157531e-02],
       [8.99561288e-01, 1.00438712e-01],
       [8.81510285e-01, 1.18489715e-01],
       [9.68023742e-01, 3.19762581e-02],
       [9.47590638e-01, 5.24093619e-02],
       [9.02624031e-09, 9.99999991e-01],
       [9.54815700e-01, 4.51842999e-02]])

## Performance Metric

**Confusion table**:

| | Pred Neg | Pred Pos |
| --- | --- | --- |
| True Neg | 31 | 11 |
| True Pos | 48 | 68 |

**Accuracy**: The percentage of correct predictions made by a model. 

- With the confusion table above, accuracy=$\frac{31+68}{31+68+11+48}=\%62.7$

**AUC Score**: 

- Accuracy metric is problematic for when data is unbalanced (e.g., 990 pos but only 10 neg).
- AUC (Area under the ROC Curve) is another performance measure that works better than accuracy on  unbalanced data.



- Suppose we rank the data points in the descending order of their predicted probabilities of being positive. AUC measures how well the positive and negative data points are separated after ranking.

<img src=https://developers.google.com/machine-learning/crash-course/images/AUCPredictionsRanked.svg width="700">


- AUC represents the probability that a random positive (green) data point is ranked higher than a random negative (red) data point.

- AUC ranges in value from 0 to 1 with 1 indicating a perfect model.

More performance metrics: https://scikit-learn.org/stable/modules/model_evaluation.html

All of these metrics are callable from **sklearn.metrics**. 

We often calculate these metrics in both training and testing sets to check if the model is overfitted.

In [15]:
from sklearn.metrics import accuracy_score, roc_auc_score

Accuracy:

In [16]:
print("Train:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Test:")
print(accuracy_score(test_y,sparselr.predict(test_x)))

Train:
0.998109640831758
Test:
0.8314176245210728


AUC Score: The instances will be sorted by the predicted probability of being positive (the alphabetically last class), which is stored in **sparselr.predict_proba(x)[:, 1]**. 

In [17]:
print("Train:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Test:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

Train:
0.9999928473334859
Test:
0.915261233636042


This model has a similar performance in training and testing sets, so it is not overfitted.