# Homework 3 | MSCI:6100

**10 Points**

This assignment has two parts. Each part has questions based on Modules 5, Modules 6 or both.

## Part 1: SMS Spam Detection by Text Analytics

The following code reads a collection of SMS messages with each message labeled as **ham** (legitimate) or **spam**. The code also splits **df** into training (70%) and testing (30%) sets as two new data frames called **df_train** and **df_test**. 

Your task is to use **df_train** to build a predictive model to detect spam messages and test its performance on **df_test**.

In [1]:
#Read raw data 
import pandas as pd
df = pd.read_csv("classdata/spam.csv")
#Split into training and testing sets
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.30, random_state=2021)
df.head()

Unnamed: 0,Label,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Questions based on only Module 5:

1a. (0.5 point) To train and test the model, you will first need to construct DTMs and labels. Create DTMs for training and testing sets in any way you like. It is completely your choices to remove stopwords or not, to do stemming or not, to use TF, TFIDF or Binary score, to use n-gram, and to do row normalization or not. Save your DTMs as **train_x** and **test_x**. Create the class labels for training and testing sets. Save them as **train_y** and **test_y**. Print the shapes of the DTMs for training and testing sets. 

In [3]:
#Your answer here:
# Needed Imports
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk 

# Define TFIDF vectorizor with stemming
stemmer = nltk.stem.SnowballStemmer("english")
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

nltk_stopwords = nltk.corpus.stopwords.words("english") 

vectorizer=StemmedTfidfVectorizer(stop_words=nltk_stopwords, norm=None)

#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_train["Message"])
train_y = df_train["Label"]

#Create the testing DTM and the labels
test_x = vectorizer.transform(df_test["Message"])
test_y = df_test["Label"]

#Check your answer:
print(train_x.shape)
print(test_x.shape)

(3900, 5974)
(1672, 5974)


1b. (1 point) Create a sparse logistic regression using the DTM and the class lables you created in the previous question. Set the parameters in **LogisticRegression** as follows
  - random_state=2021   
  - tol=0.001           
  - max_iter=1000
  - C=0.1
 
Save your model as **sparselr**. Print the number of non-zero betas in  **sparselr**.

In [4]:
#Your answer here:
# Needed Imports
from sklearn.linear_model import LogisticRegression

# Initialize the model
sparselr = LogisticRegression(penalty='l1', 
                              solver='liblinear',
                              random_state=2021,
                              tol=0.001,
                              max_iter=1000, 
                              C=0.1)
sparselr.fit(train_x,train_y)

#Check your solution:
print(sum(sparselr.coef_[0]!=0))

128


### Questions based on both Modules 5 and 6:

1c. (1 point) Create a sparse logistic regression using the DTM and the class lables you created in  question 1a. This time you need to select $C$ by 5-fold cross validation from a grid of **20 candidates** that increase proportionally from **l1_min_c** to **l1_min_c$\times 10^{5}$**.  Since this data is unbalanced, AUC is a better performance metric than accuracy. Use AUC score as the criterion for selecting $C$. Set the parameters in **LogisticRegressionCV** as follows
  - random_state=2021   
  - tol=0.001           
  - max_iter=1000
  - scoring='roc_auc' 
 
Save your model as **sparselr_cv**. Print the number of non-zero betas in  **sparselr_cv**.

In [7]:
#Your answer here:
# Needed Imports
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import l1_min_c
import numpy as np

param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 
sparselr_cv = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   
                                cv=5,            
                                scoring='roc_auc', 
                                random_state=2021,  
                                tol=0.001,
                                max_iter=1000)
sparselr_cv.fit(train_x, train_y)


#Check your solution:
print(sum(sparselr_cv.coef_[0]!=0))

542


1d. (1 point) Evaluate and print the accuracy and AUC score of **sparselr_cv** from the previous question on the testing and training sets. If you **AUC score on the testing set** is less than 0.985, modify the DTM in your solution for question 1a and re-run your codes for questions 1b and 1c until the AUC score on the testing set in this question is at least 0.985.

*Hint: It is your choices to remove stopwords or not, to do stemming or not, to use TF, TFIDF or Binary score, to use n-gram, and to do row normalization or not. There is no best combination that works for all datasets. Just keep trying.*

In [9]:
#Your answer here:
# imports Needed
from sklearn.metrics import accuracy_score, roc_auc_score

print("Train Accuracy:")
print(accuracy_score(train_y,sparselr_cv.predict(train_x)))
print("Test Accuracy:")
print(accuracy_score(test_y,sparselr_cv.predict(test_x)))
print("Train AUC:")
print(roc_auc_score(train_y,sparselr_cv.predict_proba(train_x)[:, 1]))
print("Test AUC:")
print(roc_auc_score(test_y,sparselr_cv.predict_proba(test_x)[:, 1]))

Train Accuracy:
1.0
Test Accuracy:
0.9826555023923444
Train AUC:
0.9999999999999999
Test AUC:
0.9873620002074122


1e.(1 point) Print 10 terms in **sparselr_cv** that have the largest impact to class "spam", which means, if these terms appear in a message, that message is more likely to be a spam. 

In [12]:
#Your answer here:

dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Beta': sparselr_cv.coef_[0]
                     })

#Show the most positive terms
dfbeta.sort_values(by="Beta",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)



Unnamed: 0,Term,Beta
0,146tf150p,1.515027
1,ac,1.46187
2,servic,1.42541
3,rington,1.321306
4,voicemail,1.275736
5,slower,1.265541
6,teenag,1.235977
7,uk,1.17727
8,gbp,1.14521
9,claim,1.142113


1f. (1 point) The following code creates a list of three messages. Apply **sparselr_cv** from question 1c to each message and print the predicted class of each message (ham or spam) and the probability of each message being a spam. 

*Hint: sparselr_cv cannot be directly applied to text. You must first convert the messages to a DTM using the same vectorizer in question 1a. Would you use fit_transform() or transform()?*

In [14]:
NewMessage=["""Congrats! 1 year special cinema pass for 2 is yours. 
                call 09061209465 now! C Suprman V, Matrix3, StarWars3, 
                etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out! """,
            
            """Update_Now - 12Mths Half Price Orange line rental: 
               400mins...Call MobileUpd8 on 08000839402 or call2optout=J5Q""",
            
            """Yo carlos, a few friends are already asking me about you, 
               you working at all this weekend?"""]

In [15]:
#Your answer here:
#Create the testing DTM and the labels
newMessage = vectorizer.transform(NewMessage)

#Check your answer:
print(newMessage.shape)

sparselr_cv.predict(newMessage)


(3, 5974)


array(['spam', 'spam', 'ham'], dtype=object)

In [16]:
sparselr_cv.predict_proba(newMessage)

array([[3.13467928e-03, 9.96865321e-01],
       [5.81316100e-03, 9.94186839e-01],
       [9.99997396e-01, 2.60447855e-06]])

## Part 2: Predict Stock Price Direction by News Headlines

The following code loads the data file **"classdata/BA_newsline_direction.csv"** into a dataframe called **df2**. This data contains the headlines of the news about The Boeing Company published by Thomson Reuters each day in 2020. See column **headline**. It also contains the moving direction ("up" or "down") of the stock price of Boeing one day after the news being published. See column **direction**. Note that there can be multiple news on Boeing published on the same day.  If so, each news headline is saved a separated record in **df2**.

Your task is to compare the performance of SLR and XGBoost in predicting the price directions.

In [17]:
import pandas as pd
df2 = pd.read_csv("classdata/BA_newsline_direction.csv")
df2.head()

Unnamed: 0,date,nextday_RET,headline,direction
0,2020-01-02,-0.00168,BUZZ-Norwegian Air: Hopes for Boeing deal driv...,down
1,2020-01-02,-0.00168,AIRBUS <AIR.PA> SHARES UP 1.3 PERCENT .,down
2,2020-01-02,-0.00168,"AIRBUS <AIR.PA> SHARES EXTEND GAINS, STOCK UP ...",down
3,2020-01-02,-0.00168,HM DUNN AEROSYSTEMS - DOES NOT EXPECT TO FURLO...,down
4,2020-01-02,-0.00168,Reuters Insider - Trading at Noon: Tracking oi...,down


### Questions based on only Module 5:

2a. (0.5 point) Split **df2** into training (70%) and testing (30%) sets as two new data frames called **df_train** and **df_test**. Create DTMs for training and testing sets based on the following instructions:

- Use the default tokenizer from sklearn library. 
- Remove stop words in the list of nltk. 
- Do not stem the terms.
- Create DTM in binary scores with using bigrams. 

Save your DTMs as **train_x** and **test_x**. Create the class labels for training and testing sets. Save them as **train_y** and **test_y**. Print the shapes of the DTMs for training and testing sets. 

In [18]:
#Your answer here:
from sklearn.feature_extraction.text import CountVectorizer


df_train, df_test = train_test_split(df2, test_size=0.30, random_state=2021)
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

nltk_stopwords = nltk.corpus.stopwords.words("english")
vectorizer=CountVectorizer(stop_words=nltk_stopwords, binary=True,ngram_range=(2,2))

#Create the training DTM and the labels
train_x = vectorizer.fit_transform(df_train["headline"])
train_y = df_train["direction"]

#Create the testing DTM and the labels
test_x = vectorizer.transform(df_test["headline"])
test_y = df_test["direction"]

#Check your answer:
print(train_x.shape)
print(test_x.shape)

(4090, 23191)
(1754, 23191)


### Questions based on Module 5 and 6:

2b. (1 point) Create a sparse logistic regression using the DTM and the class lables you created in  question 2a. Select $C$ by 5-fold cross validation from a grid of **20 candidates** that increase proportionally from **l1_min_c** to **l1_min_c$\times 10^{5}$**.  Use **accuracy** as the criterion for selecting $C$. Set the parameters in **LogisticRegressionCV** as follows
  - random_state=2021   
  - tol=0.001           
  - max_iter=1000
  - scoring='accuracy' 
 
Save your model as **sparselr_cv**. Print the number of non-zero betas in  **sparselr_cv**.

In [21]:
#Your answer here:

import numpy as np
from sklearn.svm import l1_min_c
from sklearn.linear_model import LogisticRegressionCV

param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=5, num=20) 


sparselr_cv = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   #Use the grid generated above
                                cv=5,            #Number of folds, that is, K
                                scoring='accuracy', #The performance metric to select the best C.
                                random_state=2021,  #To make sure the result is reproducible
                                tol=0.001,
                                max_iter=1000)
sparselr_cv.fit(train_x, train_y)
#Check your answer:
print(sum(sparselr_cv.coef_[0]!=0))

17211


2c. (0.5 point) Evaluate and print the accuracy and AUC score of **sparselr_cv** from the previous question on the testing and training sets. 

In [22]:
#Your answer here:
print("Train Accuracy:")
print(accuracy_score(train_y,sparselr_cv.predict(train_x)))
print("Test Accuracy:")
print(accuracy_score(test_y,sparselr_cv.predict(test_x)))
print("Train AUC:")
print(roc_auc_score(train_y,sparselr_cv.predict_proba(train_x)[:, 1]))
print("Test AUC:")
print(roc_auc_score(test_y,sparselr_cv.predict_proba(test_x)[:, 1]))

Train Accuracy:
0.9946210268948655
Test Accuracy:
0.7309007981755986
Train AUC:
0.999938417754819
Test AUC:
0.8135793865528409


2d. (1 point) Use the DTM in question 2a to build a XGBoost model to predict the direction. You need to select parameter 'max_depth' between 2 and 5 and select parameter 'n_estimators' between 10 and 100 by cross validation using **GridSearchCV**. Set the parameters in **XGBClassifier** as follows
  - nthread=4
  - use_label_encoder=False
  - verbosity = 0
  - random_state=2021
  
Set the parameters in **GridSearchCV** as follows
  - cv=5
  - scoring = 'accuracy'
  
Save the XGBoost model as **xgb**.

In [24]:
#Your answer here:
from sklearn import preprocessing
from sklearn.model_selection import GridSearchCV  
from xgboost import XGBClassifier

le = preprocessing.LabelEncoder()
train_y=le.fit_transform(train_y)
test_y=le.transform(test_y)

 
param_list = {  
 'max_depth':[2, 5],       #Candidate for max_depth
 'n_estimators':[10, 100]  #Candidate for n_estimators
}
xgb=XGBClassifier(nthread=4,
                  use_label_encoder=False,
                  verbosity = 0,
                  random_state=2021
                 )
xgb = GridSearchCV(estimator = xgb, 
                   param_grid = param_list,
                   scoring = 'accuracy',  #The performance metric to select the best parameters.
                   cv=5                   #Number of folds, i.e., K
                  )  

xgb.fit(train_x, train_y)


2e. (0.5 point) Evaluate and print the accuracy and AUC score of **xgb** from the previous question on the testing and training sets. 

In [26]:
#Your answer here:


print("Train Accuracy:")
print(accuracy_score(train_y,xgb.predict(train_x)))
print("Test Accuracy:")
print(accuracy_score(test_y,xgb.predict(test_x)))
print("Train AUC:")
print(roc_auc_score(train_y,xgb.predict_proba(train_x)[:, 1]))
print("Test AUC:")
print(roc_auc_score(test_y,xgb.predict_proba(test_x)[:, 1]))

print(xgb.best_params_)

Train Accuracy:
0.6968215158924206
Test Accuracy:
0.6174458380843786
Train AUC:
0.7920542162911943
Test AUC:
0.6881238387838711
{'max_depth': 5, 'n_estimators': 100}


2f. (1 point) The following code creates a list of three news headlines. Apply **xgb** to each headline and print the predicted price direction (encoded as 1 or 0) after the news and the probability of moving in each direction.

*Hint: xgb cannot be directly applied to text. You must first convert the messages to a DTM using the same vectorizer in question 1a.*

In [28]:
News=[""" BRIEF-United Airlines Is Set To Take Delivery Of 
          A 737 Max From Boeing As Early As Tuesday - CNBC """,
            
     """ BOEING - DELIVERIES TO THE RNLAF ARE EXPECTED TO CONTINUE INTO 2021.""",
            
     """BRIEF-United Airlines Holdings Says Entered Agreement With Unit Of BOC 
           Aviation Ltd To Finance Through Sale,Lease Transaction 6 Boeing 787-9 Aircraft"""]

In [30]:
#Your answer here:

xgb=XGBClassifier(max_depth=5,
                  n_estimators=100,
                  nthread=4,
                  use_label_encoder=False,
                  verbosity = 0,
                  random_state=2021
                 )
xgb.fit(train_x, train_y)

news = vectorizer.transform(News)

xgb.predict(news)


array([1, 0, 0])

In [31]:
xgb.predict_proba(news)

array([[0.47590506, 0.52409494],
       [0.51115847, 0.48884156],
       [0.7261682 , 0.27383178]], dtype=float32)