# Exercise 6

The csv file "classdata/twosports.csv" contains the posts from a forum. All posts are about baseball and hockey. You task is to build a sparse logistic regression model that can predict if a post is about baseball or hockey. 

- Column "topic" contains the class labels ("baseball" or "hockey"). 
- Column "text" contains the texts of posts. 

The following code load the libraries and reads the data and shows the frequencies of the class labels.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
import nltk 
import numpy as np
from sklearn.svm import l1_min_c
from sklearn.linear_model import LogisticRegressionCV

df = pd.read_csv("classdata/twosports.csv",encoding="latin-1")
df.head()

Unnamed: 0,topic,text
0,baseball,Umpires are not required to call time out just...
1,hockey,In article <1993Apr21.174430.24039@Virginia.ED...
2,hockey,I hear Daigle will eb the first pick next year...
3,hockey,If you wanted to send your own letter to the N...
4,baseball,In article <C51vwC.Lru@usenet.ucs.indiana.edu>...


The following code split the data into training and testing sets using a random seed of 2021. It also defines the stop-word list and different vectorizers which you may need for this question.

In [7]:
df_train, df_test = train_test_split(df, test_size=0.30)
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

nltk_stopwords = nltk.corpus.stopwords.words("english") 

df_train, df_test = train_test_split(df, test_size=0.30, 
                                     random_state=2021   #Random seed set to be 2021.
                                    )
df_train.reset_index(drop=True,inplace=True)
df_test.reset_index(drop=True,inplace=True)

stemmer = nltk.stem.SnowballStemmer("english", ignore_stopwords=True)
class StemmedTfidfVectorizer(TfidfVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([stemmer.stem(w) for w in analyzer(doc)])

1. Use any DTM you like to create a sparse logistic regression model to predict column "topic".  You need to select $C$ by 5-fold cross validation from a grid of **30 candidates** that increase proportionally from **l1_min_c** to **l1_min_c$\times 10^{8}$**. Use AUC as the criterion for selecting $C$. Set the remaining parameters in **LogisticRegressionCV** as follows
  
  - random_state=2021   
  - tol=0.001           
  - max_iter=100
  - scoring='accuracy'

Calculate and print the accuracy and AUC score of your model on the testing set.

In [9]:
#Your answer here: 
vectorizer=StemmedTfidfVectorizer(stop_words=nltk_stopwords, norm=None)

#Create the training and testing DTMs and the labels
train_x = vectorizer.fit_transform(df_train["text"])
train_y = df_train["topic"]
test_x = vectorizer.transform(df_test["text"])
test_y = df_test["topic"]

param_grid = l1_min_c(train_x, train_y, loss='log') * np.logspace(start=0, stop=8, num=30) 
sparselr = LogisticRegressionCV(penalty='l1', 
                                solver='liblinear', 
                                Cs=param_grid,   #Use the grid generated above
                                cv=5,            #Number of folds, that is, K
                                scoring='roc_auc', #The performance metric to select the best C.
                                random_state=2021,  #To make sure the result is reproducible
                                tol=0.001,
                                max_iter=100)
sparselr.fit(train_x, train_y)

print("Train Accuracy:")
print(accuracy_score(train_y,sparselr.predict(train_x)))
print("Test Accuracy:")
print(accuracy_score(test_y,sparselr.predict(test_x)))
print("Train AUC:")
print(roc_auc_score(train_y,sparselr.predict_proba(train_x)[:, 1]))
print("Test AUC:")
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

#Check your answer
print(accuracy_score(test_y,sparselr.predict(test_x)))
print(roc_auc_score(test_y,sparselr.predict_proba(test_x)[:, 1]))

Train Accuracy:
0.9978494623655914
Test Accuracy:
0.9615384615384616
Train AUC:
1.0
Test AUC:
0.9953280975161611
0.9615384615384616
0.9953280975161611


2. Use the same DTM from the previous question to build a XGBoost model to predict column "topic".  You need to select parameter 'max_depth' from $\{2,3,4\}$ and select parameter 'n_estimators' from $\{100,500\}$ by 5-fold cross validation. Use AUC as the criterion for selecting the parameters. Set other parameters in **XGBClassifier** as follows

  - nthread=4
  - use_label_encoder=False
  - verbosity = 0
  - random_state=2021
  
Save the output XGBoost model as **xgb**.

In [10]:
#Your answer here:
from sklearn.model_selection import GridSearchCV  
from xgboost import XGBClassifier
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
train_y = le.fit_transform(train_y)
test_y = le.transform(test_y)

param_list = {  
 'max_depth':[2,3,5],       #Candidate for max_depth
 'n_estimators':[100, 500]  #Candidate for n_estimators
}
xgb=XGBClassifier(nthread=4,
                  use_label_encoder=False,
                  verbosity = 0,
                  random_state=2021
                 )

xgb = GridSearchCV(estimator = xgb, 
                   param_grid = param_list,
                   scoring = 'roc_auc',  #The performance metric to select the best parameters.
                   cv=5                   #Number of folds, i.e., K
                  )  

xgb.fit(train_x, train_y)

#Check your answer:
xgb

3. What is the best combination of the parameters used in the XGBoost model in question 2?

In [5]:
#Your answer here:
xgb.best_params_

{'max_depth': 3, 'n_estimators': 100}

4. Print the accuracy and the AUC score on the testing set obtained by the XGBoost model in question 2?

In [6]:
#Your answer here:

print("Train Accuracy:")
print(accuracy_score(train_y,xgb.predict(train_x)))
print("Test Accuracy:")
print(accuracy_score(test_y,xgb.predict(test_x)))
print("Train AUC:")
print(roc_auc_score(train_y,xgb.predict_proba(train_x)[:, 1]))
print("Test AUC:")
print(roc_auc_score(test_y,xgb.predict_proba(test_x)[:, 1]))

Train Accuracy:
0.9842293906810036
Test Accuracy:
0.9515050167224081
Train AUC:
0.9994665060359856
Test AUC:
0.993697973268203


5. Use the same DTM from question 1 to build a XGBoost model to predict column "topic". You need to use the best combination of the parameters identified in quesiton 3. Other parameters should be set the same as in question 2. What are the ten most important terms your XGBoost model uses to make predictions.  

In [7]:
#Your answer here:
xgb=XGBClassifier(max_depth=3,
                  n_estimators=100,
                  nthread=4,
                  use_label_encoder=False,
                  verbosity = 0,
                  random_state=2021
                 )
xgb.fit(train_x, train_y)

dfbeta = pd.DataFrame({'Term': vectorizer.get_feature_names(),
                       'Importance': xgb.feature_importances_
                     })
dfbeta.sort_values(by="Importance",inplace=True,ascending=False)
dfbeta.reset_index(inplace=True,drop=True)
dfbeta.head(10)

#Check your answer:
dfbeta.head(10)



Unnamed: 0,Term,Importance
0,hockey,0.087059
1,playoff,0.045106
2,pitch,0.041586
3,goal,0.038171
4,devil,0.037045
5,bat,0.028685
6,wing,0.026134
7,pitcher,0.02441
8,basebal,0.022069
9,patrick,0.019354
