# Logistic Regression Example

Using an SMS Spam data set (slightly modified) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data set is a collection of 5574 SMS messages that have been labeled as ham or spam. The file is a tab-delimited file with the first column the label and the second the message content. I edited the data set to remove some unwanted columns and add headings. 



In [1]:
import pandas as pd
df = pd.read_csv('sms-spam.csv', header=0, usecols=[1,2], encoding='latin-1')
print('rows and columns:', df.shape)
print(df.head())

rows and columns: (4837, 2)
   spam                                               text
0     0  Go until jurong point, crazy.. Available only ...
1     0                      Ok lar... Joking wif u oni...
2     1  Free entry in 2 a wkly comp to win FA Cup fina...
3     0  U dun say so early hor... U c already then say...
4     0  Nah I don't think he goes to usf, he lives aro...


In [2]:
# text preprocessing
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopwords, binary=True)

In [3]:
# set up X and y
X = vectorizer.fit_transform(df.text)
y = df.spam

In [4]:
# divide into train and test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

### Train and test

Train on the train data and then evaluate on the test data.

In [5]:
from sklearn.linear_model.logistic import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
pred = classifier.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))

accuracy score:  0.955578512397
precision score:  0.963855421687
recall score:  0.666666666667
f1 score:  0.788177339901


This is a lower accuracy and f1 than Naive Bayes. Recall that Naive Bayes and logistic regression are both considered to be linear classifers because they divide the space with a hyperplane. NB uses a generative function to find the hyperplane that most likely generated the data that is observed. Logistic regression uses a discriminative function to find the decision boundary that best divides the data. Usually, logistic regression will outperform NB at some point when the training data becomes sufficiently large. 

Let's change some parameters and see if we get better results. 

After trying a few parameters based on the [documentation page](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), it turned out that the parameter that made a significant improvement was letting the class weight be balanced according to the data set. This probably improved performance because we do have a very imbalanced data set. 

In [7]:
classifier2 = LogisticRegression(class_weight='balanced')
classifier2.fit(X_train, y_train)
pred2 = classifier2.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred2))
print('precision score: ', precision_score(y_test, pred2))
print('recall score: ', recall_score(y_test, pred2))
print('f1 score: ', f1_score(y_test, pred2))

accuracy score:  0.976239669421
precision score:  0.90756302521
recall score:  0.9
f1 score:  0.903765690377
