# Baseline Model: Logistic Regression

Import Data

## Preprocessing

In [None]:
def remove_punctuation(text):
    '''a function for removing punctuation'''
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
train_data['comment_text'] = train_data['comment_text'].apply(remove_punctuation)
valid_data['comment_text'] = valid_data['comment_text'].apply(remove_punctuation)

In [None]:
sw=stopwords.words('english')
def removesw(text):
    '''a function for removing the stopword'''
    # removing the stop words and lowercasing the selected words
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    # joining the list of words with space separator
    return " ".join(text)
train_data['comment_text'] = train_data['comment_text'].apply(removesw)
valid_data['comment_text'] = valid_data['comment_text'].apply(removesw)

In [None]:
stemmer = SnowballStemmer("english")

def stemming(text):    
    '''a function which stems each word in the given text'''
    text = [stemmer.stem(word) for word in text.split()]
    return " ".join(text) 
train_data['comment_text'] = train_data['comment_text'].apply(stemming)
valid_data['comment_text'] = valid_data['comment_text'].apply(stemming)

Logistic Regression is [well suited](https://blog.insightdatascience.com/always-start-with-a-stupid-model-no-exceptions-3a22314b9aaa) as a baseline model for classification and natural language processing. Baseline models take less tims to construct since there architecture is relatively simple. They are easier and faster to train, so you can iterate quickly through them. This advantage helps to deal with bugs that point to data issues, as well.
Hence, they give you information to build on in a short time. Baseline models function as benchmarks for more complex models. By studying the shortcomings and struggles of our baseline model we can make decisions on what complex model to deploy next. Therefore, the baseline model also gives a methodological orientation. 

**Vectorizer**

In [None]:
tfidf_vec = TfidfVectorizer(max_df=0.7,stop_words='english')

**Regression Model: predicting 'toxic'**

In [None]:
X = train_data['comment_text']
y = train_data['toxic']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X_train_vec = tfidf_vec.fit_transform(X_train)
X_test_vec = tfidf_vec.transform(X_test)

log_toxic = LogisticRegression()
log_toxic.fit(X_train_vec,y_train)

predictions = log_toxic.predict(X_test_vec)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))

Out baseline model (Logistic Regression) performs with 95% accuracy on the training data.
However, since we habe imbalanced data, we want to improve on the precision and recall values.

In [None]:
confusion_matrix = pd.crosstab(y_test, predictions, rownames=['Actual'], colnames=['Predicted'])
sns.heatmap(confusion_matrix, annot=True)

print('Accuracy: ',metrics.accuracy_score(y_test, predictions))

In [None]:
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l2 lasso l2 ridge
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(X_train_vec,y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

tuned hpyerparameters :(best parameters)  {'C': 10.0, 'penalty': 'l2'}
accuracy : 0.950686690973072

**Resampling imbalanced data**

In [None]:
from imblearn.over_sampling import SMOTE

In [None]:
sm = SMOTE(random_state = 2)

X_train_res, y_train_res = sm.fit_sample(X_train_vec, y_train.ravel())

**Execute GridSearch for Model Optimization**

In [None]:
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l2 lasso l2 ridge
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(X_train_res,y_train_res)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy :",logreg_cv.best_score_)

tuned hpyerparameters :(best parameters)  {'C': 1000.0, 'penalty': 'l2'}
accuracy : 0.9467282127031019