# Project 3: Advanced Running Retargeting using NLP

---

## Part 3: Cleaning/Pre-Processing

This section includes NLP pre-processing prior to modeling. 'Post length' and 'post word count' were added as feature engineered columns. Sentiment Analysis was then performed on the post column. Negative, positive, neutral and compound sentiment were recorded for each post. A pre-cleaning function was then written. The steps are as follows: 
1. Regex was used to remove special characters from the post column 
2. Post was tokenized and characters were lowercase
3. Part of speech was tagged to each token 
4. Tokens were lemmatized and returned to the dataframe 

The data was then split using train_test_split. Tdif-Vectorizer was used with BayesSearchCV to vectorize the features to be ran through the model. Tdif-Vectorizer hyperparameters were tuned. Transformed X_train and X_test were saved to dataframes to model with numerical columns that will be scaled in the next step. 

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import re

# Pre-Processing Imports
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import RegexpTokenizer, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import wordnet
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.pipeline import Pipeline 
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, precision_score, recall_score, balanced_accuracy_score

In [2]:
#Read in cleaned csv
runners=pd.read_csv('../data/clean_runners.csv')
runners.head()

Unnamed: 0,author,post,is_advanced
0,Brojadyn2006,"Further college running Hello, so I was wonder...",1
1,Tea-reps,Race Report: Big breakthrough at the Boston Ha...,1
2,Caffeinated262,Garden of Life Palm Beaches Marathon I have th...,1
3,blueheeler9,2022 BAA Half Marathon | Wet &amp; Glorious 1:...,1
4,zzach_519,2022 Berkeley Half race report ### Race Inform...,1


#### Featured engineered post_length and word_count

In [3]:
#feature engineering 

#new column for post length 
runners['post_length']=[len(i) for i in runners['post']]

#new column for word count
runners['post_word_count']=[len(i.split(' ')) for i in runners['post']]

runners.head()

Unnamed: 0,author,post,is_advanced,post_length,post_word_count
0,Brojadyn2006,"Further college running Hello, so I was wonder...",1,511,95
1,Tea-reps,Race Report: Big breakthrough at the Boston Ha...,1,8090,1406
2,Caffeinated262,Garden of Life Palm Beaches Marathon I have th...,1,440,82
3,blueheeler9,2022 BAA Half Marathon | Wet &amp; Glorious 1:...,1,7512,1379
4,zzach_519,2022 Berkeley Half race report ### Race Inform...,1,6934,1277


#### Sentiment Analysis 

In [4]:
#Sentiment Analysis 
sia=SentimentIntensityAnalyzer()

for row in runners[['post']].iterrows():
    idx, vals = row 
    sentiments= sia.polarity_scores(vals['post'])
    runners.loc[idx, 'neg']=sentiments['neg']
    runners.loc[idx, 'pos']=sentiments['pos']
    runners.loc[idx, 'neu']=sentiments['neu']
    runners.loc[idx, 'compound']=sentiments['compound']
    
runners.head()

Unnamed: 0,author,post,is_advanced,post_length,post_word_count,neg,pos,neu,compound
0,Brojadyn2006,"Further college running Hello, so I was wonder...",1,511,95,0.028,0.077,0.895,0.6801
1,Tea-reps,Race Report: Big breakthrough at the Boston Ha...,1,8090,1406,0.044,0.144,0.813,0.9993
2,Caffeinated262,Garden of Life Palm Beaches Marathon I have th...,1,440,82,0.0,0.147,0.853,0.9078
3,blueheeler9,2022 BAA Half Marathon | Wet &amp; Glorious 1:...,1,7512,1379,0.046,0.119,0.835,0.9987
4,zzach_519,2022 Berkeley Half race report ### Race Inform...,1,6934,1277,0.041,0.106,0.853,0.9979


#### Function to clean data before being vectorized 

In [110]:
#This function was partially found on stack overflow, and modified by me to fit my pre-cleaning needs 

#This function does all of the cleaning needed before vectorizing

#instantiate lemmatizer 
lemmatizer = WordNetLemmatizer()

#function to be used in def nlp_clean to use wordnet to tag part of speech 
def nltk2wn_tag(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:                    
        return None

#function to do all cleaning before vectorizing 
def nlp_clean(sentence):
    token = RegexpTokenizer(r'[\w\'\']+') #remove special characters 
    token_tagged = token.tokenize(sentence.lower()) #tokenize and lowercase 
    pos_tagged=nltk.pos_tag(token_tagged) #part of speech tagging 
    nltk_tagged = map(lambda x: (x[0], nltk2wn_tag(x[1])), pos_tagged) #replace pos with twin tag above
    
    words_lst = []
    for word, tag in nltk_tagged:
        if tag is None:                        
            words_lst.append(word)
        else:
            words_lst.append(lemmatizer.lemmatize(word, tag)) #lemmatize words with wanted pos tag
    return " ".join(words_lst) #join lemmatized words back to string to use in vectorizer 

In [6]:
#applying function on 'post' column 
runners['tok_pos_lemma_post']=runners['post'].apply(nlp_clean)
runners.head()

Unnamed: 0,author,post,is_advanced,post_length,post_word_count,neg,pos,neu,compound,tok_pos_lemma_post
0,Brojadyn2006,"Further college running Hello, so I was wonder...",1,511,95,0.028,0.077,0.895,0.6801,further college run hello so i be wonder for s...
1,Tea-reps,Race Report: Big breakthrough at the Boston Ha...,1,8090,1406,0.044,0.144,0.813,0.9993,race report big breakthrough at the boston hal...
2,Caffeinated262,Garden of Life Palm Beaches Marathon I have th...,1,440,82,0.0,0.147,0.853,0.9078,garden of life palm beach marathon i have the ...
3,blueheeler9,2022 BAA Half Marathon | Wet &amp; Glorious 1:...,1,7512,1379,0.046,0.119,0.835,0.9987,2022 baa half marathon wet amp glorious 1 26 o...
4,zzach_519,2022 Berkeley Half race report ### Race Inform...,1,6934,1277,0.041,0.106,0.853,0.9979,2022 berkeley half race report race informatio...


#### Train Test Split 

In [96]:
#Saving df for eda
runners.to_csv('sent_lemma.csv', index=False)

In [14]:
#define X and y 
X=runners[['tok_pos_lemma_post', 'post_length', 'post_word_count', 'neg', 'pos', 'neu', 'compound']]
y=runners['is_advanced']

#train test split 
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [35]:
#making a pipeline for tfidf vectorizer 

# #instantiating count vectorizer/standard scaler 
tvec=TfidfVectorizer(stop_words='english', ngram_range=(1,3))

#creating a column transformer to model non sentiment columns 
ct=make_column_transformer(
    (tvec, 'tok_pos_lemma_post'),
    remainder='passthrough'
)

#creating a pipeline 
transformer_pipe= Pipeline([
    ('transformer', ct),
    ('bnb',BernoulliNB())
])

#Transformer pipe2 params 
transformer_pipe_params = {
    'transformer__tfidfvectorizer__max_features': Integer(1,15000), 
    'transformer__tfidfvectorizer__min_df': Integer(1,1000),     
    'transformer__tfidfvectorizer__max_df': Real(0.50,0.99),
}

#Instantiate BayesSearchCV
bs = BayesSearchCV(
    estimator = transformer_pipe,
    search_spaces=transformer_pipe_params,
    n_iter=50, 
    verbose=1,
    cv=5,
    n_jobs=-1
)

In [22]:
bs.fit(X_train, y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fits
Fitting 5 folds for each of 1 candidates, totalling 5 fi

BayesSearchCV(cv=5,
              estimator=Pipeline(steps=[('transformer',
                                         ColumnTransformer(remainder='passthrough',
                                                           transformers=[('tfidfvectorizer',
                                                                          TfidfVectorizer(ngram_range=(1,
                                                                                                       3),
                                                                                          stop_words='english'),
                                                                          'tok_pos_lemma_post')])),
                                        ('bnb', BernoulliNB())]),
              n_jobs=-1,
              search_spaces={'transformer__tfidfvectorizer__max_df': Real(low=0.5, high=0.99, prior='uniform', transform='normalize'),
                             'transformer__tfidfvectorizer__max_features': Integer(low=1, high=

In [25]:
#Best params 
bs.best_params_

OrderedDict([('transformer__tfidfvectorizer__max_df', 0.5),
             ('transformer__tfidfvectorizer__max_features', 15000),
             ('transformer__tfidfvectorizer__min_df', 1000)])

In [24]:
#predicted values 
preds = bs.predict(X_test)

#best scores
print(f'The accuracy training score is    {bs.score(X_train, y_train)}')
print(f'The accuracy testing score is     {bs.score(X_test, y_test)}')
print(f'The bac score is                  {balanced_accuracy_score(y_test, preds)}')
print(f'The precision is                  {precision_score(y_test, preds)}')
print(f'The recall is                     {recall_score(y_test, preds)}')

The accuracy training score is    0.7553652021737128
The accuracy testing score is     0.7587476979742173
The bac score is                  0.7901597622486698
The precision is                  0.9307756463719766
The recall is                     0.6611374407582938


#### Transformed features put back together with numerical columns 

In [26]:
#Tvec with optimal params, create df for modeling 
tv=TfidfVectorizer(stop_words='english', max_df=0.5, max_features= 15000, min_df=1000, ngram_range=(1,3))

X_train_transformed=pd.DataFrame(tv.fit_transform(X_train['tok_pos_lemma_post']).todense(), columns=tv.get_feature_names(), index=X_train.index)
X_test_transformed=pd.DataFrame(tv.transform(X_test['tok_pos_lemma_post']).todense(), columns=tv.get_feature_names(), index=X_test.index)

In [107]:
#Creating tvec vocab into a dataframe for eda 
vocab=tv.vocabulary_
tvec_words_df=pd.DataFrame(vocab.items(), columns=['word', 'count'])
tvec_words_df=tvec_words_df.set_index('word')

#save as csv for eda
tvec_words_df.to_csv('tvec_words.csv')

In [31]:
#Concatinating additional columns back in 
X_train_vec=pd.concat([X_train.drop(columns='tok_pos_lemma_post'), X_train_transformed], axis=1)
X_test_vec=pd.concat([X_test.drop(columns='tok_pos_lemma_post'), X_test_transformed], axis=1)

In [91]:
#Saving dataframes for modeling 
X_train_vec.to_csv('X_train_vec.csv', index=False)
X_test_vec.to_csv('X_test_vec.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)

---
#### Next Section: Part 4- EDA 