# Model Optimization

## Feature engineering

Now, we will work on feature engineering to optimize our model.
We can divide our features in two types: Meta and Text based.

### Meta features
These will include all the features that we get from the user or information we can obtain given our understanding of the english language.

1. Post upvotes 
2. Days since request
3. User subs on (count of subreddits) 
4. User activity comments
5. User activity comments raop 
6. User posts reddit 
7. User posts raop 
8. Comment count 
9. User rate start 
10. User rate end 
11. Account age (days)
12. Day of the week
13. Day of the month
14. Month
15. Week of the year
16. Text was edited

### Text based features
These will include all the features that are related to frequency, svd, sentiment.

1. Text length (word count) 
2. Text compound sentiment 
3. Title length (word count)
4. Title sentiment
5. TFIDF
6. SVD

## Let's import libraries and load data

In [3]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# Data manipulation
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# ML Scikit

# NLP
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
import string

## SVD
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

## Decision Trees
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb

# Regression model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

# KNN Model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor

# Feature handling
from sklearn.model_selection import train_test_split

# Reports
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn import metrics

# For producing decision tree diagrams.
from IPython.core.display import Image, display
from six import StringIO

# MNN and DNN
from sklearn.neural_network import MLPClassifier
import tensorflow as tf
from tensorflow import keras
##Helper libraries
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
## Keras and TF
from tensorflow import keras
from tensorflow.keras import layers
from keras import backend as K

#Visualizing Confusion Matrix using Heatmap
import seaborn as sns


In [2]:
# load the data
df = pd.read_csv('../data/interim/logit_sentiments.csv')

In [3]:
df.columns

Index(['request_text', 'request_title', 'post_upvotes', 'text_word_count',
       'text_sentiment', 'title_word_count', 'title_sentiment',
       'days_since_request', 'user_activity_comments',
       'user_activity_comments_raop', 'user_posts_reddit', 'user_posts_raop',
       'user_rate_start', 'user_rate_end', 'day_request', 'day_month_request',
       'month_request', 'week_request', 'requester_received_pizza'],
      dtype='object')

# Split data to evaluate new features

We're using the baseline classifier - Logistic regression

In [197]:
# Data Handling for Model

X, Y = df.iloc[:,:-1], df.iloc[:,-1]

# Shuffle the data, but make sure that the features and accompanying labels stay in sync.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X.iloc[shuffle], Y.iloc[shuffle]

# Split into train and test.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=1)
X_dev, X_test, y_dev, y_test = train_test_split(X_test, y_test, test_size=0.4, random_state=1)


In [198]:
# Let's see the different tests

print("Train X,y - ", X_train.shape, " ", y_train.shape)
print("Dev X,y   -  ", X_dev.shape, "  ", y_dev.shape)
print("Test X,y  -  ", X_test.shape, "  ", y_test.shape)
print("Total: ", X_train.shape[0] + X_dev.shape[0] +  X_test.shape[0])

Train X,y -  (3030, 18)   (3030,)
Dev X,y   -   (606, 18)    (606,)
Test X,y  -   (404, 18)    (404,)
Total:  4040


# Pre-processing text with TFIDF vectorizer and SVD dimensionality reduction

Let's see first the text we are interested in analyzing.

In [109]:
# Check nan
# pd.isnull(X_train['request_text']).sum()
# X_train['request_text'].count()

# See text messages
X_train.loc[:,'request_text']

3956    Hello RAOP,\n\nfigured I'd give posting here a...
2162    Hi RAOP!\n\nMy two roommates and I are all ful...
2169    I've recently graduated university. I think I ...
2027    So, I didn't find out my wife was leaving me t...
483     (*throwaway acct*)\nI never thought I'd get to...
                              ...                        
1899    First time on Random Acts of Pizza, my wallet ...
3081    My son has recieved his meal, thanks veritas27...
2723    Hey, I'm a broke college student, would really...
2668    I live in Goodyear, AZ.\n\nI can receive pizza...
1992    Well, my rent is paid up until June, and [my c...
Name: request_text, Length: 3030, dtype: object

In [15]:
# Fills nan in text features
# Input: array, array name
# Output: fill array nan with 'the'
def fill_nan_text(texts, feature):
    texts = texts.replace(np.nan, 'the', regex=True)
    return texts

# Improve text tokenization
# Input: string
# Output: string without stop words, and applied
# lemmatization and stemming
def better_preprocessor(s):
    
    s = s.lower() # to lowercase
    
    #Stemming and lemmatization
    ps = PorterStemmer()
    lemmer = WordNetLemmatizer()
    words = word_tokenize(s)
    cleaned_words = [lemmer.lemmatize(ps.stem(word)) for word in words]
    
    #Remove stop words
    stop_words = set(stopwords.words('english'))
    new_cleaned_words = [w for w in cleaned_words if not w.lower() in stop_words]
    s = ''
    for word in new_cleaned_words:
        if word == '.' or word == '?' or word == ',':
            s = s + word
        else:
            s = s + ' ' + word
    
    return s

# Vectorizes the text data into separated tokens as features
# Input: array with text
# Output: print size of vocabulary
def create_text_features_no_processor(train):
    
    # Vectorization with empty preprocessing of text
    vectorizer = CountVectorizer(preprocessor= lambda x: x)
    X_train_counts_raw = vectorizer.fit_transform(train)
    print('Size of train Vocabulary with empty pre-processor is:', X_train_counts_raw.shape)

# Vectorizes the text data into separated tokens as features and applies better processor
# Input: array with text
# Output: print size of vocabulary
def create_text_features_with_processor(train, test):
    
    # Vectorization with better preprocessor
    # declare Vectorizer with counts of words
    tfidf_vectorizer = TfidfVectorizer(
                                       analyzer='word',
                                       token_pattern=r'\w{1,}',
#                                        min_df=3, 
                                       preprocessor = better_preprocessor,
                                       ngram_range=(1,3))

    #vectorizer on train data
    X_train_counts = tfidf_vectorizer.fit_transform(fill_nan_text(train,'request_text'))
    X_test = tfidf_vectorizer.transform(fill_nan_text(test,'request_text'))
    print('---------- \nSize of train Vocabulary with better preprocessor is:', X_train_counts.shape[1])
    
    #feature names
    feature_names = tfidf_vectorizer.get_feature_names()
#     print(len(feature_names))
    return tfidf_vectorizer,X_train_counts, X_test, feature_names

def svd_reduction(X_train_counts,X_test,svd_comps, feature_names):
    
    #SVD for dimensionality reduction
    svd = TruncatedSVD(n_components=svd_comps,n_iter=10,random_state=42)
    normalizer = Normalizer(copy=False) #Normalizing SVD results
    lsa = make_pipeline(svd, normalizer) #making pipeline for svd and normalizer

    ## reduced by SVD
    X_svd = lsa.fit_transform(X_train_counts)
    X_test_svd = lsa.transform(X_test)
    print('---------- \nSize of train Vocabulary with better preprocessor and SVD is:', X_svd.shape[1])
    
    #explained variance by SVD reduced dimensions
    explained_variance = svd.explained_variance_ratio_.sum()
    print("Explained variance of the SVD step: {}%".format(int(explained_variance * 100)))
    
    #extract feature names corresponding to the 1500 SVD components
    best_features = [feature_names[i] for i in svd.components_[0].argsort()[::-1]]
    #print(best_features[:1500])
    
    print(f"Train array: {X_svd.shape}\nTest array: {X_test_svd.shape}")
    
    #Convert to pandas df
#     df_orig = pd.DataFrame(X_train_counts.toarray(), columns=feature_names)
#     df = pd.DataFrame(df_orig, columns=best_features[:1500])
  
    return X_svd, X_test_svd

def clean_array(array):
    # Remove original text variables
    array.pop('request_title')
    array.pop('request_text')
    # X_dev.columns
    return array

# Merge meta and text data for Classification

In [17]:
# Constants

svd_components = 200 # tested with 200, 400, 600, 1000, 1500
cs=1 # tested with 0.01,0.1,1,10,100
depths = [2,3,4]
estimators = [15, 30, 80]
best=[0,0,0,0]

# Data Handling for Model

X, Y = df.iloc[:,:-1], df.iloc[:,-1]

# Shuffle the data, but make sure that the features and accompanying labels stay in sync.
np.random.seed(0)
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X.iloc[shuffle], Y.iloc[shuffle]

# Split into train and test.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=1)
X_dev, X_test, y_dev, y_test = train_test_split(X_test, y_test, test_size=0.4, random_state=1)


# Vectorize with Tfidf + SVD 
# Vectorize with TfIDF + SVD
count_vector, train_counts, test_counts, features = create_text_features_with_processor(
    X_train.loc[:,'request_text'], 
    X_dev.loc[:,'request_text'])

train_svd, test_svd = svd_reduction(train_counts, test_counts,
                                    svd_components,
                                   features)

# Clean text variables: title and request
X_train = clean_array(X_train)
X_dev = clean_array(X_dev)

print(f"Train meta: {X_train.shape} - Train Text SVD: {train_svd.shape}") 
print(f"Test meta: {X_dev.shape} - Test Text SVD: {test_svd.shape}")

X_train_mt = np.concatenate((X_train,train_svd), axis=1)
X_dev_mt = np.concatenate((X_dev,test_svd), axis=1)
print(f"Test meta + text: {X_train_mt.shape} - Test meta + text: {X_dev_mt.shape}")

# fit model with L2 Regularization
print("**************************")
print("Logistic regression")
lr_model = LogisticRegression(C=cs, max_iter= 1000, solver='liblinear',
                             multi_class="auto", penalty ='l2')
lr_model.fit(X_train_mt, y_train)

# predictions
y_pred = lr_model.predict(X_dev_mt)

# results
#     cnf_matrix = metrics.confusion_matrix(y_dev, y_pred)
# print(cnf_matrix)
#     print(classification_report(y_dev, y_pred))
report_log = classification_report(y_dev, y_pred, output_dict=True)
print("Accuracy for Logistic regression model is", round(report_log['accuracy'],2))
print("**************************")
print("Random forest classifier")

for d in depths:
    for e in estimators:
        # XG Boost classifier
        clf = xgb.XGBClassifier(max_depth=d, 
                                n_estimators=e, 
#                                 colsample_bytree=0.8,
#                                 subsample=0.8, 
#                                 nthread=10, 
                                learning_rate=0.1,
                                use_label_encoder=False,
                                eval_metric='mlogloss')
        clf.fit(X_train_mt, y_train)
        predictions = clf.predict_proba(X_dev_mt)
        if clf.score(X_dev_mt, y_dev) > best[0]:
            print (f'Accuracy (xg boost): {clf.score(X_dev_mt, y_dev):.2} - d={d} - e={e}')
            best[0] = clf.score(X_dev_mt, y_dev)
            best[1] = d
            best[2] = e
            best[3] = 'xg_boost'
        # Random Forest
        rfc = RandomForestClassifier(n_estimators=e, max_depth=d)
        rfc.fit(X_train_mt, y_train)
        if clf.score(X_dev_mt, y_dev) > best[0]:
            print (f'Accuracy (RF): {rfc.score(X_dev_mt, y_dev):.2} - d={d} - e={e}')
            best[0] = rfc.score(X_dev_mt, y_dev)
            best[1] = d
            best[2] = e
            best[3] = 'RF'
            
        # Ada Boost
        abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=d)
                                 , n_estimators=e, learning_rate=0.1)
        abc.fit(X_train_mt, y_train)
        if abc.score(X_dev_mt, y_dev) > best[0]:
            print(f'Accuracy (adaboost): {abc.score(X_dev_mt, y_dev):.2} - d={d} - e={e}')
            best[0] = rfc.score(X_dev_mt, y_dev)
            best[1] = d
            best[2] = e
            best[3] = 'ada'

print("**************************")
print(f"The best classifier is {best[3]} with acc={best[0]:.3} - d={best[1]} - e={best[2]}")

---------- 
Size of train Vocabulary with better preprocessor is: 196244
---------- 
Size of train Vocabulary with better preprocessor and SVD is: 200
Explained variance of the SVD step: 13%
Train array: (3030, 200)
Test array: (606, 200)
Train meta: (3030, 16) - Train Text SVD: (3030, 200)
Test meta: (606, 16) - Test Text SVD: (606, 200)
Test meta + text: (3030, 216) - Test meta + text: (606, 216)
**************************
Logistic regression
Accuracy for Logistic regression model is 0.85
**************************
Random forest classifier
Accuracy (xg boost): 0.82 - d=7 - e=100
Accuracy (xg boost): 0.82 - d=7 - e=150
Accuracy (adaboost): 0.83 - d=7 - e=250
Accuracy (xg boost): 0.81 - d=12 - e=100
Accuracy (xg boost): 0.81 - d=12 - e=150
Accuracy (xg boost): 0.82 - d=12 - e=250
The best classifier is xg_ with acc=0.8201320132013201 - d=12 - e=250


# Old code. We can remove in the end

In [112]:
# # Remove original text variables
# X_train.pop('request_title')
# X_train.pop('request_text')
# X_dev.pop('request_title')
# X_dev.pop('request_text')
# # X_dev.columns

# X_train_mt = np.concatenate((X_train,train_svd), axis=1)
# X_dev_mt = np.concatenate((X_dev,test_svd), axis=1)


# print(f"Test meta + text: {X_train_mt.shape} - Test meta + text: {X_dev_mt.shape}")

Test meta + text: (3030, 217) - Test meta + text: (606, 217)


In [116]:
# X_train_mt[1,1:]

In [114]:
# # fit model
# lr_model = LogisticRegression(max_iter= 1000, solver='liblinear')
# lr_model.fit(X_train_mt[:,1:], y_train)

# # predictions
# y_pred = lr_model.predict(X_dev_mt[:,1:])

# # results
# cnf_matrix = metrics.confusion_matrix(y_dev, y_pred)
# # print(cnf_matrix)
# print(classification_report(y_dev, y_pred))
# report_log = classification_report(y_dev, y_pred, output_dict=True)
# print("Accuracy for Logistic regression model is", round(report_log['accuracy'],2))

              precision    recall  f1-score   support

       False       0.87      0.94      0.90       455
        True       0.77      0.57      0.65       151

    accuracy                           0.85       606
   macro avg       0.82      0.76      0.78       606
weighted avg       0.84      0.85      0.84       606

Accuracy for Logistic regression model is 0.85


In [115]:
# # dt = DecisionTreeClassifier(criterion="entropy", splitter="best", random_state=0)
# # dt.fit(X_train_mt[:,1:], y_train)

# # print ('Accuracy (a decision tree):', dt.score(X_dev_mt[:,1:], y_dev))

# rfc = RandomForestClassifier(n_estimators=20, max_depth=4)
# rfc.fit(X_train_mt[:,1:], y_train)

# print ('Accuracy (a random forest):', rfc.score(X_dev_mt[:,1:], y_dev))

# abc = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1), n_estimators=50, learning_rate=0.1)

# abc.fit(X_train_mt[:,1:], y_train)
# print ('Accuracy (adaboost with decision trees):', abc.score(X_dev_mt[:,1:], y_dev))

Accuracy (a random forest): 0.7706270627062707
Accuracy (adaboost with decision trees): 0.8283828382838284
