# <font color ='pickle'>**Final Pipeline: Data Preprocessing + Manual Features + ML Model pipeline**

Below are the Validation scores of all the various Pipelines

- Pipeline 1: Data Preprocessing + Sparse Embeddings (TF-IDF) + ML Model = 0.928
- Pipeline 2: Data Preprocessing + Manual Features + ML Model pipeline = 0.972
- Pipeline 3: Combine Manual Features and TfID vectors = 0.975

I will use Pipeline 2 as my final pipeline as the validation score of the Pipeline 3 is almost same, and Pipeline 2 is a simple model as compared to the Pipeline 3 which is a little complex.

## <font color = 'pickle'>**Install/Import Libraries**

In [None]:
# Import necessary libraries
import pandas as pd
from pathlib import Path

# Import the joblib library for saving and loading models
import joblib

# Import scikit-learn classes for building models
# from sklearn.linear_model import LogisticRegression
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
# from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin, BaseEstimator
##
from xgboost import XGBClassifier
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
##

# Import the scipy library for working with sparse matrices
# from scipy.sparse import csr_matrix


# <font color = 'indian red'>**Specify Base folder for Project**

In [None]:
# Check if the code is running in a Colab environment
import sys
if 'google.colab' in str(get_ipython()):# If the code is running in Colab

    # mount google drive
    from google.colab import drive
    drive.mount('/content/drive', force_remount= True)

    !pip install -U nltk -qq
    !pip install -U spacy -qq
    !python -m spacy download en_core_web_sm -qq
    !pip install -U pyspellchecker -qq

    # set the base path to a Google Drive folder
    basepath = '/content/drive/MyDrive/BUAN 6342.501 - Applied Natural Language Processing (Harpreet Singh)/Assignments/'
    sys.path.append('/content/drive/MyDrive/BUAN 6342.501 - Applied Natural Language Processing (Harpreet Singh)/Assignments/custom_functions')

Mounted at /content/drive
2023-09-10 01:33:58.051647: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-10 01:34:00.710904: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-10 01:34:00.711356: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-

In [None]:
# Convert the base path to a Path object
base_folder = Path(basepath)

# Define the data folder path
data_folder = base_folder/'HW2'
custom_functions = base_folder/'custom_functions'
model_folder = base_folder/'HW2/models'
model_folder.mkdir(exist_ok=True)

In [None]:
import custom_preprocessor_mod as cp
from featurizer import ManualFeatures
from plot_learning_curve import plot_learning_curve

## <font color ='pickle'>**Load test_subset(40% of raw data) dataset**

We will load the test_subset of 40% data that we created in the initial beggining of previous Notebook

In [None]:
test_subset = pd.read_csv(data_folder/'test_subset.csv')

In [None]:
test_subset.head()

Unnamed: 0,message,label
0,"Funny fact Nobody teaches volcanoes 2 erupt, t...",0
1,I sent my scores to sophas and i had to do sec...,0
2,We know someone who you know that fancies you....,1
3,Only if you promise your getting out as SOON a...,0
4,Congratulations ur awarded either å£500 of CD ...,1


In [None]:
test_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2229 entries, 0 to 2228
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   message  2229 non-null   object
 1   label    2229 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 35.0+ KB


In [None]:
X_test = test_subset['message'].values
y_test = test_subset['label'].values

### <font color ='pickle'>**Load Saved Model**

In [None]:
file_best_estimator_pipeline2_round1 = model_folder / \
    'pipeline2_round1_best_estimator.pkl'
file_complete_grid_pipeline2_round1 = model_folder / \
    'pipeline2_round1_complete_grid.pkl'


In [None]:
# load the saved model
best_estimator_pipeline2_round1 = joblib.load(
    file_best_estimator_pipeline2_round1)
complete_grid_pipeline2_round1 = joblib.load(
    file_complete_grid_pipeline2_round1)

### <font color ='pickle'>**Evaluate model on test datset**

In [None]:
featurizer = ManualFeatures(spacy_model='en_core_web_sm',spam_features = True,count_features = True ,pos_features =False, ner_features= False)

In [None]:
# Final Pipeline
def final_pipeline(text):
    features, feature_names = featurizer.fit_transform(text)
    best_estimator_pipeline2_round1 = joblib.load(
        file_best_estimator_pipeline2_round1)
    predictions = best_estimator_pipeline2_round1.predict(features)
    return predictions

In [None]:
# predicted values for Test data set
y_test_pred = final_pipeline(X_test)

### <font color ='pickle'>**Classification report for test dataset**

In [None]:
print('\nTest set classification report:\n\n',
      classification_report(y_test, y_test_pred))



Test set classification report:

               precision    recall  f1-score   support

           0       0.99      0.99      0.99      1930
           1       0.93      0.91      0.92       299

    accuracy                           0.98      2229
   macro avg       0.96      0.95      0.95      2229
weighted avg       0.98      0.98      0.98      2229



* We can see the model is performing really well on the test data with weighted average f1 score as 0.98.