# **Notebook B.** Classification Model based on Text Embeddings
----

Text embeddings are a way of representing a document (sentence, paragraph, or patent abstract in this case) as a vector of values. These embeddings (typically a vector of 300 values) have very little meaning on their own, but they provide a useful way of comparing sentences or documents, as those with similar embeddings have similar embeddings. 

Embeddings are typically algorithims trained on very large datasets (e.g. GloVe: Global Vectors for Word Representation embeddings are trained on 
 embeddings are trained on ). There are many different algorithims and new ones every day. However, the general intuition is that these algorithms predict words based on their neighbours (surrounding words). This allows us to capture context and account for: 
- Words that are the same but have different meanings (i.e. Homonyms)
- Words that are different but have similar meanings (i.e. Synonyms)

Some "classical" papers that introduce the use of these embeddings are availible here: 
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546. **LINK:** https://arxiv.org/pdf/1310.4546.pdf
- Le, Q., & Mikolov, T. (2014, June). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196). PMLR. **LINK:** http://proceedings.mlr.press/v32/le14.pdf
- Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). **LINK:** https://www.aclweb.org/anthology/D14-1162.pdf
- Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.  **LINK:** https://arxiv.org/pdf/1907.11692v1.pdf

There are also numerous online guides which explain the intuition more extensively: https://en.wikipedia.org/wiki/Word_embedding 

**Aside:** Scholars may be more familar with LDA (Topic Modeling) methods. The final product (e.g. 300 Dim Vector representing a text document) is similar. However, the underlying intuition behind this approach is different. LDA often relies on a bag-of-words approach (as show in Notebook A) which does not take context into account. Embeddings are a textual representation based on the surrounding context (words accompanying a focal word) and therefore accounts for nuances in the way language is used (e.g. synonyms, acronyms, etc.). 




**In this notebook**, we we will use these embedding models to transform our text descriptions into a 300-dimension vector with pre-trained embeddings. We will use the embeddings built into the *Spacy* package as they provide a very easy-to-use tool. However, researchers may want to use different embeddings depending on their application. 

Here is our workflow for this notebook: 

- Step 1) Load the Data 
- Step 2) Convert Text to Embeddings
- Step 3) Perform Classification using Different Models
- Step 4) Compare Model Outputs

# B.1. Load Packages 
---

In [None]:
# General Packages #
import os
import pandas as pd
import numpy as np

# Load TQDM to Show Progress Bars #
from tqdm import tqdm
from tqdm.notebook import tqdm as tqdm_notebook

# Sklearn Packages #
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer, TfidfVectorizer
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, StratifiedKFold, cross_val_predict
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix

# Import SkLearn Classifiers #
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.linear_model import Lasso, LassoCV, SGDClassifier, LinearRegression, LogisticRegression, RidgeCV, RidgeClassifierCV, HuberRegressor, LinearRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Text Libraries #
from spacy.lang.en import STOP_WORDS
STOP_WORDS = list(STOP_WORDS)

In [None]:
# Turn of warnings, just to avoid pesky messages that might cause confusion here
# Remove when testing your own code #
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Mount Personal Google Drive on own Machine -- You have to follow the link to log in #
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# B.2. Load Training Data ##
---------------- 

In [None]:
# Change to Working Directory with Training Data # 
os.chdir("/content/drive/MyDrive/USPTO AI Patent Classification/")

# Load Training Data #
TrainingData = pd.read_csv("./Training Data/4K Patents - AI 20p.csv")

# Store Data in Lists for Text Classification #
IDs = np.array(TrainingData['app number'].values.tolist())
Abstract_Text = TrainingData['abstract'].values.tolist()
Classes = TrainingData['actual'].values.tolist()

# B.3. Convert Text Data to Numerical Vector Representation 
----

We are going to use *Spacy* (https://spacy.io/), a powerful natural-language-processing tool. 

One of the advantages of this tool is that it has built-in functionality that will make it easier to implement this process of converting the documents from text to embeddings. 

*Spacy* has four basic pipleine models (for English):

*   en_core_web_sm
*   en_core_web_md
*   en_core_web_lg
*   en_core_web_trf

The sm (small), md (medium) and lg (large) models are all designed to run on a CPU (regular computer). The trf is transformer based (Roberta Model) and runs best on a GPU. 

The larger files have embeddings for a larger number of words. This makes them more complete. However, at the same time, these files require more memory and taker longer to run so there exists a tradeoff. The trf model is the most advanced but also potentially the most cumbersome to run. 

Notice that here we are just using embeddings to transform our text into a vector representation. There are other ways to use embeddings as shown in some of the other notebooks. 

You can access the list of models here: https://spacy.io/models







In [None]:
# Load SPACY (Very Powerful NLP Library)
import spacy

In [None]:
# In case it's not already downloaded, then run this code to download it # 
import spacy.cli
spacy.cli.download("en_core_web_lg")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [None]:
# Load NLP Model from Spacy #
nlp = spacy.load('en_core_web_lg')

In [None]:
# Convert Each Abstract into Vector Representation #
Abstract_Vectors = np.concatenate([nlp(doc).vector.reshape(1,-1) for doc in tqdm_notebook(Abstract_Text, desc = 'Convert Description to Vector')])

HBox(children=(FloatProgress(value=0.0, description='Convert Description to Vector', max=4000.0, style=Progres…




# B.4. Perform Classification  #

--- 

We will be exploring the classification perfomance of different models based on the text documents that we have transformed using this embedding representation. 


In [None]:
# Define the Set of Classifiers and their parameters # 

CLASSIFIERS = [
               ["Nearest Neighbors", KNeighborsClassifier(10)],
               ["Support Vector Classifier (RBF)", SVC()],
               #["Naive Bayes", MultinomialNB()],  -- # Omitted Because it Does not Work with Embeddings 
               ["Logistic Regression", LogisticRegression()],
               ["Ridge Regression", RidgeClassifierCV()],
               ["Random Forest", RandomForestClassifier(n_estimators= 1000)],
               ["Decision Tree with Boosting", AdaBoostClassifier()],
               ["Decision Tree with Bagging", BaggingClassifier()],
               ["Multi Layer Perceptron", MLPClassifier(hidden_layer_sizes = (100,))],
               ["Gradient Boosting Classifier", GradientBoostingClassifier()],
               ['LASSO', LassoCV()],
               ['Linear Regression', LinearRegression()],
               ['Robust Regression', HuberRegressor()]
               ]


In [None]:
# Number of Folds (Splits) for Cross Validation #
NUM_OF_SPLITS = 5

# Define whether you want to manually reweight the sample by oversampling the smaller class 
Reweight = True

# Define arrays in which to store classification outputs # 
RESULTS = []
Classified_Values =[]

# Loop Through Different Classifiers #
for CL in tqdm_notebook(CLASSIFIERS, desc = "Evaluating Classifiers"):

    # Extract Classifier Names & Model #
    name  = CL[0]
    Model = CL[1]

    # Define Arrays to store Actual, Predicted and Ids variables (Because we are shuffling them in next step) # 
    y_actual = []
    y_predicted = []
    id_s = []

    # Loop through K Folds and Repeat Cross Validation #
    
    KFoldSplitter = StratifiedKFold(n_splits = NUM_OF_SPLITS, shuffle = True, random_state = 1)
    
    for train_i, test_i in tqdm_notebook(KFoldSplitter.split(Abstract_Vectors, Classes), 
                                         desc = 'Cross-Validating',
                                         leave = False,
                                         total = NUM_OF_SPLITS):

        # Select Rows in Data Based on Indexes [train_i, test_i]
        Y = np.array(Classes)

        train_X, test_X = Abstract_Vectors[train_i], Abstract_Vectors[test_i]
        train_y, test_y = Y[train_i], Y[test_i]
        Train_IDs, Test_IDs = IDs[train_i], IDs[test_i]

        # Reconstruct training data to ensure class are balanced approx. 50/50 (Reweight = True if yes) #        
        temp_y = list(train_y)
        temp_X = list(train_X)

        if Reweight == True:
            # Repeat up to three times. This is arbitrary, but should be cautious about doing it more often. 
            # This current one needs only about 1.5 iterations to balance the sample.
            for j in range(0,3,1):
                # Loop through eacb observation and add in positive values if the balance is not met
                for i in range(0, len(train_y), 1):
                    if (train_y[i] != 0) & (np.mean(temp_y) < 0.5) :
                        temp_y.append(temp_y[i])
                        temp_X.append(temp_X[i])
                    else: 
                        continue

        # Train Model #
        Results = Model.fit(temp_X, temp_y)

        # Perform Prediction on Holdout Sample # 
        y_pred = Model.predict( list(test_X))

        # Convert Continuous Predicted Values to 0/1 values # 
        y_pred2 = []
        for y in y_pred:
            if y > 0.5:
                y_pred2.append(1)
            else:
                y_pred2.append(0)
        y_pred = y_pred2

        # Add to List with Final Results # 
        y_actual = y_actual + list(test_y)
        y_predicted = y_predicted + y_pred
        id_s = id_s + list(Test_IDs)

 
    # ---------------------------------------------------------- #
    # This runs only after all of the folds have been classified # 
    # ---------------------------------------------------------- #

    # Compute the Share of AI Patents #
    Share = np.round(np.mean(y_predicted), 3)

    # Calculate Model Performance Metrics #
    Accuracy = accuracy_score(y_actual, y_predicted)
    ROC = roc_auc_score(y_actual, y_predicted)
    Precision = precision_score(y_actual, y_predicted)
    Recall = recall_score(y_actual, y_predicted)
    F1 = f1_score(y_actual, y_predicted)
    CM = confusion_matrix(y_actual, y_predicted)

    # Round to 3 Decimal Places # 
    #FN = np.round(CM[0][0]/CM[0].sum(), 3)
    #FP = np.round(CM[0][1]/CM[0].sum(), 3)
    #TN = np.round(CM[1][0]/CM[1].sum(), 3)
    #TP = np.round(CM[1][1]/CM[1].sum(), 3)

    FN = np.round(CM[0][0]/(CM[0][0] + CM[1][0]), 3)
    FP = np.round(CM[0][1]/(CM[0][1] + CM[1][1]), 3)
    TN = np.round(CM[1][0]/(CM[0][0] + CM[1][0]), 3)
    TP = np.round(CM[1][1]/(CM[0][1] + CM[1][1]), 3)

    # Add Classification Performance Metrics to List#
    RESULTS.append([name, Share, TP, FN, FP, TN,
                                          np.round(Accuracy, 3),
                                          np.round(ROC, 3),
                                          np.round(Precision, 3),
                                          np.round(Recall, 3),
                                          np.round(F1, 3)])

    # Add Classification Results to List # 
    Classified_Values.append(list(zip(len(id_s)*[name],id_s, y_actual, y_predicted)))


HBox(children=(FloatProgress(value=0.0, description='Evaluating Classifiers', max=12.0, style=ProgressStyle(de…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…

HBox(children=(FloatProgress(value=0.0, description='Cross-Validating', max=5.0, style=ProgressStyle(descripti…




# B.5. Output Classification Results # 
----


In [None]:
# Convert List of Model Performance Metrics to Dataframe #
RESULTS_TABLE = pd.DataFrame(RESULTS, columns = ["Name", "Share", "True-Positives", 
                                                 "False-Negatives", "False-Positives", 
                                                 "True-Negatives","Accuracy", "AUC", 
                                                 "Precision", "Recall", "F1"] )
RESULTS_TABLE["Type"] = "Embedding Vectors"
RESULTS_TABLE = RESULTS_TABLE[["Name", "Type", "Share", "True-Positives", 
                               "False-Negatives", "False-Positives", 
                               "True-Negatives","Accuracy", "AUC", 
                               "Precision", "Recall", "F1"]]

# Output Results #
RESULTS_TABLE.sort_values("Accuracy", ascending = False ).to_csv("./Output/Model Performance/Embedding Model Classification Performance.csv")

# Display Results -- Out of Sample (Holdout) prediction -- Sorted by Accuracy #
RESULTS_TABLE.sort_values("Accuracy", ascending = False )

Unnamed: 0,Name,Type,Share,True-Positives,False-Negatives,False-Positives,True-Negatives,Accuracy,AUC,Precision,Recall,F1
7,Multi Layer Perceptron,Embedding Vectors,0.213,0.761,0.951,0.239,0.049,0.911,0.873,0.761,0.809,0.784
4,Random Forest,Embedding Vectors,0.168,0.815,0.924,0.185,0.076,0.906,0.823,0.815,0.685,0.744
8,Gradient Boosting Classifier,Embedding Vectors,0.24,0.699,0.957,0.301,0.043,0.895,0.873,0.699,0.837,0.762
2,Logistic Regression,Embedding Vectors,0.271,0.67,0.974,0.33,0.026,0.892,0.897,0.67,0.906,0.771
3,Ridge Regression,Embedding Vectors,0.269,0.668,0.971,0.332,0.029,0.89,0.892,0.668,0.895,0.765
9,LASSO,Embedding Vectors,0.268,0.668,0.97,0.332,0.03,0.89,0.89,0.668,0.892,0.764
10,Linear Regression,Embedding Vectors,0.266,0.666,0.968,0.334,0.032,0.888,0.885,0.666,0.882,0.759
1,Support Vector Classifier (RBF),Embedding Vectors,0.28,0.656,0.977,0.344,0.023,0.887,0.898,0.656,0.916,0.765
6,Decision Tree with Bagging,Embedding Vectors,0.166,0.75,0.909,0.25,0.091,0.882,0.784,0.75,0.62,0.678
11,Robust Regression,Embedding Vectors,0.285,0.631,0.971,0.369,0.029,0.874,0.882,0.631,0.897,0.74


In [None]:
# Output Classification Results for Training Dataset -- PREDICTED VALUES -- Out Of Sample (Holdout) Prediction # 

for i in range(0,len(Classified_Values), 1):

  Temp = pd.DataFrame(  Classified_Values[i],
                        columns = ['Model', 'id', 'Actual', 'Predicted'] )
  
  if i == 0: 
    name = Temp.head(1)['Model'][0]
    Temp = Temp[['id', 'Actual', 'Predicted']]
    Temp.columns = ['id', 'Actual', name]
    Final = Temp

  else: 

    name = Temp.head(1)['Model'][0]
    Temp = Temp[['id', 'Predicted']]
    Temp.columns = ['id', name]

    Final = Final.merge(Temp, on = ['id'])

# Save Data Frame # 
Final.to_csv("./Output/Classification Output/Embedding Classification Results.csv")