# Baseline model to predict the first two digits of NAICS

The objective of this notebook is to create a baseline model to predict the first two digits of the NAICS code.

The main idea is to use TF-IDF to vectorize the text data and then try some simple models to predict the first two digits of the NAICS code.


In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
DATA_DIR = os.path.abspath(os.path.join(os.getcwd(), "../", "data"))
NAICS_DATA = os.path.join(DATA_DIR, "processed/coverwallet.xlsx")

In [3]:
df = pd.read_excel(NAICS_DATA)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14182 entries, 0 to 14181
Data columns (total 2 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   NAICS                 14180 non-null  float64
 1   BUSINESS_DESCRIPTION  14177 non-null  object 
dtypes: float64(1), object(1)
memory usage: 221.7+ KB


In [5]:
df.head()

Unnamed: 0,NAICS,BUSINESS_DESCRIPTION
0,722511.0,Zenyai Viet Cajun & Pho Restaurant is dedicate...
1,541330.0,"Kilduff Underground Engineering, Inc. (KUE) is..."
2,453998.0,024™ is a premium home fragrance brand that de...
3,561720.0,Our Services include Office Cleaning Carpet cl...
4,621610.0,NYS Licensed Home Health Agency


In [6]:
# null data rows
df[df.isnull().any(axis=1)]

Unnamed: 0,NAICS,BUSINESS_DESCRIPTION
1248,561311.0,
1492,325992.0,
1989,811310.0,
5196,541430.0,
9149,541990.0,
11535,,JRC GROUP LLC. Is an ASSET BASED freight broke...
13486,,Consulting services for Facility Asset Managem...


In [7]:
df.dropna(inplace=True)

We will use the first two digits of NAICS, so let's create a new column with this information.


In [8]:
df["NAICS_2"] = df["NAICS"].astype(str).str[:2].astype(int)
df.head()

Unnamed: 0,NAICS,BUSINESS_DESCRIPTION,NAICS_2
0,722511.0,Zenyai Viet Cajun & Pho Restaurant is dedicate...,72
1,541330.0,"Kilduff Underground Engineering, Inc. (KUE) is...",54
2,453998.0,024™ is a premium home fragrance brand that de...,45
3,561720.0,Our Services include Office Cleaning Carpet cl...,56
4,621610.0,NYS Licensed Home Health Agency,62


In [9]:
df["NAICS_2"].value_counts()

NAICS_2
54    4202
23    2976
56    1159
61     751
33     682
42     650
62     517
81     488
51     380
72     333
53     331
71     300
48     270
32     241
45     240
31     184
52     162
44     148
49      45
22      32
92      30
11      26
55      15
21      13
Name: count, dtype: int64

The dataset is certainly imbalanced, so we will use the `stratify` parameter in the `train_test_split` function to ensure that the distribution of the target variable is the same in the training and test sets.


In [10]:
from sklearn.model_selection import train_test_split

X = df["BUSINESS_DESCRIPTION"]
y = df["NAICS_2"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((11340,), (2835,), (11340,), (2835,))

In [11]:
print(y_train.value_counts())

NAICS_2
54    3361
23    2381
56     927
61     601
33     546
42     520
62     414
81     390
51     304
72     266
53     265
71     240
48     216
32     193
45     192
31     147
52     130
44     118
49      36
22      26
92      24
11      21
55      12
21      10
Name: count, dtype: int64


In [12]:
print(y_test.value_counts())

NAICS_2
54    841
23    595
56    232
61    150
33    136
42    130
62    103
81     98
51     76
72     67
53     66
71     60
48     54
45     48
32     48
31     37
52     32
44     30
49      9
92      6
22      6
11      5
55      3
21      3
Name: count, dtype: int64


## KMeans classifier


In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

In [14]:
clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("KNN", KNeighborsClassifier(n_neighbors=5)),
    ]
)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          11       0.00      0.00      0.00         5
          21       0.00      0.00      0.00         3
          22       0.33      0.17      0.22         6
          23       0.68      0.87      0.76       595
          31       0.36      0.32      0.34        37
          32       0.52      0.35      0.42        48
          33       0.60      0.35      0.44       136
          42       0.20      0.61      0.30       130
          44       0.29      0.07      0.11        30
          45       0.64      0.15      0.24        48
          48       0.46      0.72      0.56        54
          49       1.00      0.33      0.50         9
          51       0.42      0.17      0.24        76
          52       0.67      0.69      0.68        32
          53       0.78      0.70      0.74        66
          54       0.79      0.73      0.76       841
          55       0.00      0.00      0.00         3
          56       0.77    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Naive Bayes classifier


In [15]:
from sklearn.naive_bayes import MultinomialNB

clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("NB", MultinomialNB()),
    ]
)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          11       0.00      0.00      0.00         5
          21       0.00      0.00      0.00         3
          22       0.00      0.00      0.00         6
          23       0.64      0.84      0.72       595
          31       0.00      0.00      0.00        37
          32       0.00      0.00      0.00        48
          33       0.82      0.07      0.12       136
          42       0.77      0.08      0.14       130
          44       0.00      0.00      0.00        30
          45       0.00      0.00      0.00        48
          48       0.00      0.00      0.00        54
          49       0.00      0.00      0.00         9
          51       0.00      0.00      0.00        76
          52       0.00      0.00      0.00        32
          53       1.00      0.02      0.03        66
          54       0.42      0.97      0.58       841
          55       0.00      0.00      0.00         3
          56       0.90    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Random Forest classifier


In [16]:
from sklearn.ensemble import RandomForestClassifier

clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("RF", RandomForestClassifier(n_estimators=100)),
    ]
)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          11       0.00      0.00      0.00         5
          21       0.00      0.00      0.00         3
          22       0.00      0.00      0.00         6
          23       0.72      0.89      0.80       595
          31       0.50      0.05      0.10        37
          32       0.73      0.17      0.27        48
          33       0.55      0.38      0.45       136
          42       0.47      0.43      0.45       130
          44       0.33      0.03      0.06        30
          45       0.33      0.06      0.11        48
          48       0.40      0.31      0.35        54
          49       1.00      0.11      0.20         9
          51       1.00      0.03      0.05        76
          52       0.80      0.25      0.38        32
          53       0.85      0.44      0.58        66
          54       0.57      0.93      0.71       841
          55       0.00      0.00      0.00         3
          56       0.78    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## First conclusions

The KMeans classifier had the best performance, according to the accuracy and F1-score metrics. The Naive Bayes classifier had the worst performance, and the Random Forest classifier had intermediate performance, close to the KMeans classifier.

To improve the performance of the models, we can try to preprocess the text data, such as removing stopwords, punctuation, and using lemmatization or stemming.


## Text preprocessing


Let's preprocess the data and create the models.


In [17]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_trf")


def preprocess(sentence: str):
    doc = nlp(sentence)
    word_list = [word.lemma_ for word in doc if not (word.is_stop or word.is_punct)]
    return " ".join(word_list)

In [None]:
df["PREPROCESSED_DESCRIPTION"] = df["BUSINESS_DESCRIPTION"].apply(preprocess)

We'll save the dataframe just because it will save time in the preprocessing step in the future.


In [None]:
df.to_parquet(os.path.join(DATA_DIR, "processed/coverwallet_preprocessed.parquet"))

In [18]:
df = pd.read_parquet(
    os.path.join(DATA_DIR, "processed/coverwallet_preprocessed.parquet")
)

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    df["PREPROCESSED_DESCRIPTION"],
    df["NAICS_2"],
    test_size=0.2,
    random_state=42,
    stratify=df["NAICS_2"],
)

## KNN classifier for preprocessed data


In [20]:
clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("RF", KNeighborsClassifier()),
    ]
)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          11       0.33      0.20      0.25         5
          21       0.00      0.00      0.00         3
          22       0.29      0.33      0.31         6
          23       0.69      0.89      0.78       595
          31       0.34      0.30      0.32        37
          32       0.44      0.33      0.38        48
          33       0.63      0.40      0.49       136
          42       0.21      0.58      0.31       130
          44       0.14      0.03      0.05        30
          45       0.56      0.19      0.28        48
          48       0.52      0.76      0.62        54
          49       0.80      0.44      0.57         9
          51       0.39      0.22      0.28        76
          52       0.64      0.78      0.70        32
          53       0.71      0.74      0.73        66
          54       0.80      0.71      0.75       841
          55       0.00      0.00      0.00         3
          56       0.75    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Hyperparameter tuning


In [21]:
from sklearn.model_selection import GridSearchCV

parameters = {
    "RF__n_neighbors": [10, 15, 20, 25],
}

clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("RF", KNeighborsClassifier()),
    ]
)

grid_search = GridSearchCV(clf, parameters, n_jobs=-1, verbose=1)

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)

grid_search.best_params_

Fitting 5 folds for each of 4 candidates, totalling 20 fits


{'RF__n_neighbors': 20}

Let's train the KNN classifier with the preprocessed data and the best hyperparameters found.


In [23]:
clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("RF", KNeighborsClassifier(n_neighbors=20)),
    ]
)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          11       0.00      0.00      0.00         5
          21       0.00      0.00      0.00         3
          22       0.00      0.00      0.00         6
          23       0.71      0.93      0.80       595
          31       0.43      0.27      0.33        37
          32       0.62      0.33      0.43        48
          33       0.67      0.38      0.48       136
          42       0.45      0.51      0.48       130
          44       0.40      0.07      0.11        30
          45       0.64      0.19      0.29        48
          48       0.69      0.69      0.69        54
          49       0.67      0.22      0.33         9
          51       0.56      0.18      0.28        76
          52       0.67      0.69      0.68        32
          53       0.71      0.83      0.76        66
          54       0.73      0.86      0.79       841
          55       0.00      0.00      0.00         3
          56       0.78    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Other models


Let's see if we can improve the performance of the models by using other models and try some hyperparameter tuning.


### Logistic Regression


In [24]:
from sklearn.linear_model import LogisticRegression

clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("LR", LogisticRegression()),
    ]
)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          11       0.00      0.00      0.00         5
          21       0.00      0.00      0.00         3
          22       0.00      0.00      0.00         6
          23       0.81      0.91      0.86       595
          31       0.67      0.16      0.26        37
          32       0.72      0.27      0.39        48
          33       0.62      0.61      0.61       136
          42       0.54      0.56      0.55       130
          44       1.00      0.03      0.06        30
          45       0.47      0.17      0.25        48
          48       0.79      0.63      0.70        54
          49       1.00      0.22      0.36         9
          51       0.58      0.20      0.29        76
          52       0.95      0.59      0.73        32
          53       0.86      0.64      0.73        66
          54       0.67      0.91      0.77       841
          55       0.00      0.00      0.00         3
          56       0.79    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [44]:
# regularization
from sklearn.preprocessing import StandardScaler

parameters = {
    "LR__penalty": ["l1", "l2", None],
    "LR__C": [0.1, 1, 2],
}

clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("standard_scaler", StandardScaler(with_mean=False)),
        ("LR", LogisticRegression(solver="liblinear")),
    ]
)

grid_search = GridSearchCV(clf, parameters, n_jobs=-1, verbose=1)

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)

print(classification_report(y_test, y_pred))

Fitting 5 folds for each of 9 candidates, totalling 45 fits


15 fits failed out of a total of 45.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/home/ernesto/.cache/pypoetry/virtualenvs/zrive-ds-ZlIsFOKS-py3.11/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/ernesto/.cache/pypoetry/virtualenvs/zrive-ds-ZlIsFOKS-py3.11/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ernesto/.cache/pypoetry/virtualenvs/zrive-ds-ZlIsFOKS-py3.11/lib/python3.11/site-packages/sklearn/pipelin

              precision    recall  f1-score   support

          11       0.25      0.20      0.22         5
          21       0.00      0.00      0.00         3
          22       0.33      0.17      0.22         6
          23       0.79      0.86      0.83       595
          31       0.42      0.22      0.29        37
          32       0.41      0.33      0.37        48
          33       0.55      0.52      0.54       136
          42       0.45      0.41      0.43       130
          44       0.25      0.17      0.20        30
          45       0.31      0.19      0.23        48
          48       0.73      0.65      0.69        54
          49       0.50      0.22      0.31         9
          51       0.43      0.25      0.32        76
          52       0.73      0.59      0.66        32
          53       0.72      0.62      0.67        66
          54       0.70      0.85      0.77       841
          55       0.00      0.00      0.00         3
          56       0.71    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [43]:
grid_search.best_params_

{'LR__penalty': 'l2'}

The Logistic Regression model seems to work better with regularization for this data.


### Support Vector Machine


In [45]:
from sklearn.svm import SVC

parameters = {
    "SVC__C": [0.1, 1, 2],
    "SVC__kernel": ["linear", "poly", "rbf", "sigmoid"],
}

clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("SVC", SVC()),
    ]
)

grid_search = GridSearchCV(clf, parameters, n_jobs=-1, verbose=1)

grid_search.fit(X_train, y_train)

y_pred = grid_search.predict(X_test)

print(classification_report(y_test, y_pred))

grid_search.best_params_

Fitting 5 folds for each of 12 candidates, totalling 60 fits
              precision    recall  f1-score   support

          11       0.00      0.00      0.00         5
          21       0.00      0.00      0.00         3
          22       0.00      0.00      0.00         6
          23       0.82      0.89      0.85       595
          31       0.42      0.35      0.38        37
          32       0.62      0.31      0.42        48
          33       0.59      0.58      0.59       136
          42       0.53      0.63      0.58       130
          44       0.33      0.03      0.06        30
          45       0.45      0.29      0.35        48
          48       0.80      0.74      0.77        54
          49       0.67      0.22      0.33         9
          51       0.53      0.26      0.35        76
          52       0.83      0.78      0.81        32
          53       0.76      0.68      0.72        66
          54       0.72      0.88      0.79       841
          55       0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'SVC__C': 1, 'SVC__kernel': 'linear'}

### Gradient Boosting


In [26]:
from sklearn.ensemble import GradientBoostingClassifier

clf = Pipeline(
    [
        ("vectorizer_tfidf", TfidfVectorizer()),
        ("GB", GradientBoostingClassifier()),
    ]
)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          11       0.17      0.20      0.18         5
          21       0.00      0.00      0.00         3
          22       0.00      0.00      0.00         6
          23       0.80      0.83      0.81       595
          31       0.27      0.11      0.15        37
          32       0.31      0.25      0.28        48
          33       0.59      0.49      0.53       136
          42       0.50      0.44      0.47       130
          44       0.19      0.13      0.16        30
          45       0.25      0.17      0.20        48
          48       0.67      0.63      0.65        54
          49       0.29      0.22      0.25         9
          51       0.48      0.21      0.29        76
          52       0.68      0.66      0.67        32
          53       0.75      0.65      0.70        66
          54       0.60      0.82      0.69       841
          55       0.00      0.00      0.00         3
          56       0.75    

## Final considerations

After preprocessing the data, the models had a better performance, being all of them very similar.

The one we could use as a baseline model is the SVM model, which had the best performance in terms of accuracy and F1-score.
