## Imports

* `os` or listing directories of resume docs.
* `textract` as a tool for reading .doc, .docx.
* `numpy` for dealing with numpy arrays.
* `pandas` for making DataFrames.
* `sklearn` for vectorizing text, applying models, analyzing results.
* `sklearn-pandas` for a very useful DataFrameMapper tool.

In [1]:
import os
import textract
import numpy as np
import pandas as pd
import sklearn_pandas as skpd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn import metrics

## Read the docs

Assuming we have 4 labels:

* web_dev - for web developers
* dev - for other developers
* embed - for embedded developers

* neither - get some resumes where none of the above apply

Let's construct a data set:

**Note**
The `./doc` folder is just full of copies of my resume, so our classification isn't going to be useful. You'll need to obtain a diverse set of resumes. The more, the better.

In [2]:
y_labels = ('web_dev', 'dev', 'embed', 'neither')

def process_docs(path, label):
    docs = os.listdir(path)
    data = []
    for doc in docs:
        if not doc.split('.')[1].startswith('doc'):
            continue
        try:
            text = textract.process(f'{path}/{doc}')
            data.append({'label': label, 'text': str(text)})
        except textract.exceptions.ShellError as e:
            print(f'{doc} is too small', e)
    return data


data = []
for label in y_labels:
    data.extend(process_docs(f'./doc/{label}', label))
df = pd.DataFrame(data)

## Training & Testing

Here we split our data set into a training set and a test set to help validate that training set.

In [3]:
train, test = train_test_split(df, test_size=0.4)

# Labels must be identical in train and test set. Increase test_size if they are not.
assert set(train['label']) == set(test['label']) == set(y_labels)

## Vectorize text and boost important terms

Here we use a DataFrameMapper, a class for mapping pandas data frame columns to different sklearn transformations to vectorize our text and boost the importance of certain terms.

In [4]:
def boost_terms(X, y=None):
    terms = ['html', 'css', 'javascript']
    rows = X.shape[0]
    results = np.zeros((rows, 1))
    for i in range(rows):
        value = X[i, 0].lower()
        if any(term in value for term in terms):
            results[i] = 1  # Increase feature weight
    return results


tfidf = TfidfVectorizer(ngram_range=(1, 1), max_features=1000)
boost = FunctionTransformer(boost_terms, validate=False)

preprocess = skpd.DataFrameMapper([
    ('text', tfidf),
    (['text'], boost),
])

## Apply logistic regression

Since we have multiple labels, we're using multinomial classification in our logistic regression:

In [5]:
lr = LogisticRegression(solver='lbfgs', max_iter=10000, multi_class='multinomial', class_weight='balanced')
steps = (
    ('preprocess', preprocess), 
    ('lr', lr),
)
pipe = Pipeline(steps=steps)
pipe.fit(train, train['label'])
print('Done.')

Done.


## Let's try it out

You could read a new doc into memory and attempt to classify. I'm just going to use this string literal here:

In [6]:
trial = pd.DataFrame([{'text': 'I love python and machine learning', 'label': ''}])
pd.DataFrame(pipe.predict_proba(trial), columns=y_labels)

Unnamed: 0,web_dev,dev,embed,neither
0,0.25,0.25,0.25,0.25


In [7]:
print('Prediction: ', pipe.predict(trial)[0])

Prediction:  dev


## Conclusion

As you can see, my training set is identical, so classification is pretty useless. If you get a bunch of resumes and load them up with this process, it will be able to classify whether or not the resume belongs to a web dev, dev, embedded dev, or another profession.

In [8]:
y_true, y_pred = test['label'], pipe.predict(test)
print(metrics.classification_report(y_true, y_pred))

              precision    recall  f1-score   support

         dev       0.38      1.00      0.55         6
       embed       0.00      0.00      0.00         4
     neither       0.00      0.00      0.00         4
     web_dev       0.00      0.00      0.00         2

    accuracy                           0.38        16
   macro avg       0.09      0.25      0.14        16
weighted avg       0.14      0.38      0.20        16



  'precision', 'predicted', average, warn_for)
