## Imports

* `os` or listing directories of resume docs.
* `textract` as a tool for reading .doc, .docx
* `pandas` for making DataFrames
* `sklearn` for vectorizing text, applying models, analyzing results.

In [1]:
import os
import textract
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn import metrics

## Read the docs

Assuming we have 4 labels:

* web_dev - for web developers
* dev - for other developers
* embed - for embedded developers

* neither - get some resumes where none of the above apply

Let's construct a data set:

**Note**
The `./doc` folder is just full of copies of my resume, so our classification isn't going to be useful. You'll need to obtain a diverse set of resumes. The more, the better.

In [2]:
y_labels = ('web_dev', 'dev', 'embed', 'neither')

def process_docs(path, label):
    docs = os.listdir(path)
    data = []
    for doc in docs:
        if not doc.split('.')[1].startswith('doc'):
            continue
        try:
            text = textract.process(f'{path}/{doc}')
            data.append({'label': label, 'text': text})
        except textract.exceptions.ShellError as e:
            print(f'{doc} is too small', e)
    return data


data = []
for label in y_labels:
    data.extend(process_docs(f'./doc/{label}', label))
df = pd.DataFrame(data)

## Training & Testing

Here we split our data set into a training set and a test set to help validate that training set.

In [3]:
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Vectorize text and apply logistic regression

Since we have multiple labels, we're using multinomial classification in our logistic regression:

In [4]:
vec = TfidfVectorizer(ngram_range=(1, 1), max_features=1000)
lr = LogisticRegression(solver='lbfgs', max_iter=10000, multi_class='multinomial', class_weight='balanced')
steps = (
    ('text', vec), 
    ('lr', lr)
)
pipe = Pipeline(steps=steps)
pipe.fit(X_train, y_train)
print('Done.')

Done.


## Let's try it out

You could read a new doc into memory and attempt to classify. I'm just going to use this string literal here:

In [5]:
trial = pd.DataFrame([{'text': 'I love python and machine learning'}])
pd.DataFrame(pipe.predict_proba(trial['text']), columns=y_labels)

Unnamed: 0,web_dev,dev,embed,neither
0,0.25,0.25,0.25,0.25


In [6]:
print('Prediction: ', pipe.predict(trial['text'])[0])

Prediction:  dev
