# DistilBERT Embedding and Model

In [72]:
import pandas as pd
import numpy as np

from transformers import (
    AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification, 
    Trainer, TrainingArguments, DataCollatorWithPadding, pipeline
)
from datasets import Dataset
import evaluate

from sklearn.model_selection import train_test_split

## Import Data

In [5]:
df = pd.read_csv('../data/clean/job_ads.csv')

In [24]:
df.columns

Index(['title', 'department', 'telecommuting', 'has_company_logo',
       'has_questions', 'employment_type', 'required_experience',
       'required_education', 'industry', 'function', 'fraudulent', 'job_ad',
       'country'],
      dtype='object')

In [25]:
X = df[['job_ad']]
y = df['fraudulent']

In [26]:
X = X.fillna('')

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state=1613)

In [58]:
X_train['label'] = y_train

In [59]:
X_train

Unnamed: 0,job_ad,label
1419,Armor People Link is currently seeking a Staff...,0
8922,EUROPEAN DYNAMICS () and follow us on Twitter ...,0
6079,"Farmigo, a well funded Startup , develops a p...",0
16679,"VAM SYSTEMS is a Business Consulting, IT Solut...",0
9864,Our clients have bold dreams. We need your he...,0
...,...,...
683,Our mission is to bring the world's best-loved...,0
5976,ValleySoft is a fast growing global IT Service...,0
13155,BADR is an established company that is stridin...,0
4669,Want to build a career in IT? Free training in...,0


In [60]:
X_test['label'] = y_test

In [61]:
X_test

Unnamed: 0,job_ad,label
985,En Adjust somos un DSP (Demand Side Platform) ...,0
11534,Outstanding Member Service Starts With Outstan...,0
3831,About HitFigure:Franchised car dealers who rep...,0
4848,compensation: Salary: $10-15 an hour based on ...,0
3891,Applied Memetics LLC is a professional service...,0
...,...,...
16911,"LEI Home Enhancements, is an Ohio based compan...",0
14693,Making Quality Metrics ActionableWe are revolu...,0
9479,"ABC Supply Co. , Inc. is the nation’s largest...",0
7084,We Provide Full Time Permanent Positions for m...,0


## DistilBERT Model

I opted for the [DistilBERT base model (uncased)](https://huggingface.co/distilbert/distilbert-base-uncased) first published in this [paper](https://arxiv.org/abs/1910.01108).  This is a smaller version than BERT (67 million parameters instead 110 million parameters) it maintains most of the accuracy of BERT (97% according to the paper) but is smaller and faster and thus, more energy efficient.  This seems ideal for my application of creating web application, so I opted for this version.

I also looked through the fine-tuned versions to see if there was a version applicable to my situation, but there isn't, so I'm going to fine-tune the DistilBERT model for this application.  

In [62]:
id2label = {0: 'real', 1: 'fraud'}
label2id = {'real': 0, 'fraud': 1}

In [63]:
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", 
    num_labels = 2, 
    id2label=id2label, 
    label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [64]:
train_dataset = Dataset.from_pandas(X_train)
test_dataset = Dataset.from_pandas(X_test)

In [65]:
tokenize_train = train_dataset.map(lambda ad: tokenizer(ad['job_ad'], truncation=True), batched =True)

Map:   0%|          | 0/13410 [00:00<?, ? examples/s]

In [66]:
tokenize_test = test_dataset.map(lambda ad: tokenizer(ad['job_ad'], truncation=True), batched =True)

Map:   0%|          | 0/4470 [00:00<?, ? examples/s]

In [108]:
metrics = evaluate.combine(['accuracy', 'precision', 'recall', 'f1','roc'])

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

In [109]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # convert the logits to their predicted class
    predictions = np.argmax(logits, axis=-1)
    return metrics.compute(predictions=predictions, references=labels)

In [110]:
# Prepare data collator for padding sequences
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    weight_decay=0.01,
    eval_strategy="epoch",
    logging_strategy="epoch"
)

# Define Trainer object for training the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenize_train,
    eval_dataset=tokenize_test,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


In [111]:
# Train the model
trainer.train()

# Save the trained model
trainer.save_model('model')



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2202,0.213994,0.951678,0.0,0.0,0.0
2,0.2198,0.220774,0.951678,0.0,0.0,0.0
3,0.217,0.224938,0.951678,0.0,0.0,0.0
4,0.2173,0.221765,0.951678,0.0,0.0,0.0
5,0.2185,0.221656,0.951678,0.0,0.0,0.0


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [113]:
f1 = evaluate.load('f1')

In [114]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # convert the logits to their predicted class
    predictions = np.argmax(logits, axis=-1)
    return f1.compute(predictions=predictions, references=labels)

In [115]:
# Prepare data collator for padding sequences
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results2",
    learning_rate=2e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.01,
    eval_strategy="epoch",
    logging_strategy="epoch"
)

# Define Trainer object for training the model
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenize_train,
    eval_dataset=tokenize_test,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


In [116]:
# Train the model
trainer.train()

# Save the trained model
trainer.save_model('model2')



Epoch,Training Loss,Validation Loss,F1
1,0.2184,0.215071,0.0
2,0.2192,0.22125,0.0
3,0.2176,0.233627,0.0
4,0.2174,0.217775,0.0
5,0.218,0.210218,0.0
6,0.2155,0.226048,0.0
7,0.2151,0.225853,0.0
8,0.2165,0.232077,0.0
9,0.2147,0.216794,0.0
10,0.2125,0.220363,0.0




In [77]:
X_test.iloc[45]

job_ad    We help teachers get safe and secure jobs abro...
label                                                     0
Name: 7052, dtype: object

In [101]:
X_test.iloc[45]['job_ad']

'We help teachers get safe and secure jobs abroad :)Play with kids, get paid for it. Vacancies in Asia$1500 USD + monthly ($200 Cost of living)Housing providedAirfare providedExcellent for student loans/credit cardsGabriel Adkins (We are looking for friendly people.  If you do not plan to take part in a 3-5 minute interview, kindly do not waste your time applying :-)University degree required.  TEFL / TESOL / CELTA, and/or teaching experience preferredCanada/US passport holders onlySee job description'

In [86]:
tokenize_test[56]

{'job_ad': "Costa coffee was initially started in London by two Italian brothers named Sergio and Bruno Costa; and it has now become a multinational coffee chain.  Costa coffee is the world’s third largest coffee house chain with over 1700 stores in more than 28 countries across the globe.  Our stores can be found anywhere from airports to bookstores, Hotels, Pizza Hut branches, etc.  the largest store is located in Dubai that allows a sitting of 321 people at once.   We are planning to set up new centers at some universities and hospitals, where the coffee beans used will be of the same type. Our new development program is designed to enhance customer service experience by launching several new stores in the U. S.  by the end of 2014.  That's why we want to hire talented managers that will help us accomplish this goal, building the store from the ground up.  We are looking to set up stores in Florida.  In the majority of cases we will assign you to the area you are in, but if there's 

In [88]:
trainer.evaluate()

{'eval_loss': 0.21981267631053925,
 'eval_accuracy': 0.9516778523489933,
 'eval_runtime': 105.7805,
 'eval_samples_per_second': 42.257,
 'eval_steps_per_second': 5.285,
 'epoch': 5.0}

In [91]:
preds = trainer.predict(tokenize_test)



In [98]:
preds

PredictionOutput(predictions=array([[ 1.8830388, -2.3619325],
       [ 1.8830385, -2.3619323],
       [ 1.8830385, -2.3619328],
       ...,
       [ 1.8830386, -2.3619328],
       [ 1.8830388, -2.361933 ],
       [ 1.8830388, -2.3619325]], dtype=float32), label_ids=array([0, 0, 0, ..., 0, 0, 0]), metrics={'test_loss': 0.21981267631053925, 'test_accuracy': 0.9516778523489933, 'test_runtime': 106.1135, 'test_samples_per_second': 42.125, 'test_steps_per_second': 5.268})

In [99]:
y_test

985      0
11534    0
3831     0
4848     0
3891     0
        ..
16911    0
14693    0
9479     0
7084     0
2416     0
Name: fraudulent, Length: 4470, dtype: int64

## More Testing

In [100]:
my_classifier = pipeline('text-classification', model='model')

Device set to use mps:0


In [102]:
my_classifier(X_test.iloc[45]['job_ad'])

[{'label': 'real', 'score': 0.9858664870262146}]

BLAH BLAH

In [103]:
my_classifier("""
About the job CareHarmony is seeking a talented and motivated Data Scientist to join our team. As a Data Scientist at CareHarmony, you will play a crucial role in leveraging data to make a positive impact on healthcare outcomes. 
In this role, you will work closely with our data science team to analyze complex healthcare data, build predictive models, and generate actionable insights. You will have the opportunity to work with a diverse dataset, including patient interactions, electronic medical records (EMR), and outcomes data. 
Key responsibilities include data cleaning and preprocessing, feature engineering, model development and evaluation, and presenting findings to both technical and non-technical stakeholders. You will collaborate with cross-functional teams to translate data insights into actionable strategies that improve patient care and coordination. 
If you are passionate about using data to drive meaningful change in healthcare, and you thrive in a collaborative and innovative environment, we would love to hear from you! 
Requirements Strong background in statistics, machine learning, and data analysis
Proficiency in programming languages such as Python or R
Experience working with large and complex datasets
Knowledge of statistical modeling techniques and experience building predictive models
Strong problem-solving and analytical skills

Ability to communicate complex findings to both technical and non-technical stakeholders
Experience with healthcare data and electronic medical records (EMR) is a plus


Benefits

Why Apply
 Opportunity to get in the early stage at a high-growth HealthTech company with extreme product market fit and exponential growth (we're deployed at 80+ health systems across >25 states!).
 You are trusted to take complete autonomy over data ingestion, data integration, and data-related API development.
 Actively transform healthcare outcomes by solving real-world problems for millions of patients
 """
)

[{'label': 'real', 'score': 0.9858664870262146}]

Now let's test the model with two fake job ads that ChatGPT wrote for me:

In [104]:
my_classifier(
    """
    Title:
Work From Home! Earn $5,000 a Week — No Experience Needed!

Description:
We are an international financial company seeking motivated individuals to work from home. You will process payments and forward funds to our clients. No prior experience required!

Perks:

Get paid $5,000–$7,500 weekly

Flexible hours — work whenever you want

Instant promotion opportunities

Receive a $500 “training fee” after your first task

Requirements:

Must be 18+

Have a bank account for “direct deposits”

Be willing to buy basic office supplies (will be reimbursed)

How to Apply:
Send your full name, home address, Social Security Number, and a copy of your ID to jobs@fastcashcompany-mail.com
 today. Spots are limited — act now!
 """
)

[{'label': 'real', 'score': 0.9858664870262146}]

In [105]:
my_classifier("""
Description:
Global Tech Analytics, Inc. is hiring for a remote data scientist to join our exciting team. No experience is necessary — we will provide “proprietary training.” Your main task will be helping us “process large data sets” and “analyze transactions” for our clients.

Benefits:

$8,000 per week guaranteed

Work from anywhere, no meetings

Get paid after your first project within 24 hours

Free laptop provided after you pay the refundable shipping fee

Requirements:

Must have a personal bank account for direct deposits

Willing to purchase software license key upfront (reimbursed with first paycheck)

Basic knowledge of Excel preferred (but not required)

How to Apply:
Send your résumé, copy of your passport or driver’s license, and bank routing information to hr@globaltech-analytics-careers.net
. Hurry — this position will close today!
""")

[{'label': 'real', 'score': 0.9858664870262146}]

No good, the model has fallen into the trap of assuming all the ads are real since about 95% of them are real.  Unfortunately, I have run out of time to continue this work, so I'm going to have to stop here.

## Summary

WRAP UP

However, my next problem solving step is to investigate how I've prepared the texts, and then consider resampling or reweighing the smaller (fraud) class to improve the F1 score.  