# garbage_classifier

This notebook classifies website text snippets into useful or not (i.e., garbage) using transfer learning starting from an existing hugging face model
* Get a model checkpoint for an encoder model 
* Use reinforcement learning to apply the model on a new classification problem (EAGER website data) with limited new trained data
* Apply new head of model to full EAGER corpus to come up with mixes of models
* Metrics and model registered through a combination of comet.ml and tensorboard 

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Fri Jul 22 19:13:06 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
results_file = '/content/drive/MyDrive/raaste-results/garbage-classifier-results.csv'
company_file_dir = '/content/EAGER/data/orgs/parsed_page_output'
garbage_out_file_dir = '/content/EAGER/data/orgs/garbage'

## Install and import libraries

In [3]:
COMET_PROJECT_NAME = "eager-garbage-classifier"

In [4]:
# check environment
import sys
IN_COLAB = 'google.colab' in sys.modules
print (IN_COLAB)

True


In [5]:
# colab file system setup 
if IN_COLAB: 
    !git clone https://github.com/euphonic/EAGER.git
    !pwd
    !mkdir /content/logs

fatal: destination path 'EAGER' already exists and is not an empty directory.
/content
mkdir: cannot create directory ‘/content/logs’: File exists


In [6]:
# mount google drive if in colab
drive_path = '/content/drive/'

if IN_COLAB:  
    from google.colab import drive
    drive.mount(drive_path, force_remount=True)

Mounted at /content/drive/


In [33]:
# install huggingface and other modules if in colab
if IN_COLAB: 
    !pip install transformers
    !pip install datasets
    !pip install python-dotenv
    !pip install comet_ml
    !pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting evaluate
  Downloading evaluate-0.1.2-py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.9 MB/s 
Installing collected packages: evaluate
Successfully installed evaluate-0.1.2


In [7]:
from comet_ml import Experiment
from comet_ml.api import API
from dotenv import load_dotenv

# setup comet_ml experiment
if IN_COLAB: 
    # read env file from Google drive 
    env_file = drive_path + 'MyDrive/raaste-config/.env'
    comet_config_file = drive_path + 'MyDrive/raaste-config/.comet.config'
    load_dotenv(env_file)

In [34]:
# ml libraries
from transformers import pipeline, AutoTokenizer, TFAutoModelForSequenceClassification, DataCollatorWithPadding, \
  AutoConfig, TFBertForSequenceClassification
from transformers.pipelines.pt_utils import KeyDataset
from datasets import Dataset
import datasets
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import PolynomialDecay
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from sklearn.model_selection import train_test_split
import pandas as pd

# other
import numpy as np
import gzip
import tarfile
import datetime
import os
from tqdm import tqdm 
import csv
import evaluate

In [27]:
if IN_COLAB: 
    !pip uninstall -y comet_ml
    !pip install comet_ml

Found existing installation: comet-ml 3.31.6
Uninstalling comet-ml-3.31.6:
  Successfully uninstalled comet-ml-3.31.6
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting comet_ml
  Using cached comet_ml-3.31.6-py2.py3-none-any.whl (372 kB)
Installing collected packages: comet-ml
Successfully installed comet-ml-3.31.6


In [9]:
# load tensorboard 
%load_ext tensorboard

## Model training
keep test == 1, discard == 0

In [10]:
# base bert
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [11]:
firm_file_location = '/content/EAGER/data/modeling/garbage/garbage_classifier_input.csv'
input_df = pd.read_csv(firm_file_location)
print(input_df.shape)

(5624, 2)


In [12]:
# inspect duplicates manually 
dup_df = input_df[input_df.duplicated('sample_text', keep=False)]
tmp_out_dir = '/content/tmp/'
os.makedirs(tmp_out_dir, exist_ok=True)  
dup_df.to_csv(tmp_out_dir + '/dup_df.csv', ',')

In [13]:
# remove duplicates and nulls
non_dup_df = input_df[~input_df.duplicated('sample_text', keep="first")]
print (non_dup_df.shape) 
non_null_df = non_dup_df[~ non_dup_df['sample_text'].isnull() ]
print (non_null_df.shape)

(4740, 2)
(4740, 2)


In [14]:
dataset = Dataset.from_pandas(non_null_df, split='train')
dataset.cast_column("of_interest", datasets.Value('int8'))

Casting the dataset:   0%|          | 0/1 [00:00<?, ?ba/s]

Dataset({
    features: ['sample_text', 'of_interest', '__index_level_0__'],
    num_rows: 4740
})

In [15]:
# 85% train, 15% test + validation
train_test_dataset = dataset.train_test_split(test_size=0.15)
# Split the 20% test + valid in half test, half valid
test_valid_dataset = train_test_dataset['test'].train_test_split(test_size=0.3)
# gather everyone if you want to have a single DatasetDict
train_test_valid_dataset = datasets.DatasetDict({
    'train': train_test_dataset['train'],
    'test': test_valid_dataset['test'],
    'valid': test_valid_dataset['train']})

In [16]:
train_test_valid_dataset

DatasetDict({
    train: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__'],
        num_rows: 4029
    })
    test: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__'],
        num_rows: 214
    })
    valid: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__'],
        num_rows: 497
    })
})

In [17]:
def tokenize_function(x):
  return tokenizer(x["sample_text"], truncation=True, max_length=100)

In [18]:
tokenized_dataset = train_test_valid_dataset.map(tokenize_function, batched=True, batch_size=None)



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [19]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4029
    })
    test: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 214
    })
    valid: Dataset({
        features: ['sample_text', 'of_interest', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 497
    })
})

In [20]:
samples = tokenized_dataset["train"].to_dict()
samples = {k: v for k, v in samples.items() if k not in ["__index_level_0__", "sample_text"]}
for k, v in samples.items(): 
  print (k, v[0:5])

of_interest [0, 0, 0, 1, 0]
input_ids [[101, 6529, 102], [101, 4553, 2062, 1028, 102], [101, 2055, 23713, 4817, 102], [101, 2057, 2191, 2009, 2147, 2005, 2017, 1012, 102], [101, 16031, 2072, 1011, 2381, 1013, 17987, 102]]
token_type_ids [[0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask [[1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]


In [21]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding="max_length", max_length=100, return_tensors="tf")

In [22]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'attention_mask': TensorShape([4029, 100]),
 'input_ids': TensorShape([4029, 100]),
 'of_interest': TensorShape([4029]),
 'token_type_ids': TensorShape([4029, 100])}

In [23]:
# config
config = AutoConfig.from_pretrained(checkpoint)
config.num_labels=1
config.hidden_dropout_prob = 0.2
config.attention_probs_dropout_prob = 0.2
config.hidden_size = 64
config.intermediate_size = 256
config.num_hidden_layers = 4
config.num_attention_heads = 4
print (type(config))

<class 'transformers.models.bert.configuration_bert.BertConfig'>


In [24]:
num_epochs = 50
batch_sizes = [32]
model_name = COMET_PROJECT_NAME.replace ('-', '_')

for bs in batch_sizes: 
  # read config file from git repo 
  experiment = Experiment(project_name=COMET_PROJECT_NAME)
  with experiment.train():
    experiment.log_parameter("batch_size", bs)

  # model
  model = TFAutoModelForSequenceClassification.from_config(config)
  print (type(model))

  print ('batch_size', bs)

  tf_train_dataset = tokenized_dataset["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols="of_interest",
    shuffle=True,
    collate_fn=data_collator,
    batch_size=bs,
  )

  tf_validation_dataset = tokenized_dataset["valid"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "token_type_ids"],
    label_cols="of_interest",
    shuffle=False,
    collate_fn=data_collator,
    batch_size=bs,
  )

  # The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied
  # by the total number of epochs
  num_train_steps = len(tf_train_dataset) * num_epochs
  lr_scheduler = PolynomialDecay(
      initial_learning_rate=5e-5, end_learning_rate=0, decay_steps=num_train_steps
  )

  opt = Adam(learning_rate=lr_scheduler, beta_1=0.9, beta_2=0.999)

  log_dir = "/content/logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
  tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)    

  early_stopping_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

  loss = tf.keras.losses.BinaryFocalCrossentropy(from_logits=True, gamma=2.0, label_smoothing=0.2) # gamma = 0 is equivalent to binary cross entropy
  model.compile(optimizer=opt, loss=loss, metrics=["accuracy"])

  model.fit(tf_train_dataset, validation_data=tf_validation_dataset, 
        epochs=num_epochs, callbacks=[tensorboard_callback, early_stopping_callback])
  
  # save model to disk -- can be added to for loop above

  model_save_path = '/content/models/' + model_name + "_" + str(bs)
  print (model_save_path)
  model.save_pretrained(model_save_path)
  experiment.log_model(name=model_name, file_or_folder=model_save_path)
  
  experiment.end()

COMET ERROR: Failed to calculate active processors count. Fall back to default CPU count 1
COMET INFO: Experiment is live on comet.ml https://www.comet.com/euphonic/eager-garbage-classifier/5f24502e9d0e49a3851d9ff8275f7e72



<class 'transformers.models.bert.modeling_tf_bert.TFBertForSequenceClassification'>
batch_size 32


COMET INFO: Ignoring automatic log_parameter('verbose') because 'keras:verbose' is in COMET_LOGGING_PARAMETERS_IGNORE


Epoch 1/50

COMET INFO: ignoring tensorflow summary log of metrics because of keras; set `comet_ml.loggers.tensorboard_logger.LOG_METRICS = True` to override


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
/content/models/eager_garbage_classifier_32


COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.com/euphonic/eager-garbage-classifier/5f24502e9d0e49a3851d9ff8275f7e72
COMET INFO:   Metrics [count] (min, max):
COMET INFO:     accuracy [15]                : (0.6061057448387146, 0.9823777675628662)
COMET INFO:     batch_accuracy [195]         : (0.5625, 1.0)
COMET INFO:     batch_loss [195]             : (0.011768973432481289, 0.17344476282596588)
COMET INFO:     epoch_duration [15]          : (4.527011622999908, 25.86587282099981)
COMET INFO:     loss [15]                    : (0.021229946985840797, 0.1695539355278015)
COMET INFO:     val_accuracy [15]            : (0.5875251293182373, 0.8832998275756836)
COMET INFO:     val_loss [15]                : (0.08461077511310577, 0.16738691926002502)
COMET INFO:     validate_batch_accuracy [30] : (0.596

## Out of sample testing
Run on final eval dataset only

In [27]:
# create test dataset
tf_test_dataset = tokenized_dataset["test"].to_tf_dataset (
  columns=["attention_mask", "input_ids", "token_type_ids"],
  label_cols="of_interest",
  shuffle=False,
  collate_fn=data_collator,
  batch_size=bs,
)

In [66]:
# get list of strings and list of labels
eval_text = train_test_valid_dataset['test']['sample_text']
eval_labels = train_test_valid_dataset['test']['of_interest']

# define inference pipeline
pipe = pipeline ("text-classification", model=model, tokenizer=tokenizer, device=0, batch_size = 8,function_to_apply='sigmoid' )
tokenizer_kwargs = {'padding':'max_length','truncation':True,'max_length':100}

In [43]:
# run pipeline
out = pipe (eval_text, **tokenizer_kwargs)

In [67]:
# calculate class predictions from sigmoids 
scores = np.asarray([o['score'] for o in out])

print ('scores: ' + str(scores[0:5]))
threshold = 0.5
preds = np.where(scores > threshold, 1, 0)
print ('preds: ' + str(preds[0:5]))

print ('eval labels: ' + str(eval_labels[0:5]))

scores: [0.08487141 0.08529108 0.07992468 0.92187572 0.06950309]
preds: [0 0 0 1 0]
eval labels: [0, 0, 0, 1, 0]


In [69]:
# run metrics
accuracy = evaluate.load("accuracy")
accuracy.compute(references=eval_labels, predictions=preds)

{'accuracy': 0.8504672897196262}

In [72]:
# which sample texts where incorrectly classified? 
for i in range (len(preds)):
  if (preds[i] != eval_labels[i]):
    print ('score: ' + str(scores[i]) + ' vs. label: ' + str(eval_labels[i]) )
    print ('\t' + eval_text[i])

score: 0.903399646282196 vs. label: 0
	Used by the content network, Cloudflare, to identify trusted web traffic.
score: 0.6870911717414856 vs. label: 0
	Operations & Maintenance
score: 0.33153608441352844 vs. label: 1
	Glycol
score: 0.48014360666275024 vs. label: 1
	We serve companies that:
score: 0.6628424525260925 vs. label: 0
	View on MobileHome
score: 0.4596140384674072 vs. label: 1
	Adaptive Interference Cancellation
score: 0.1039787009358406 vs. label: 1
	Tell us what you think
score: 0.8026952743530273 vs. label: 0
	702-639-4440
score: 0.8969711661338806 vs. label: 0
	Flip text color (from light to dark)
score: 0.8757806420326233 vs. label: 0
	Will Quirk, ‚Äì AFFAIRS
score: 0.14474432170391083 vs. label: 1
	VgaArbiter : Routing instructions correctly.
score: 0.4717980623245239 vs. label: 1
	Dr. Clive Bosnyak
score: 0.8142120242118835 vs. label: 0
	11) 200√ó100 Needle Cleaner
score: 0.27364853024482727 vs. label: 1
	¬†The interview season is Oct. 15, 2018 through Jan. 18, 2019.
s

## Register model

In [None]:
# save best run (confirm this through comet ml portal)
best_run = 'euphonic/' + COMET_PROJECT_NAME + '/daily_pilaster_8175'
version = "1.0.3"
api = API()
api_exp = api.get(best_run)
api_exp.register_model(model_name, version=version)

COMET INFO: Successfully registered 'eager-garbage-classifier', version '1.0.3' in workspace 'euphonic'


{'registryModelId': 'dyXSwgVHtuYcesXa63lRU7TUh',
 'registryModelItemId': 'h0YeXb5dKZA70vwTzd21qcp5K'}

## Make predictions
Write results out to file for manual DQ

In [None]:
# download latest model
best_run = 'euphonic/' + COMET_PROJECT_NAME + '/daily_pilaster_8175'
version = "1.0.3" # need to set manually for now
api = API()
api.download_registry_model("euphonic", COMET_PROJECT_NAME, version,
                            output_path="/content/registered_models/", expand=True)

COMET INFO: Downloading registry model 'eager-garbage-classifier', version '1.0.3', stage None from workspace 'euphonic'...
COMET INFO: Unzipping model to '/content/registered_models' ...
COMET INFO: done!


In [None]:
 # optional retrieve model from registry
local_model_dir = 'eager_garbage_classifier_32' # need to set manually for now
model = TFBertForSequenceClassification.from_pretrained ('/content/registered_models/' + local_model_dir)

Some layers from the model checkpoint at /content/registered_models/eager_garbage_classifier_32 were not used when initializing TFBertForSequenceClassification: ['dropout_83']
- This IS expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForSequenceClassification were initialized from the model checkpoint at /content/registered_models/eager_garbage_classifier_32.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [None]:
# open file dir, read files and run predictions. aggregate results in a list object
data_for_df = []

for filename in os.listdir(company_file_dir):
  if filename.endswith('.txt'):
    with open( os.path.join(company_file_dir, filename) ) as file_h:
      co_text = [line for line in file_h]
      co_id = np.repeat (filename, len(co_text))
      co_index = np.arange (0, len(co_text))
      L = list(zip(co_id, co_index, co_text))
      data_for_df.append (L)

In [None]:
# transfer results to dataframe
dq_check_df = pd.DataFrame([t for lst in data_for_df for t in lst], columns = ['firm_name_txt', 'index', 'sample_text'])
dq_check_df

Unnamed: 0,firm_name_txt,index,sample_text
0,The Procter & Gamble Company.txt,0,See our programs\n
1,The Procter & Gamble Company.txt,1,See our impact\n
2,The Procter & Gamble Company.txt,2,See our commitment\n
3,The Procter & Gamble Company.txt,3,See our Efforts\n
4,The Procter & Gamble Company.txt,4,See our Iconic Brands\n
...,...,...,...
2096117,RenovaCare Sciences Corp.txt,119,"© Copyright 2018 RenovaCare, Inc. Third party ..."
2096118,RenovaCare Sciences Corp.txt,120,products are under development and not approve...
2096119,RenovaCare Sciences Corp.txt,121,Results may vary from person to person.\n
2096120,RenovaCare Sciences Corp.txt,122,Cookie Notice\n


In [None]:
# lower case text 
dq_check_df['sample_text_lower'] = dq_check_df['sample_text'].str.lower()
dq_check_df

Unnamed: 0,firm_name_txt,index,sample_text,sample_text_lower
0,The Procter & Gamble Company.txt,0,See our programs\n,see our programs\n
1,The Procter & Gamble Company.txt,1,See our impact\n,see our impact\n
2,The Procter & Gamble Company.txt,2,See our commitment\n,see our commitment\n
3,The Procter & Gamble Company.txt,3,See our Efforts\n,see our efforts\n
4,The Procter & Gamble Company.txt,4,See our Iconic Brands\n,see our iconic brands\n
...,...,...,...,...
2096117,RenovaCare Sciences Corp.txt,119,"© Copyright 2018 RenovaCare, Inc. Third party ...","© copyright 2018 renovacare, inc. third party ..."
2096118,RenovaCare Sciences Corp.txt,120,products are under development and not approve...,products are under development and not approve...
2096119,RenovaCare Sciences Corp.txt,121,Results may vary from person to person.\n,results may vary from person to person.\n
2096120,RenovaCare Sciences Corp.txt,122,Cookie Notice\n,cookie notice\n


In [None]:
# remove duplicates
dq_check_no_dup_df = dq_check_df[~dq_check_df.duplicated('sample_text_lower', keep="first")]
dq_check_no_dup_df

Unnamed: 0,firm_name_txt,index,sample_text,sample_text_lower
0,The Procter & Gamble Company.txt,0,See our programs\n,see our programs\n
1,The Procter & Gamble Company.txt,1,See our impact\n,see our impact\n
2,The Procter & Gamble Company.txt,2,See our commitment\n,see our commitment\n
3,The Procter & Gamble Company.txt,3,See our Efforts\n,see our efforts\n
4,The Procter & Gamble Company.txt,4,See our Iconic Brands\n,see our iconic brands\n
...,...,...,...,...
2096104,RenovaCare Sciences Corp.txt,106,"© Copyright 2018 RenovaCare, Inc. Third party ...","© copyright 2018 renovacare, inc. third party ..."
2096108,RenovaCare Sciences Corp.txt,110,"We use cookies for analytics, advertising and ...","we use cookies for analytics, advertising and ..."
2096114,RenovaCare Sciences Corp.txt,116,"Scottsdale, 85260\n","scottsdale, 85260\n"
2096116,RenovaCare Sciences Corp.txt,118,"is housed with its corporate partner, StemCell...","is housed with its corporate partner, stemcell..."


In [None]:
# build inference pipeline
pipe = pipeline ("text-classification", model=model, tokenizer=tokenizer, device=0, batch_size = 8,function_to_apply='sigmoid' )
tokenizer_kwargs = {'padding':'max_length','truncation':True,'max_length':100}

In [None]:
# create huggingface dataset structure
smaller_df = dq_check_no_dup_df.drop(columns=['sample_text'])
dq_check_ds = Dataset.from_pandas(smaller_df)
dq_check_ds

Dataset({
    features: ['firm_name_txt', 'index', 'sample_text_lower', '__index_level_0__'],
    num_rows: 272599
})

In [None]:
# run inference and store preds to results list of dicts
results = []

itr = 0
for kds in tqdm(KeyDataset(dq_check_ds, "sample_text_lower")):
  out = pipe (kds, **tokenizer_kwargs)[0]
  out['lower_text'] = kds
  results.append(out)
  
  itr += 1
  if (itr % 10000 == 0):
    print (out)

  4%|▎         | 10002/272599 [08:18<3:32:16, 20.62it/s]

{'label': 'LABEL_0', 'score': 0.9270698428153992, 'lower_text': 'the navistar® thermocool® and steer® thermocool® catheters are approved for the treatment of drug refractory recurrent symptomatic paroxysmal atrial fibrillation, when used with compatible three-dimensional electroanatomic mapping systems.\n'}


  7%|▋         | 20003/272599 [16:39<3:25:54, 20.45it/s]

{'label': 'LABEL_0', 'score': 0.9099326133728027, 'lower_text': 'you may be given access to confidential information through the site or links from the site. you may not disclose confidential information to any third party without the written consent of dolby. you must protect confidential information with at least the same degree of care that is accorded to your confidential information, but in no event less than reasonable care. confidential information includes, but is not limited to, all nonpublic information regarding dolby, its intellectual property or its customers, products, quantity and prices of products purchased, design and development data, engineering details, drawings, sales and marketing plans, unannounced products, any information marked as "confidential" or "proprietary" or similarly marked, or any information that, if disclosed, might be competitively detrimental to dolby. you may have entered into separate nondisclosure agreements with governing specific disclosures

 11%|█         | 30003/272599 [24:54<3:27:41, 19.47it/s]

{'label': 'LABEL_0', 'score': 0.9318647384643555, 'lower_text': 'is recognized internally and externally for the diversity of our associates and our inclusive culture, driven by a d&strategy based upon concrete business, scientific, talent and reputational outcomes\n'}


 15%|█▍        | 40002/272599 [33:04<3:06:38, 20.77it/s]

{'label': 'LABEL_0', 'score': 0.9149348139762878, 'lower_text': 'san antonio medical center\n'}


 18%|█▊        | 50003/272599 [41:11<2:57:50, 20.86it/s]

{'label': 'LABEL_0', 'score': 0.8309967517852783, 'lower_text': 'apr. 1990joins yutaka miyanaga\n'}


 22%|██▏       | 60004/272599 [49:12<2:49:23, 20.92it/s]

{'label': 'LABEL_0', 'score': 0.12607915699481964, 'lower_text': 'dori ellis\n'}


 26%|██▌       | 70002/272599 [57:12<2:40:51, 20.99it/s]

{'label': 'LABEL_0', 'score': 0.7666402459144592, 'lower_text': 'heating, ventilation, climate\n'}


 29%|██▉       | 80004/272599 [1:05:11<2:37:00, 20.45it/s]

{'label': 'LABEL_0', 'score': 0.9157922863960266, 'lower_text': 'not all products, services or offers are approved or offered in every market and approved labelling and instructions may vary from one country to another. for country specific product information, see the appropriate country website.\n'}


 33%|███▎      | 90004/272599 [1:13:09<2:25:11, 20.96it/s]

{'label': 'LABEL_0', 'score': 0.8282966017723083, 'lower_text': '- - - single use components\n'}


 37%|███▋      | 100002/272599 [1:21:03<2:16:57, 21.00it/s]

{'label': 'LABEL_0', 'score': 0.9259302616119385, 'lower_text': 'semiconductor lasers\n'}


 40%|████      | 110002/272599 [1:28:58<2:08:11, 21.14it/s]

{'label': 'LABEL_0', 'score': 0.4862001836299896, 'lower_text': 'publications & abstracts\n'}


 44%|████▍     | 120004/272599 [1:36:59<2:02:32, 20.75it/s]

{'label': 'LABEL_0', 'score': 0.7903949022293091, 'lower_text': 'the way(現way)を制定\n'}


 48%|████▊     | 130002/272599 [1:45:01<1:53:35, 20.92it/s]

{'label': 'LABEL_0', 'score': 0.07483556866645813, 'lower_text': '+358 10 862 1051\n'}


 51%|█████▏    | 140004/272599 [1:53:06<1:46:46, 20.70it/s]

{'label': 'LABEL_0', 'score': 0.2724936902523041, 'lower_text': '15,000,000 grams\n'}


 55%|█████▌    | 150002/272599 [2:01:04<1:37:16, 21.00it/s]

{'label': 'LABEL_0', 'score': 0.9322437644004822, 'lower_text': '"at merck, we have a belief that was first expressed by our modern-day founder, who said, \'we try never to forget that medicine is for the people.\' we live by those words. but with every fiber of her being, julie embodies them. and because she does, our company – and our world – are a better places."\n'}


 59%|█████▊    | 160002/272599 [2:09:06<1:29:07, 21.06it/s]

{'label': 'LABEL_0', 'score': 0.9127178192138672, 'lower_text': 'implantology module\n'}


 62%|██████▏   | 170003/272599 [2:17:08<1:22:17, 20.78it/s]

{'label': 'LABEL_0', 'score': 0.07363671809434891, 'lower_text': '9046580\n'}


 66%|██████▌   | 180002/272599 [2:25:08<1:13:59, 20.86it/s]

{'label': 'LABEL_0', 'score': 0.06929577887058258, 'lower_text': 'jan 28 2019\n'}


 70%|██████▉   | 190004/272599 [2:33:09<1:06:29, 20.70it/s]

{'label': 'LABEL_0', 'score': 0.913806676864624, 'lower_text': '2012 partner of the year\n'}


 73%|███████▎  | 200003/272599 [2:41:09<58:15, 20.77it/s]

{'label': 'LABEL_0', 'score': 0.9093415141105652, 'lower_text': 'we are an energy company.\n'}


 77%|███████▋  | 210003/272599 [2:49:11<51:16, 20.35it/s]

{'label': 'LABEL_0', 'score': 0.9335169196128845, 'lower_text': '- executive vice president, strategy, portfolio & alternative energy\n'}


 81%|████████  | 220002/272599 [2:57:12<42:21, 20.69it/s]

{'label': 'LABEL_0', 'score': 0.06948334723711014, 'lower_text': '4,694\n'}


 84%|████████▍ | 230003/272599 [3:05:13<34:01, 20.87it/s]

{'label': 'LABEL_0', 'score': 0.9317600727081299, 'lower_text': 'to develop the best products we can and to use our company to push for environmental and social changenvironmental change\n'}


 88%|████████▊ | 240004/272599 [3:13:18<25:45, 21.08it/s]

{'label': 'LABEL_0', 'score': 0.9151761531829834, 'lower_text': 'international association of chiefs of police\n'}


 92%|█████████▏| 250002/272599 [3:21:23<17:48, 21.14it/s]

{'label': 'LABEL_0', 'score': 0.91788250207901, 'lower_text': 'equal opportunity magazine top 50 employer\n'}


 95%|█████████▌| 260004/272599 [3:29:28<10:14, 20.49it/s]

{'label': 'LABEL_0', 'score': 0.6159295439720154, 'lower_text': 'uw-oshkosh\n'}


 99%|█████████▉| 270001/272599 [3:37:33<02:06, 20.57it/s]

{'label': 'LABEL_0', 'score': 0.923089861869812, 'lower_text': "learn how and the nature conservancy are taking a science-based approach to preserving the environment in brazil, with praxair's greenway project.\n"}


100%|██████████| 272599/272599 [3:39:38<00:00, 20.68it/s]


In [None]:
# write prediction results to file 
keys = results[0].keys()

with open(results_file, 'w') as res_f:
    dict_writer = csv.DictWriter(res_f, keys)
    dict_writer.writeheader()
    dict_writer.writerows(results)

## Impute predictions into full dataset

In [None]:
# read model prediction results file and store in dictionary
results_dict = {}

with open(results_file, mode='r') as res_f:
    reader = csv.reader(res_f)
    next(reader)
    results_dict = {rows[2]: float(rows[1]) for rows in reader}

for k in list(results_dict.keys())[:10]:
  print (k, results_dict[k])

see our programs 0.13878001272678375
see our impact 0.7193490266799927
see our commitment 0.734010636806488
see our efforts 0.085110142827034
see our iconic brands 0.11456183344125748
ingredients you can trust 0.9114071726799011
see our safety process 0.7538415193557739
force for good 0.07810583710670471
force for growth 0.18520288169384003
magnetics, magnetics, magnetics 0.9273517727851868


In [None]:
# cycle through original files and impute predictions 
for filename in os.listdir(company_file_dir):
  if filename.endswith('.txt'):
    with open( os.path.join(company_file_dir, filename) ) as file_h:
      print ("Working on " + filename)
      co_keep_text = []
      for line in file_h:
        lc_line = line.lower().strip('\n')
        if (lc_line not in results_dict):
          print (lc_line)
        elif (results_dict[lc_line] > 0.5):
          co_keep_text.append(line)

  with open(garbage_out_file_dir + "/" + filename, 'w') as keep_f:
    for text in co_keep_text:
        keep_f.write(text)

Working on Magnachip Semiconductor.txt
Working on Cool Planet Energy Systems.txt
Working on Two Blades Foundation.txt
Working on CoolEarth Solar.txt
Working on ASML Netherlands BV.txt
Working on Nikon.txt
Working on Nutech Ventures.txt
Working on SolarLego.txt
Working on Altex Technologies.txt
Working on QUALCOMM.txt
Working on Performance Plants.txt
Working on Veracode.txt
Working on Minebea Co.txt
Working on SolAero Technologies Corp.txt
Working on Aerogen.txt
Working on UT-Battelle.txt
Working on Rapamycin Holdings.txt
Working on Samsung Electronics.txt
Working on Michigan Biotechnology Institute.txt
Working on Dexerials.txt
Working on Teradata US.txt
Working on Arkema.txt
Working on Nanomix.txt
Working on Gracenote.txt
Working on Revera.txt
Working on Turf Group.txt
Working on Plant Sensory Systems.txt
Working on Nippon Shokubai Co.txt
Working on HARMAN INDUSTRIES.txt
Working on Midrex Technologies.txt
Working on KT.txt
Working on AC.txt
Working on Medicis Pharmaceutical.txt
Workin