## Data Cleaning: 

### Import Data: 

In [1]:
import pandas as pd

df_ads = pd.read_csv('final_testing_dataset.csv')

# SELECT cc_text, ad FROM ads_nonads
df_ads = df_ads[["cc_text", "ad"]]

In [2]:
print(df_ads.head())
print(df_ads.shape)
print(df_ads.info())

                                             cc_text   ad
0  creating havoc in our supply chains, and raisi...  1.0
1  So he could charge rich tourists $12,500 for p...  1.0
2  rock band foreign I'm Elissa Slotkin, and I'm ...  1.0
3  In the meantime I think we have to provide the...  0.0
4  And right now, we'll even pay off your phone w...  1.0
(1009, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1009 entries, 0 to 1008
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   cc_text  1009 non-null   object 
 1   ad       1008 non-null   float64
dtypes: float64(1), object(1)
memory usage: 15.9+ KB
None


### Cleaning

In [3]:
# drop rows with any missing values
df_ads = df_ads.dropna()
print(df_ads.shape)
# drop duplicate rows
df_ads = df_ads.drop_duplicates()
print(df_ads.shape)
# drop rows where 'ad' is not 0 or 1
df_ads = df_ads[df_ads['ad'].isin([0, 1])]
print(df_ads.shape)

# Convert 'cc_text' column to string
df_ads['cc_text'] = df_ads['cc_text'].astype(str)

# Convert 'ad' column to integer
df_ads['ad'] = df_ads['ad'].astype(int)

(1008, 2)
(1007, 2)
(1007, 2)


In [4]:
# data check after cleaning
print(df_ads["ad"].value_counts())

ad
0    509
1    498
Name: count, dtype: int64


In [5]:
# print our the head of the data when ad is 1
print(df_ads[df_ads["ad"] == 1].head())

# print our the head of the data when ad is 0
print(df_ads[df_ads["ad"] == 0].head())

                                             cc_text  ad
0  creating havoc in our supply chains, and raisi...   1
1  So he could charge rich tourists $12,500 for p...   1
2  rock band foreign I'm Elissa Slotkin, and I'm ...   1
4  And right now, we'll even pay off your phone w...   1
5  If you think you might be pregnant, you want t...   1
                                              cc_text  ad
3   In the meantime I think we have to provide the...   0
14  them. anothern side, sam brown will be moving ...   0
16  i'm bernie rayno join bri guy and myself at 6 ...   0
25  apple is needed. just in this crew. on the sid...   0
28  i'm bernie rayno join me and ariella is back o...   0


#### This dataset is relatively balanced now, so we do text cleaning

In [6]:
df_ads["cc_text"][1]

"So he could charge rich tourists $12,500 for prime elk hunting. No wonder Sheehy said he'd end protections for public lands. So if you're for hunting and for access to public lands... ?you can't be for Shady Sheehy. Montana Outdoor Values Action Fund is responsible for the content of this ad."

In [7]:
df_ads["cc_text"][4]

'And right now, we\'ll even pay off your phone when you switch! ? (vo) For over 50 years Purina Cat Chow has been helping cats feel at home. With trusted nutrition, no wonder it\'s the number one dry cat food in America. Come home to Cat Chow. ? ("Ladies\' Night By: Kool & the Gang) ? (?) (?) Get your grills out this summer with Pepsi, the official beverage of Grills Night Out.'

In [8]:
import re

def clean_text(text):
    # Remove everything within HTML tags
    text = re.sub(r'<.*?>', '', text)
    # lower case 
    text = text.lower()
    # Remove special characters except for commas and periods
    text = re.sub(r'[^a-z\s,.]', '', text)
    # Remove special characters at the beginning of the sentence
    # text = re.sub(r'^[^A-Za-z0-9\s]+', '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text


In [9]:
sample_1 = clean_text(df_ads["cc_text"][1])
print(sample_1)

so he could charge rich tourists , for prime elk hunting. no wonder sheehy said hed end protections for public lands. so if youre for hunting and for access to public lands... you cant be for shady sheehy. montana outdoor values action fund is responsible for the content of this ad.


In [10]:
sample_2 = clean_text(df_ads["cc_text"][4])
print(sample_2)

and right now, well even pay off your phone when you switch vo for over years purina cat chow has been helping cats feel at home. with trusted nutrition, no wonder its the number one dry cat food in america. come home to cat chow. ladies night by kool the gang get your grills out this summer with pepsi, the official beverage of grills night out.


In [11]:
# Apply clean_text function to cc_text column
df_ads["cc_text"] = df_ads["cc_text"].apply(clean_text)

In [12]:
df_ads["cc_text"].head()

0    creating havoc in our supply chains, and raisi...
1    so he could charge rich tourists , for prime e...
2    rock band foreign im elissa slotkin, and im ru...
3    in the meantime i think we have to provide the...
4    and right now, well even pay off your phone wh...
Name: cc_text, dtype: object

### Train-test split

In [13]:
from sklearn.model_selection import train_test_split

# Split the data into features and target
X = df_ads["cc_text"]
y = df_ads["ad"]

# Step 1: Split the data into training+validation and testing sets
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 2: Split the training+validation set into separate training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42, stratify=y_train_val) # 0.25 x 0.8 = 0.2

# Print the shapes of the resulting datasets to verify
print("Training set size:", X_train.shape, y_train.shape)
print("Validation set size:", X_val.shape, y_val.shape)
print("Testing set size:", X_test.shape, y_test.shape)

Training set size: (603,) (603,)
Validation set size: (202,) (202,)
Testing set size: (202,) (202,)


### Now you can use training dataset to build your model and text dataset to test the model performance. 

## BERT Processing: 

### Tokenizer

In [14]:
# Run this once and restart the kernal 
#%pip install transformers[sentencepiece] 

In [15]:
#%pip install torch torchvision torchaudio

In [16]:

# from transformers import BertTokenizer

# checkpoint = "bert-base-cased"
# tokenizer = BertTokenizer.from_pretrained(checkpoint)
# from transformers import AutoModelForSequenceClassification

# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# num_labels = 2  # 2 labels: 0 for non-ads, 1 for ads

# model = (AutoModelForSequenceClassification
#          .from_pretrained(checkpoint, num_labels=num_labels)
#          .to(device))

In [14]:

from transformers import BertTokenizer

checkpoint = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(checkpoint)


In [15]:
# tokenize the training, testing, evaluation sets
X_train_tokens = tokenizer(X_train.tolist(), padding=True, truncation=True)
X_test_tokens = tokenizer(X_test.tolist(), padding=True, truncation=True)
X_val_tokens = tokenizer(X_val.tolist(), padding=True, truncation=True)

In [16]:
# Print the first few examples of tokenized training data
print(X_train_tokens.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])


In [17]:
y_train.value_counts()

ad
0    305
1    298
Name: count, dtype: int64

In [18]:
import torch

train_seq = torch.tensor(X_train_tokens["input_ids"])
train_mask = torch.tensor(X_train_tokens["attention_mask"])
train_y = torch.tensor(y_train.tolist())

val_seq = torch.tensor(X_val_tokens["input_ids"])
val_mask = torch.tensor(X_val_tokens["attention_mask"])
val_y = torch.tensor(y_val.tolist())

test_seq = torch.tensor(X_test_tokens["input_ids"])
test_mask = torch.tensor(X_test_tokens["attention_mask"])
test_y = torch.tensor(y_test.tolist())

In [25]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, input_ids, attention_mask, labels):
        self.input_ids = input_ids
        self.attention_mask = attention_mask
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'labels': self.labels[idx]
        }

# Create instances of the CustomDataset
train_dataset = CustomDataset(train_seq, train_mask, train_y)
val_dataset = CustomDataset(val_seq, val_mask, val_y)
test_dataset = CustomDataset(test_seq, test_mask, test_y)

In [23]:
# Use DataParallel if multiple GPUs are available
# if torch.cuda.device_count() > 1:
#     print(torch.cuda.device_count())
#     model = torch.nn.DataParallel(model)

In [21]:
from transformers import AutoModelForSequenceClassification

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_labels = 2  # 2 labels: 0 for non-ads, 1 for ads

model = (AutoModelForSequenceClassification
         .from_pretrained(checkpoint, num_labels=num_labels)
         .to(device))

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
from transformers import Trainer, TrainingArguments

In [26]:
batch_size = 16
logging_steps = (len(train_dataset) // batch_size) 
logging_steps

37

In [27]:
from transformers import Trainer, TrainingArguments

# Define the training arguments
model_name = f"{checkpoint}-adremoval_testingdata"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  eval_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  log_level="error",
                                  optim='adamw_torch',
                                  )

In [28]:

from sklearn.metrics import accuracy_score

def get_accuracy(preds):
  accuracy = accuracy_score(preds.label_ids, preds.predictions.argmax(axis=-1))
  return {'accuracy': accuracy}

In [29]:
torch.cuda.empty_cache()

# Initialize the Trainer
trainer = Trainer(
    model=model,
    compute_metrics=get_accuracy,
    args=training_args,
    train_dataset= train_dataset,
    eval_dataset= val_dataset,
    tokenizer=tokenizer
)

In [30]:
# Start training
trainer.train()

  0%|          | 0/76 [00:00<?, ?it/s]

{'loss': 0.5518, 'grad_norm': 5.558544635772705, 'learning_rate': 1.0263157894736844e-05, 'epoch': 0.97}


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.4463832676410675, 'eval_accuracy': 0.8267326732673267, 'eval_runtime': 38.4375, 'eval_samples_per_second': 5.255, 'eval_steps_per_second': 0.338, 'epoch': 1.0}
{'loss': 0.292, 'grad_norm': 5.64611291885376, 'learning_rate': 5.263157894736843e-07, 'epoch': 1.95}


  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.31147944927215576, 'eval_accuracy': 0.8712871287128713, 'eval_runtime': 39.1658, 'eval_samples_per_second': 5.158, 'eval_steps_per_second': 0.332, 'epoch': 2.0}
{'train_runtime': 834.2506, 'train_samples_per_second': 1.446, 'train_steps_per_second': 0.091, 'train_loss': 0.41747593879699707, 'epoch': 2.0}


TrainOutput(global_step=76, training_loss=0.41747593879699707, metrics={'train_runtime': 834.2506, 'train_samples_per_second': 1.446, 'train_steps_per_second': 0.091, 'total_flos': 210714955351200.0, 'train_loss': 0.41747593879699707, 'epoch': 2.0})

In [84]:
trainer.evaluate(test_dataset)

  0%|          | 0/13 [00:00<?, ?it/s]

{'eval_loss': 0.2973465025424957,
 'eval_accuracy': 0.8861386138613861,
 'eval_runtime': 54.8417,
 'eval_samples_per_second': 3.683,
 'eval_steps_per_second': 0.237,
 'epoch': 2.0}

In [31]:
trainer.save_model()

In [32]:
model_name 

'bert-base-uncased-adremoval_testingdata'

## Sample from another Dataset 

In [33]:
#lable 1
from transformers import pipeline
classifier = pipeline('text-classification', model=model_name)
classifier('so he could charge rich tourists $12,500 for prime elk hunting. no wonder sheehy said he end protections for public lands. so if you re for hunting and for access to public lands... you cant be for shady. montana outdoor values action fund is responsible for the content of this ad.')
     

[{'label': 'LABEL_1', 'score': 0.88851398229599}]

In [34]:
#lable 0 
classifier('in the meantime i think we have to provide the studies necessary no thanks he s palestine residents say the study has a good first step but they also need medical care to go with it a resident still do not have also caller just a companion bill in the senate is being cosponsored by republican J. D. Vance and democrat sherrod brown to also help push the I. R. S. to announce wednesday the twenty one million dollars norfolk southern says it has paid directly to residents will not be taxed the norfolk southern do that there was a community the I. R. S. did damage on top of that and we fix that those who have already reported the payments on their twenty twenty three taxes will need to amend the returns to get a refund last month norfolk southern also agreed to pay three hundred ten million dollars for cleanup and other fees bringing its total expected costs related to the derailment to one point seven billion dollars in washington cary leahy spectrum news you number released this week')

[{'label': 'LABEL_0', 'score': 0.8665247559547424}]

In [35]:
#label1
classifier("Need to get your a1c down? You may pay as little as $10 per prescription. Why do vitamins and supplements cost so much more now? Other companies are charging you more and more for less and less, and we hate that. ")

[{'label': 'LABEL_1', 'score': 0.8129271864891052}]