<a href="https://colab.research.google.com/github/deeksha2107/temp/blob/main/docs/2notebook/Example_1_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## sklearn and TextAttack

This following code trains two different text classification models using sklearn. Both use logistic regression models: the difference is in the features.

We will load data using `datasets`, train the models, and attack them using TextAttack.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/QData/TextAttack/blob/master/docs/2notebook/Example_1_sklearn.ipynb)

[![View Source on GitHub](https://img.shields.io/badge/github-view%20source-black.svg)](https://github.com/QData/TextAttack/blob/master/docs/2notebook/Example_1_sklearn.ipynb)

In [5]:
!pip install datasets


Collecting datasets
  Using cached datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

Please remember to run  **pip3 install textattack[tensorflow]**  in your notebook enviroment before the following codes:

### Training

This code trains two models: one on bag-of-words statistics (`bow_unstemmed`) and one on tf–idf statistics (`tfidf_unstemmed`). The dataset is the IMDB movie review dataset.

In [10]:
import nltk  # the Natural Language Toolkit

nltk.download("punkt")  # The NLTK tokenizer
nltk.download("stopwords")  # Stopwords

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [14]:
import datasets
import os
import pandas as pd
import re
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
from sklearn import preprocessing
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Nice to see additional metrics
from sklearn.metrics import classification_report


def load_data(dataset_split="train"):
    dataset = datasets.load_dataset("rotten_tomatoes")[dataset_split]
    # Open and import positve data
    df = pd.DataFrame()
    df["Review"] = [review["text"] for review in dataset]
    df["Sentiment"] = [review["label"] for review in dataset]
    # Remove non-alphanumeric characters
    df["Review"] = df["Review"].apply(lambda x: re.sub("[^a-zA-Z]", " ", str(x)))
    # Tokenize the training and testing data
    df_tokenized = tokenize_review(df)
    return df_tokenized


def tokenize_review(df):
    # Tokenize Reviews in training
    tokened_reviews = [word_tokenize(rev) for rev in df["Review"]]
    # Create word stems
    stemmed_tokens = []
    porter = PorterStemmer()
    for i in range(len(tokened_reviews)):
        stems = [porter.stem(token) for token in tokened_reviews[i]]
        stems = " ".join(stems)
        stemmed_tokens.append(stems)
    df.insert(1, column="Stemmed", value=stemmed_tokens)
    return df


def transform_BOW(training, testing, column_name):
    vect = CountVectorizer(
        max_features=100, ngram_range=(1, 3), stop_words="english"
    )
    vectFit = vect.fit(training[column_name])
    BOW_training = vectFit.transform(training[column_name])
    BOW_training_df = pd.DataFrame(
        BOW_training.toarray(), columns=vect.get_feature_names_out()
    )
    BOW_testing = vectFit.transform(testing[column_name])
    BOW_testing_Df = pd.DataFrame(
        BOW_testing.toarray(), columns=vect.get_feature_names_out()
    )
    return vectFit, BOW_training_df, BOW_testing_Df


def transform_tfidf(training, testing, column_name):
    Tfidf = TfidfVectorizer(
        ngram_range=(1, 3), max_features=100, stop_words="english"
    )
    Tfidf_fit = Tfidf.fit(training[column_name])
    Tfidf_training = Tfidf_fit.transform(training[column_name])
    Tfidf_training_df = pd.DataFrame(
        Tfidf_training.toarray(), columns=Tfidf.get_feature_names_out()
    )
    Tfidf_testing = Tfidf_fit.transform(testing[column_name])
    Tfidf_testing_df = pd.DataFrame(
        Tfidf_testing.toarray(), columns=Tfidf.get_feature_names_out()
    )
    return Tfidf_fit, Tfidf_training_df, Tfidf_testing_df


def add_augmenting_features(df):
    tokened_reviews = [word_tokenize(rev) for rev in df["Review"]]
    # Create feature that measures length of reviews
    len_tokens = []
    for i in range(len(tokened_reviews)):
        len_tokens.append(len(tokened_reviews[i]))
    len_tokens = preprocessing.scale(len_tokens)
    df.insert(0, column="Lengths", value=len_tokens)

    # Create average word length (training)
    Average_Words = [len(x) / (len(x.split())) for x in df["Review"].tolist()]
    Average_Words = preprocessing.scale(Average_Words)
    df["averageWords"] = Average_Words
    return df


def build_model(X_train, y_train, X_test, y_test, name_of_test):
    log_reg = LogisticRegression(C=30, max_iter=200).fit(X_train, y_train)
    y_pred = log_reg.predict(X_test)
    print(
        "Training accuracy of " + name_of_test + ": ", log_reg.score(X_train, y_train)
    )
    print("Testing accuracy of " + name_of_test + ": ", log_reg.score(X_test, y_test))
    print(classification_report(y_test, y_pred))  # Evaluating prediction ability
    return log_reg


# Load training and test sets
# Loading reviews into DF
df_train = load_data("train")

print("...successfully loaded training data")
print("Total length of training data: ", len(df_train))
# Add augmenting features
df_train = add_augmenting_features(df_train)
print("...augmented data with len_tokens and average_words")

# Load test DF
df_test = load_data("test")

print("...successfully loaded testing data")
print("Total length of testing data: ", len(df_test))
df_test = add_augmenting_features(df_test)
print("...augmented data with len_tokens and average_words")

# Create unstemmed BOW features for training set
unstemmed_BOW_vect_fit, df_train_bow_unstem, df_test_bow_unstem = transform_BOW(
    df_train, df_test, "Review"
)
print("...successfully created the unstemmed BOW data")

# Create TfIdf features for training set
unstemmed_tfidf_vect_fit, df_train_tfidf_unstem, df_test_tfidf_unstem = transform_tfidf(
    df_train, df_test, "Review"
)
print("...successfully created the unstemmed TFIDF data")

# Running logistic regression on dataframes
bow_unstemmed = build_model(
    df_train_bow_unstem,
    df_train["Sentiment"],
    df_test_bow_unstem,
    df_test["Sentiment"],
    "BOW Unstemmed",
)

tfidf_unstemmed = build_model(
    df_train_tfidf_unstem,
    df_train["Sentiment"],
    df_test_tfidf_unstem,
    df_test["Sentiment"],
    "TFIDF Unstemmed",
)

...successfully loaded training data
Total length of training data:  8530
...augmented data with len_tokens and average_words
...successfully loaded testing data
Total length of testing data:  1066
...augmented data with len_tokens and average_words
...successfully created the unstemmed BOW data
...successfully created the unstemmed TFIDF data
Training accuracy of BOW Unstemmed:  0.6228604923798359
Testing accuracy of BOW Unstemmed:  0.6060037523452158
              precision    recall  f1-score   support

           0       0.59      0.69      0.64       533
           1       0.63      0.52      0.57       533

    accuracy                           0.61      1066
   macro avg       0.61      0.61      0.60      1066
weighted avg       0.61      0.61      0.60      1066

Training accuracy of TFIDF Unstemmed:  0.6227432590855803
Testing accuracy of TFIDF Unstemmed:  0.6078799249530957
              precision    recall  f1-score   support

           0       0.60      0.68      0.63   

### Attacking

TextAttack includes a build-in `SklearnModelWrapper` that can run attacks on most sklearn models. (If your tokenization strategy is different than above, you may need to subclass `SklearnModelWrapper` to make sure the model inputs & outputs come in the correct format.)

Once we initializes the model wrapper, we load a few samples from the IMDB dataset and run the `TextFoolerJin2019` attack on our model.

In [18]:
!pip install textattack
import textattack

Collecting textattack
  Downloading textattack-0.3.10-py3-none-any.whl.metadata (38 kB)
Collecting bert-score>=0.3.5 (from textattack)
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting flair (from textattack)
  Downloading flair-0.14.0-py3-none-any.whl.metadata (12 kB)
Collecting language-tool-python (from textattack)
  Downloading language_tool_python-2.8.1-py3-none-any.whl.metadata (12 kB)
Collecting lemminflect (from textattack)
  Downloading lemminflect-0.2.3-py3-none-any.whl.metadata (7.0 kB)
Collecting lru-dict (from textattack)
  Downloading lru_dict-1.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.5 kB)
Collecting terminaltables (from textattack)
  Downloading terminaltables-3.1.10-py2.py3-none-any.whl.metadata (3.5 kB)
Collecting word2number (from textattack)
  Downloading word2number-1.1.zip (9.7 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting num2words (from text

textattack: Updating TextAttack package dependencies.
textattack: Downloading NLTK required packages.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw to /root/nltk_data...
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [33]:
import pickle
from textattack.models.wrappers import SklearnModelWrapper
class MySklearnModelWrapper(SklearnModelWrapper):
    def __call__(self, text_input_list, batch_size=32):
        encoded_text_matrix = self.tokenizer.transform(text_input_list).toarray()
        tokenized_text_df = pd.DataFrame(
            encoded_text_matrix, columns=self.tokenizer.get_feature_names_out() # Use get_feature_names_out() here
        )
        return self.model.predict_proba(tokenized_text_df)
with open("tfidf_vect_fit2.pkl", 'rb') as file:
    tokenizer = pickle.load(file)

with open("tfidf_lr_model2.pkl", 'rb') as file:
    model = pickle.load(file)
model_wrapper = MySklearnModelWrapper(model, tokenizer)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [44]:
from textattack.datasets import HuggingFaceDataset
from textattack.goal_functions import TargetedClassification
from textattack.attack_recipes import TextFoolerJin2019, DeepWordBugGao2018
from textattack import Attacker
import textattack
from textattack.datasets import Dataset

test_data = pd.read_csv("Dataset2_2Test.csv")
dataset = Dataset([(row['final_cleaned_text'], row['label']) for _, row in test_data.iterrows()])
attack = TextFoolerJin2019.build(model_wrapper)
attack.goal_function = TargetedClassification(model_wrapper, target_class=0)

attacker = Attacker(attack, dataset)
attacker.attack_dataset()

textattack: Unknown if model of class <class 'sklearn.linear_model._logistic.LogisticRegression'> compatible with goal function <class 'textattack.goal_functions.classification.untargeted_classification.UntargetedClassification'>.
textattack: Unknown if model of class <class 'sklearn.linear_model._logistic.LogisticRegression'> compatible with goal function <class 'textattack.goal_functions.classification.targeted_classification.TargetedClassification'>.


Attack(
  (search_method): GreedyWordSwapWIR(
    (wir_method):  delete
  )
  (goal_function):  TargetedClassification(
    (target_class):  0
  )
  (transformation):  WordSwapEmbedding(
    (max_candidates):  50
    (embedding):  WordEmbedding
  )
  (constraints): 
    (0): WordEmbeddingDistance(
        (embedding):  WordEmbedding
        (min_cos_sim):  0.5
        (cased):  False
        (include_unknown_words):  True
        (compare_against_original):  True
      )
    (1): PartOfSpeech(
        (tagger_type):  nltk
        (tagset):  universal
        (allow_verb_noun_swap):  True
        (compare_against_original):  True
      )
    (2): UniversalSentenceEncoder(
        (metric):  angular
        (threshold):  0.840845057
        (window_size):  15
        (skip_text_shorter_than_window):  True
        (compare_against_original):  False
      )
    (3): RepeatModification
    (4): StopwordModification
    (5): InputColumnModification(
        (matching_column_labels):  ['premi


  0%|          | 0/10 [00:00<?, ?it/s][A
[Succeeded / Failed / Skipped / Total] 0 / 0 / 1 / 1:  10%|█         | 1/10 [00:00<00:00, 58.77it/s][A
[Succeeded / Failed / Skipped / Total] 0 / 0 / 2 / 2:  20%|██        | 2/10 [00:00<00:00, 61.98it/s][A
[Succeeded / Failed / Skipped / Total] 0 / 0 / 3 / 3:  30%|███       | 3/10 [00:00<00:00, 63.56it/s][A
[Succeeded / Failed / Skipped / Total] 0 / 0 / 4 / 4:  40%|████      | 4/10 [00:00<00:00, 62.03it/s][A
[Succeeded / Failed / Skipped / Total] 0 / 0 / 5 / 5:  50%|█████     | 5/10 [00:00<00:00, 61.99it/s][A
[Succeeded / Failed / Skipped / Total] 0 / 0 / 6 / 6:  60%|██████    | 6/10 [00:00<00:00, 64.63it/s][A
[Succeeded / Failed / Skipped / Total] 0 / 0 / 6 / 6:  70%|███████   | 7/10 [00:00<00:00, 66.10it/s][A
[Succeeded / Failed / Skipped / Total] 0 / 0 / 7 / 7:  70%|███████   | 7/10 [00:00<00:00, 63.35it/s][A
[Succeeded / Failed / Skipped / Total] 0 / 0 / 8 / 8:  80%|████████  | 8/10 [00:00<00:00, 66.24it/s][A
[Succeeded / Failed /

Encoded Text Matrix Shape: (1, 100)
Number of Features: 100
--------------------------------------------- Result 1 ---------------------------------------------

 verification code from alabama group mim eversion contents ty le tex ht ml chars e tut f content transfer encoding bit mc task id mci p group online templet id content length antivirus vast vs inbound message x antivirus status cleaner james blue you are confirming login please enter the following code please pay attention after verification


Encoded Text Matrix Shape: (1, 100)
Number of Features: 100
--------------------------------------------- Result 2 ---------------------------------------------

 wet pussy request come and make my holes derive never been faked properly can you give me favor and fulfill my dream nice ginger hair and big bubbly boobs do you think my books big enough bye cute


Encoded Text Matrix Shape: (1, 100)
Number of Features: 100
--------------------------------------------- Result 3 --------------


  average_perc_words_perturbed = self.perturbed_word_percentages.mean()
  ret = ret.dtype.type(ret / rcount)
  avg_num_queries = self.num_queries.mean()


[<textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac1cff8130>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac1cff8040>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac1cff9ae0>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac24aece50>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac1cff8670>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac1cff8af0>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac1cff8970>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac1cff8370>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac1cffa260>,
 <textattack.attack_results.skipped_attack_result.SkippedAttackResult at 0x7dac1cff9120>]

### Conclusion
We were able to train a model on the IMDB dataset using `sklearn` and use it in TextAttack by initializing with the `SklearnModelWrapper`. It's that simple!