pip install gensim

In [1]:
import pandas as pd
import numpy as np
import os
import re
import spacy
from transformers import BertTokenizer, BertModel 
import torch 
from typing import  Tuple
from sklearn import pipeline, svm
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import  ConfusionMatrixDisplay , precision_score , recall_score, f1_score, accuracy_score
import matplotlib.pyplot as plt
from category_encoders.hashing import HashingEncoder
from gensim.test.utils import common_texts

### Objective
The task is to classify Tweets into True/False. True indicates that the tweet refers to an actual natural disaster. False can be anything else. 

We will assume that due to the nature of the task false negative are more problematic that false positive. For this reason we will pay close attention to the recall score.

This exercise is based on the Kaggle challenge: https://www.kaggle.com/competitions/nlp-getting-started

Sources, examples and documentation used by this notebook are referenced at the end. General reference material is used but example of prior completions of this Kaggle excercise are avoided.


# Data Exploration

In [2]:
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
train[0:10]

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [None]:
train.groupby('target').count()

The dataset is skewed to the negative (4342 vs 3271), which will need to be factored into the evaluation and train test split   

Below we will look through each of the three features to see if they need to be cleaned up and/or engineered to optimise them for the choosen model.

### The location field:

In [None]:
train.groupby("location").id.nunique().sort_values(ascending=True).head(50)

Looks like there are some non location strings.

The locations dont follow a consistant format.

This feature will need to be cleaned so to ensure there is a consistent naming convention and that each location is a real world location. It may also be usefull to split the parts of the location out so that we test the model with different levels of geographic granularity.

In [None]:
train.groupby('target')['location'].nunique()

### Keyword


In [None]:
[index for index,value in enumerate(no_of_words) if value 
 > 1]

There is just one keyword value per field

In [None]:
train[(train['keyword'].isna() ==True)].groupby('target').count()

As there are only 61 rows missing the keyword , I'll just remove those rows for the training data.

In [None]:
train = train['keyword'].dropna()

### Text

We can see from the count below that there are no null values for 'text'. These string values contain the twitter hashtags which might be a usefull feature to extract.

In general the field forms the primary source of information potential in the data set. To maximise this potential this should form the primary area of effort. 

In order to retain the semantic meaning of the sentances we will use a transformer to extract sentance embedings for each tweet. Using the principle of transfer learning we will use a large general purpose transformer like BERT. The embeddings produced will then be combined with the other features and used as the input for the final classification.

In [None]:
train[(train['text'].isna() ==True)].groupby('target').count()

# Task Steps

1. Pipeline choice: identify the best way of tying the different steps together.
2. Remove non-locations from the location field , ensure a consitent format and split the split the different geographic layers into independant features.
3. Transform the keword feature into a single word embedding in a usable format as a categorical feature.
4. Test and select a transformer.
5. Combine the features
6. Train and test a classifier.
7. Make final solution and select model to be submitted.




# Feature Engineering

## Location

In [None]:
location = train['location'].astype('string')

### NER LOC & GPE Identification

In [None]:
nlp = spacy.load("en_core_web_sm")   

doc_lst = []

for l in location:
    if pd.isna(l):
        doc_lst.append(l)
    else:
        doc = nlp(l)
        doc_lst.append(doc)

In [None]:
# check results
for i in doc_lst[0:100]:
    if pd.isna(i):
        'do nothing'
    else:
        print([(X.text, X.label_) for X in i.ents])

In [None]:
# try with a different model
trf = spacy.load("en_core_web_lg") 

doc_lst_trf = []

for l in location:
    if pd.isna(l):
        doc_lst_trf.append(l)
    else:
        doc = trf(l)
        doc_lst_trf.append(doc)
        
for i in doc_lst_trf[0:100]:
    if pd.isna(i):
        'do nothing'
    else:
        ## print([(X.text, X.label_) for X in i.ents])
        print(i.text , i. )

Alot of locations are being identified as org's. The accuarcy of this method isn't great. Perhaps some rule based matching will work better.

Calculate the accuracy!

### Rules based country , city & state extraction

In [None]:
cities = pd.read_csv('/kaggle/input/world-cities/worldcities.csv')
cities.head()                      

In [None]:
def geo_like (source_lst ,geo_lst ):
    dest_lst = []
    
    compiled_regex = [re.compile(r'(?<![^\W\d_])' + re.escape(x) + r'(?![^\W\d_])', re.IGNORECASE) for x in geo_lst]
    
    for i in source_lst:
        if pd.isna(i):
            dest_lst.append(None)
        else:
            row_gp_lst = [x for x, regex in zip(geo_lst, compiled_regex) if regex.search(i)]
            if not row_gp_lst :
                dest_lst.append(None)
            else:
                dest_lst.append(row_gp_lst)

    return dest_lst 

In [None]:
def find_long (dest_lst):
    dest_lst_2 = []
    for i in dest_lst:
        if i == None:
            dest_lst_2.append(None)
        else:
            dest_lst_2.append(max(i , key=len))
    return dest_lst_2

In [None]:
## country
# the list of countries from the cities dataset doesn't give variations on country names, e.g United States , USA ect. 
# there probably are datasets avaialble that would cover most to of the common purmutations.

country_lst = cities['country'].unique()

country = find_long(geo_like(location , country_lst))

In [None]:
##  city

city_lst = cities['city'].unique()
city = find_long(geo_like(location , city_lst))

In [None]:
## state 

states = pd.read_csv('/kaggle/input/startup-success-prediction-dataset/D3/states.csv')

states_name_lst = states['State'].unique()

states_abv_lst = states['Abbreviation'].unique()

state_name = find_long(geo_like(location , states_name_lst))

state_abv = find_long(geo_like(location , states_abv_lst ))

In [None]:
states.head()

In [None]:
## add to test dataset

train['country'] = country
train['city'] = city
train['state'] = state_name
train['state_abv'] = state_abv

In [None]:
train[(train['location'].isna() ==False)].head()

In [None]:
# fill in blank countries where the city has been identified
singilton = cities.groupby('city')["country"].nunique().loc[lambda x: x==1].sort_values()

city_country = cities.merge(singilton , how = 'inner' , left_on ='city' , right_on = 'city')[["city" , "country_x"]].drop_duplicates()

train = train.merge(city_country , how ='left' , left_on = 'city', right_on = 'city'  )

train['country'] = train['country'].fillna(train['country_x'])


In [None]:
# fill in blank countries where the state has been identified 
train['country'] = train[(train['state'].isna() == False) | (train['state_abv'].isna() == False)]['country'].fillna("United States")

# create one state column with the two letter code

In [None]:
train.loc[( train['location'].isna() == False)].head()

In [None]:
train[(train['city'].isna() ==True)].groupby('target').count()

# what to do about the null locations?

# Pipeline pre-processing steps

## Keyword preprocessing

Need to find an alternitive to gensim's word2vec, it's sklearn api is unsupported. 
Should be possible to find a sklearn or scipy text vectoriser for optimized for a single word that still includes a semantic understanding in terms of its location in the vector space.

In [None]:
class WordVectorTransformer(TransformerMixin,BaseEstimator):
    def __init__(self, model="en_core_web_lg"):
        self.model = model

    def fit(self,X,y=None):
        return self

    def transform(self,X):
        nlp = spacy.load(self.model)
        return np.concatenate([nlp(word).vector.reshape(1,-1) for word in X])

### Handeling the skew

this is handeled in the svm linear estimator by the balance parameter, perhaps something similar can be done for the other estimaters?

## Text preprocessing

In [None]:
class tokenizer( BaseEstimator,TransformerMixin):
    def __init__(
        self, variables
    ):
        self.pre_trained = BertTokenizer.from_pretrained("bert-base-uncased")
        self.add_special_tokens = True
        self.var = variables
        
    def _tokenize(self, text :str) :
        tokenized = self.pre_trained.encode_plus(
            text,
            add_special_tokens = self.add_special_tokens,
            max_length = 512, 
            )
        return (
            torch.tensor(tokenized["input_ids"]).unsqueeze(0),
            torch.tensor(tokenized["attention_mask"]).unsqueeze(0),
        )
    
    def transform ( self, X):
        col = self.var
        text = X['col'].tolist()
        with torch.no_grad():
            X['col'] = [self._tokenize(string) for string in text]
            #step1_out = step1_out.values
            return X

    def fit( self, X, y=None):
        return self

In [None]:
class bertmodel(BaseEstimator,TransformerMixin):
    def __init__(
        self , variables
    ):
        self.bert_model = BertModel.from_pretrained("bert-base-uncased")
        self.var = variables
    
    def _berty (self , tolkens , attention_mask):
        with torch.no_grad():
          embeddings = self.bert_model(tolkens, attention_mask = attention_mask)
        last_hidden_state = embeddings[0]
        get_cls = last_hidden_state[:, 0, :]
        
        return get_cls 
    def transform ( self, X):
        col = self.var
        tolkenized_text = x[] # how to ensure that each step gets the variables it needs, works on the variables its supposed to and that the final output contains all of the the pre-processed features and not the others/ or maybe i dont have to remove the others??
        with torch.no_grad():
            return torch.stack([self._berty(tolkens , attention_mask) for tolkens , attention_mask in X])[:, 0, :]

    def fit(self, X, y=None):
        return self


## Location preprocessing

In [None]:
class hashingcustom(BaseEstimator, TransformerMixin):
    def __init__(self, variables):
        self.variables = variables
        self.he = HashingEncoder(
            cols = variables, 
            n_components=20*len(variables)
    def fit(self, X, y = None): #may need to try with y_train values, couldn't find explenation why the hashing encoder would need this data but the example & documentation seems to suggest it does.
        X_ = X.loc[:,self.variables]
        self.he.fit(X_)
        return self
    def transform(self, X):
        X_ = X.loc[:,self.variables]
        X_transformed =   
            pd.DataFrame(self.he.transform(X_).toarray(), 
            columns= self.he.get_feature_names_out())
        X.drop(self.variables, axis= 1, inplace=True)
        X[self.he.get_feature_names_out()] = 
            X_transformed[self.he.get_feature_names_out()].values
    return X

# Define Pipeline GridSearch , Cross Validation and Scoring

In [None]:
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py

def print_dataframe(filtered_cv_results):
    """Pretty print for filtered dataframe"""
    for mean_precision, std_precision, mean_recall, std_recall, params in zip(
        filtered_cv_results["mean_test_precision"],
        filtered_cv_results["std_test_precision"],
        filtered_cv_results["mean_test_recall"],
        filtered_cv_results["std_test_recall"],
        filtered_cv_results["params"],
    ):
        print(
            f"precision: {mean_precision:0.3f} (±{std_precision:0.03f}),"
            f" recall: {mean_recall:0.3f} (±{std_recall:0.03f}),"
            f" for {params}"
        )
    print()


def refit_strategy(cv_results):
    # print the info about the grid-search for the different scores
    precision_threshold = 0.75

    cv_results_ = pd.DataFrame(cv_results)
    print("All grid-search results:")
    print_dataframe(cv_results_)

    # Filter-out all results below the threshold
    high_precision_cv_results = cv_results_[
        cv_results_["mean_test_precision"] > precision_threshold
    ]

    print(f"Models with a precision higher than {precision_threshold}:")
    print_dataframe(high_precision_cv_results)

    high_precision_cv_results = high_precision_cv_results[
        [
            "mean_score_time",
            "mean_test_recall",
            "mean_test_precision",
            "std_test_recall",
            "std_test_precision",
            "rank_test_recall",
            "rank_test_precision",
            "params",
        ]
    ]

    # Select the most performant models in terms of recall
    # (within 1 sigma from the best)
    best_recall_std = high_precision_cv_results["mean_test_recall"].std()
    best_recall = high_precision_cv_results["mean_test_recall"].max()
    best_recall_threshold = best_recall - best_recall_std

    high_recall_cv_results = high_precision_cv_results[
        high_precision_cv_results["mean_test_recall"] > best_recall_threshold
    ]
    print(
        "Out of the previously selected high precision models, we keep all the\n"
        "the models within one standard deviation of the highest recall model:"
    )
    print_dataframe(high_recall_cv_results)
    """
    # From the best candidates, select the fastest model to predict
    fastest_top_recall_high_precision_index = high_recall_cv_results[
        "mean_score_time"
    ].idxmin()

    print(
        "\nThe selected final model is the fastest to predict out of the previously\n"
        "selected subset of best models based on precision and recall.\n"
        "Its scoring time is:\n\n"
        f"{high_recall_cv_results.loc[fastest_top_recall_high_precision_index]}"
    )
    
    return fastest_top_recall_high_precision_index
    """
    return high_recall_cv_results

In [None]:
disp = ConfusionMatrixDisplay.from_predictions(test_y, y_pred)
#disp.plot()
plt.show()
print('Precision: %.3f' % precision_score(test_y, y_pred))
print('Recall: %.3f' % recall_score(test_y, y_pred))
print('F1: %.3f' % f1_score(test_y, y_pred))
print('Accuracy: %.3f' % accuracy_score(test_y, y_pred))

# Create Pipeline & Find the Best Model & Parameters

In [None]:
train_x, test_x, train_y, test_y = train_test_split(train['text'], train['target'], test_size=0.2, random_state=42)

questions left:
If i just use the cls token does that capture multi sentenance tweets correctly?/

how does the sklearn pipeline know what to pass as an output from one step to the inputs of the next step/

can i use udf's instead of class's for the pipeline steps?/
how do you navigate through a tensors structure / how does a tensor work?/
should i be using the attention mask or is it being used by default?

Do i need to pre initialise the estimators or do it in the fit method of each step?

Should I be using the fit or fit_transpform methods of the pipeline?

The classes work individually and together, outside the pipeline. It's the bastard pipeline thats' making stringing the stes together difficult. Perhaps this does suggest something to do with the initialisation.

https://medium.com/@benlc77/how-to-write-clean-and-scalable-code-with-custom-transformers-sklearn-pipelines-ecb8e53fe110


In [None]:
# configure the steps of the pre-processing sub-pipes
text_pre_process_pipe = pipeline.Pipeline(
    steps=[
        ('Tokenize' , tokenizer),
        ('embed' , bertmodel)
    ]
)

keyword_pre_process_pipe = pipeline.Pipeline(
    steps[
        ('word2vec' , wordvec)
    ])

loc_pre_processing = pipeline.Pipeline(
    steps[
        ('hash' , hashing)
    ])

combined_preprocessing = pipeline.FeatureUnion([
    ('text', text_pre_process_pipe),
    ('keyword', keyword_pre_process_pipe),
    ('geo', loc_pre_processing),
])
# need to think about surfacing the parameters up as inputs into the class for grid search parameter tuning

In [None]:
# set up the gridsearchcv parameter grid and selection of models to test
svm = svm.SVM()

scores = ["precision" , "recall"]

param_svm = [
    {"kernel" : ["linear"], "C": [1, 10, 100, 1000], "multi_class" :["ovr", "crammer_singer"],"class_weightdict" : ["balanced"]},
    {"kernel": ["rbf"], "gamma": [1e-3, 1e-4], "C": [1, 10, 100, 1000]},
]

param_nb = []

param_gbt = []

param_rf = []

class_models = [ #should this be a dict of lists e.g {"model": [svm , nd , gbt , rf] , "param"}
    {"name" : "svm", "model": svm, "param": param_svm},
    # {"name" : "Naive Bayes", "model": nb , "param" : param_nb},
    #{"name": "gradient boosted trees" ,"model" : gbt , "param": param_gbt},
    #{"name": "random forest", "model": rf , "param": param_rf}
    #https://towardsdatascience.com/naive-bayes-classifier-explained-50f9723571ed,
    # gradient boosted trees
    # random forest
}

In [None]:
#Initialise the classes
bertmodel = bertmodel(variables = ['text'])
tokenizer = tokenizer(variables = ['text'])
hashing = hashingcustom(variables = ['country' , 'city' , 'state'] )
wordvec = WordVectorTransformer(variables = 'keyword' )

In [None]:
#iterate through the model's being tested and hyperparamter tuning
for name, model, param in class_models.items():
  
    complete_pipeline = Pipeline([
        ('preprocessing', combined_preprocessing),
        ('Model Training', GridSearchCV(estimator=model,param_grid=param, scoring=scores , n_jobs=2,refit=refit_strategy, c=6 )
    ])
    
    # model fitting
    complete_pipeline.fit(train_x, train_y)
    
    # model scoring
    test_pred = complete_pipeline.predict(test_x)
    
        # review this given the refit strategy
    # Evaluate model performance
    disp = ConfusionMatrixDisplay.from_predictions(test_y, test_pred)
    plt.show()
    print('Precision: %.3f' % precision_score(test_y, y_pred))
    print('Recall: %.3f' % recall_score(test_y, y_pred))
    print('F1: %.3f' % f1_score(test_y, y_pred))
    print('Accuracy: %.3f' % accuracy_score(test_y, y_pred))
    

parameter tuning:
https://scikit-learn.org/stable/modules/grid_search.html

model selection
https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_stats.html

## TwHIN-BERT

In [None]:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('Twitter/twhin-bert-base')
model = AutoModel.from_pretrained('Twitter/twhin-bert-base')
inputs = tokenizer("I'm using TwHIN-BERT! #TwHIN-BERT #NLP", return_tensors="pt")
outputs = model(**inputs)

# Model Selection & Final Submission

### References

https://towardsdatascience.com/build-a-bert-sci-kit-transformer-59d60ddd54a5

https://medium.com/@khang.pham.exxact/text-classification-with-bert-7afaacc5e49b

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

@article{zhang2022twhin,
  title={TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations},
  author={Zhang, Xinyang and Malkov, Yury and Florez, Omar and Park, Serim and McWilliams, Brian and Han, Jiawei and El-Kishky, Ahmed},
  journal={arXiv preprint arXiv:2209.07562},
  year={2022}
}

https://towardsdatascience.com/pre-processing-should-extract-context-specific-features-4d01f6669a7e

tokenization:
https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils.py
https://github.com/google-research/bert/blob/master/tokenization.py
none the wiser on how the special tokens handels #, im guessing it doesn't extract the semantic meaning.

https://towardsdatascience.com/the-ultimate-guide-to-training-bert-from-scratch-the-tokenizer-ddf30f124822

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

https://datasciencetoday.net/index.php/en-us/nlp/211-paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained

https://lifewithdata.com/2023/05/27/transformermixin-in-scikit-learn/

https://towardsdatascience.com/4-ways-to-encode-categorical-features-with-high-cardinality-1bc6d8fd7b13#b13b

https://stackoverflow.com/questions/43366561/use-sklearns-gridsearchcv-with-a-pipeline-preprocessing-just-once

https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html#sphx-glr-auto-examples-model-selection-plot-grid-search-digits-py

https://lvngd.com/blog/spacy-word-vectors-as-features-in-scikit-learn/
Still not sure how flattening the vector array doesn't balloon out the number of features the model has to handel what issues this might cause.

In [None]:
from sklearn.model_selection import cross_val_score, GridSearchCV