## Summary:

Whenever you make a purchase with a credit or debit card, it gets recorded in your account. The merchant information is an unstructured string with a 'store number' embedded somewhere in the string. In this project, I wrote scripts (non learning based vs learning based) to extract the correct store numbers from the merchant descriptors. I deleted some parts where the merchant descriptor is visible due to privacy concerns.

**Method 1**: I made a non-learning model first because I thought it will be very effective for this particular assignment. I used regular expression library and several if and then statements to make a function that can pick out the store number from transaction information string.

**Method 2**: I made a learning-based model using SpaCy library. I trained a transformer based neural network algorithm for NER in SpaCy library. I don't have a powerful GPU myself, so I couldn't design my own archetecture and experiment with pytorch.





In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import re

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.treebank import TreebankWordDetokenizer

import random
import spacy
from spacy.util import minibatch, compounding
from pathlib import Path
# from spacy.training.example import Example

In [13]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# df = pd.read_csv("Summer Internship - Homework Exercise.csv")
df = pd.read_csv("Summer Internship - Homework Exercise.csv")

In [5]:
# setting up the dataframe for train, validation, and test set in pandas
train = df.loc[df['dataset'] == 'train']
validation = df.loc[df['dataset'] == 'validation']
test = df.loc[df['dataset'] == 'test']

# **Non-Learning Based Model with Regular Expression**
---



In [9]:
states = [ 'AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
           'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
           'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
           'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
           'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY']

In [10]:
# does string have number in it?
def has_numbers(inputString):
    return any(char.isdigit() for char in inputString)
# function returns only digits from strong
def digitonly(inputString):
    return (''.join(filter(str.isdigit, inputString))).lstrip('0')

In [11]:
def store_number_extractor(string):
    tokenized = word_tokenize(string)
    store_number = ''
    
    # When store number is after '#'
    if "#" in string:
        for i in range(len(tokenized)):
            if tokenized[i] == "#":
                # filter only digits
                # .lstrip to remove leading zero
                return digitonly(tokenized[i+1])
    
    # when there are two numbers
    for i in range(len(tokenized)):
        if (i == len(tokenized)-1):
            break
        # if X##### (one letter) or CA#### (state) in tokenized
        # then, return the token
        if (has_numbers(tokenized[i])):
            split = re.split('(\d+)',tokenized[i])
            if ((split[0] in states) or ((len(split[0]) == 1) and (not (split[0].isnumeric())))):
                return split[0] + split[1]
        if has_numbers(tokenized[i]) and has_numbers(tokenized[i+1]):
            # if digit behind longer it is the code
            front = digitonly(tokenized[i])
            back = digitonly(tokenized[i+1])
            if(len(front) < len(back)):
                return back
            else:
                return front
            
        if tokenized[i].isnumeric() and tokenized[i+1].isnumeric():
            return tokenized[i].lstrip('0')
        
        
    # if X##### (one letter) or CA#### (state) in tokenized
    #    then, return the token
    for i in range(len(tokenized)):
        if (has_numbers(tokenized[i])):
            split = re.split('(\d+)',tokenized[i])
            if ((split[0] in states) or ((len(split[0]) == 1) and (not (split[0].isnumeric())))):
                return split[0] + split[1]
            
        # if the store number has state code in the front
        #if (re.split('(\d+)',tokenized[i])[0] in states)
        
    
        
    # the rest, when store number is the only number    
    for i in range(len(tokenized)):   
        if tokenized[i].isnumeric():
            return tokenized[i].lstrip('0')
    return (''.join(filter(str.isdigit, TreebankWordDetokenizer().detokenize(tokenized)))).lstrip('0')

In [14]:
# testing the function
string = "BP#5998869CK ST"
word_tokenize(string)

['BP', '#', '5998869CK', 'ST']

In [15]:
store_number_extractor(string)

'5998869'

In [16]:
# accuracy score for the training set
sum((train['transaction_descriptor'].apply(store_number_extractor) == train['store_number']))/len(train)

0.96

In [18]:
# testing it on validation set:
sum((validation['transaction_descriptor'].apply(store_number_extractor) == validation['store_number']))/len(validation)

0.96

In [19]:
# and on test set:
sum((test['transaction_descriptor'].apply(store_number_extractor) == test['store_number']))/len(test)

0.95

Respectable performance for the non-learning based 'filter' I've created by just figuring out the patterns myself.


Now, 
# **Learning-Based Entity Extraction Model**  (using SpaCy)

In [None]:
!pip install -U spacy

In [None]:
TRAIN_DATA = []
# setting up the data (json structure)
for index, row in train.iterrows():
    substringstart = row["transaction_descriptor"].find(row["store_number"])
    substringend = substringstart + len(row["store_number"])
    TRAIN_DATA.append((row["transaction_descriptor"], {'entities': [(substringstart, substringend, row["store_number"])]}))

In [None]:
# new formatting for Spacy v3
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

In [None]:
# download transfer model from spacy
!python -m spacy download en_core_web_trf

In [10]:
import torch
torch.cuda.is_available

<function torch.cuda.is_available>

In [11]:
!pip install torch



In [None]:
# install hungging face transfermer library
!pip install -U spacy[cuda92,transformers]

In [None]:
# !pip install cupy
!curl https://colab.chainer.org/install | sh -

In [14]:
!export CUDA_PATH = "/usr/local/cuda-9.2"

/bin/bash: line 0: export: `=': not a valid identifier
/bin/bash: line 0: export: `/usr/local/cuda-9.2': not a valid identifier


In [15]:
!export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH

In [16]:
# initalize the config file for our data
!python -m spacy init fill-config base_config.cfg config.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [17]:
# training spacy NER model based on transformer archetecture
!python -m spacy train --gpu-id 0 config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./validation.spacy

[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-05-08 20:58:20,750] [INFO] Set up nlp object from config
[2022-05-08 20:58:20,759] [INFO] Pipeline: ['transformer', 'ner']
[2022-05-08 20:58:20,763] [INFO] Created vocabulary
[2022-05-08 20:58:20,764] [INFO] Finished initializing nlp object
Downloading: 100% 481/481 [00:00<00:00, 544kB/s]
Downloading: 100% 878k/878k [00:00<00:00, 30.0MB/s]
Downloading: 100% 446k/446k [00:00<00:00, 17.7MB/s]
Downloading: 100% 1.29M/1.29M [00:00<00:00, 27.4MB/s]
Downloading: 100% 478M/478M [00:06<00:00, 73.7MB/s]
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertFo

In [6]:
import spacy
nlp = spacy.load("output/model-best")

In [8]:
# Accuracy of the trained model on the test set
accuracy = []
for index, row in test.iterrows():
    doc = nlp(row['transaction_descriptor'])
    for ent in doc.ents:
        predicted = ent.text
    accuracy.append((predicted == row["store_number"]))
accuracy = sum(accuracy)/len(accuracy)
accuracy

0.64