### NLP Augmentation Hands-On


Augmentation in Computer Vision is one of the important techniques and has proved to be effective. 
In NLP, augmentation is also tried and shown imporovements in quite a few cases.

In this part, We will first undestand the following

    -What Data Augmentation is and why it works?

    -Why it works so well Computer Vision?

    -Benefits on Augmentation.
    
    -Types of NLP Augmentation.

Then we will jump into one of the types of NLP Augmentation and will do hands-on.

#### What augmentation is and why it works?

Data Augmentation is a technique to sythetically generate new data points such that generated data have same semantics
as of original data. In other words Data Augmentation is semantically invariant transformation.

Data Augmentation has these primary reasons to work.

- Data Scarcity


- Improves generalization capabilities (reuce overfitting)


- Test Time Augmentation (Confident Prediction)


#### Why it works so well Computer Vision?

In Computer vision, particulally Deep Learning algorithms are data hungary. It means more data is always welcome.

Though there are some researcher object the volume vs quality of data. If you want to undestand more aabout it please
go through this https://www.slideshare.net/xamat/10-lessons-learned-from-building-machine-learning-systems


Transformations applied on image during augmenation still preserve the meaning, hence are semantically invariant transformation. (reference - https://medium.com/secure-and-private-ai-writing-challenge/data-augmentation-increases-accuracy-of-your-model-but-how-aa1913468722)

!["Image Aumentation"](image_aug_3.png)

#### Rules of Data Augmentation 

1. The augmented data must follow a statistical distribution similar to that of the original data.


2. A human being should not be able to distinguish between the amplified data and the original data.


3. Data augmentation involves semantically invariant transformations.


4. In supervised learning, the transformations allowed for data augmentation are those that do not modify the class label of the new data generated.


5. In order to respect the semantic invariance, the number of successive or combined transformations must be limited, empirically to two (2).


Reference for above Rules [Text Data Augmentation Made Simple](https://arxiv.org/abs/1812.04718)

#### Benefits of Data Augmentation

Benefits of augmentation is widely docoments in Computer vision research.

- Implicit regularization


- Semi-Supervised applications, insufficient data.


- Cost effective way to data gathering and labeling. Automated synthetic data generation helps to alliviate tedious data collection process.

Now we have some understanding of Data Augmentation we will shift our attention to text augmentation. Text augmentation and NLP Augmentation could be treated as synonym.


**NLP augmentation** can be classified into these major categories. Which each category having bunch of techniques.


#### Categories of NLP Augmentation

- Lexical Substitution


- Back Translation


- Text Surface Transformation


- Random Noise Injection


- Instance Crossover Augmentation


- Syntax-tree Manipulation




In this part we do hands-on for Lexical Substitution

In [122]:
from sklearn.datasets import fetch_20newsgroups
from bs4 import BeautifulSoup
import re
import unidecode
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories,
                                  shuffle=True,
                                  remove=('headers', 'footers', 'quotes'),
                                  random_state=42)
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [123]:
data = twenty_train.data
target = twenty_train.target

In [124]:
def clean_data(text):
    soup = BeautifulSoup(text)
    html_pattern = re.compile('<.*?>')
    text = html_pattern.sub(r' ', soup.text)
    text = unidecode.unidecode(text)
    text = re.sub('[^A-Za-z0-9.]+', ' ', text)
    text = text.lower()
    
    return text

In [125]:
data = [clean_data(txt) for txt in data]

In [126]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data)
X_train_counts.shape

(2257, 28179)

In [127]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score


mnb = MultinomialNB()

In [128]:
print("Mean Accuracy: {:.2}".format(cross_val_score(mnb, X_train_counts, target, cv=5).mean()))

Mean Accuracy: 0.84


['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
(2257, 28179)
Mean Accuracy: 0.84


#### Now we will experiment with Lexical Substitution using NLTK wordnet 

In [57]:
import nltk
from nltk.tag import pos_tag
from nltk import sent_tokenize

In [160]:
from nltk.corpus import wordnet

def get_synonym_for_word(word):
    """returns the synonym given word if found, otherwise returns the same word"""
    synonyms = []
    for syn in wordnet.synsets(word):

        for l in syn.lemmas():
            synonyms.append(l.name())
    synonyms = [syn for syn in synonyms if syn!=word]
    if len(synonyms) == 0:
        return word
    else:
        return synonyms[0]

def augment_sentence_wordnet(sentence, filters=['NN', 'JJ']):
    """Augments words in sentence which are filtered by pos tags"""
    
    pos_sent = pos_tag(sentence.split())
    new_sent = []
    for word,tag in pos_sent:
        if tag in filters:
            new_sent.append(get_synonym_for_word(word))
        else:
            new_sent.append(word)
            
    return " ".join(new_sent)

def augment_data(data, target):
    """Creates augmented data using wordnet synonym imputation."""
    
    aug_data = []
    aug_target = []
    for row, t in zip(data, target):
        aug_row = []
        row_sents = sent_tokenize(row)
        #print("row_sents", row_sents)
        for line in row_sents:
            line = augment_sentence_wordnet(line)
            aug_row.append(line)
        row_sents = " ".join(aug_row)
        
        #print(row_sents)
        aug_data.append(row)
        aug_data.append(row_sents)
        aug_target.append(t)
        aug_target.append(t)
        #print(len(aug_data))
    return aug_data, aug_target

        

In [161]:
aug_data, aug_target = augment_data(data, target)

In [164]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_aug_counts = count_vect.fit_transform(aug_data)


mnb_aug = MultinomialNB()

In [165]:
print("Mean Accuracy: {:.2}".format(cross_val_score(mnb_aug, X_train_aug_counts, aug_target, cv=5).mean()))

Mean Accuracy: 0.85


In [2]:
import re
import unidecode
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from bs4 import BeautifulSoup
import nltk
from nltk.tag import pos_tag
from nltk import sent_tokenize
from nltk.corpus import wordnet

def get_synonym_for_word(word):
    """returns the synonym given word if found, otherwise returns the same word"""
    
    synonyms = []
    for syn in wordnet.synsets(word):

        for l in syn.lemmas():
            synonyms.append(l.name())
    synonyms = [syn for syn in synonyms if syn!=word]
    if len(synonyms) == 0:
        return word
    else:
        return synonyms[0]

def augment_sentence_wordnet(sentence, filters=['NN', 'JJ']):
    """Augments words in sentence which are filtered by pos tags"""
    
    pos_sent = pos_tag(sentence.split())
    new_sent = []
    for word,tag in pos_sent:
        if tag in filters:
            new_sent.append(get_synonym_for_word(word))
        else:
            new_sent.append(word)
            
    return " ".join(new_sent)

def augment_data(data, target):
    """Creates augmented data using wordnet synonym imputation."""
    
    aug_data = []
    aug_target = []
    for row, t in zip(data, target):
        aug_row = []
        row_sents = sent_tokenize(row)
        #print("row_sents", row_sents)
        for line in row_sents:
            line = augment_sentence_wordnet(line)
            aug_row.append(line)
        row_sents = " ".join(aug_row)
        
        #print(row_sents)
        aug_data.append(row)
        aug_data.append(row_sents)
        aug_target.append(t)
        aug_target.append(t)
        #print(len(aug_data))
    return aug_data, aug_target

def clean_data(text):
    soup = BeautifulSoup(text)
    html_pattern = re.compile('<.*?>')
    text = html_pattern.sub(r' ', soup.text)
    text = unidecode.unidecode(text)
    text = re.sub('[^A-Za-z0-9.]+', ' ', text)
    text = text.lower()
    
    return text
    
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
                                  categories=categories,
                                  shuffle=True,
                                  remove=('headers', 'footers', 'quotes'),
                                  random_state=42)
print(twenty_train.target_names)

data = twenty_train.data
data = [clean_data(txt) for txt in data]
target = twenty_train.target

aug_data, aug_target = augment_data(data, target)

count_vect = CountVectorizer()
X_train_aug_counts = count_vect.fit_transform(aug_data)
print(X_train_aug_counts.shape)

mnb_aug = MultinomialNB()
print("Mean Accuracy: {:.2}".format(cross_val_score(mnb_aug, X_train_aug_counts, aug_target, cv=5).mean()))
# We get 85% accuracy here, during my experiment. 

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
(4514, 32279)
Mean Accuracy: 0.85
