<a href="https://colab.research.google.com/github/anajikadam/NLP/blob/main/5_PythonLibraries__NLP_projects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

help [link](https://towardsdatascience.com/5-lesser-known-python-libraries-for-your-next-nlp-project-ff13fc652553)

## 1) Contractions
We can write a long list of regular expressions to expand contractions in your text data. 

for Example: i.e. don’t = do not; can’t = cannot; haven’t = have not.

But here is Python Library which is saving our efforts.

Contractions is an easy-to-use library that will expand both common English contractions and slang. 
It is fast, efficient, and handles most edge cases, such as missing apostrophes.

In [1]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.0.58-py2.py3-none-any.whl (8.0 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 5.2 MB/s 
[?25hCollecting pyahocorasick
  Downloading pyahocorasick-1.4.2.tar.gz (321 kB)
[K     |████████████████████████████████| 321 kB 49.9 MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.2-cp37-cp37m-linux_x86_64.whl size=85443 sha256=970b124de9c695ed5975bb073a49b51f00f8b0f6cbd1e3afce298d60308aeb56
  Stored in directory: /root/.cache/pip/wheels/25/19/a6/8f363d9939162782bb8439d886469756271abc01f76fbd790f
Successfully built pyahocorasick
Installing collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully install

In [2]:
import contractions


s = "ive gotta go! i'll see yall later."

text = contractions.fix(s, slang=True)
print("ORIGINAL:  ", s)
print()
print("OUTPUT:  ", text)

ORIGINAL:   ive gotta go! i'll see yall later.

OUTPUT:   I have got to go! I will see you all later.


In [3]:
s = "She celebrated her birthday for an entire month. She's so extra."

text = contractions.fix(s, slang=True)
print("ORIGINAL:  ", s)
print()
print("OUTPUT:  ", text)

ORIGINAL:   She celebrated her birthday for an entire month. She's so extra.

OUTPUT:   She celebrated her birthday for an entire month. she is so extra.


In [4]:
s = "I'll never go something ahead"

text = contractions.fix(s, slang=True)
print("ORIGINAL:  ", s)
print()
print("OUTPUT:  ", text)

ORIGINAL:   I'll never go something ahead

OUTPUT:   I will never go something ahead


##### Use Case
An important part of text preprocessing is creating uniformity and whittling down the list of unique words without losing too much meaning. For instance, bag-of-words models and TF-IDF create large sparse matrixes, in which each variable is a distinct vocabulary word in the corpus. Expanding contractions can further reduce dimensionality, or even help filter out stopwords.

## 2) Distilbert-Punctuator
Restoring missing punctuation to plain English text… sounds easy right? Well, it’s definitely a lot trickier for a computer to do this.
Distilbert-punctuator is the only working Python library I could find that performs this task. And it’s super accurate as well! That’s because it uses a slimmed-down variation of BERT, which is a state-of-the-art, pretrained language model by Google. It was further fine-tuned on a combination of over 20,000 news articles and 4,000 TED Talk transcripts to detect sentence boundaries. When inserting sentence-ending punctuation, such as periods, the model will also appropriately capitalize the next starting letter.

In [5]:
!pip install distilbert-punctuator

Collecting distilbert-punctuator
  Downloading distilbert_punctuator-0.2.0-py3-none-any.whl (27 kB)
Collecting pydantic==1.8.2
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 7.2 MB/s 
[?25hCollecting torch==1.7.1
  Downloading torch-1.7.1-cp37-cp37m-manylinux1_x86_64.whl (776.8 MB)
[K     |████████████████████████████████| 776.8 MB 18 kB/s 
[?25hCollecting plane>=0.2.0
  Downloading plane-0.2.1-py3-none-any.whl (11 kB)
Collecting typer==0.3.2
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting transformers>=4.12.0
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 47.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 49.4 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.ma

In [6]:
from dbpunctuator.inference import Inference, InferenceArguments
from dbpunctuator.utils import DEFAULT_ENGLISH_TAG_PUNCTUATOR_MAP
args = InferenceArguments(
        model_name_or_path="Qishuai/distilbert_punctuator_en",
        tokenizer_name="Qishuai/distilbert_punctuator_en",
        tag2punctuator=DEFAULT_ENGLISH_TAG_PUNCTUATOR_MAP
    )
punctuator_model = Inference(inference_args=args, 
                             verbose=False)
text = [
    """
however when I am elected I vow to protect our American workforce
unlike my opponent I have faith in our perseverance our sense of trust and our democratic principles will you support me
    """
]
print(punctuator_model.punctuation(text)[0])

2021-12-29 08:41:14,721 - [32mINFO[0m - inference_interface.py:75 - inference_interface._produce_server - 72 - set up punctuator


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/802 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/253M [00:00<?, ?B/s]

2021-12-29 08:41:29,821 - [32mINFO[0m - inference_interface.py:88 - inference_interface._produce_server - 72 - start running punctuator
2021-12-29 08:41:31,862 - [32mINFO[0m - inference_interface.py:91 - inference_interface._produce_server - 72 - start client


['However, when I am elected, I vow to protect our American workforce. Unlike my opponent, I have faith in our perseverance, our sense of trust and our democratic principles. Will you support me?']


In [65]:
text = [
    """
She ate 3 flavors milk coffee and taro When are you going to come home"""
]
print(punctuator_model.punctuation(text)[0])

['She ate 3 flavors milk, coffee and taro. When are you going to come home?']


In [66]:

text = [
    """
Happy birthday John"""
]
print(punctuator_model.punctuation(text)[0])

['Happy birthday John.']


#### Use Case
Sometimes, you simply want your text data to be more grammatically correct and presentable. Whether the task is fixing messy Twitter posts or chatbot messages, this library is very useful.

## 3) Textstat
Textstat is an easy-to-use, lightweight library that provides various metrics on your text data, such as reading level, reading time, and word count.

In [7]:
!pip install textstat

Collecting textstat
  Downloading textstat-0.7.2-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 4.6 MB/s 
[?25hCollecting pyphen
  Downloading pyphen-0.12.0-py3-none-any.whl (2.0 MB)
[K     |████████████████████████████████| 2.0 MB 29.1 MB/s 
[?25hInstalling collected packages: pyphen, textstat
Successfully installed pyphen-0.12.0 textstat-0.7.2


In [8]:
import textstat
text = """
Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. 
"""
# Flesch reading ease score
print(textstat.flesch_reading_ease(text))
  # 90-100 | Very Easy
  # 80-89  | Easy
  # 70-79  | Fairly Easy
  # 60-69  | Standard
  # 50-59  | Fairly Difficult
  # 30-49  | Difficult
  # <30    | Very Confusing
# Reading time (output in seconds)
# Assuming 70 milliseconds/character
print(textstat.reading_time(text, ms_per_char=70))
# Word count 
print(textstat.lexicon_count(text, removepunct=True))

74.87
7.98
30


In [58]:
text = """
Ever posted something online for free only to discover how many people want to argue with you about a free gift? “Can you deliver that free couch to me?” “Can you throw in the TV in the picture too?” It comes from the old adage of “beggars can’t be choosers” and shows you how they very much will still try. The worst part is choosey beggars often get insulting fast. “You’re a terrible person and you deserve to die—the only thing my child dying of cancer wanted for his birthday was your TV! And your Xbox! For free!”
"""
# Flesch reading ease score
print("reading ease score : ", textstat.flesch_reading_ease(text))
  # 90-100 | Very Easy
  # 80-89  | Easy
  # 70-79  | Fairly Easy
  # 60-69  | Standard
  # 50-59  | Fairly Difficult
  # 30-49  | Difficult
  # <30    | Very Confusing
# Reading time (output in seconds)
# Assuming 70 milliseconds/character
print("Reading Time (Assuming 70 milliseconds/character) : ", textstat.reading_time(text, ms_per_char=70))
# Word count 
print("Word count : ", textstat.lexicon_count(text, removepunct=True))

reading ease score :  63.66
Reading Time (Assuming 70 milliseconds/character) :  29.54
Word count :  98


### Use Case
These metrics add an additional layer of analysis. Say, for example, you are looking into a dataset of celebrity news articles from a gossip magazine. Using textstat, you may find that written pieces that are quicker and easier reads tend to be more popular and have longer retention rates.

## 4) Gibberish-Detector
The primary purpose of this low-code library is to detect gibberish (or unintelligible words). It uses a model that is trained on a large corpus of English words.

In [12]:
pip install gibberish-detector

Collecting gibberish-detector
  Downloading gibberish_detector-0.1.1-py3-none-any.whl (10 kB)
Installing collected packages: gibberish-detector
Successfully installed gibberish-detector-0.1.1


Open your CLI and cd over to the directory in which big.txt is located

- Run the following: gibberish-detector train .\big.txt > gibberish-detector.model


A file called gibberish-detector.model will be created in your current directory.

[big.txt](https://raw.githubusercontent.com/rrenaud/Gibberish-Detector/master/big.txt)

In [17]:
path = '/content/big.txt'

with open(path) as f:
    lines = f.readlines()

lines[:5]

['The Project Gutenberg EBook of The Adventures of Sherlock Holmes\n',
 'by Sir Arthur Conan Doyle\n',
 '(#15 in our series by Sir Arthur Conan Doyle)\n',
 '\n',
 'Copyright laws are changing all over the world. Be sure to check the\n']

In [40]:
!gibberish-detector train big.txt > big.model

In [None]:
# !gibberish-detector train content/big.txt

In [43]:
from gibberish_detector import detector
# load the gibberish detection model

model_path = '/content/big.model'
Detector = detector.create_from_model(model_path)
text1 = "xdnfklskasqd"   # xdnfklskasqd (this is gibberish)
print(Detector.is_gibberish(text1))
text2 = "apples"  # apples (this is not)
print(Detector.is_gibberish(text2))

True
False


In [51]:
text3 = "Adventures"  # not gibberish
print(Detector.is_gibberish(text3))

False


In [56]:
text3 = "ersjya"  # not gibberish
print(Detector.is_gibberish(text3))

True


### Use Case
I’ve used gibberish-detector in the past to help me remove bad observations from datasets.
It can also be implemented for error handling on user inputs. For instance, you may want to return an error message if a user enters meaningless, gibberish text on your web app.

### 5) NLPAug
I’ve saved the best for last. This versatile library is truly a hidden gem.
First off, what is data augmentation? It is any technique that expands the size of a training set by adding slightly modified copies of the existing data. Data augmentation is commonly used when the existing data is either limited in diversity or imbalanced. For computer vision problems, augmentation is used to create new samples by cropping, rotating, and changing the brightness of images. With numerical data, synthesized instances can be created by using clustering techniques.

This is where NLPAug comes in. The library can augment text by either substituting or inserting words that are semantically associated.
It does this by employing pretrained language models like BERT, which is a powerful approach because that takes into account the context of words. Based on the parameters you set, the top n number of similar words will be used to modify the text.
Pretrained word embeddings, such as Word2Vec and GloVe, can also be used to replace words with synonyms.

https://www.analyticsvidhya.com/blog/2021/08/nlpaug-a-python-library-to-augment-your-text-data/

In [9]:
!pip install nlpaug

Collecting nlpaug
  Downloading nlpaug-1.1.10-py3-none-any.whl (410 kB)
[?25l[K     |▉                               | 10 kB 26.0 MB/s eta 0:00:01[K     |█▋                              | 20 kB 21.8 MB/s eta 0:00:01[K     |██▍                             | 30 kB 16.3 MB/s eta 0:00:01[K     |███▏                            | 40 kB 14.0 MB/s eta 0:00:01[K     |████                            | 51 kB 5.6 MB/s eta 0:00:01[K     |████▉                           | 61 kB 6.0 MB/s eta 0:00:01[K     |█████▋                          | 71 kB 5.5 MB/s eta 0:00:01[K     |██████▍                         | 81 kB 6.2 MB/s eta 0:00:01[K     |███████▏                        | 92 kB 6.5 MB/s eta 0:00:01[K     |████████                        | 102 kB 5.3 MB/s eta 0:00:01[K     |████████▊                       | 112 kB 5.3 MB/s eta 0:00:01[K     |█████████▋                      | 122 kB 5.3 MB/s eta 0:00:01[K     |██████████▍                     | 133 kB 5.3 MB/s eta 0:00:01[K

In [10]:
import nlpaug.augmenter.word as naw
# main parameters to adjust
ACTION = 'substitute' # or use 'insert'
TOP_K = 15 # randomly draw from top 15 suggested words
AUG_P = 0.40 # augment 40% of words within text
aug_bert = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', 
    action=ACTION, 
    top_k=TOP_K,
    aug_p=AUG_P
    )
text = """
Come into town with me today to buy food!
"""
augmented_text = aug_bert.augment(text, n=3) # n: num. of outputs
print(augmented_text)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

['come out school behind me tonight to buy food!', 'come stay here with me tomorrow to have food!', 'come into business with this... to buy it!']


In [11]:
text = """
Can you solve my issues today
"""
augmented_text = aug_bert.augment(text, n=3) # n: num. of outputs
print(augmented_text)

['if we solve my issues...', 'will it solve my question today', 'can i repeat my equation today']


In [59]:
text = """
What time does school begin?
"""
augmented_text = aug_bert.augment(text, n=3) # n: num. of outputs
print(augmented_text)

['what class should school finish?', 'how time does someone break?', 'where time does summer start?']


In [60]:
t1 = """
The school looks like a prison.
"""
augmented_text = aug_bert.augment(t1, n=3) # n: num. of outputs
print(augmented_text)

['law school was like my prison.', 'the place looks like an museum.', 'the story ended like a dream.']


#### Use Case
Let’s say you are training a supervised classification model on a dataset that has 15k positive reviews, and only 4k negative reviews. A heavily imbalanced dataset such as this will create model bias towards the majority class (positive reviews) during training.
Simply duplicating examples of the minority class (negative reviews) will not add any new information to the model. Instead, utilize the advanced text augmentation features of NLPAug to increase the minority class with variety. This technique has been shown to improve AUC and F1-Score.