# **Basic Text Preprocessing**

This will cover

**1.** Converting to lowercase

**2.** Removing stop words and punctuation

**3.** Finding POS tags

**4.** Lemmatization

## **Install packages if not yet installed**

In [1]:
import sys

!{sys.executable} -m pip install azure.storage.blob # Azure Blob Storage
!{sys.executable} -m pip install nltk # NTLK
!{sys.executable} -m pip install spacy # Spacy
!{sys.executable} -m pip install --user https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz # SpaCy Language Model

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz (13.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.7/13.7 MB[0m [31m60.7 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l- done


## **Create connection string to storage account**

In [2]:
connectionString="DefaultEndpointsProtocol=https;AccountName=shivmldatasets;AccountKey=Uoz2wy3N+KONfZAXvPc2QG4Z+G5S6BTvPn0zK6CaoCbM30tBtbToarFMZyo0EeimLD4P8RBuzoJJ+AStJ80Qiw==;EndpointSuffix=core.windows.net"

## **Download stopwords**

In [3]:
import nltk, re
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from nltk.corpus import stopwords
STOP_WORDS_LIST=stopwords.words('english')

In [5]:
# Define a function to remove stopwords and puncutation.
def removeStopwordsAndPunctuation(text):
    # Removing extra spaces and anything other than alphanumeric characters.
    text=re.sub("[\s]+", " ", re.sub("[^\w\d ]", " ", text)).lower()
    # Removing stopwords
    text=[word for word in text.split() if not word in STOP_WORDS_LIST]
    return ' '.join(text)

## **Load the language model**

The language model pipeline consists of classes: `['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']`

Disabling `ner` (Named-Entity Recognition) to speed up the model.

In [6]:
import spacy

In [7]:
nlp=spacy.load("en_core_web_sm", disable=["ner"])
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer']

## **Preprocess Text**

**1.** Removal of stop words and punctuation.

**2.** Pass the articles text in the following form (a list of tuples). The text of each article goes first in each tuple and we pass the `id` of the article in the dictionary. This is to easy get the ID of an article after spacy text processing. When passing the data in this format to `nlp.pipe()` method, mark `as_tuples` as True.

```python
[
    ("text ...", {"ID" : <ID-Value>}),
    ("text ...", {"ID" : <ID-Value>}),
    ...
]
```

**3.** Replace the tokens in the article text with their lemmatized form.

In [8]:
# Define a function to preprocess text.
def textPreprocess(articles):
    # Remove stop words and punctuation
    articles.loc[:, "text"]=articles.text.apply(removeStopwordsAndPunctuation)
    # Get lemmatized words
    for article, attr in nlp.pipe(list(articles.apply(lambda article: (article.text, {"ID": article.id}), axis=1)), as_tuples=True, n_process=-1, batch_size=32):
        articles.loc[articles.id==attr["ID"], "text"]=" ".join([token.lemma_.strip() for token in article])

## **Read the dataset files**

**1.** Read each dataset file.

**2.** Preprocess text.

**3.** Save it to new blob file.

In [9]:
import pandas as pd
from azure.storage.blob import BlobClient

In [10]:
%%time
for i in range(1, 35):
    # Read the blob file.
    articles=pd.read_csv(f"https://shivmldatasets.blob.core.windows.net/ml-datasets/gfg-articles-scrapped-{i}.csv")
    articles.drop(columns=["Unnamed: 0"], inplace=True)
    articles.text=articles.text.astype("str")
    # Preprocess text.
    textPreprocess(articles)
    # Save to blob.
    blob=BlobClient.from_connection_string(conn_str=connectionString, container_name="ml-datasets", blob_name=f"preprocessed/gfg-articles-preprocessed-{i}.csv")
    blob.upload_blob(articles[["id", "text"]].to_csv(), overwrite=True)

CPU times: user 3min 28s, sys: 9.56 s, total: 3min 37s
Wall time: 5min 7s


In [14]:
# See top 5 rows.
articles=pd.read_csv(f"https://shivmldatasets.blob.core.windows.net/ml-datasets/preprocessed/gfg-articles-preprocessed-1.csv")
articles[:5]

Unnamed: 0.1,Unnamed: 0,id,text
0,0,0,sql structured query language allow we select ...
1,1,1,foundation css open source responsive front en...
2,2,2,although many we already aware microsoft excel...
3,3,3,servlet simple java program run server capable...
4,4,4,suffix sum arraygiven array arr size n task co...
