# Simple Non-Vectorization Text Features



Since text cannot be directly utilized by any downstream machine learning or deep learning model (given these are mathematical functions at heart, we need to transform it into numeric or vectorized formats.

In this tutorial, we will work towards extracting various features from text so as to use them for different NLP tasks. There are vectorization methods which we will cover in the next tutorial.

In this tutorial we will focus on simple non-vectorized methodologies which will try and derive features from various properties of the text content. The idea is these features being numeric can be used by downstream machine learning models as necessary for downstream tasks which you will learn about in the next few modules.

In this notebook, we will cover:
- Count based hand-crafted features
- Parts of Speech based features
- Text Legibility features
- Sentiment-based Features

## Load Libraries

In [43]:
import string
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

## Load Dataset

``sklearn`` provides a number of datasets for understanding and building NLP pipelines. In this notebook we will make use of **20-newsgroups** dataset. This notebook consists of a number of news articles classified under various categories.

For the purposes of this notebook/tutorial we will only focus on the actual article _text_ only for the space-scientific articles category.

In [2]:
cats = ['sci.space']
news_group_data = fetch_20newsgroups(subset='train', categories=cats,remove=('headers', 'footers', 'quotes')).data

In [5]:
df = pd.DataFrame(data={
    'article':news_group_data
})
df.shape

(593, 1)

In [6]:
df.head()

Unnamed: 0,article
0,\nAny lunar satellite needs fuel to do regular...
1,\nGlad to see Griffin is spending his time on ...
2,\n\n\nIn spite of my great respect for the peo...
3,\n\n\n\n\nDidn't one of the early jet fighters...
4,I just got out of the Army. Go signal corps or...


How the first article looks like

In [7]:
df['article'].values[0]

"\nAny lunar satellite needs fuel to do regular orbit corrections, and when\nits fuel runs out it will crash within months.  The orbits of the Apollo\nmotherships changed noticeably during lunar missions lasting only a few\ndays.  It is *possible* that there are stable orbits here and there --\nthe Moon's gravitational field is poorly mapped -- but we know of none.\n\nPerturbations from Sun and Earth are relatively minor issues at low\naltitudes.  The big problem is that the Moon's own gravitational field\nis quite lumpy due to the irregular distribution of mass within the Moon."

## Count Based Features

Counting presence or absence of certain words/characters is a good proxy of the information contained in a sentence/corpus. In this section, we will prepare a list of various count based hand-crafted features such as:

- **Word Count**: total number of words in the documents
- **Character Count**: total number of characters in the documents
- **Average Word Density**: average length of the words used in the documents
- **Puncutation Count**: total number of punctuation marks in the documents
- **Upper Case Count**: total number of upper count words in the documents
- **Title Word Count**: total number of proper case (title) words in the documents

In [8]:
feature_col = 'article'

In [9]:
df['char_count'] = df[feature_col].apply(len)

In [10]:
df['word_count'] = df[feature_col].apply(lambda x: len(x.split()))

In [11]:
df['word_density'] = df['char_count'] / (df['word_count']+1)

In [12]:
df['punctuation_count'] = df[feature_col].apply(lambda x: len("".join(a for a in x if a in string.punctuation)))

In [13]:
df['title_word_count'] = df[feature_col].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))

In [14]:
df['upper_case_word_count'] = df[feature_col].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

In [15]:
df.head()

Unnamed: 0,article,char_count,word_count,word_density,punctuation_count,title_word_count,upper_case_word_count
0,\nAny lunar satellite needs fuel to do regular...,575,97,5.867347,14,9,0
1,\nGlad to see Griffin is spending his time on ...,184,32,5.575758,2,3,0
2,\n\n\nIn spite of my great respect for the peo...,666,122,5.414634,15,13,8
3,\n\n\n\n\nDidn't one of the early jet fighters...,262,46,5.574468,11,5,4
4,I just got out of the Army. Go signal corps or...,281,50,5.509804,4,7,2


## Parts of Speech Features

Count based features are easy to create and understand. Yet count based features do not make use of any linguistic constructs or contextual information. In week-1 we studied about _Parts of Speech Tagging_. POS tagging helps us capture different constructs of a sentence such as nouns, verbs, etc.

In this section, we will prepare features based on POS tags, such as:

- Noun Count
- Verb Count
- Adjective Count
- Adverb Count
- Pronoun Count

[Reference](https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/)

In [16]:
import nltk

In [17]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /Users/laurent/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/laurent/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [20]:
import textblob
pos_family = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' :  ['RB','RBR','RBS','WRB']
}

In [21]:
# function to check and get the part of speech tag count of a words in a given sentence
# note this may take some time to execute on larger corpora

def check_pos_tag(x, flag):
    cnt = 0
    try:
        wiki = textblob.TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_family[flag]:
                cnt += 1
    except:
        pass
    return cnt

In [22]:
feature_col = 'article'

In [23]:
df['noun_count'] = df[feature_col].apply(lambda x: check_pos_tag(x, 'noun'))

In [24]:
df['verb_count'] = df[feature_col].apply(lambda x: check_pos_tag(x, 'verb'))

In [25]:
df['adj_count'] = df[feature_col].apply(lambda x: check_pos_tag(x, 'adj'))

In [26]:
df['adv_count'] = df[feature_col].apply(lambda x: check_pos_tag(x, 'adv'))

In [27]:
df['pron_count'] = df[feature_col].apply(lambda x: check_pos_tag(x, 'pron'))

In [28]:
df[['article','noun_count','verb_count','adj_count','adv_count','pron_count']].head()

Unnamed: 0,article,noun_count,verb_count,adj_count,adv_count,pron_count
0,\nAny lunar satellite needs fuel to do regular...,28,13,16,8,4
1,\nGlad to see Griffin is spending his time on ...,9,5,2,2,2
2,\n\n\nIn spite of my great respect for the peo...,26,21,13,7,14
3,\n\n\n\n\nDidn't one of the early jet fighters...,13,8,2,6,2
4,I just got out of the Army. Go signal corps or...,14,7,6,4,3


# Text Legibility Features

There are a wide variety of text legibility tests which can be leveraged to extract various statistics from text data which can be used as a measure of ease of understanding, readability and complexity.

The [`textstat`](https://pypi.org/project/textstat/) package provides a wide variety of such tests which can be leveraged to extract text legibility features.

We will cover a few of the essential ones here including:

- Syllable Counts
- Sentence Counts
- Flesch Reading Ease Score
- Automated Readability Index

In [29]:
!pip install textstat



In [30]:
import textstat

## Syllable Count

Returns the number of syllables present in the given text document

In [31]:
df['syllable_count'] = [textstat.syllable_count(doc)
                          for doc in df[feature_col].values]


## Sentence Count

Returns the number of sentences present in the given text document

In [32]:
df['sentence_count'] = [textstat.sentence_count(doc)
                          for doc in df[feature_col].values]

## Flesch Reading Ease Score

In the Flesch reading-ease test, higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read. The formula for the Flesch reading-ease score (FRES) test and the score interpretations are showcased as follows based on [Wikipedia](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease)

![](https://i.imgur.com/YxgbUpv.png)

While the maximum score is 121.22, there is no limit on how low the score can be. A negative score is valid.

In [33]:
df['flesch_reading_ease_score'] = [textstat.flesch_reading_ease(doc)
                                      for doc in df[feature_col].values]

## Automated Readability Index

The automated readability index (ARI) is a readability test for English texts, designed to gauge the understandability of a text.

Like the Flesch–Kincaid grade level, Gunning fog index etc., it produces an approximate representation of the US grade level needed to comprehend the text.

The complete formula for computing the index and interpretation is depicted as follows thanks to [Wikipedia](https://en.wikipedia.org/wiki/Automated_readability_index)

![](https://i.imgur.com/2ohzUok.png)

In [34]:
df['automated_readability_index'] = [textstat.automated_readability_index(doc)
                                      for doc in df[feature_col].values]

In [35]:
df[['article', 'syllable_count', 'sentence_count', 'flesch_reading_ease_score',
    'automated_readability_index']].head()

Unnamed: 0,article,syllable_count,sentence_count,flesch_reading_ease_score,automated_readability_index
0,\nAny lunar satellite needs fuel to do regular...,147,5,60.65,11.6
1,\nGlad to see Griffin is spending his time on ...,47,2,63.7,8.8
2,\n\n\nIn spite of my great respect for the peo...,163,7,79.19,8.0
3,\n\n\n\n\nDidn't one of the early jet fighters...,60,5,87.52,4.3
4,I just got out of the Army. Go signal corps or...,68,3,71.44,8.1


## Sentiment Based Features

If you are dealing with subjective and opinionated text where people often express stong emotions, feelings.

This might make it a classic case where the text documents  are a good candidate for extracting sentiment as a feature.

TextBlob is an excellent open-source library for performing NLP tasks with ease, including sentiment analysis. It also an a sentiment lexicon (in the form of an XML file) which it leverages to give both polarity and subjectivity scores.

- The polarity score is a float within the range [-1.0, 1.0].
- The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

Perhaps this could be used for getting some new features? Let's look at some basic examples.

In [36]:
textblob.TextBlob('This is an AMAZING pair of Jeans!').sentiment

Sentiment(polarity=0.7500000000000001, subjectivity=0.9)

In [37]:
textblob.TextBlob('I really hated this UGLY T-shirt!!').sentiment

Sentiment(polarity=-0.95, subjectivity=0.85)

Remember this is unsupervised, lexicon-based sentiment analysis where we don't have any pre-labeled data saying which article might have a positive or negative sentiment. We use the lexicon to determine this.

In [38]:
df_snt_obj = df[feature_col].apply(lambda row: textblob.TextBlob(row).sentiment)
df['Polarity'] = [obj.polarity for obj in df_snt_obj.values]
df['Subjectivity'] = [obj.subjectivity for obj in df_snt_obj.values]

In [39]:
df[['article', 'Polarity', 'Subjectivity']].head()

Unnamed: 0,article,Polarity,Subjectivity
0,\nAny lunar satellite needs fuel to do regular...,-0.015909,0.431993
1,\nGlad to see Griffin is spending his time on ...,0.2,0.6
2,\n\n\nIn spite of my great respect for the peo...,-0.008532,0.535714
3,\n\n\n\n\nDidn't one of the early jet fighters...,0.130208,0.244444
4,I just got out of the Army. Go signal corps or...,0.21,0.58


Let's look at the article text with the highest negative sentiment and its other features

In [40]:
df['Polarity'].idxmin()

226

In [41]:
df.iloc[226]

article                        \nI assume, then, that someone at Thiokol put ...
char_count                                                                   225
word_count                                                                    42
word_density                                                            5.232558
punctuation_count                                                              7
title_word_count                                                               2
upper_case_word_count                                                          2
noun_count                                                                     9
verb_count                                                                     9
adj_count                                                                      1
adv_count                                                                      3
pron_count                                                                     3
syllable_count              

In [42]:
df.iloc[226]['article']

'\nI assume, then, that someone at Thiokol put on their "manager\'s hat" and said\nthat pissing off the customer by delaying shipment of the SRB to look inside\nit was a bad idea, regardless of where that tool might have ended up.'