# Feature Engineering in NLP
* Notebook by Adam Lang
* Date: 7/19/2024

# Overview
* In this notebook we will go over important techniques to consider when performing feature engineering in NLP.



## Feature Engineering Techniques
1. Special Characters and Numbers
  * chars: `@twitter`, `#datascience`
  * nums: `9897xxxxxx`, phone numbers, etc.
2. Word Counts
  * Sentiment analysis
     * positive vs. neg text has longer word counts?

3. Average word length
   * Document analysis
   * Example: social media messages vs. legal documents or speeches

4. Stopwords
   * Useful in some cases but not all.
   * Useful: in political speech
   * Not useful: text classification, named entity recognition

5. POS tags and NER
   * Counting POS and NER tags for each feature in a document.
   * Document Analysis Example, a text may have:
       * NER = 4
       * Person = 1
       * Noun = 5
       * Verb = 3
    * This information may be useful in text analytics and developing machine learning models.

6. .....and others.....
   * Depends on documents and text you are using.

# Implementation of Feature Engineering
* We will demo the following:
1. Special characters and numbers
2. Word counts
3. Number of characters
4. Average word length
5. Stop words
6. POS tags
7. NER

## Load and preprocess data

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
## imports
import pandas as pd
import re

In [3]:
# spacy imports
import spacy
# stopwords spacy
from spacy.lang.en.stop_words import STOP_WORDS

# spacy english language model
nlp = spacy.load('en_core_web_sm')


In [4]:
## data path
data_path = '/content/drive/MyDrive/Colab Notebooks/Classical NLP/tweets-.csv'

In [5]:
# upload data -- only first 1000 rows
df = pd.read_csv(data_path, nrows=1000)

# df head
df.head()

Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted
0,RT @rssurjewala: Critical question: Was PayTM ...,False,0,,2016-11-23 18:40:30,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331,True,False
1,RT @Hemant_80: Did you vote on #Demonetization...,False,0,,2016-11-23 18:40:29,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",PRAMODKAUSHIK9,66,True,False
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0,,2016-11-23 18:40:03,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12,True,False
3,RT @ANI_news: Gurugram (Haryana): Post office ...,False,0,,2016-11-23 18:39:59,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",deeptiyvd,338,True,False
4,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0,,2016-11-23 18:39:39,False,,8.014954e+17,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120,True,False


In [6]:
# df info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   text           1000 non-null   object 
 1   favorited      1000 non-null   bool   
 2   favoriteCount  1000 non-null   int64  
 3   replyToSN      104 non-null    object 
 4   created        1000 non-null   object 
 5   truncated      1000 non-null   bool   
 6   replyToSID     73 non-null     float64
 7   id             1000 non-null   float64
 8   replyToUID     104 non-null    float64
 9   statusSource   1000 non-null   object 
 10  screenName     1000 non-null   object 
 11  retweetCount   1000 non-null   int64  
 12  isRetweet      1000 non-null   bool   
 13  retweeted      1000 non-null   bool   
dtypes: bool(4), float64(3), int64(2), object(5)
memory usage: 82.2+ KB


In [7]:
#we only need the text column
df.drop(df.columns[1:], axis=1, inplace=True)

In [8]:
## new df
df.head()

Unnamed: 0,text
0,RT @rssurjewala: Critical question: Was PayTM ...
1,RT @Hemant_80: Did you vote on #Demonetization...
2,"RT @roshankar: Former FinSec, RBI Dy Governor,..."
3,RT @ANI_news: Gurugram (Haryana): Post office ...
4,RT @satishacharya: Reddy Wedding! @mail_today ...


In [12]:
## sample some tweets
df[['text']].sample(5)

Unnamed: 0,text
388,#Demonetization  How does it impact the commo...
82,RT ZeeNewsSports: #Demonetization: harbhajan_s...
137,The only thing I have gained from Modi's #demo...
348,Ts is exactly what Pappu &amp; opposition has ...
65,RT @CMOMaharashtra: Watch what @sonunigam thin...


In [13]:
## example tweet
df.loc[0,'text']

"RT @rssurjewala: Critical question: Was PayTM informed about #Demonetization edict by PM? It's clearly fishy and requires full disclosure &amp;\x85"

In [15]:
## another tweet
df.loc[512, 'text']

'RT @smita_muk: BREAKING NEWS\r\nPMapps result amnounced!\r\n90% Indians support #demonetization\r\n<ed><U+00A0><U+00BD><ed><U+00B1><U+008F><ed><U+00A0><U+00BD><ed><U+00B1><U+008F><ed><U+00A0><U+00BD><ed><U+00B1><U+008F><ed><U+00A0><U+00BD><ed><U+00B1><U+008F><ed><U+00A0><U+00BD><ed><U+00B1><U+008F><ed><U+00A0><U+00BD><ed><U+00B1><U+008F><U+270C><U+270C><U+270C><U+270C><U+270C><U+270C><ed><U+00A0><U+00BD><ed><U+00B1><U+0086><ed><U+00A0><U+00BD><ed><U+00B1><U+0086><ed><U+00A0><U+00BD><ed><U+00B1><U+0086><ed><U+00A0><U+00BD><ed><U+00B1><U+0086><ed><U+00A0><U+00BD><ed><U+00B1><U+0086><ed><U+00A0><U+00BD><ed><U+00B1><U+0086>\r\n@narendramodi Zindabad!'

Summary:
* We can see there are unicode chars in the dataset, we will remove these.

In [16]:
## preprocess tweets
def preprocess(text):

  # replace unicode chars --> empty string
  text = re.sub(r'<U\+[A-Z0-9]+>|<ed>', "", text)
  # remove newline and rawstring chars
  text = re.sub(r'\n|\r', "", text)

  return text

In [17]:
## applying function
df['text'] = df['text'].apply(preprocess)

In [18]:
## print result
df.head()

Unnamed: 0,text
0,RT @rssurjewala: Critical question: Was PayTM ...
1,RT @Hemant_80: Did you vote on #Demonetization...
2,"RT @roshankar: Former FinSec, RBI Dy Governor,..."
3,RT @ANI_news: Gurugram (Haryana): Post office ...
4,RT @satishacharya: Reddy Wedding! @mail_today ...


In [19]:
## lets look at 512 again
df.loc[512, 'text']

'RT @smita_muk: BREAKING NEWSPMapps result amnounced!90% Indians support #demonetization@narendramodi Zindabad!'

Summary
* We removed the unicode chars from this tweet and others.

In [20]:
## shape of df
df.shape

(1000, 1)

# Extract Text Based Features

### 1. Special Characters
* We can use special characters from the tweets to create features.

### 1.1 Number of mentions in Tweets


In [21]:
## function to count mentions
def mentions(text):

  # findall mentions
  mentions = re.findall('@\w+', text)

  # return mentions count
  return len(mentions)

In [22]:
# applying function
df['mentions_count'] = df['text'].apply(mentions)

In [23]:
# view features
df.head(10)

Unnamed: 0,text,mentions_count
0,RT @rssurjewala: Critical question: Was PayTM ...,1
1,RT @Hemant_80: Did you vote on #Demonetization...,1
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",1
3,RT @ANI_news: Gurugram (Haryana): Post office ...,1
4,RT @satishacharya: Reddy Wedding! @mail_today ...,2
5,@DerekScissors1: Indias #demonetization: #Bla...,2
6,RT @gauravcsawant: Rs 40 lakh looted from a ba...,1
7,RT @Joydeep_911: Calling all Nationalists to j...,1
8,RT @sumitbhati2002: Many opposition leaders ar...,2
9,National reform now destroyed even the essence...,0


In [24]:
df.describe()

Unnamed: 0,mentions_count
count,1000.0
mean,0.866
std,0.988444
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,8.0


Summary:
* We can see the max num of mentions is 8.

### 1.2 Number of hashtags used in Tweets

In [33]:
## function to count hashtags
def hashtags(text):

  # find hashtags
  hashtags = re.findall('#\w+', text)

  # return count of hashtags
  return len(hashtags)

In [34]:
df.columns

Index(['text', 'mentions_count'], dtype='object')

In [35]:
# apply function
df['hashtags_count'] = df['text'].apply(hashtags)

In [36]:
## result
df.head(10)

Unnamed: 0,text,mentions_count,hashtags_count
0,RT @rssurjewala: Critical question: Was PayTM ...,1,1
1,RT @Hemant_80: Did you vote on #Demonetization...,1,1
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",1,1
3,RT @ANI_news: Gurugram (Haryana): Post office ...,1,1
4,RT @satishacharya: Reddy Wedding! @mail_today ...,2,2
5,@DerekScissors1: Indias #demonetization: #Bla...,2,2
6,RT @gauravcsawant: Rs 40 lakh looted from a ba...,1,1
7,RT @Joydeep_911: Calling all Nationalists to j...,1,2
8,RT @sumitbhati2002: Many opposition leaders ar...,2,1
9,National reform now destroyed even the essence...,0,1


In [38]:
## stats on hashtags
df['hashtags_count'].describe()

count    1000.000000
mean        1.688000
std         1.272114
min         0.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        10.000000
Name: hashtags_count, dtype: float64

Summary:
* Max num of hashtags in 1 tweet is 10.

### 1.3 Number of name titles in Tweet

In [39]:
# function to count name titles in Tweet
def title(text):
  count = re.findall('Mr\.|Mrs\.|Dr\.|Miss\s*', text)
  return len(count)

In [40]:
# apply
df['text'].apply(title)

0      0
1      0
2      0
3      0
4      0
      ..
995    0
996    0
997    0
998    0
999    0
Name: text, Length: 1000, dtype: int64

Summary:
* It appears surnames are not used often in tweets.

### 2.1 Word Count

In [41]:
# list comp to count num of words in tweet
df['word_count'] = [len(i.split()) for i in df['text']]


In [42]:
## features
df.head(10)

Unnamed: 0,text,mentions_count,hashtags_count,word_count
0,RT @rssurjewala: Critical question: Was PayTM ...,1,1,20
1,RT @Hemant_80: Did you vote on #Demonetization...,1,1,11
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",1,1,21
3,RT @ANI_news: Gurugram (Haryana): Post office ...,1,1,16
4,RT @satishacharya: Reddy Wedding! @mail_today ...,2,2,9
5,@DerekScissors1: Indias #demonetization: #Bla...,2,2,12
6,RT @gauravcsawant: Rs 40 lakh looted from a ba...,1,1,22
7,RT @Joydeep_911: Calling all Nationalists to j...,1,2,18
8,RT @sumitbhati2002: Many opposition leaders ar...,2,1,17
9,National reform now destroyed even the essence...,0,1,18


In [43]:
## describe stats
df['word_count'].describe()

count    1000.000000
mean       16.685000
std         4.566468
min         3.000000
25%        14.000000
50%        17.000000
75%        20.000000
max        28.000000
Name: word_count, dtype: float64

Summary:
* The max number of words in a tweet is 28.
* The average number of words in a tweet is 16-17.

### 2.2 Number of Chars

In [44]:
## list comprehension count num of chars
df['char_count'] = [len(i) for i in df['text']]

In [45]:
# print features
df.head(10)

Unnamed: 0,text,mentions_count,hashtags_count,word_count,char_count
0,RT @rssurjewala: Critical question: Was PayTM ...,1,1,20,144
1,RT @Hemant_80: Did you vote on #Demonetization...,1,1,11,66
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",1,1,21,138
3,RT @ANI_news: Gurugram (Haryana): Post office ...,1,1,16,140
4,RT @satishacharya: Reddy Wedding! @mail_today ...,2,2,9,107
5,@DerekScissors1: Indias #demonetization: #Bla...,2,2,12,121
6,RT @gauravcsawant: Rs 40 lakh looted from a ba...,1,1,22,143
7,RT @Joydeep_911: Calling all Nationalists to j...,1,2,18,139
8,RT @sumitbhati2002: Many opposition leaders ar...,2,1,17,139
9,National reform now destroyed even the essence...,0,1,18,140


In [46]:
# describe
df['char_count'].describe()

count    1000.000000
mean      124.673000
std        21.427861
min        34.000000
25%       117.000000
50%       135.000000
75%       139.000000
max       148.000000
Name: char_count, dtype: float64

Summary: Max num of chars in a tweet is 148 with an avg of 124. We should note that we are counting spaces too.

### 3. Average Word Length

In [53]:
# function to calculate avg word length of tweet
def avg_word_len(text):

  # store word lengths
  word_len = 0

  # iterate over tweets
  for token in text.split():
    word_len += len(token)

  # num of words in tweet
  word_count = text.split()

  # return avg len of words in tweet
  return round((word_len/len(word_count)),3)

In [54]:
# apply function
df['avg_word_len'] = df['text'].apply(avg_word_len)

In [55]:
# print features
df.head(10)

Unnamed: 0,text,mentions_count,hashtags_count,word_count,char_count,avg_word_len
0,RT @rssurjewala: Critical question: Was PayTM ...,1,1,20,144,6.2
1,RT @Hemant_80: Did you vote on #Demonetization...,1,1,11,66,5.091
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",1,1,21,138,5.571
3,RT @ANI_news: Gurugram (Haryana): Post office ...,1,1,16,140,7.75
4,RT @satishacharya: Reddy Wedding! @mail_today ...,2,2,9,107,11.0
5,@DerekScissors1: Indias #demonetization: #Bla...,2,2,12,121,9.167
6,RT @gauravcsawant: Rs 40 lakh looted from a ba...,1,1,22,143,5.5
7,RT @Joydeep_911: Calling all Nationalists to j...,1,2,18,139,6.722
8,RT @sumitbhati2002: Many opposition leaders ar...,2,1,17,139,7.176
9,National reform now destroyed even the essence...,0,1,18,140,6.778


In [56]:
# describe
df['avg_word_len'].describe()

count    1000.00000
mean        6.86535
std         1.69860
min         3.89300
25%         5.65000
50%         6.57900
75%         7.64900
max        16.66700
Name: avg_word_len, dtype: float64

Summary: Avg word len max is 16, mean is 6.

### 4. Stopwords

In [57]:
# function to count num of stopwords in tweets
def stopwords(text):

  # create spacy object
  doc = nlp(text)

  # var to store count of stopwords
  count = 0
  for token in doc:
    if token.is_stop == True:
      count += 1
  return count

In [58]:
# applying function
df['stopwords'] = df['text'].apply(stopwords)

In [59]:
# describe
df['stopwords'].describe()

count    1000.000000
mean        6.038000
std         3.285019
min         0.000000
25%         4.000000
50%         6.000000
75%         8.000000
max        19.000000
Name: stopwords, dtype: float64

Summary:
* The max num of stopwords is 19, the avg is 6.

### 5. POS tagging

In [61]:
## creating function using spacy for POS
def pos(text):

  # spacy object
  doc = nlp(text)

  # var store count of POS tags
  count = 0

  # iterate over tokens
  for token in doc:
    # count noun, preposition, adjective, verb
    if token.pos_ in ['NOUN', 'ADP', 'ADJ', 'VBG']:
      count += 1

  # return the count
  return count

In [62]:
# apply func
df['POS'] = df['text'].apply(pos)

In [63]:
# head of df
df.head(10)

Unnamed: 0,text,mentions_count,hashtags_count,word_count,char_count,avg_word_len,stopwords,POS
0,RT @rssurjewala: Critical question: Was PayTM ...,1,1,20,144,6.2,7,9
1,RT @Hemant_80: Did you vote on #Demonetization...,1,1,11,66,5.091,4,4
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",1,1,21,138,5.571,5,6
3,RT @ANI_news: Gurugram (Haryana): Post office ...,1,1,16,140,7.75,2,10
4,RT @satishacharya: Reddy Wedding! @mail_today ...,2,2,9,107,11.0,0,6
5,@DerekScissors1: Indias #demonetization: #Bla...,2,2,12,121,9.167,4,6
6,RT @gauravcsawant: Rs 40 lakh looted from a ba...,1,1,22,143,5.5,11,11
7,RT @Joydeep_911: Calling all Nationalists to j...,1,2,18,139,6.722,8,7
8,RT @sumitbhati2002: Many opposition leaders ar...,2,1,17,139,7.176,8,10
9,National reform now destroyed even the essence...,0,1,18,140,6.778,7,10


In [64]:
# describe result
df['POS'].describe()

count    1000.000000
mean        7.657000
std         3.096601
min         0.000000
25%         5.000000
50%         8.000000
75%        10.000000
max        19.000000
Name: POS, dtype: float64

### 6. NER


In [67]:
# function for NER count
def ner(text) -> int:

  # spacy obj
  doc = nlp(text)

  # var count store NER tags
  count = 0

  # iterate tokens using spacy doc.ents
  for ent in doc.ents:
    # increment counter if token is a NER
    if ent.label_:
      count += 1
  # return count
  return count

In [68]:
# apply function
df['NER'] = df['text'].apply(ner)

In [69]:
# result
df.head(10)

Unnamed: 0,text,mentions_count,hashtags_count,word_count,char_count,avg_word_len,stopwords,POS,NER
0,RT @rssurjewala: Critical question: Was PayTM ...,1,1,20,144,6.2,7,9,3
1,RT @Hemant_80: Did you vote on #Demonetization...,1,1,11,66,5.091,4,4,1
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",1,1,21,138,5.571,5,6,4
3,RT @ANI_news: Gurugram (Haryana): Post office ...,1,1,16,140,7.75,2,10,1
4,RT @satishacharya: Reddy Wedding! @mail_today ...,2,2,9,107,11.0,0,6,2
5,@DerekScissors1: Indias #demonetization: #Bla...,2,2,12,121,9.167,4,6,2
6,RT @gauravcsawant: Rs 40 lakh looted from a ba...,1,1,22,143,5.5,11,11,3
7,RT @Joydeep_911: Calling all Nationalists to j...,1,2,18,139,6.722,8,7,2
8,RT @sumitbhati2002: Many opposition leaders ar...,2,1,17,139,7.176,8,10,1
9,National reform now destroyed even the essence...,0,1,18,140,6.778,7,10,3


In [70]:
# describe NERs
df['NER'].describe()

count    1000.000000
mean        2.269000
std         1.464508
min         0.000000
25%         1.000000
50%         2.000000
75%         3.000000
max         8.000000
Name: NER, dtype: float64

In [78]:
##  getting the NER labels
def ner_label(text_):

  # spacy obj
  doc = nlp(text_)

  # empty list
  ent_var = []

  # iterate
  for ent in doc.ents:
    assert ent.text, ent.label_
    ent_var.append([ent.text, ent.label_])


  return dict(ent_var)


In [79]:
# apply func
df['ner_label'] = df['text'].apply(ner_label)

In [80]:
df.head(10)

Unnamed: 0,text,mentions_count,hashtags_count,word_count,char_count,avg_word_len,stopwords,POS,NER,ner_label
0,RT @rssurjewala: Critical question: Was PayTM ...,1,1,20,144,6.2,7,9,3,"{'RT': 'GPE', 'PayTM': 'PERSON', 'about #Demon..."
1,RT @Hemant_80: Did you vote on #Demonetization...,1,1,11,66,5.091,4,4,1,{'#Demonetization': 'MONEY'}
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",1,1,21,138,5.571,5,6,4,"{'FinSec': 'ORG', 'RBI': 'ORG', 'CBDT Chair + ..."
3,RT @ANI_news: Gurugram (Haryana): Post office ...,1,1,16,140,7.75,2,10,1,{'Gurugram': 'PERSON'}
4,RT @satishacharya: Reddy Wedding! @mail_today ...,2,2,9,107,11.0,0,6,2,"{'@mail_today': 'DATE', '#': 'CARDINAL'}"
5,@DerekScissors1: Indias #demonetization: #Bla...,2,2,12,121,9.167,4,6,2,"{'#': 'CARDINAL', 'Blackmoney': 'PERSON'}"
6,RT @gauravcsawant: Rs 40 lakh looted from a ba...,1,1,22,143,5.5,11,11,3,"{'Rs 40': 'PRODUCT', 'Kishtwar': 'GPE', 'Third..."
7,RT @Joydeep_911: Calling all Nationalists to j...,1,2,18,139,6.722,8,7,2,"{'RT @Joydeep_911': 'ORG', 'Nationalists': 'NO..."
8,RT @sumitbhati2002: Many opposition leaders ar...,2,1,17,139,7.176,8,10,1,{'RT @sumitbhati2002': 'ORG'}
9,National reform now destroyed even the essence...,0,1,18,140,6.778,7,10,3,"{'#': 'CARDINAL', 'second': 'ORDINAL', 'https:..."


In [84]:
# iterate through column 'ner_label' and extract keys and value counts

from collections import defaultdict

key_counts = defaultdict(int)
value_counts = defaultdict(int)

for _, row in df.iterrows():
  for key, value in row['ner_label'].items():
    key_counts[key] += 1
    value_counts[value] += 1

print("Key counts:", key_counts)
print("Value counts:", value_counts)


Key counts: defaultdict(<class 'int'>, {'RT': 5, 'PayTM': 2, 'about #Demonetization': 2, '#Demonetization': 77, 'FinSec': 2, 'RBI': 10, 'CBDT Chair + Harvard': 2, 'Aam Aadmi': 4, 'Gurugram': 1, '@mail_today': 1, '#': 288, 'Blackmoney': 1, 'Rs 40': 1, 'Kishtwar': 1, 'Third': 1, 'RT @Joydeep_911': 1, 'Nationalists': 2, 'RT @sumitbhati2002': 1, 'second': 1, 'https://t.co/eyySIREiUq': 1, 'Narendra Modi App': 1, '@Jaggesh2 Bharat': 1, 'RT @sona2905': 1, 'RT @Dipankar_cpiml: The Modi app': 1, 'Gandhi': 3, 'Kerala': 2, 'RT @AAPVind:': 1, '@naam_pk': 2, "about #NarendraModi's": 5, '100%': 4, '#survey@DonMu': 1, 'first': 11, 'RT @mayankjain100': 1, 'Communal and Black Money is Secular': 2, 'LOL': 3, '#Demonetization Effect': 1, 'https://t.co/aA5lfy9ALG': 1, 'RT roshankar': 4, 'Harvard': 6, 'Larry Summers': 6, '@arunjaitley': 4, 'Lrnd': 1, '2sales': 1, 'BAU Operations #Demonetization': 1, 'RT @kanimozhi:': 4, 'Pappu &': 2, '12.5': 1, '#DeMonetization': 7, 'Card': 1, 'Amex': 1, 'RT @harshkkapoor: