# Analysis of NSF Abstracts

Here we show some practical Text Analysis with emphasis on Feature Engineering on the collection of NSF Award abstracts. Abstracts are generously available at **UCI Machine Learning Dataset** collection by **Michael J. Pazzani** (hope my citation is right!).

Each document consists of a research abstract along with metadata including the name of NSF department to which the abstract belongs. Here is an example of how these documents look like:


<img src="doc_ex.png" width="800">

As seen above, the funded research is on Mathematics and the department is Devision of Mathematical Science with tag **DMS**.

For sake of practicing, we exclude all meta data and only use abstract text to predict the department.

In [None]:
# Imports
import pandas as pd
from IPython.display import display
import os
import itertools
import re
import numpy as np
import matplotlib.pyplot as plt 
import pickle 
import time
import string
from collections import Counter
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.naive_bayes import MultinomialNB
import warnings
from sklearn.decomposition import NMF, LatentDirichletAllocation
import nltk
from nltk.tokenize import RegexpTokenizer
warnings.simplefilter("ignore")

# Reading Raw Files

For sake of simplicity I used I processed data beforehand and here will just upload it but the process of reading files is in the **read_data** function for those who want to check.

In this showcase we use only a fraction of text documents. The complete dataset is from 1990 to 2003 with almost 129000 documents. 

**PS:** *Later I decided to limit the scope of analysis to only 3 tags (classes) to ease running the notebook for everyone. You may skip this limitation and explore a broader analysis.*

In [None]:
def read_data(nsf_data_dir, pfile=False, wrt=False):
    if pfile == True:
        data = pickle.load( open( "data.p", "rb" ) )
        return data
    d = {}

    for root, dirs, files in os.walk(nsf_data_dir):
        print(root)
        for f in files:
            if 'txt' in f:
                d[f] = []
                fle = open(os.path.join(root, f), "r")
                content = ' '.join([ii.strip() for ii in fle.readlines()])
                d[f].append(re.search(r'NSF Org(.*?)Latest', content).group(1).strip(': '))
                d[f].append(content.split('Abstract')[1].strip(':  '))
                fle.close()

    
    data = pd.DataFrame.from_dict(d, orient='index')
    data.columns = ['Tag','Text']
    data['Text'] = data['Text'].str.replace('\n',' ').str.lower() # Removing new lines / Upper-Case to Lower-Case
    data['Text'] = data['Text'].str.replace('\t',' ') # Removing tabs
    data['Text'] = data['Text'].str.replace('_',' ') 
    data['Text'] = data['Text'].str.replace('[^\w\s]',' ') # Removing punctuations
    data['Text'] = data['Text'].apply(lambda x: x.translate(str.maketrans(' ',' ',string.digits))) # Removing numbers
    if wrt:
        pickle.dump( data, open( "data_1995_2003.p", "wb" ) )
    
    return data

Indices are the file names and two columns Text and Tag represent the research abstract and the department respectively. As you see bellow a part of abstracts are not available! The description of data shows even more duplicate texts.

Let's go through them:

In [None]:
# I initially choose only 3 tags to ease and lighten the whole tutorial. 
# You can start (and even finish!) with larger fraction of data.
data = read_data('NSF',pfile=True)
data = data[data['Tag'].isin(['OCE','DMS','CHE'])]

# Have a look at how data is organized
display(data.head())

# Get a summary of data
display(data.describe())

In [None]:
# Search duplicates and keep only one from each
for text,count in Counter(data['Text']).items():
    if count>1:
        data.drop(list(data.index[data['Text']==text])[1::],axis=0,inplace=True)

# Get a summary of data again
display(data.describe())

So the problem of duplicates is solved (in a blind way!) but is every abstract informative?

Let's explore short abstracts:

In [None]:
# Search short (probably not informative) abstracts
c = Counter(data['Text'])
for text, count in c.items():
    if len(text) < 150: # try with 1000 as well and see how it affects
        data.drop(data.index[data['Text']==text],axis=0,inplace=True)
        
# Get a summary of data again
display(data.describe())

Before starting anything, we need to know how classes (here *Tags*) are distributed as a general aspect of data. Histograms are the right way to do this but here I would like to show more detailed information so I simply sort the class populations and plot it.

**PS:** *It makes much more sense in the presence of whole dataset. Now we are limited to only 3 classes. Please remove this limitation and run the code again to get a better insight about what imbalance class distribution means.*

In [None]:
# print('Class population in ascending order:\n',sorted(list(Counter(data['Tag'].tolist()).values())))
# plt.figure(figsize=(15,5))
# plt.plot(sorted(list(Counter(data['Tag'].tolist()).values())),'-*')
# plt.ylabel('Population of Classes')
# plt.xlabel('Class Index')
# plt.show()

### Now we can also check the histogram but we choose bins to put close class populations together.

print('Class populations\n',Counter(data['Tag'].tolist()).items())#('First 20 largest classes:\n',Counter(data['Tag'].tolist()).most_common(20))
# plt.figure(figsize=(15,5))
# plt.hist(list(Counter(data['Tag'].tolist()).values()),bins=40)
# plt.ylabel('# of Classes')
# plt.xlabel('Class Population')
# plt.show()

Let's reduce problem to some classes with almost similar number of samples according to the histogram above (**1000** to **2000**)

**PS:** *This does not make sense on the reduced version of problem. Try it on the whole dataset.*

In [None]:
tags = ['OCE','DMS','CHE'] 
# We already chose these tags at the beginning. It was supposed to happen here. 
# Since now on the original task equals the reduced version.

# To start our journey, we first need to tokenize the data
data = data[data.Tag.isin(tags)]
data['Tokens'] = data.apply(lambda row: nltk.word_tokenize(row['Text'].strip()), axis=1)
print(len(data),'samples from',len(tags),'classes')

Let's check the abstract length (in words and characters) as well. It is a pretty naive feature but let's explore it:

In [None]:
# Lengths of documents in character all together
lens = [len(ii) for ii in data.Text]
print('minimum text length (in char) is:',np.min(lens))
print('maximum text length (in char) is:',np.max(lens))
print('mean and median text lengths (in char) are:',np.mean(lens),np.median(lens))
plt.figure(figsize=(20,10))
plt.hist(lens,bins=80)
plt.xlabel('Text Length (in char)')
plt.ylabel('# of Documents')
plt.title('Lengths of documents in character all together')
plt.show()

# Lengths of documents in character separately
lens = {tag:[len(ii) for ii in data.loc[data['Tag']==tag].Text] for tag in tags}
plt.figure(figsize=(20,10))
for tag in tags:
    plt.hist(lens[tag],bins=80,alpha=.5,label=tag)
    plt.xlabel('Text Length (in char)')
    plt.ylabel('# of Documents')
    plt.title('Lengths of documents in character separately')
plt.legend()
plt.show()

# Lengths of documents in token all together
lens = [len(ii) for ii in data.Tokens]
print('minimum text length (in token) is:',np.min(lens))
print('maximum text length (in token) is:',np.max(lens))
print('mean and median text lengths (in token) are:',np.mean(lens),np.median(lens))
plt.figure(figsize=(20,10))
plt.hist(lens,bins=80)
plt.xlabel('Text Length (in token)')
plt.ylabel('# of Documents')
plt.title('Lengths of documents in token all together')
plt.show()

# Lengths of documents in token separately
lens = {tag:[len(ii) for ii in data.loc[data['Tag']==tag].Tokens] for tag in tags}
plt.figure(figsize=(20,10))
for tag in tags:
    plt.hist(lens[tag],bins=80,alpha=.5,label=tag)
    plt.xlabel('Text Length (in token)')
    plt.ylabel('# of Documents')
    plt.title('Lengths of documents in token separately')
plt.legend()
plt.show()

In [None]:
def len_of_tag_char(tag):
    return [len(ii) for ii in data[data.Tag == tag]['Text']]
def len_of_tag_tok(tag):
    return [len(ii) for ii in data[data.Tag == tag]['Tokens']]

lengths = [len_of_tag_char(tag) for tag in tags]
plt.figure(figsize=(15,5))
plt.boxplot(lengths)
plt.ylabel('Text Length (character)')
plt.xticks([ii for ii in range(1,len(tags)+1)], tags)
plt.show()


lengths = [len_of_tag_tok(tag) for tag in tags]
plt.figure(figsize=(15,5))
plt.boxplot(lengths)
plt.ylabel('Text Length (token)')
plt.xticks([ii for ii in range(1,len(tags)+1)], tags)
plt.show()


Let's continue with finding StopWords and important words. StopWords are non-informative words inside the corpus.

In [None]:
c_total = Counter()
for ind in data.index:
    c_total.update(Counter(data.loc[ind].Tokens))

In this container, we saved the number of occurances of **each word** in **all documents**. Let's have a look at it:

In [None]:
print(c_total.most_common(50))

Now we check the number of times each word appeared in **different documents (Document Frequency)** i.e. if a word appeard several times within a document, we count only one.

This reveals a part of our **curpos-based StopWord** list:

In [None]:
c_unique = Counter()
for ind in data.index:
    c_unique.update(Counter(set(data.loc[ind].Tokens)))

Let's have a look at them. Are they really StopWords??!!!

In [None]:
print('First 15 common words:\n')
for word in c_unique.most_common(15):
    print(word[0],'-->', 'appeared in',word[1],'documents out of 10953 documents i.e.',np.round(100*word[1]/len(data),2),'%')

and what about the total number of words?

In [None]:
print('There are',len(c_unique),'unique words i.e. the complete size of our vocab.\
 This is the intrinsic dimension of any BoW representation.')

print(np.sum(list(c_total.values())), 'words in total (with repeatations) i.e. sum of BoW matrix elements.')

Let's check the class-based specification of each word:

In [None]:
tag_dict = {tag:Counter() for tag in data.Tag.unique()}
for ind in data.index:
    tag = data.loc[ind]['Tag']
    tag_dict[tag].update(Counter(set(data.loc[ind].Tokens)))

Now a fun starts with a basic example. Let's see how words are assigned to tags and what we can infer from this.

In [None]:
tag_specific_words = pd.DataFrame(columns=['OCE_Words','OCE_%','CHE_Words','CHE_%','DMS_Words','DMS_%'])
for tag in ['OCE','CHE', 'DMS']:
    len_tag = len(data[data['Tag']==tag])
    words = []
    percent = []
    for word in tag_dict[tag].most_common(15):
        words.append(word[0])
        percent.append(np.round(100*word[1]/len_tag,2))
    tag_specific_words[tag+'_Words'] = words
    tag_specific_words[tag+'_%'] = percent
display(tag_specific_words)

Now I repeat the same but this time let's remove some of the very frequent StopWords. It means that we are going from least specific word-tag relation to most specific one. Recall the concept of underfitting-overfitting. (Please note that we did not find all StopWords yet)

So let's go this way; first we find most frequent words of each tag ignoring first 50 most common words we found. Then we do it with ignoring first 500 common words and then 5000 and see the results. 

In [None]:
for n_stop_words in [50,500,5000]:
    tag_specific_words = pd.DataFrame(columns=['OCE_Words','OCE_%','CHE_Words','CHE_%','DMS_Words','DMS_%'])
    print('Ignoring first',n_stop_words,'stop-words ##############')
    StopWords = [ii[0] for ii in c_unique.most_common(n_stop_words)]
    for tag in ['OCE','CHE', 'DMS']:
        len_tag = len(data[data['Tag']==tag])
        jj = 0
        words = []
        percent = []
        for word in tag_dict[tag].most_common(2*n_stop_words):
            if word[0] not in StopWords:
                words.append(word[0])
                percent.append(np.round(100*word[1]/len_tag,2))
                jj += 1
            if jj == 20:
                break
        tag_specific_words[tag+'_Words'] = words
        tag_specific_words[tag+'_%'] = percent
    display(tag_specific_words)
    print('\n\n')

Interesting ...! We are clearly seeing what *Feature* means in terms of *Text Data*.

Let's go on by doing the same but this time with calculating some information theoretic score for each word to be selected as a feature (feature importance, ranking, selection, blahblah). Later we can compare the performance of our methods on these words with the performance of the same models on other feature sets like BoW variations.

The score is calculated based on the intuitive idea bellow

### How class-informative is a word? i.e. how confident you can predict the class of a text if you see this word in it?

To do this we calculate an information-theoretic inspired score as follows:

$$\large S(w_{i}) = \frac{1}{N_{c}-1}\times\frac{N_{c}-N_{w_{i},c}}{N_{w_{i},c}} $$

where $N_{w,c}$ is the number of classes in which the word $w_{i}$ appeard, and $N_{c}$ is the total number of classes. This score is bounded between $0$ and $1$ (I confess that I complicated a simple concept too much but it became beatiful at least :P ).

## Advantage
* Captures the most distingushable words.
* Does not need normalization by class sizes as it looks at word appearence in a binary way.

## Disadvantage
* Pretty naive idea! Of course the total number of appearences in classes is more informative than *if the word $w$ ever appeared in class $c$ or not*.
 * Just assume the case that a word appears in $N_{c}-1$ classes only once and in one class $10k$ times.
 
 
* Danger zone! The features captured here are pretty special such that do not capture an unseen document in general (overfitting problem).
 * **Solution**: randomly add not-so-special words to this dictionary to improve generalization (doesn't looks like Regularization?!).
* It **CAN NOT** consider all classes i.e. the score **DOES NOT** tell to which class a word belongs!
 * A slight modification on the formulation can solve this problem. In this case we calculate the top $n$ best words for each class and concatenate them to construct feature vector **(It will become a supervised version of what we know as TF-IDF :) )**