# Predicting Computer Science Domain From Abstracts and Titles of Research Articles
#### Using Machine Learning and Custom Keywords

> **Important:** The original dataset doesnot contain any label. Conference titles have been used to label the dataset, and the modified version has been used here to understand the prediction accuracy. Relevant code cells might be omitted while testing on any unlabeled dataset.




# Setting Up Dependencies

In [214]:
# Mounting google drive location for correct directory path
# Might need google account authorization code
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [215]:
# Necessary imports
import numpy as np
import pandas as pd
import tensorflow as tf
import nltk
import re

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from scipy.sparse import coo_matrix

# Necessary downloads
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Loading CSV Resource Files

In [216]:
# Loading the provided dataset, added column with labels from conference name  
dataset = pd.read_csv('/content/drive/MyDrive/Research/BCS2 Lab/Final/LabeledDataSetAllPapers.csv')

In [217]:
# Loading the custom keyword set
raw_keywords = pd.read_csv('/content/drive/MyDrive/Research/BCS2 Lab/Final/keywords.csv')

In [218]:
print("Dimensions of the input dataset: " + str(dataset.shape))
dataset.head()

Dimensions of the input dataset: (195830, 7)


Unnamed: 0,EID,venue_name,Year,Title,Abstract,Conference Name,Category
0,2-s2.0-36348977467,AAAI,2007,Photometric and geometric restoration of docum...,The popularity of current hand-held digital im...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Artificial intelligence
1,2-s2.0-36349036042,AAAI,2007,"ESP: A logic of only-knowing, noisy sensing an...",When reasoning about actions and sensors in re...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Artificial intelligence
2,2-s2.0-36349024731,AAAI,2007,Impromptu teams of heterogeneous mobile robots,As robots become more involved in assisting us...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Artificial intelligence
3,2-s2.0-36349006415,AAAI,2007,The Virtual Solar-terrestrial Observatory: A d...,The Virtual Solar-Terrestrial Observatory is a...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Artificial intelligence
4,2-s2.0-36349000637,AAAI,2007,Reasoning about attribute authenticity in a we...,The reliable authentication of user attributes...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Artificial intelligence


In [219]:
print("Dimensions of the custom keyword set: " + str(raw_keywords.shape))
raw_keywords.head()

Dimensions of the custom keyword set: (1339, 3)


Unnamed: 0,Broader Category,Subcategory,Keywords
0,AI,Artificial intelligence,Autonomous
1,AI,Artificial intelligence,Algorithm
2,AI,Artificial intelligence,Object Detection Models
3,AI,Artificial intelligence,Neural Network
4,AI,Artificial intelligence,Monte Carlo Tree Search


In [220]:
# initializing lemmatizer and stemmer 
lem = WordNetLemmatizer()
stem = PorterStemmer()

In [221]:
# extracting the hand lebeled keywords
category_keyword = {}
subcat = []
keywords = []
for i in range (raw_keywords.shape[0]):
  row = raw_keywords.iloc[i,:].tolist()
  if row[1] not in subcat:
    subcat.append(row[1])
    if(len(keywords)>0):
      category_keyword.update({subcat[-2]: list(set(keywords))})
      keywords = []
  value = row[2].split()  # splitting by space 

  for i in value:
    values = re.split('-|,|;',i)  # using regex for further cleaning 
    for j in values:
      j = j.lower()
      j = lem.lemmatize(j)
      # i = stem.stem(i)
      i = re.sub("&lt;/?.*?&gt;"," &lt;&gt; ", i)
      # i = re.sub("(\\d|\\W)+", " ", i)
      keywords.append(j)

category_keyword.update({subcat[-1]: list(set(keywords))})  # extracting the unique ones
print('Custom hand labeled keywords by category: \n')
for i in category_keyword.keys():
  print(str(i) + ": " + str(len(category_keyword[i])))
  print(str(i) + ": " + str(category_keyword[i]))
  # print(str(category_keyword[i]))

Custom hand labeled keywords by category: 

Artificial intelligence: 69
Artificial intelligence: ['object', 'programming', 'compression', 'translation', 'dynamic', 'information', 'query', 'scheduling', 'multiagent', 'neural', 'linear', 'tree', 'reasoning', 'augmentation', 'syntactic', 'semantic', 'filtering', 'environment', 'carlo', 'database', 'planning', 'resolution', 'mereological', 'stochastic', 'modeling', 'hybrid', 'learning', 'application', 'network', 'logic', 'temporal', 'robotic', 'relation', 'search', 'predictive', 'matching', 'prototyping', 'data', 'auction', 'algorithm', 'image', 'lexico', 'analysis', 'machine', 'deep', 'heuristic', 'monte', 'detection', 'rapid', 'representation', 'structure', 'system', 'permutation', 'pattern', 'pathfinding', 'space', 'design', 'autonomous', 'processing', 'exploration', 'simulation', 'and', 'model', 'summation', 'statistical', 'entity', 'calculus', 'combinatorial', 'web']
Computer Vision: 68
Computer Vision: ['object', 'kinematic', 'resona

In [222]:
print(category_keyword.keys())
print(category_keyword.values())

dict_keys(['Artificial intelligence', 'Computer Vision', 'Machine learning & data mining', 'Natural language processing', 'The Web & information retrieval', 'Computer Architecture', 'Compuer Networks', 'Computer Security', 'Databases', 'Design automation', 'Embedded & real-time systems', 'High-performance computing', 'Mobile Computing', 'Measurement & perf. analysis', 'Operating Systems', 'Programming languages', 'Software engineering', 'Algorithms & complexity', 'Cryptography', 'Logic & Verification', 'Visualization', 'Robotics', 'Economics and Computation', 'Human Computer Interaction', 'Computational Biology and Bioinformatics', 'Computer Graphics'])
dict_values([['object', 'programming', 'compression', 'translation', 'dynamic', 'information', 'query', 'scheduling', 'multiagent', 'neural', 'linear', 'tree', 'reasoning', 'augmentation', 'syntactic', 'semantic', 'filtering', 'environment', 'carlo', 'database', 'planning', 'resolution', 'mereological', 'stochastic', 'modeling', 'hybrid

In [223]:
count = 0
for i in (category_keyword.values()):
  for j in i:
    # print(j)
    count = count + 1
print("Total keywords: " + str(count))

Total keywords: 2062


# Inspecting The Columns (Optional)
This dataset contains some entries that do not contain both title and abstract. Those are inspected and further dropped down. Final input dataset might not be this faulty. The notebook cells under this optional section can be commented out in that case. Otherwise, this section needs to be re-inspected and re-written by the checker. 

In [224]:
print("Glimpse of Titles: ")
all_titles = dataset['Title']
print(all_titles.head())
print("\nGlimpse of Abstracts: ")
all_abstracts = dataset['Abstract']
print(all_abstracts.head())

Glimpse of Titles: 
0    Photometric and geometric restoration of docum...
1    ESP: A logic of only-knowing, noisy sensing an...
2       Impromptu teams of heterogeneous mobile robots
3    The Virtual Solar-terrestrial Observatory: A d...
4    Reasoning about attribute authenticity in a we...
Name: Title, dtype: object

Glimpse of Abstracts: 
0    The popularity of current hand-held digital im...
1    When reasoning about actions and sensors in re...
2    As robots become more involved in assisting us...
3    The Virtual Solar-Terrestrial Observatory is a...
4    The reliable authentication of user attributes...
Name: Abstract, dtype: object


In [225]:
all_abstracts = dataset['Abstract']
valid_abstracts = []
invalid_index = []
for abs in range(len(all_abstracts)):
    if all_abstracts[abs] is not np.NaN:
        valid_abstracts.append(all_abstracts[abs])
    else:
      invalid_index.append(abs)
# valid_abstracts = np.array(valid_abstracts)
print("Shape of given abstracts: " + str(all_abstracts.shape))
# print("Shape of valid abstracts: " + str(valid_abstracts.shape))
print("Number of invalid abstracts: " + str(len(invalid_index)))
print("Invalid index range: " + "From " + str(invalid_index[0]) + " To " + str(invalid_index[-1])
      + ", Total: " + str(invalid_index[-1] - invalid_index[0] + 1) + "\n")
# all_titles

Shape of given abstracts: (195830,)
Number of invalid abstracts: 2988
Invalid index range: From 192842 To 195829, Total: 2988



# Statistical Exploration

In [226]:
# Fetching word count in the abstracts
dataset = dataset.loc[0:192841] # these are the valid ones 
dataset['word_count'] = dataset['Abstract'].apply(lambda x: len(str(x).split(" ")))
print("Words Per Abstract:")
dataset[['Abstract', 'word_count']]

Words Per Abstract:


Unnamed: 0,Abstract,word_count
0,The popularity of current hand-held digital im...,182
1,When reasoning about actions and sensors in re...,118
2,As robots become more involved in assisting us...,134
3,The Virtual Solar-Terrestrial Observatory is a...,124
4,The reliable authentication of user attributes...,132
...,...,...
192837,Digital Illness can be conceptualized as a sta...,148
192838,This paper elaborates on a process to streamli...,139
192839,Social networks play an important role in the ...,138
192840,The current COVID-19 experience seems to expos...,135


In [227]:
print('Descriptive Statistics: \n')
dataset.word_count.describe()

Descriptive Statistics: 



count    192842.00000
mean        153.06259
std          56.48467
min           3.00000
25%         117.00000
50%         150.00000
75%         185.00000
max        1635.00000
Name: word_count, dtype: float64

In [228]:
dataset = dataset.drop('word_count', axis = 1) # dropping the intermediate word_count column, after inspection done

In [229]:
# print('Most common entries: \n') 
# most_freq = pd.Series(''.
#                 join(dataset['Abstract']).split()).value_counts()
# most_freq[:1000]

In [230]:
# print('Least common entries: \n') 
# least_freq = pd.Series(''.
#                 join(dataset['Abstract']).split()).value_counts()
# least_freq[-100:]

# Data Preprocessing

### Handling Stop Words

In [231]:
# stop words
stop_words = set(stopwords.words("english"))
print('Stop words from library: ' + str(len(stop_words)))

Stop words from library: 179


In [232]:
# custom stop words
custom_stop_size = 300
most_freq = pd.Series(''.
                join(dataset['Abstract']).split()).value_counts().index.tolist()[:custom_stop_size]
# most_freq
new_words = []
for i in most_freq:
    if i not in stop_words:
        new_words.append(i)
print('Custom stop words: ' + str(len(new_words)))

Custom stop words: 230


In [233]:
stop_words = stop_words.union(new_words)
print('Total stop words: ' + str(len(stop_words)))

Total stop words: 409


### Handling Punctuations, Upper Cases, Special Characters

In [234]:
# using regex, splitting, lemmatizer and stemmer to clean data
corpus = []
for i in range(dataset['Abstract'].shape[0]):  
    # removing punctuations
    text = re.sub('[^a-zA-Z]', ' ', dataset['Abstract'][i])
    text2 = re.sub('[^a-zA-Z]', ' ', dataset['Title'][i])
    # lowercasing
    text = text.lower()
    text2 = text2.lower()
    # removing tags (<>)
    text = re.sub("&lt;/?.*?&gt;"," &lt;&gt; ", text)
    text2 = re.sub("&lt;/?.*?&gt;"," &lt;&gt; ", text2)
    # removing special characters and digits
    text = re.sub("(\\d|\\W)+", " ", text)
    text2 = re.sub("(\\d|\\W)+", " ", text2)
    # string to list of words
    text = text.split()
    text2 = text2.split()
    
    # stemming
    # text = [stem.stem(word) for word in text if not word in stop_words]
    # text = " ".join(text)
    
    # lemmatizing and stemming
    text = [lem.lemmatize(word) for word in text if not word in stop_words]
    text2 = [lem.lemmatize(word) for word in text2 if not word in stop_words]
    # text = [stem.stem(word) for word in text]
    # text2 = [stem.stem(word) for word in text2]
    # print(text)
    # print(text2)
    text = text + text2
    text = " ".join(text)
    corpus.append(text)
print('Total number of words in whole corpus: ' + str(len(corpus)))

Total number of words in whole corpus: 192842


# Tf-idf Algorithm

> Searches over the whole corpus to find the Term Frequency and Inverse Document Frequency metric of the dataset to extract important features relevant to each entries

> Parameters: maximum threshold to ignore common values (max_df) and maximum number of extracted features (max_features)





In [235]:
# running a count vectorizer method 
cv = CountVectorizer(max_df = 0.7, stop_words=stop_words,
                    max_features = 25000)
X = cv.fit_transform(corpus)
# list(cv.vocabulary_.keys())[:100]

tfidf_transformer = TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf_transformer.fit(X)

# getting features name
feature_names = cv.get_feature_names()

  'stop_words.' % sorted(inconsistent))


# Predicting Probable Category (A Single Row)

In [236]:
# fetch document for which keywords need to be extracted
def predict_category(token):  
  doc = corpus[token]

  # generate tf-idf for the given document 
  vector = tfidf_transformer.transform(cv.transform([doc]))
  return vector

def sort_coo(coo_matrix):
  tuples = zip(coo_matrix.col, coo_matrix.data)
  return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn):
  sorted_items = sorted_items[:topn]
  score_vals = []
  feature_vals = []

  for index, score in sorted_items:
    score_vals.append(round(score, 5))
    feature_vals.append(feature_names[index])
  results = {}
  for i in range(len(feature_vals)):
    results[feature_vals[i]] = score_vals[i]
  return results

In [237]:
token = 5

tf_idf_vector = predict_category(token)

sorted_items = sort_coo(tf_idf_vector.tocoo())

probable_keywords = extract_topn_from_vector(feature_names, sorted_items, 500)

print("\nTitle + Abstract:\n")
print(dataset['Title'][token])
print(dataset['Abstract'][token])
print("\nProbable Keywords:\n")
for k in probable_keywords:
  print(k, probable_keywords[k])


Title + Abstract:

Posterior probability profiles for the automated assessment of the recovery of stroke patients
Assessing recovery from stroke has been so far a time consuming procedure in which highly trained clinicians are required. This paper proposes a mechatronic platform which measures low forces and torques exerted by subjects. Class posterior probabilities are used as a quantitative and statistically sound tool to assess motor recovery from these force and torque measurements. The performance of the patients is expressed in terms of the posterior probability to belong to the class of normal subjects. The mechatronic platform together with the class posterior probabilities enables to automate motor recovery assessment without the need for highly trained clinicians. It is shown that the class posterior probability profiles are highly correlated, r ‚âà 0.8, with the well-established Fugl-Meyer scale assessment in motor recovery. These results have been obtained through careful 

In [238]:
output_map = {}
for i in category_keyword.keys():
  output_map.update({i:0})
# print(output_map)

keyword_list = list(category_keyword.values())
category_list = list(category_keyword.keys())

# print(keyword_list)
# print(category_list)

for k in probable_keywords.keys():
  for v in keyword_list:
    if k in v:
      subcat = category_list[keyword_list.index(v)]
      output_map[subcat] = int(output_map[subcat]) + 1
print(output_map)

occurence = list(output_map.values())
output = list(output_map.keys())
print(occurence)
print(output)
print(output[occurence.index(max(output_map.values()))])

{'Artificial intelligence': 0, 'Computer Vision': 0, 'Machine learning & data mining': 2, 'Natural language processing': 1, 'The Web & information retrieval': 1, 'Computer Architecture': 0, 'Compuer Networks': 0, 'Computer Security': 1, 'Databases': 0, 'Design automation': 0, 'Embedded & real-time systems': 0, 'High-performance computing': 2, 'Mobile Computing': 1, 'Measurement & perf. analysis': 2, 'Operating Systems': 2, 'Programming languages': 1, 'Software engineering': 2, 'Algorithms & complexity': 0, 'Cryptography': 2, 'Logic & Verification': 0, 'Visualization': 0, 'Robotics': 2, 'Economics and Computation': 0, 'Human Computer Interaction': 1, 'Computational Biology and Bioinformatics': 2, 'Computer Graphics': 1}
[0, 0, 2, 1, 1, 0, 0, 1, 0, 0, 0, 2, 1, 2, 2, 1, 2, 0, 2, 0, 0, 2, 0, 1, 2, 1]
['Artificial intelligence', 'Computer Vision', 'Machine learning & data mining', 'Natural language processing', 'The Web & information retrieval', 'Computer Architecture', 'Compuer Networks', 

# Predicting For Whole Dataset and Measuring Accuracy

In [239]:
count = 0
result = []
for token in range((dataset.shape[0])):    
    tf_idf_vector = predict_category(token)
    sorted_items = sort_coo(tf_idf_vector.tocoo())
    probable_keywords = extract_topn_from_vector(feature_names, sorted_items, 500)
    
    output_map = {}
    for i in category_keyword.keys():
      output_map.update({i:0})
    keyword_list = list(category_keyword.values())
    category_list = list(category_keyword.keys())
    for k in probable_keywords.keys():
      for v in keyword_list:
        if k in v:
          subcat = category_list[keyword_list.index(v)]
          output_map[subcat] = int(output_map[subcat]) + 1
    occurence = list(output_map.values())
    output = list(output_map.keys())
    # print(output[occurence.index(max(output_map.values()))])
    result.append(output[occurence.index(max(output_map.values()))])
    if result[token] == dataset['Category'][token]:
        count = count + 1

**Parameters and Hyperparameter List For Below Accuracy:**
1. Custom Stop Words: 300 (230 Effective)
2. Lemmatizing: On
3. Stemming: Off
4. CountVectorizer max_df: 0.7
5. CountVectorizer max_features: 25000 (Top 500 Chosen)
6. Total Custom Keywords After Processing: 2062 




In [241]:
print("Accuracy: " + str(round(((count/dataset.shape[0])*100),3)) + "%")

Accuracy: 25.24%


# Writing The Output To New CSV File 

In [243]:
filename = "/content/drive/MyDrive/Research/BCS2 Lab/Final/output.csv"
# adding newest column from list
dataset = dataset.drop('Category', axis = 1)
dataset['Prediction'] = result
dataset

Unnamed: 0,EID,venue_name,Year,Title,Abstract,Conference Name,Prediction
0,2-s2.0-36348977467,AAAI,2007,Photometric and geometric restoration of docum...,The popularity of current hand-held digital im...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Mobile Computing
1,2-s2.0-36349036042,AAAI,2007,"ESP: A logic of only-knowing, noisy sensing an...",When reasoning about actions and sensors in re...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Programming languages
2,2-s2.0-36349024731,AAAI,2007,Impromptu teams of heterogeneous mobile robots,As robots become more involved in assisting us...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Computer Architecture
3,2-s2.0-36349006415,AAAI,2007,The Virtual Solar-terrestrial Observatory: A d...,The Virtual Solar-Terrestrial Observatory is a...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Human Computer Interaction
4,2-s2.0-36349000637,AAAI,2007,Reasoning about attribute authenticity in a we...,The reliable authentication of user attributes...,AAAI-07/IAAI-07 Proceedings: 22nd AAAI Confere...,Artificial intelligence
...,...,...,...,...,...,...,...
192837,2-s2.0-85099545860,WWW,2020,Digital health and illness: Balancing it,Digital Illness can be conceptualized as a sta...,19th Ibero-American International Conference o...,Design automation
192838,2-s2.0-85099544945,WWW,2020,Towards a concept for streamlining game design...,This paper elaborates on a process to streamli...,19th Ibero-American International Conference o...,Computer Security
192839,2-s2.0-85099543353,WWW,2020,The main role of video ads' structure on socia...,Social networks play an important role in the ...,19th Ibero-American International Conference o...,Natural language processing
192840,2-s2.0-85099536986,WWW,2020,A new perspective on the issue of privacy: Cov...,The current COVID-19 experience seems to expos...,19th Ibero-American International Conference o...,Mobile Computing


In [244]:
dataset.to_csv(filename)