# Exploring Natural Language Processing  
**Filename:** exploring_nlp.ipynb  
**Path:** TAMIDS/Code/Scholars@TAMU Data/exploring_nlp.ipynb  
**Created Date:** 05 April 2022, 01:53 

Learning how to use NLP.

In [3]:
from IPython.display import Markdown, display, HTML
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from gensim import corpora, models, similarities
import jieba

pd.options.display.float_format = '{:,.3f}'.format
plt.style.use('seaborn-darkgrid')

# General Markdown Formatting Functions

def printmd(string, level=1):
    header_level = '#'*level + ' '
    display(Markdown(header_level + string))

## Loading the Data

In [17]:
people = pd.read_pickle('../../Data/Scholars@TAMU/my_api_calls/people_df.pickle')
publications = pd.read_pickle('../../Data/Scholars@TAMU/my_api_calls/publications_df.pickle')

In [18]:
publications.sample(n=50)

Unnamed: 0_level_0,author_ids,author_uins,year,publication_type,publication_title,keyword,un_sustainable_development_goals,author_organization,author_city,author_country,abstract
publication_api_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
n32475SE,[na2a37577],[702002692],2001.0,Journal Article,Phase-matching condition between acoustic and ...,"[Optics, Spectroscopy]",,"[Texas A and M University, IBM Almaden Researc...","[College Station, San Jose]","[United States, United States]",A study was performed on the phase-matching co...
n479824SE,[nfc1740f1],[930004302],2001.0,Conference,Fiber-Based Electro-Optic Field Imaging System,,,,,,
n416669SE,[n40dbead6],[823002420],2005.0,Journal Article,Vicksburg is the Key: The Struggle for the Mis...,,,,,,
n346165SE,[nfa1a6351],[302002850],2015.0,Journal Article,Effects of A. nodosum seaweed extracts on spin...,[Agriculture],,[Texas A and M University],[College Station],[United States],Â© 2014 . Seaweed extracts (SWE) are biodegrad...
n44068SE,[n279be03a],[601003904],2006.0,Journal Article,Entanglement conditions for two-mode states: A...,"[Optics, Physics]",,"[Hunter College, Texas A and M University]","[New York, College Station]","[United States, United States]",We examine the implications of several recentl...
n171776SE,,,,,,,,"[Oregon State University, Texas A and M Univer...","[Corvallis, College Station]","[United States, United States]",
n47847SE,[n67571474],[103000260],2009.0,Journal Article,Prenatal lead exposure enhances methamphetamin...,"[Behavioral Sciences, Neurosciences & Neurolog...",,[Texas A and M University],[College Station],[United States],Adult female rats were exposed to lead-free so...
n418013SE,[n2b854905],[723009901],2012.0,Journal Article,Superior activity of MnOx-CeO2/TiO2 catalyst f...,"[Chemistry, Engineering]",,[Huazhong University of Science and Technology],[Wuhan],[China],
n341979SE,[n1d2223c8],[302002950],2010.0,Journal Article,Study on retrograde extrapolation of blood alc...,,,"[Ministry of Justice, China]",[Shanghai],[China],Objective: To make the study on retrograde ext...
n468054SE,"[n6ba4cec9, n042bccf8]","[226005058, 917005862]",2020.0,Conference,Microbiomes of a corallivore (Hermodice carunc...,"[Zoology, Zoology]",,,,,


In [19]:
compsci = 'n230467SE'
phys = 'n127269SE'
systems = 'n165882SE'
agbuis = 'n188661SE'
histor = 'n51137SE'

## nlp

In [47]:
texts = [publications['abstract'][key] if publications['abstract'][key] else '' for key in [compsci, phys, systems, agbuis, histor]] 

keyword = r'Computer science is the study of computation, automation, and information.[1] Computer science spans theoretical disciplines (such as algorithms, theory of computation, and information theory) to practical disciplines (including the design and implementation of hardware and software).[2][3] Computer science is generally considered an area of academic research and distinct from computer programming.[4] Algorithms and data structures are central to computer science.[5] The theory of computation concerns abstract models of computation and general classes of problems that can be solved using them. The fields of cryptography and computer security involve studying the means for secure communication and for preventing security vulnerabilities. Computer graphics and computational geometry address the generation of images. Programming language theory considers approaches to the description of computational processes, and database theory concerns the management of repositories of data. Human–computer interaction investigates the interfaces through which humans and computers interact, and software engineering focuses on the design and principles behind developing software. Areas such as operating systems, networks and embedded systems investigate the principles and design behind complex systems. Computer architecture describes the construction of computer components and computer-operated equipment. Artificial intelligence and machine learning aim to synthesize goal-orientated processes such as problem-solving, decision-making, environmental adaptation, planning and learning found in humans and animals. Within artificial intelligence, computer vision aims to understand and process image and video data, while natural-language processing aims to understand and process textual and linguistic data.'

cut_texts = [jieba.lcut(text) for text in texts]

dictionary = corpora.Dictionary(cut_texts)

feature_cnt = len(dictionary.token2id)

corpus = [dictionary.doc2bow(text) for text in cut_texts]

tfidf = models.TfidfModel(corpus)

kw_vector = dictionary.doc2bow(jieba.lcut(keyword))

index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features = feature_cnt)

sim = index[tfidf[kw_vector]]

for i in range(len(sim)):
    print('keyword is similar to text%d: %.5f' % (i + 1, sim[i]))

keyword is similar to text1: 0.95150
keyword is similar to text2: 0.94754
keyword is similar to text3: 0.91311
keyword is similar to text4: 0.00000
keyword is similar to text5: 0.00000


In [52]:
def run_gensim_similarities(text_dict: dict, keyword: str) -> dict:
    """
    texts: dict[pub_api_id: text] - bodies of texts to compare against the keyword
    keyword: str

    returns: dict[key: similarity_num]
    """

    keys, texts = text_dict.keys(), text_dict.values()
    cut_texts = [jieba.lcut(text) for text in texts]

    dictionary = corpora.Dictionary(cut_texts)
    feature_cnt = len(dictionary.token2id)
    corpus = [dictionary.doc2bow(text) for text in cut_texts]
    tfidf = models.TfidfModel(corpus)
    kw_vector = dictionary.doc2bow(jieba.lcut(keyword))
    index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features = feature_cnt)
    sim = index[tfidf[kw_vector]]
    return {key: val for key, val in zip(keys, sim)}

In [54]:
text_dict = {key: publications['abstract'][key] if publications['abstract'][key] else '' for key in [compsci, phys, systems, agbuis, histor]}

run_gensim_similarities(text_dict, keyword)

{'n230467SE': 0.95149904,
 'n127269SE': 0.94753855,
 'n165882SE': 0.91310966,
 'n188661SE': 0.0,
 'n51137SE': 0.0}