In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [29]:
df = pd.read_csv('papers.csv')

In [30]:
df.head(5)

Unnamed: 0,id,year,title,event_type,pdf_name,abstract,paper_text
0,1,1987,Self-Organization of Associative Database and ...,,1-self-organization-of-associative-database-an...,Abstract Missing,767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABA...
1,10,1987,A Mean Field Theory of Layer IV of Visual Cort...,,10-a-mean-field-theory-of-layer-iv-of-visual-c...,Abstract Missing,683\n\nA MEAN FIELD THEORY OF LAYER IV OF VISU...
2,100,1988,Storing Covariance by the Associative Long-Ter...,,100-storing-covariance-by-the-associative-long...,Abstract Missing,394\n\nSTORING COVARIANCE BY THE ASSOCIATIVE\n...
3,1000,1994,Bayesian Query Construction for Neural Network...,,1000-bayesian-query-construction-for-neural-ne...,Abstract Missing,Bayesian Query Construction for Neural\nNetwor...
4,1001,1994,"Neural Network Ensembles, Cross Validation, an...",,1001-neural-network-ensembles-cross-validation...,Abstract Missing,"Neural Network Ensembles, Cross\nValidation, a..."


In [31]:
df.shape

(7241, 7)

In [32]:
df = df.iloc[:5000,:]

In [33]:
df.shape

(5000, 7)

# preprocessing of data

In [34]:
df['paper_text'][0]

'767\n\nSELF-ORGANIZATION OF ASSOCIATIVE DATABASE\nAND ITS APPLICATIONS\nHisashi Suzuki and Suguru Arimoto\nOsaka University, Toyonaka, Osaka 560, Japan\nABSTRACT\nAn efficient method of self-organizing associative databases is proposed together with\napplications to robot eyesight systems. The proposed databases can associate any input\nwith some output. In the first half part of discussion, an algorithm of self-organization is\nproposed. From an aspect of hardware, it produces a new style of neural network. In the\nlatter half part, an applicability to handwritten letter recognition and that to an autonomous\nmobile robot system are demonstrated.\n\nINTRODUCTION\nLet a mapping f : X -+ Y be given. Here, X is a finite or infinite set, and Y is another\nfinite or infinite set. A learning machine observes any set of pairs (x, y) sampled randomly\nfrom X x Y. (X x Y means the Cartesian product of X and Y.) And, it computes some\nestimate j : X -+ Y of f to make small, the estimation erro

# steps to do
1. lower case
2. remove html tage
3. remove special character and digit
4. tokenization
5. remove stops words
6. remove words less than three letter
7. lemmatize

In [35]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [36]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [37]:


stops_words = set(stopwords.words('english'))

new_words = [
    'fig', 'figure', 'image', 'sample', 'using', 'show', 'result',
    'two', 'three', 'four', 'five', 'seven', 'eight', 'nine'
]

stops_words = list(stops_words.union(new_words))


In [38]:
def preprocess(txt):
    txt = txt.lower()  #tolowercase
    txt = re.sub(r'<.*?>',' ',txt)   # remove html tag
    txt = re.sub(r'[^a-zA-Z]',' ',txt)   #remove special characters
    txt = re.sub(r'http\S+', ' ', txt)
    txt = re.sub(r'\b(html|page|bold|parent|false)\b', ' ', txt)
    txt = nltk.word_tokenize(txt)  # tokenize
    txt = [word for word in txt if word not in stops_words]  #remove stopwords
    txt = [word for word in txt if len(word) > 3] 
    lemmatizer = WordNetLemmatizer()
    txt = [lemmatizer.lemmatize(word) for word in txt]
    
    return ' '.join(txt)

In [39]:
preprocess('This 6576575 hop moving loving *&^%^&% is PYTHON <h1> <p> hello world </p> </h1>')

'moving loving python hello world'

In [48]:
docs = (
    df['title'].fillna('') + ' ' + df['abstract'].fillna('')
).apply(preprocess)


In [49]:
docs[0]

'self organization associative database application abstract missing'

In [47]:
docs[5]

'sing neural instantiate deformable model christopher williams michael revowand geoffrey hinton department computer science university toronto toronto ontario canada abstract deformable model attractive approach recognizing nonrigid object considerable within class variability however severe search problem associated fitting model data neural network provide better starting point search time significantly reduced method demonstrated character recognition task previous work developed approach handwritten character recognition based deformable model hinton williams revow revow williams hinton obtained good performance method major problem search procedure fitting model computationally intensive efficient algorithm like dynamic programming task paper demonstrate possible compile knowledge gained fitting model data obtain better starting point significantly reduce search time deformable model digit recognition basic idea deformable model digit recognition digit model test classified findin

In [50]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(
    stop_words='english',
    ngram_range=(1,2),
    max_df=0.9,
    min_df=3
)
X = cv.fit_transform(docs)

# TFIDF Transformer

In [51]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
tfidf_transformer = tfidf_transformer.fit(X)

feature_names = cv.get_feature_names_out()

# Extracting keywrods

In [52]:
def get_keywords(idx,docs,topN=10):
    # getting word count and importance
    docs_words_count = tfidf_transformer.transform(cv.transform([docs[idx]]))

    #sorting sparse matrix
    docs_words_count = docs_words_count.tocoo()
    tuples = zip(docs_words_count.col,docs_words_count.data)
    sorted_items = sorted(tuples,key=lambda x: (x[1],x[0]),reverse=True)

    # getting top 10 keyworsds
    sorted_items = sorted_items[:topN]

    score_vals = []
    features_vals = []
    
    for idx,score in sorted_items:
        score_vals.append(round(score,3))
        features_vals.append(feature_names[idx])

    # final result
    results = {}
    for idx in range(len(features_vals)):
        results[features_vals[idx]] = score_vals[idx]
    return results 

def print_keywords(idx,keywords,df):
    print('\n---title----')
    print(df['title'][idx])
    print('\n---abstract---')
    print(df['abstract'][idx])
    print('\n---keyword--')
    for k in keywords:
        print(k,keywords[k])

idx = 4995
keywords = get_keywords(idx,docs)
print_keywords(idx,keywords,df)


---title----
Low-Rank Time-Frequency Synthesis

---abstract---
Many single-channel signal decomposition techniques rely on a low-rank factorization of a time-frequency transform. In particular, nonnegative matrix factorization (NMF) of the spectrogram -- the (power) magnitude of the short-time Fourier transform (STFT) -- has been considered in many audio applications. In this setting, NMF with the Itakura-Saito divergence was shown to underly a generative Gaussian composite model (GCM) of the STFT, a step forward from more empirical approaches based on ad-hoc transform and divergence specifications. Still, the GCM is not yet a generative model of the raw signal itself, but only of its STFT. The work presented in this paper fills in this ultimate gap by proposing a novel signal synthesis model with low-rank time-frequency structure. In particular, our new approach opens doors to multi-resolution representations, that were not possible in the traditional NMF setting. We describe two exp

In [53]:
import pickle
pickle.dump(cv,open('count_vector.pkl','wb'))
pickle.dump(tfidf_transformer,open('tfidf_transformer.pkl','wb'))
pickle.dump(feature_names,open('feature_names.pkl','wb'))