In [1]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

In [2]:
df = pd.read_csv('single.csv')
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,submitter,authors,title,comments,journal-ref,doi,abstract,report-no,categories,versions,parent categories,clean categories,abstract length
0,0,0,704.0001,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,A fully differential calculation in perturba...,ANL-HEP-PR-07-12,['hep-ph'],"['v1', 'v2']",['Physics'],Physics,983
1,1,2,704.0003,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,The evolution of Earth-Moon system is descri...,,['physics.gen-ph'],"['v1', 'v2', 'v3']",['Physics'],Physics,880
2,4,5,704.0006,Yue Hin Pong,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,"6 pages, 4 figures, accepted by PRA",,10.1103/PhysRevA.75.043613,We study the two-particle wave function of p...,,['cond-mat.mes-hall'],['v1'],['Physics'],Physics,918
3,5,6,704.0007,Alejandro Corichi,"Alejandro Corichi, Tatjana Vukasinac and Jose ...",Polymer Quantum Mechanics and its Continuum Limit,"16 pages, no figures. Typos corrected to match...","Phys.Rev.D76:044016,2007",10.1103/PhysRevD.76.044016,A rather non-standard quantum representation...,IGPG-07/03-2,['gr-qc'],"['v1', 'v2']",['Physics'],Physics,1036
4,6,7,704.0008,Damian Swift,Damian C. Swift,Numerical solution of shock and ramp compressi...,Minor corrections,"Journal of Applied Physics, vol 104, 073536 (2...",10.1063/1.2975338,A general formulation was developed to repre...,"LA-UR-07-2051, LLNL-JRNL-410358",['cond-mat.mtrl-sci'],"['v1', 'v2', 'v3']",['Physics'],Physics,949


In [3]:
df.loc[1]['abstract']

"  The evolution of Earth-Moon system is described by the dark matter field\nfluid model proposed in the Meeting of Division of Particle and Field 2004,\nAmerican Physical Society. The current behavior of the Earth-Moon system agrees\nwith this model very well and the general pattern of the evolution of the\nMoon-Earth system described by this model agrees with geological and fossil\nevidence. The closest distance of the Moon to Earth was about 259000 km at 4.5\nbillion years ago, which is far beyond the Roche's limit. The result suggests\nthat the tidal friction may not be the primary cause for the evolution of the\nEarth-Moon system. The average dark matter field fluid constant derived from\nEarth-Moon system data is 4.39 x 10^(-22) s^(-1)m^(-1). This model predicts\nthat the Mars's rotation is also slowing with the angular acceleration rate\nabout -4.38 x 10^(-22) rad s^(-2).\n"

# Text Cleaning and Preparation

## Special Character Cleaning 

We can see the following special characters:

- \r
- \n
- \ before possessive pronouns (government's = government\'s)
- \ before possessive pronouns 2 (Yukos' = Yukos\')
- " when quoting text

In [4]:
df['clean abstract'] = df['abstract'].str.replace("\r", " ")
df['clean abstract'] = df['clean abstract'].str.replace("\n", " ")
df['clean abstract'] = df['clean abstract'].str.replace("    ", " ")

In [5]:
df.loc[1]['clean abstract']

"  The evolution of Earth-Moon system is described by the dark matter field fluid model proposed in the Meeting of Division of Particle and Field 2004, American Physical Society. The current behavior of the Earth-Moon system agrees with this model very well and the general pattern of the evolution of the Moon-Earth system described by this model agrees with geological and fossil evidence. The closest distance of the Moon to Earth was about 259000 km at 4.5 billion years ago, which is far beyond the Roche's limit. The result suggests that the tidal friction may not be the primary cause for the evolution of the Earth-Moon system. The average dark matter field fluid constant derived from Earth-Moon system data is 4.39 x 10^(-22) s^(-1)m^(-1). This model predicts that the Mars's rotation is also slowing with the angular acceleration rate about -4.38 x 10^(-22) rad s^(-2). "

In [6]:
# " when quoting text
df['clean abstract'] = df['clean abstract'].str.replace('"', '')

## Upcase/Downcase 

In [7]:
# Lowercasing the text
df['clean abstract'] = df['clean abstract'].str.lower()

In [8]:
df.loc[1]['clean abstract']

"  the evolution of earth-moon system is described by the dark matter field fluid model proposed in the meeting of division of particle and field 2004, american physical society. the current behavior of the earth-moon system agrees with this model very well and the general pattern of the evolution of the moon-earth system described by this model agrees with geological and fossil evidence. the closest distance of the moon to earth was about 259000 km at 4.5 billion years ago, which is far beyond the roche's limit. the result suggests that the tidal friction may not be the primary cause for the evolution of the earth-moon system. the average dark matter field fluid constant derived from earth-moon system data is 4.39 x 10^(-22) s^(-1)m^(-1). this model predicts that the mars's rotation is also slowing with the angular acceleration rate about -4.38 x 10^(-22) rad s^(-2). "

## Punctuation Signs 

In [9]:
punctuation_signs = list("?:!.,;")
df['clean abstract'] = df['clean abstract']

for punct_sign in punctuation_signs:
    df['clean abstract'] = df['clean abstract'].str.replace(punct_sign, '')

In [10]:
df.loc[1]['clean abstract']

"  the evolution of earth-moon system is described by the dark matter field fluid model proposed in the meeting of division of particle and field 2004 american physical society the current behavior of the earth-moon system agrees with this model very well and the general pattern of the evolution of the moon-earth system described by this model agrees with geological and fossil evidence the closest distance of the moon to earth was about 259000 km at 45 billion years ago which is far beyond the roche's limit the result suggests that the tidal friction may not be the primary cause for the evolution of the earth-moon system the average dark matter field fluid constant derived from earth-moon system data is 439 x 10^(-22) s^(-1)m^(-1) this model predicts that the mars's rotation is also slowing with the angular acceleration rate about -438 x 10^(-22) rad s^(-2) "

## Possessive Pronouns

In [11]:
df['clean abstract'] = df['clean abstract'].str.replace("'s", "")

In [12]:
df.loc[1]['clean abstract']

'  the evolution of earth-moon system is described by the dark matter field fluid model proposed in the meeting of division of particle and field 2004 american physical society the current behavior of the earth-moon system agrees with this model very well and the general pattern of the evolution of the moon-earth system described by this model agrees with geological and fossil evidence the closest distance of the moon to earth was about 259000 km at 45 billion years ago which is far beyond the roche limit the result suggests that the tidal friction may not be the primary cause for the evolution of the earth-moon system the average dark matter field fluid constant derived from earth-moon system data is 439 x 10^(-22) s^(-1)m^(-1) this model predicts that the mars rotation is also slowing with the angular acceleration rate about -438 x 10^(-22) rad s^(-2) '

## Stemming and Lemmatization

In [13]:
# Downloading punkt and wordnet from NLTK
nltk.download('punkt')
print("------------------------------------------------------------")
nltk.download('wordnet')

------------------------------------------------------------


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/carlosgrivera/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/carlosgrivera/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [14]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In [15]:
df = df.reset_index(drop=True)
df

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,id,submitter,authors,title,comments,journal-ref,doi,abstract,report-no,categories,versions,parent categories,clean categories,abstract length,clean abstract
0,0,0,704.00010,Pavel Nadolsky,"C. Bal\'azs, E. L. Berger, P. M. Nadolsky, C.-...",Calculation of prompt diphoton production cros...,"37 pages, 15 figures; published version","Phys.Rev.D76:013009,2007",10.1103/PhysRevD.76.013009,A fully differential calculation in perturba...,ANL-HEP-PR-07-12,['hep-ph'],"['v1', 'v2']",['Physics'],Physics,983,a fully differential calculation in perturba...
1,1,2,704.00030,Hongjun Pan,Hongjun Pan,The evolution of the Earth-Moon system based o...,"23 pages, 3 figures",,,The evolution of Earth-Moon system is descri...,,['physics.gen-ph'],"['v1', 'v2', 'v3']",['Physics'],Physics,880,the evolution of earth-moon system is descri...
2,4,5,704.00060,Yue Hin Pong,Y. H. Pong and C. K. Law,Bosonic characters of atomic Cooper pairs acro...,"6 pages, 4 figures, accepted by PRA",,10.1103/PhysRevA.75.043613,We study the two-particle wave function of p...,,['cond-mat.mes-hall'],['v1'],['Physics'],Physics,918,we study the two-particle wave function of p...
3,5,6,704.00070,Alejandro Corichi,"Alejandro Corichi, Tatjana Vukasinac and Jose ...",Polymer Quantum Mechanics and its Continuum Limit,"16 pages, no figures. Typos corrected to match...","Phys.Rev.D76:044016,2007",10.1103/PhysRevD.76.044016,A rather non-standard quantum representation...,IGPG-07/03-2,['gr-qc'],"['v1', 'v2']",['Physics'],Physics,1036,a rather non-standard quantum representation...
4,6,7,704.00080,Damian Swift,Damian C. Swift,Numerical solution of shock and ramp compressi...,Minor corrections,"Journal of Applied Physics, vol 104, 073536 (2...",10.1063/1.2975338,A general formulation was developed to repre...,"LA-UR-07-2051, LLNL-JRNL-410358",['cond-mat.mtrl-sci'],"['v1', 'v2', 'v3']",['Physics'],Physics,949,a general formulation was developed to repre...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8227,880470,1019990,1808.10844,Theerawit Wilaiprasitporn,"Nannapas Banluesombatkul, Thanawin Rakthanmano...",Single Channel ECG for Obstructive Sleep Apnea...,,,,Obstructive sleep apnea (OSA) is a common sl...,,['eess.SP'],['v1'],['Electrical Engineering and Systems Science'],Electrical Engineering and Systems Science,1793,obstructive sleep apnea (osa) is a common sl...
8228,880600,1020148,1809.00132,Yinghao Ge Mr.,"Yinghao Ge, Weile Zhang, Feifei Gao, and Hlain...",Angle-Domain Approach for Parameter Estimation...,"Single columns, 32 pages, 12 figures, transact...",,,"In this paper, we consider a downlink orthog...",,['eess.SP'],['v1'],['Electrical Engineering and Systems Science'],Electrical Engineering and Systems Science,1372,in this paper we consider a downlink orthogo...
8229,880605,1020153,1809.00137,Yinghao Ge Mr.,"Yinghao Ge, Weile Zhang, Feifei Gao, Shun Zhan...",Beamforming Network Optimization for Reducing ...,"Double columns, 13 pages, 10 figures, transact...",,,Communications in high-mobility environments...,,['eess.SP'],"['v1', 'v2', 'v3']",['Electrical Engineering and Systems Science'],Electrical Engineering and Systems Science,1694,communications in high-mobility environments...
8230,880752,1020321,1809.00305,Kenta Iida,Kenta Iida and Hitoshi Kiya,Robust Image Identification for Double-Compres...,This paper will be presented at APSIPA Annual ...,,,In the case that images are shared via socia...,,['eess.IV'],['v1'],['Electrical Engineering and Systems Science'],Electrical Engineering and Systems Science,980,in the case that images are shared via socia...


In [16]:
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['clean abstract']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)

In [17]:
df['clean abstract'] = lemmatized_text_list

## Stop Words

In [18]:
# Downloading the stop words list
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/carlosgrivera/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [20]:
example = "me eating a meal"
word = "me"

# The regular expression is:
regex = r"\b" + word + r"\b"  # we need to build it like that to work properly

re.sub(regex, "StopWord", example)

'StopWord eating a meal'

In [21]:
df['clean abstract'] = df['clean abstract']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['clean abstract'] = df['clean abstract'].str.replace(regex_stopword, '')

In [22]:
df.loc[1]['clean abstract']
df.to_csv('single_clean.csv')

# Label Encoding

In [23]:
category_codes = {
    'Physics': 0,
    'Mathematics': 1,
    'Quantitative Biology': 2,
    'Quantitative Finance': 3,
    'Statistics': 4,
    'Electrical Engineering and Systems Science': 5,
    'Economics': 6,
    'Computer Science': 7
}

In [24]:
# Category mapping
df['category code'] = df['clean categories']
df = df.replace({'category code':category_codes})

In [25]:
df[['clean abstract','category code']]

Unnamed: 0,clean abstract,category code
0,fully differential calculation perturbativ...,0
1,evolution earth-moon system describe da...,0
2,study two-particle wave function pair ato...,0
3,rather non-standard quantum representation ...,0
4,general formulation develop represent mat...,0
...,...,...
8227,obstructive sleep apnea (osa) common sleep...,5
8228,paper consider downlink orthogonal frequ...,5
8229,communications high-mobility environments ...,5
8230,case image share via social network serv...,5


# Train - test split 

In [26]:
X_train, X_test, y_train, y_test = train_test_split(df['clean abstract'], 
                                                    df['category code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

# Text representation

We have various options:

- Count Vectors as features
- TF-IDF Vectors as features
- Word Embeddings as features
- Text / NLP based features
- Topic Models as features
We'll use TF-IDF Vectors as features.

We have to define the different parameters:

- ngram_range: We want to consider both unigrams and bigrams.
- max_df: When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold
- min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.
- max_features: If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
See TfidfVectorizer? for further detail.

It needs to be mentioned that we are implicitly scaling our data when representing it as TF-IDF features with the argument norm.

In [27]:
# Parameter election
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300

In [28]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(6997, 300)
(1235, 300)


Please note that we have fitted and then transformed the training set, but we have only transformed the test set.

We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category:

In [29]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")

# 'Computer Science' category:
  . Most correlated unigrams:
. algorithm
. problem
. algorithms
. graph
. design
  . Most correlated bigrams:
. monte carlo

# 'Economics' category:
  . Most correlated unigrams:
. economic
. equilibrium
. inference
. estimators
. estimator
  . Most correlated bigrams:
. monte carlo

# 'Electrical Engineering and Systems Science' category:
  . Most correlated unigrams:
. performance
. image
. communication
. channel
. signal
  . Most correlated bigrams:
. monte carlo

# 'Mathematics' category:
  . Most correlated unigrams:
. dimension
. finite
. space
. prove
. group
  . Most correlated bigrams:
. monte carlo

# 'Physics' category:
  . Most correlated unigrams:
. phase
. field
. energy
. quantum
. mass
  . Most correlated bigrams:
. monte carlo

# 'Quantitative Biology' category:
  . Most correlated unigrams:
. sequence
. population
. dynamics
. evolution
. protein
  . Most correlated bigrams:
. monte carlo

# 'Quantitative Finance' category:
  . Most co

As we can see, the unigrams correspond well to their category. However, bigrams do not. If we get the bigrams in our features:

In [30]:
bigrams

['monte carlo']

We can see there are only two. This means the unigrams have more correlation with the category than the bigrams, and since we're restricting the number of features to the most representative 300, only a few bigrams are being considered.

In [31]:
# X_train
with open('Pickles/X_train.pickle', 'wb') as output:
    pickle.dump(X_train, output)
    
# X_test    
with open('Pickles/X_test.pickle', 'wb') as output:
    pickle.dump(X_test, output)
    
# y_train
with open('Pickles/y_train.pickle', 'wb') as output:
    pickle.dump(y_train, output)
    
# y_test
with open('Pickles/y_test.pickle', 'wb') as output:
    pickle.dump(y_test, output)
    
# df
with open('Pickles/df.pickle', 'wb') as output:
    pickle.dump(df, output)
    
# features_train
with open('Pickles/features_train.pickle', 'wb') as output:
    pickle.dump(features_train, output)

# labels_train
with open('Pickles/labels_train.pickle', 'wb') as output:
    pickle.dump(labels_train, output)

# features_test
with open('Pickles/features_test.pickle', 'wb') as output:
    pickle.dump(features_test, output)

# labels_test
with open('Pickles/labels_test.pickle', 'wb') as output:
    pickle.dump(labels_test, output)
    
# TF-IDF object
with open('Pickles/tfidf.pickle', 'wb') as output:
    pickle.dump(tfidf, output)