In [1]:
import numpy as np
import pandas as pd
import sys

def show_mem_usage():
    '''Displays memory usage from inspection
    of global variables in this notebook'''
    gl = sys._getframe(1).f_globals
    vars= {}
    for k,v in list(gl.items()):
        # for pandas dataframes
        if hasattr(v, 'memory_usage'):
            mem = v.memory_usage(deep=True)
            if not np.isscalar(mem):
                mem = mem.sum()
            vars.setdefault(id(v),[mem]).append(k)
        # work around for a bug
        elif isinstance(v,pd.Panel):
            v = v.values
        vars.setdefault(id(v),[sys.getsizeof(v)]).append(k)
    total = 0
    for k,(value,*names) in vars.items():
        if value>1e6:
            print(names,"%.3fMB"%(value/1e6))
        total += value
    print("%.3fMB"%(total/1e6))

**Helpful Links: Where the Data Lives**

Open Academic Society: [Project Page](https://www.openacademic.ai/oag/)

Microsoft Research: [MS Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)

In [2]:
import pandas as pd

In [3]:
model_df = pd.read_json('data/mag_papers_0/mag_subset.txt', lines=True)

In [4]:
model_df.shape

(10000, 19)

In [5]:
model_df.columns

Index(['abstract', 'authors', 'doc_type', 'doi', 'fos', 'id', 'issue',
       'keywords', 'lang', 'n_citation', 'page_end', 'page_start', 'publisher',
       'references', 'title', 'url', 'venue', 'volume', 'year'],
      dtype='object')

In [6]:
# filter out non-English articles

model_df = model_df[model_df.lang == 'en']

model_df.shape

(5167, 19)

In [7]:
# keep abstract, authors, fos, keywords, year, title
model_df = model_df.drop(['doc_type', 'doi', 'id', 'issue', 'lang', 'n_citation', 'page_end', 
                            'page_start', 'publisher', 'references', 'url', 'venue', 'volume'], axis=1)

model_df.shape

(5167, 6)

# (1) raw data > algorithm w/ XKCD comic

## Content Based Recommendation using Jaccard Similarity

How to go about building a recommender system? 

Let's start simple with a few fields. We'll calculate the Jaccard Similarity between two items, then rank the results to choose a "most similar" paper for each input.

In [8]:
model_df.head(2)

Unnamed: 0,abstract,authors,fos,keywords,title,year
0,A system and method for maskless direct write ...,,"[Electronic engineering, Computer hardware, En...",,System and Method for Maskless Direct Write Li...,2015
1,,[{'name': 'Ahmed M. Alluwaimi'}],"[Biology, Virology, Immunology, Microbiology]","[paratuberculosis, of, subspecies, proceedings...",The dilemma of the Mycobacterium avium subspec...,2016


We can already see that this dataset will need some wrangling. Lists and dictionaries are good for data storage, but not [tidy](http://vita.had.co.nz/papers/tidy-data.html) or well-suited for machine learning without some unpacking.

In [9]:
unique_fos = sorted(list({ feature
                          for paper_row in model_df.fos.fillna('0')
                          for feature in paper_row }))

unique_year = sorted(model_df['year'].astype('str').unique())

paper_features = unique_fos + unique_year

In [10]:
def feature_array(x, var, unique_array):
    row_dict = {}
    for i in x.index:
        var_dict = {}
        
        for j in range(len(unique_array)):
            if type(x[i]) is list:
                if unique_array[j] in x[i]:
                    var_dict.update({var + '_' + unique_array[j]: 1})
                else:
                    var_dict.update({var + '_' + unique_array[j]: 0})
            else:    
                if unique_array[j] == str(x[i]):
                    var_dict.update({var + '_' + unique_array[j]: 1})
                else:
                    var_dict.update({var + '_' + unique_array[j]: 0})
        
        row_dict.update({i : var_dict})
    
    feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
    
    return feature_df

We will start with a simple example of building a recommender with just a few fields, building sparse arrays of available features to calculate for the Jaccard similary between papers. We will see if reasonably similar papers can be found in a timely manner.

In [11]:
%time fos_features = feature_array(model_df['fos'], 'fos', unique_fos)

from sys import getsizeof
print('Size of fos feature array: ', getsizeof(fos_features))

CPU times: user 8min 45s, sys: 8.38 s, total: 8min 53s
Wall time: 9min 7s
Size of fos feature array:  779231832


In [12]:
%time year_features = feature_array(model_df['year'], 'year', unique_year)

print('Size of year feature array: ', getsizeof(year_features))

CPU times: user 15.6 s, sys: 106 ms, total: 15.7 s
Wall time: 16 s
Size of year feature array:  22714156


In [13]:
year_features.shape[1] + fos_features.shape[1]

4849

In [14]:
# now looking at 5167 x  4849 array for our feature space

%time first_features = fos_features.join(year_features).T

first_size = getsizeof(first_features)

print('Size of first feature array: ', first_size)

CPU times: user 3.89 s, sys: 386 ms, total: 4.28 s
Wall time: 4.32 s
Size of first feature array:  802239497


Not bad, but waiting 8+ mins for 10K observations for one feature seems a little slow. We are also using 802+MB with only two variables from our original data set. 

Let's see how our current features perform at giving us a good recommendation. We'll define a "good" recommendation as a paper that looks similar to the input.

In [15]:
first_features.shape

(4849, 5167)

In [16]:
first_features.head()

Unnamed: 0,0,1,2,5,7,8,9,10,11,12,...,9979,9980,9981,9984,9986,9988,9994,9997,9998,9999
fos_0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fos_1/N expansion,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fos_10G-PON,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fos_3D radar,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
fos_3D single-object recognition,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
from scipy.spatial.distance import cosine

def item_collab_filter(features_df):
    item_similarities = pd.DataFrame(index = features_df.columns, columns = features_df.columns)
    
    for i in features_df.columns:
        for j in features_df.columns:
            item_similarities.loc[i][j] = 1 - cosine(features_df[i], features_df[j])
    
    return item_similarities

In [None]:
%time foo = item_collab_filter(first_features.loc[:, 0:1000])

In [None]:
#%time foo3 = item_collab_filter(first_features)

[General Note] Why is this slow? We are taking the dot product of 4849 x 1000 matrix using a nested for loop. While the 4849 x 10 took ms, we increase the time per loop as we increase the # of observations we add to the model. Remember, this is a subset of the total available dataset, filtered for English only papers. As we move closer to a "good" result, we would need to go back and test on the larger set for our best results. 

How can we make this faster? Well, since we only need one result at a time, we can change our function so that we only calculate one item at a time, specifying the top results we want. We'll do this later as we continue to move through our experiment. It is useful for us to see this for the first time so that understand the full feature space.

We need to get a better idea of how these features will translate to us getting a good recommendation. Do we have enough observations to move forward? Let's plot a heatmap to see if we have any papers that are similar to each other.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
%matplotlib notebook

In [None]:
sns.set()
ax = sns.heatmap(foo.fillna(0), 
                 vmin=0, vmax=1, 
                 cmap="YlGnBu", 
                 xticklabels=250, yticklabels=250)
ax.tick_params(labelsize=12)

That looks promising. While we have a lot of empty space, which shows that our data set is fairly diverse, we can see that our cosine measure is accurately predicting that each paper is most similar to itself. We also have some other high score candidates. These may or may not be good recommendations qualitatively, but at least we can see that our methods are not so mad.

In [None]:
def paper_recommender(paper_index, items_df):
    print('Based on the paper: \nindex = ', paper_index)
    print(model_df.iloc[paper_index])
    top_results = items_df.loc[paper_index].sort_values(ascending=False).head(4)
    print('\nTop three results: ') 
    order = 1
    for i in top_results.index.tolist()[-3:]:
        print(order,'. Paper index = ', i)
        print('Similarity score: ', top_results[i])
        print(model_df.iloc[i], '\n')
        if order < 5: order += 1

In [None]:
paper_recommender(2, foo)

Yikes. That's not that great. Can we just push more data through? Well, yes, if you don't need speedy results. 

Even on this small data set, the time for training our model is too slow for quick, iterative engineering when we calculate all the results at once. 

Let's try some of our new feature engineering tricks to see if we can speed up computation time, find better features and a better way to search for results that is not so time consuming.

# (2) engineering our current features, pipe, outcome

Okay, we remember than numerical features broadly distributed across a dataset can unnecessarily increase the size of our feature space. Let's wrangle this in first.

In [None]:
model_df['year'].tail()

In [None]:
print("Year spread: ", model_df['year'].min()," - ", model_df['year'].max())
print("Quantile spread:\n", model_df['year'].quantile([0.25, 0.5, 0.75]))

In [None]:
# plot years to see the distribution
fig, ax = plt.subplots()
model_df['year'].hist(ax=ax, bins= model_df['year'].max() - model_df['year'].min())
ax.tick_params(labelsize=12)
ax.set_xlabel('Year Count', fontsize=12)
ax.set_ylabel('Occurrence', fontsize=12)

We can see from the uneven distribution that this is an excellent candidate for binning and dummy coding. Lucky for us, pandas can do all these things using built-in functions. Our results will be easy to interpret.

## years: binning + dummy coding

In [None]:
# we'll base our bins on the range of the variable, rather than the unique number of features
model_df['year'].max() - model_df['year'].min()

In [None]:
# insert binning here (by 10 years)
bins = int(round((model_df['year'].max() - model_df['year'].min()) / 10))

temp_df = pd.DataFrame(index = model_df.index)
temp_df['yearBinned'] = pd.cut(model_df['year'].tolist(), bins, precision = 0)

In [None]:
# now we only have as many bins as we created(grouping together by 10 years)
print('We have reduced from', len(model_df['year'].unique()),
      'to', len(temp_df['yearBinned'].values.unique()), 'features representing the year.')

In [None]:
binned_yrs = pd.get_dummies(temp_df['yearBinned'])
binned_yrs.head()

In [None]:
binned_yrs.columns.categories

In [None]:
# let's look at the new distribution
fig, ax = plt.subplots()
binned_yrs.sum().plot.bar(ax = ax)
ax.tick_params(labelsize=8)
ax.set_xlabel('Binned Years', fontsize=12)
ax.set_ylabel('Counts', fontsize=12)

We have preserved the underlying distribution of the original variable through binning by decades. If we desired to use a method that would benefit from a different distribution, we could alter our binning choices to change how this variable presents itself to the model. Since we are using a cosine similarity, this is fine.

Let's move on to the next feature we originally included in our model.

## [TODO] fields of study: One-Hot Encoding > check this...

This feature contributed significantly to the original model's size and processing time. We we will aim to reduce these. 

In [None]:
# need to fill in "NaN" for sklearn
fos_df = model_df.fillna('None')

In [None]:
fos_df['fos'].head()

Let's leverage work we have already done. We have a sparse array of parsed field of study fields. The names for the feature space take up the most room. We'll pare this down by taking advantage of sklearn's One-Hot Encoder.

In [None]:
fos_features.head(2)

In [None]:
m = len(unique_fos)
m

In [None]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(n_values = m)
f = enc.fit(fos_features)

In [None]:
fos_1he = f.transform(fos_features).toarray()

In [None]:
show_mem_usage()

In [None]:
fos_1he.shape

In [None]:
sum(fos_1he[1])

In [None]:
# We can see how this will make a difference in the future by looking at the size of each
from sys import getsizeof

print('Our pandas Series, in bytes: ', getsizeof(fos_features))
print('Our hashed numpy array, in bytes: ', getsizeof(fos_1he))

Putting it back together, we'll pipe our features together and re-run our recommender to see if we have improved results. Since we are starting to use sklearn, we'll take advantage of their cosine similarity function, reducing the computational time by only focusing on one item at a time.

In [None]:
binned_yrs.shape[1] + fos_1he.shape[1]

In [None]:
# now looking at 5167 x  9442 array for our feature space

%time second_features = np.append(fos_1he, binned_yrs.values, axis = 1)

second_size = getsizeof(second_features)

print('Size of second feature array, in bytes: ', second_size)

In [None]:
print("The power of feature engineering saves us, in bytes: ", first_size - second_size)

In [None]:
from sklearn.metrics.pairwise import linear_kernel

def piped_collab_filter(features_matrix, index, top_n):
                
    item_similarities = linear_kernel(features_matrix[index:index+1], features_matrix).flatten() 
    related_indices = [i for i in item_similarities.argsort()[::-1] if i != index]

    return [(index, item_similarities[index]) for index in related_indices][0:top_n]

def paper_recommender(items_df, paper_index, top_n):
    print('Based on the paper: \nindex = ', paper_index)
    print(model_df.iloc[paper_index])
    top_results = piped_collab_filter(items_df, paper_index, top_n)
    print('\nTop three results: ') 
    order = 1
    for i in range(len(top_results)):
        print(order,'. Paper index = ', top_results[i][0])
        print('Similarity score: ', top_results[i][1])
        print(model_df.iloc[i], '\n')
        if order < 5: order += 1

In [None]:
paper_recommender(binned_yrs, 2, 3)

# (3) a few more features, pipe, outcome

## abstract: stopwords, frequency based filtering (tf-idf?)

In [None]:
# need to fill in NaN for sklearn
abstract_df = model_df.fillna('None')

# abstract: stopwords, frequency based filtering (tf-idf?)
abstract_df['abstract'].head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
X_abstract = vectorizer.fit_transform(abstract_df['abstract'])

X_abstract

In [None]:
print("n_samples: %d, n_features: %d" % X_abstract.shape)

## authors: One-Hot Encoding using sklearn DictVectorizer()

In [None]:
authors_df = pd.DataFrame(model_df.authors)
authors_df.head()

In [None]:
import json

In [None]:
type(authors_df.authors[5][0])

In [None]:
authors_list = []

for row in authors_df.itertuples():
    # create a dictionary from each Series index
    if type(row.authors) is list:
        # add these keys + values to our running dictionary    
        y = dict.fromkeys(row.authors[0].values(), row.Index)
        authors_list.append(y)

In [None]:
authors_list[0:5]

In [None]:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = authors_list
X = v.fit_transform(D)

In [None]:
X[0:5]

In [None]:
len(authors_dict)

Let's combine these new features with our last engineered features to see if we are on the right track.

In [None]:
binned_yrs.shape[1] + fos_1he.shape[1]

# now looking at 5167 x  9442 array for our feature space

%time second_features = np.append(fos_1he, binned_yrs.values, axis = 1)

second_size = getsizeof(second_features)

print('Size of second feature array, in bytes: ', second_size)

print("The power of feature engineering saves us, in bytes: ", first_size - second_size)

paper_recommender(binned_yrs, 2, 3)

# (4) a few more...does that help? results? performance?
### no? okay. return to best case. all about experimentation.

## titles: noun phrases + chunking

In [None]:
model_df['title'].head()

## keywords: stemming?

In [None]:
model_df['keywords'].head()

## summary ##

As you can see, building models for machine learning is easy. Building *good* models for the useful outcomes takes time and work. We hiked through the messy processes here of examining a collection of possible variables and experimenting with different feature engineering methods to achieve better results. We define better here as not just good outcomes from our training and testing, but also reducing the size of the model and time it takes us to iterate over different experiments.

**Citations**

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). pp.990-998. [PDF](http://keg.cs.tsinghua.edu.cn/jietang/publications/KDD08-Tang-et-al-ArnetMiner.pdf) [Slides](http://keg.cs.tsinghua.edu.cn/jietang/publications/KDD08-Tang-et-al-Arnetminer.ppt) [System](http://aminer.org/) [API](http://aminer.org/RESTful_service)

Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. [PDF](https://www.microsoft.com/en-us/research/publication/an-overview-of-microsoft-academic-service-mas-and-applications-2/) [System](https://academic.microsoft.com/) [API](https://docs.microsoft.com/en-us/azure/cognitive-services/academic-knowledge/home)

http://www.markhneedham.com/blog/2016/07/27/scitkit-learn-tfidf-and-cosine-similarity-for-computer-science-papers/