**Helpful Links: Where the Data Lives**

Open Academic Society: [Project Page](https://www.openacademic.ai/oag/)

Microsoft Research: [MS Academic Graph](https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/)

In [1]:
import pandas as pd

In [2]:
data = pd.read_json('data/mag_papers_0/mag_subset.txt', lines=True)

In [3]:
data.shape

(10000, 19)

In [4]:
data.columns

Index(['abstract', 'authors', 'doc_type', 'doi', 'fos', 'id', 'issue',
       'keywords', 'lang', 'n_citation', 'page_end', 'page_start', 'publisher',
       'references', 'title', 'url', 'venue', 'volume', 'year'],
      dtype='object')

In [5]:
# filter out non-English articles

model_df = data[data.lang == 'en']

model_df.shape

(5167, 19)

In [6]:
# keep abstract, authors, fos, keywords, year, title
model_df = model_df.drop(['doc_type', 'doi', 'id', 'issue', 'lang', 'n_citation', 'page_end', 
                            'page_start', 'publisher', 'references', 'url', 'venue', 'volume'], axis=1)

model_df.shape

(5167, 6)

# (1) raw data > algorithm w/ XKCD comic

## Content Based Recommendation using Jaccard Similarity(abstract, authors, fos, keywords, year, titles)

How to go about building a recommender system? 

Let's start simple + [steal like an artist](https://austinkleon.com/steal/). 

See the original, excellent examples from Joel Grus [Data Science from Scratch](https://github.com/joelgrus/data-science-from-scratch/blob/master/code-python3/recommender_systems.py).

In [7]:
import math, random
from collections import defaultdict, Counter

def dot(v, w):
    """v_1 * w_1 + ... + v_n * w_n"""
    return sum(v_i * w_i for v_i, w_i in zip(v, w))

def cosine_similarity(v, w):
    return dot(v, w) / math.sqrt(dot(v, v) * dot(w, w))

In [8]:
model_df.head(2)

Unnamed: 0,abstract,authors,fos,keywords,title,year
0,A system and method for maskless direct write ...,,"[Electronic engineering, Computer hardware, En...",,System and Method for Maskless Direct Write Li...,2015
1,,[{'name': 'Ahmed M. Alluwaimi'}],"[Biology, Virology, Immunology, Microbiology]","[paratuberculosis, of, subspecies, proceedings...",The dilemma of the Mycobacterium avium subspec...,2016


We can already see that this dataset will need some wrangling. Lists and dictionaries are good for data storage, but not [tidy](http://vita.had.co.nz/papers/tidy-data.html) or well-suited for machine learning without some unpacking.

In [9]:
unique_fos = sorted(list({ feature
                          for paper_row in model_df.fos.fillna('0')
                          for feature in paper_row }))

unique_keywords = sorted(list({ feature
                              for paper_row in model_df.keywords.fillna('0')
                              for feature in paper_row }))

unique_year = sorted(model_df['year'].astype('str').unique())

paper_features = unique_fos + unique_keywords + unique_year

In [10]:
def feature_array(x, var, unique_array):
    row_dict = {}
    for i in x.index:
        var_dict = {}
        
        for j in range(len(unique_array)):
            if type(x[i]) is list:
                if unique_array[j] in x[i]:
                    var_dict.update({var + '_' + unique_array[j]: 1})
                else:
                    var_dict.update({var + '_' + unique_array[j]: 0})
            else:    
                if unique_array[j] == str(x[i]):
                    var_dict.update({var + '_' + unique_array[j]: 1})
                else:
                    var_dict.update({var + '_' + unique_array[j]: 0})
        
        row_dict.update({i : var_dict})
    
    feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
    
    return feature_df

We will start with a simple example of building a recommder with just a few fields, building sparse arrays of available features to calculate for the Jaccard similary between papers. We will see if reasonably similar papers can be found in a timely manner.

In [None]:
%time
fos_features = feature_array(model_df['fos'], 'fos', unique_fos)
year_features = feature_array(model_df['year'], 'year', unique_year)
keywords_features = feature_array(model_df['keywords'], 'keywords', unique_keyword)

first_features = fos_features.join(year_features).join(keywords_features)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs


In [None]:
from sys import getsizeof
print('Size of feature array: ', getsizeof(first_features))

# (2) a few features, pipe, outcome

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib notebook

In [None]:
model_df['year'].tail()

In [None]:
print("Year spread: ", model_df['year'].min()," - ", model_df['year'].max())
print("Quantile spread:\n", model_df['year'].quantile([0.25, 0.5, 0.75]))

In [None]:
# plot years to see the distribution
sns.set_style('whitegrid')
fig, ax = plt.subplots()
model_df['year'].hist(ax=ax, bins=100)
ax.set_yscale('log')
ax.tick_params(labelsize=14)
ax.set_xlabel('Year Count', fontsize=14)
ax.set_ylabel('Occurrence', fontsize=14)

## years: binning + dummy encoding

In [None]:
# insert binning here (by 10 years)

In [None]:
model_df['year'].head()

In [None]:
X_year = pd.get_dummies(model_df['year'])
X_year.head()

## abstract: stopwords, frequency based filtering (tf-idf?)

In [None]:
# need to fill in NaN for sklearn
# is this needed > YES!
abstract_df = model_df.fillna('None')

In [None]:
# abstract: stopwords, frequency based filtering (tf-idf?)
abstract_df['abstract'].head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
X_abstract = vectorizer.fit_transform(abstract_df['abstract'])

In [None]:
X_abstract

In [None]:
print("n_samples: %d, n_features: %d" % X_abstract.shape)

# (3) a few more features, pipe, outcome

## authors: One-Hot Encoding using sklearn DictVectorizer()

In [None]:
authors_df = pd.DataFrame(model_df.authors)
authors_df.head()

In [None]:
import json

In [None]:
type(authors_df.authors[5][0])

In [None]:
authors_list = []

for row in authors_df.itertuples():
    # create a dictionary from each Series index
    if type(row.authors) is list:
        # add these keys + values to our running dictionary    
        y = dict.fromkeys(row.authors[0].values(), row.Index)
        authors_list.append(y)

In [None]:
authors_list[0:5]

In [None]:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = authors_list
X = v.fit_transform(D)

In [None]:
X[0:5]

## fields of study: Feature Hashing?

In [None]:
model_df['fos'].head()

In [None]:
len(authors_dict)

In [None]:
pd.get_dummies(model_df['fos'][1])

## titles: noun phrases + chunking

In [None]:
model_df['title'].head()

# (4) a few more...does that help? results? performance?
### no? okay. return to best case. all about experimentation.

## keywords: stemming?

In [None]:
model_df['keywords'].head()

**Citations**

Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’2008). pp.990-998. [PDF](http://keg.cs.tsinghua.edu.cn/jietang/publications/KDD08-Tang-et-al-ArnetMiner.pdf) [Slides](http://keg.cs.tsinghua.edu.cn/jietang/publications/KDD08-Tang-et-al-Arnetminer.ppt) [System](http://aminer.org/) [API](http://aminer.org/RESTful_service)

Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243-246. [PDF](https://www.microsoft.com/en-us/research/publication/an-overview-of-microsoft-academic-service-mas-and-applications-2/) [System](https://academic.microsoft.com/) [API](https://docs.microsoft.com/en-us/azure/cognitive-services/academic-knowledge/home)

In [None]:
fos_counts = Counter(feature
                     for paper_row in model_df.fos.fillna('0')
                     for feature in paper_row).most_common()

keyword_counts = Counter(feature
                     for paper_row in model_df.keywords.fillna('0')
                     for feature in paper_row).most_common()

year_counts = Counter(feature for feature in model_df.year.astype('str')).most_common()

popular_paper_features = fos_counts + keyword_counts + year_counts