[原版（英文）图书地址](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/)


**代码修改和整理**：[黄海广](https://github.com/fengdu78)，原文修改成jupyter notebook格式，并增加和修改了部分代码，测试全部通过，所有数据集已经放在[百度云](data/README.md)下载。

**备注**：本章还在更新中，请等待，代码还未测试，先放上英文原文和代码。

# 九、回到特征：将它们放到一起

When the path from data to results was first introduced in Figure 1-1, it may not have been clear how there would ever be a way forward. Throughout this book, we have focused on introducing basic principles of feature engineering using toy models and clean, simple datasets. These examples were intended to be illustrative and enlightening. 

Machine learning examples generally show the best-case scenario and results. This masks the path we have described thus far in the book. Now that the foundation is set, we are leaving the world of simple, toy data and diving into the process of feature engineering with a real-world, structured dataset. As we move through each step, we will be examining the raw data forming each feature, what the transformed feature becomes, and what trade-offs we make along the way.

To be clear, our goal for this example is not to build the best model for this dataset. Rather, it is to demonstrate the practical application of a handful of our techniques, as well as how to more deeply examine and understand whether each technique is providing value to the model one is building.

Item-Based Collaborative Filtering
Our task will be to build a recommender for academic papers using a subsample of the Microsoft Academic Graph dataset. This should come in extremely handy for all of you who are searching for citations but have not yet discovered Google Scholar. Here are some relevant statistics about the dataset:



### Microsoft Academic Graph Dataset
It contains 166,192,182 unique papers, available via Open Academic Graph. 
- It is intended to be used for research purposes only.	
- The total size of the dataset is 104 GB.	
- Each observation has 18 variables to identify each paper, including the paper’s title, abstract, authors, keywords, and fields of study.

The dataset is designed to be easy to store and access in a database. It is not tidy for machine learning models out of the box, but requires some initial wrangling. Some teachers like to spare you this step, boosting your ego by getting directly to the models and results. None of that here. We are starting together from the very beginning.

Our initial approach will be to wrangle a few variables into the right shape to push through an item-based collaborative filter. We will see if reasonably similar papers can be found in a timely and efficient manner.

### THE ORIGINS OF ITEM-BASED COLLABORATIVE FILTERING
This approach was first developed at Amazon as an improvement to user-based algorithms for recommending products. Sarawar et al. (2001) walk through the challenges and benefits of switching the perspective in recommenders from the user to the item.

Item-based collaborative filtering provides recommendations based on the similarity between items. This works in two stages: first finding the similarity scores between items, then ranking all scores to find the top-N similar item recommendations.

### BUILDING AN ITEM-BASED RECOMMENDER
An item-based recommender performs three tasks:

1. Generalize information about a “thing” or item.

2. Score all other items to find ones “like” this one.

3. Return ranked scores + items.

### First Pass: Data Import, Cleaning, and Feature Parsing
ike all good science experiments, we will start off with a hypothesis. In this case, we assume that papers published at about the same time and in similar fields of study will be the most useful to users. We will take a naive approach of parsing out these fields from a subsample of the overall dataset. After generating simple sparse arrays, we’ll run the entire item array through an item-based collaborative filter to see if we get good results.

The item-based collaborative filter depends on a similarity score to compare items. In this case, the cosine similarity provides a reasonable comparison between two non-zero vectors. The following example actually uses the cosine distance, which is the complement of the cosine similarity in the positive space, or:

$$D_C(A,B)=1-S_C(A,B)$$
where $D_C$ is the cosine distance and $S_C$ is the cosine similarity.

#### Academic Paper Recommender: Naive Approach
The first step in our journey is to import and examine the dataset. In Example 9-1, we scope our experiment by limiting the fields available after the initial import. These fields are still rich in possibility, as shown in Figure 9-1.

### Example 9-1. Import + filter data

In [4]:
import pandas as pd

In [5]:
model_df = pd.read_json('data/mag_subset20K.txt', lines=True)

model_df.shape

(19996, 13)

In [6]:
model_df.columns

Index(['authors', 'doc_type', 'doi', 'id', 'issue', 'n_citation', 'page_end',
       'page_start', 'publisher', 'title', 'venue', 'volume', 'year'],
      dtype='object')

In [7]:
# filter out non-English articles
# keep abstract, authors, fos, keywords, year, title
model_df = model_df[model_df.lang == 'en'].drop_duplicates(
    subset='title', keep='first').drop([
        'doc_type', 'doi', 'id', 'issue', 'lang', 'n_citation', 'page_end',
        'page_start', 'publisher', 'references', 'url', 'venue', 'volume'
    ],
                                       axis=1)

model_df.shape

AttributeError: 'DataFrame' object has no attribute 'lang'

In [8]:
model_df.head(2)

Unnamed: 0,authors,doc_type,doi,id,issue,n_citation,page_end,page_start,publisher,title,venue,volume,year
0,"[{'name': 'Ronald P. Mason', 'id': '2105522006...",Journal,10.1007/978-1-4684-5568-7_3,100000002,,7,27,21,"Springer, Boston, MA",Electron Spin Resonance Investigations of Oxyg...,"{'raw': 'Basic life sciences', 'id': '27556866...",49,1988
1,"[{'name': '侯晓亮', 'id': '2400277081'}]",,,1000000047,6.0,0,143,143,,建筑物地基沉降的灰色模型GM（1，1）预测法,{'raw': '安徽建筑'},13,2006


Figure 9-1. First two rows of the Microsoft Academic Graph dataset

Table 9-1 summarizes best how further wrangling is needed to get the raw data into a better shape for a model. Lists and dictionaries are good for data storage, but are not tidy or well suited for machine learning without some unpacking (Wickham, 2014).

Table 9-1. Data schema for model_df

|Field name|Description|Field type|# NaN|
|:-:|:-:|:-:|:-:|
|abstract|paper abstract|string	|4393|
|authors|author names and affiliations|list of dict, keys = name, org|1|
|fos|fields of study|list of strings|1733|
|keywords|keywords|list of strings|4294|
|title|paper title|string|0|
|year|published year|int|0|

We focus first on two fields in Example 9-2, transforming them from lists and integers into a feature array, as shown in Figure 9-2.

### Example 9-2. Collaborative filtering stage 1: Build item feature matrix

In [None]:
unique_fos = sorted(list({ feature
                          for paper_row in model_df.fos.fillna('0')
                          for feature in paper_row }))

unique_year = sorted(model_df['year'].astype('str').unique())

len(unique_fos + unique_year)

In [None]:
model_df.shape[0] - pd.isnull(model_df['fos']).sum()

In [None]:
len(unique_fos)

In [None]:
import random
[unique_fos[i] for i in sorted(random.sample(range(len(unique_fos)), 15)) ]

In [None]:
def feature_array(x, unique_array):
    row_dict = {}
    for i in x.index:
        var_dict = {}
        
        for j in range(len(unique_array)):
            if type(x[i]) is list:
                if unique_array[j] in x[i]:
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
            else:    
                if unique_array[j] == str(x[i]):
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
        
        row_dict.update({i : var_dict})
    
    feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
    
    return feature_df

In [None]:
%time year_features = feature_array(model_df['year'], unique_year)

In [None]:
%time fos_features = feature_array(model_df['fos'], unique_fos)

from sys import getsizeof
print('Size of fos feature array: ', getsizeof(fos_features))

In [None]:
year_features.shape[1] + fos_features.shape[1]

In [None]:
# now looking at 10399 x  7760 array for our feature space

%time first_features = fos_features.join(year_features).T

first_size = getsizeof(first_features)

print('Size of first feature array: ', first_size)

Let's see how our current features perform at giving us a good recommendation. We'll define a "good" recommendation as a paper that looks similar to the input.

We will start with a simple example of building a recommender with just a few fields, building sparse arrays of available features to calculate for the cosine similary between papers. We will see if reasonably similar papers can be found in a timely manner.

In [None]:
first_features.shape

In [None]:
first_features.head()

Figure 9-2. Head of first_features—observations’ (papers') indices from the original data set are columns, features are rows

We have now successfully turned a relatively small dataset, ~10K rows of raw data, into 2.5 GB of features. But this path is too sluggish for quick, iterative exploration. We need methods that will be faster and result in features that will consume less computational resources and experimentation time.

For now, though, let’s see how our current features perform at giving us a good recommendation in the next stage (Example 9-3). We’ll define a “good” recommendation as a paper that looks similar to the input.

### Example 9-3. Collaborative filtering stage 2: Search for similar items

In [None]:
from scipy.spatial.distance import cosine

def item_collab_filter(features_df):
    item_similarities = pd.DataFrame(index = features_df.columns, columns = features_df.columns)
    
    for i in features_df.columns:
        for j in features_df.columns:
            item_similarities.loc[i][j] = 1 - cosine(features_df[i], features_df[j])
    
    return item_similarities

In [None]:
%time first_items = item_collab_filter(first_features.loc[:, 0:1000])

Why does it take so long for us to calculate the item similarities using only two features? We are taking the dot product of a 10,399 × 1,000 matrix using a nested for loop. The time per loop increases as we increase the number of observations we add to the model. Remember, this is a subset of the total available dataset, filtered for English-only papers. As we move closer to a “good” result, we’ll need to go back and test on the larger set for our best results.

How can we make this faster? Since we only need one result at a time, we can change our function so that we only calculate one item at a time, specifying the number of top results we want. We’ll do this later, as we continue to move through our experiment. For now, it is useful to see the full feature space to get an understanding of the impact of iterative work on brute-forcing our way through a real-world dataset.

We need to get a better idea of how these features will translate to us getting a good recommendation. Do we have enough observations to move forward? Let's plot a heatmap to see if we have any papers that are similar to each other.

### Example 9-4. Heatmap of paper recommendations

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

In [None]:
sns.set()
ax = sns.heatmap(
    first_items.fillna(0),
    vmin=0,
    vmax=1,
    cmap="YlGnBu",
    xticklabels=250,
    yticklabels=250)
ax.tick_params(labelsize=12)

Figure 9-3. Heatmap of similar papers based on two raw features: year and fields of study

Darker pixels signal items that are similar to one another. The dark diagonal line shows that the cosine similarity is correctly indicating that each paper is most similar to itself. However, because there are a lot of NaNs for one of our features, the line is broken along the diagonal. We can see that while most of the items are not similar to one another—i.e., our dataset is fairly diverse—there are some other high-scoring candidates. These may or may not be good recommendations qualitatively, but at least we can see that our methods are not so mad.

Example 9-5 shows how to translate these item similarities into a recommendation. The good news is that we have a wide variety of features still available, with lots of room for improvement.

### Example 9-5. Item-based collaborative filtering recommendations

In [None]:
def paper_recommender(paper_index, items_df):
    print('Based on the paper: \nindex = ', paper_index)
    print(model_df.iloc[paper_index])
    top_results = items_df.loc[paper_index].sort_values(
        ascending=False).head(4)
    print('\nTop three results: ')
    order = 1
    for i in top_results.index.tolist()[-3:]:
        print(order, '. Paper index = ', i)
        print('Similarity score: ', top_results[i])
        print(model_df.iloc[i], '\n')
        if order < 5: order += 1

In [None]:
paper_recommender(2, first_items)

Yikes. The good news is that the most similar paper returned is the one we are looking for. The bad news is that the next two papers don’t seem to be very close to our initial search, even for the features we have chosen.

“Yes, yes,” you may say, “but this is the era of Big Data! That will solve our problems! Can’t we just push more data through for better results?” Potentially. But even Big Data cannot compensate for poor data and engineering choices.

![](images/chapter9/9-4.png)
Figure 9-4. Machine learning (https://xkcd.com/1838/)

Our current brute-force methods are too slow for smart, iterative engineering. Let’s try some of our new feature engineering tricks to see if we can speed up the computation time and find better features and a better way to search for results.

## Second Pass: More Engineering and a Smarter Model

The initial approach of creating a large, sparse array and shoving it through a filter can be improved in many ways. The next steps will focus specifically on applying better techniques to the two initial features and altering the item-based collaborative filter method for faster iteration.

First, it is time to try out some of those great feature engineering tricks for the two variables in our hypothesis. Looking deeper into the features already developed, we can choose techniques that will address each type of variable and convert it to a “better” feature for our recommendation system.

### Academic Paper Recommender: Take 2
Let’s focus on the year first. In “Quantization or Binning”, we reviewed how using raw counts for features can be problematic for methods  using similarity metrics. Example 9-6 (and Figure 9-5) will examine how we can transform 'year' to better fit the model we have selected.

### Example 9-6. Fixed-width binning + dummy coding (part 1)

In [None]:
print("Year spread: ", model_df['year'].min()," - ", model_df['year'].max())
print("Quantile spread:\n", model_df['year'].quantile([0.25, 0.5, 0.75]))

In [None]:
# plot years to see the distribution
fig, ax = plt.subplots()
model_df['year'].hist(ax=ax, bins= model_df['year'].max() - model_df['year'].min())
ax.tick_params(labelsize=12)
ax.set_xlabel('Year Count', fontsize=12)
ax.set_ylabel('Occurrence', fontsize=12)

We can see from the skewed distribution (Figure 9-5) that this is an excellent candidate for binning.

Figure 9-5. Raw year distribution for 10K+ academic papers in dataset

The bins will be based on ranges within the variable, rather than the unique number of features. To further reduce the feature space, we will dummy-code the resultant bins (see Example 9-7). Pandas can do both using built-in functions. These methods will make our results easy to interpret, so we can do a quick check of the transformed features before moving on (see Figure 9-6).

### Example 9-7. Fixed-width binning + dummy coding (part 2)

In [None]:
# we'll base our bins on the range of the variable, rather than the unique number of features
model_df['year'].max() - model_df['year'].min()

In [None]:
# binning here (by 10 years)
bins = int(round((model_df['year'].max() - model_df['year'].min()) / 10))

temp_df = pd.DataFrame(index=model_df.index)
temp_df['yearBinned'] = pd.cut(model_df['year'].tolist(), bins, precision=0)

In [None]:
# now we only have as many bins as we created(grouping together by 10 years)
print('We have reduced from', len(model_df['year'].unique()), 'to',
      len(temp_df['yearBinned'].values.unique()),
      'features representing the year.')

In [None]:
X_yrs = pd.get_dummies(temp_df['yearBinned'])
X_yrs.head()

In [None]:
X_yrs.columns.categories

In [None]:
# let's look at the new distribution
fig, ax = plt.subplots()
X_yrs.sum().plot.bar(ax=ax)
ax.tick_params(labelsize=8)
ax.set_xlabel('Binned Years', fontsize=12)
ax.set_ylabel('Counts', fontsize=12)

Figure 9-6. Distribution of new binned X_yrs feature

We have preserved the underlying distribution of the original variable through binning by decades. If we desired to use a method that would benefit from a different distribution, we could alter our binning choices to change how this variable presents itself to the model. Since we are using cosine similarity, this is fine. Let’s move on to the next feature we originally included in our model.

The fields-of-study feature space contributed significantly to the original model’s size and processing time.

### Example 9-8. Converting bag-of-phrases pd.Series to NumPy sparse array

In [None]:
unique_fos = sorted(list({ feature
                          for paper_row in model_df.fos.fillna('0')
                          for feature in paper_row }))

In [None]:
def feature_array(x, unique_array):
    row_dict = {}
    for i in x.index:
        var_dict = {}
        
        for j in range(len(unique_array)):
            if type(x[i]) is list:
                if unique_array[j] in x[i]:
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
            else:    
                if unique_array[j] == str(x[i]):
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
        
        row_dict.update({i : var_dict})
    
    feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
    
    return feature_df

In [None]:
%time fos_features = feature_array(model_df['fos'], unique_fos)

In [None]:
fos_features.head(2)

In [None]:
X_fos = fos_features.values

In [None]:
# We can see how this will make a difference in the future by looking at the size of each
print('Our pandas Series, in bytes: ', getsizeof(fos_features))
print('Our hashed numpy array, in bytes: ', getsizeof(X_fos))

Much better! Putting it back together, we’ll pipe our features together (Example 9-9) and rerun our recommender (Example 9-10) to see if we have improved results, taking advantage of scikit-learn’s cosine similarity function. We will also reduce the computational time by only focusing on one item at a time.

### Example 9-9. Collaborative filtering stages 1 + 2: Build item feature matrix, search for similar items

In [None]:
X_yrs.shape[1] + X_fos.shape[1]

In [None]:
# now looking at 10399 x  7623 array for our feature space

%time second_features = np.append(X_fos, X_yrs, axis = 1)

second_size = getsizeof(second_features)

print('Size of second feature array, in bytes: ', second_size)

In [None]:
print("The power of feature engineering saves us, in bytes: ", 802239497 - second_size)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity


def piped_collab_filter(features_matrix, index, top_n):

    item_similarities = 1 - cosine_similarity(features_matrix[index:index + 1],
                                              features_matrix).flatten()
    related_indices = [
        i for i in item_similarities.argsort()[::-1] if i != index
    ]

    return [(index, item_similarities[index])
            for index in related_indices][0:top_n]

### Example 9-10. Item-based collaborative filtering recommendations: Take 2

In [None]:
def paper_recommender(items_df, paper_index, top_n):
    if paper_index in model_df.index:

        print('Based on the paper:')
        print('Paper index = ', model_df.loc[paper_index].name)
        print('Title :', model_df.loc[paper_index]['title'])
        print('FOS :', model_df.loc[paper_index]['fos'])
        print('Year :', model_df.loc[paper_index]['year'])
        print('Abstract :', model_df.loc[paper_index]['abstract'])
        print('Authors :', model_df.loc[paper_index]['authors'], '\n')

        # define the location index for the DataFrame index requested
        array_ix = model_df.index.get_loc(paper_index)

        top_results = piped_collab_filter(items_df, array_ix, top_n)

        print('\nTop', top_n, 'results: ')

        order = 1
        for i in range(len(top_results)):
            print(order, '. Paper index = ',
                  model_df.iloc[top_results[i][0]].name)
            print('Similarity score: ', top_results[i][1])
            print('Title :', model_df.iloc[top_results[i][0]]['title'])
            print('FOS :', model_df.iloc[top_results[i][0]]['fos'])
            print('Year :', model_df.iloc[top_results[i][0]]['year'])
            print('Abstract :', model_df.iloc[top_results[i][0]]['abstract'])
            print('Authors :', model_df.iloc[top_results[i][0]]['authors'],
                  '\n')
            if order < top_n: order += 1

    else:
        print('Whoops! Choose another paper. Try something from here: \n',
              model_df.index[100:200])

In [None]:
paper_recommender(second_features, 2, 3)

To be honest, I don’t think our feature selection is working out too well. There is a lot of missing data in these fields. Let’s keep going to see if we can choose richer features with more information.

### FINDING YOUR PLACE
Converting between Pandas DataFrames and NumPy matrices can make indices tricky—we have the same size index, but the index assignments are not the same. Pandas assists with this using .iloc, .loc, and .get_loc, as we show in Example 9-11:

- .loc returns the index based on the original Pandas DataFrame, allowing us to reference specific papers.

- .iloc uses the integer location, which is the same index as our NumPy array.

- .get_loc helps us find the integer location when we know the DataFrame index.

### Example 9-11. Maintaining index assignment during conversions

In [None]:
model_df.loc[21]

In [None]:
model_df.iloc[21]

In [None]:
model_df.index.get_loc(30)

## Third Pass: More Features = More Information
Our experiment thus far is not supporting the original hypothesis that year and fields-of-study would be sufficient to recommend a similar paper. At this point, we have a few options:

- Upload more of the original dataset to see if we get better results.
- Spend more time exploring the data to examine if we have a sufficiently dense set to provide good recommendations.
- Iterate on the current model by adding more features.

The first option makes the assumption that the problem is in our sampling of the data. This might be the case, but is similar to Figure 9-4’s analogy of stirring the data pile for better results.

The second option would give a better idea of the underlying raw data. This should be continually revisited based on how your decisions for features and model selection change during the exploration process. The initial subsample chosen here reflects this step. Since we have more variables available in the dataset, we will not go back here yet.

This leaves the third option, moving forward on our current model by adding more features. Providing more information about each item can improve the similarity scores and result in better recommendations.

Based on our initial exploration, the next steps will focus on the fields with the most information, abstract and authors.

### Academic Paper Recommender: Take 3

Looking back at Chapter 4, we can see that abstract is a good candidate for tf-idf to filter through the noise and find the salient associative words. We do this in Example 9-12.

### Example 9-12. Stopwords + tf-idf

In [None]:
# need to fill in NaN for sklearn
filled_df = model_df.fillna('None')

In [None]:
# abstract: stopwords, frequency based filtering (tf-idf?)
filled_df['abstract'].head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    sublinear_tf=True, max_df=0.5, stop_words='english')
X_abstract = vectorizer.fit_transform(filled_df['abstract'])

X_abstract

In [None]:
print("n_samples: %d, n_features: %d" % X_abstract.shape)

In [None]:
third_features = np.append(second_features, X_abstract.toarray(), axis = 1)

In [None]:
paper_recommender(third_features, 2, 3)

We can reduce the computational load of the messy and uneven authors by wrangling into a dictionary and then running it through a one-hot encoder, as shown in Example 9-13.

### Example 9-13. One-hot encoding using scikit-learn’s DictVectorizer

In [None]:
authors_df = pd.DataFrame(filled_df.authors)
authors_df.head()

In [None]:
import json

In [None]:
authors_list = []

for row in authors_df.itertuples():
    # create a dictionary from each Series index
    if type(row.authors) is str:
        y = {'None': row.Index}
    if type(row.authors) is list:
        # add these keys + values to our running dictionary    
        y = dict.fromkeys(row.authors[0].values(), row.Index)
    authors_list.append(y)

In [None]:
authors_list[0:5]

In [None]:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = authors_list
X_authors = v.fit_transform(D)

X_authors

In [None]:
print("n_samples: %d, n_features: %d" % X_authors.shape)

In [None]:
# now looking at 5167 x  38070 array for our feature space

fourth_features = np.append(third_features, X_authors, axis = 1)

Time to check in with the recommender to see how these new features are working out. Example 9-14 shows the results.

Example 9-14. Item-based collaborative filtering recommendations: Take 3

In [None]:
paper_recommender(fourth_features, 2, 3)

Even accounting for missing data in certain fields, the top three results from the last round of feature engineering are directing us to other papers in the medical field.

The range of papers represented in this dataset is broad; for example, a random sample of papers exposed fields of study such as “Coupling constant,” “Evapotranspiration,” “Hash function,” “IVMS,” “Meditation,” “Pareto analysis,” “Second-generation wavelet transform,” “Slip,” and “Spiral galaxy.” Given that there are 7,604 unique fields of study listed for 10K+ papers, these last results seem to be moving in the right direction. We can be confident that our work is progressing toward a useful model.

Continued iteration on more text variables, such as the finding the noun phrases of the paper titles or stemming the keywords, could bring us even closer to a “best” recommendation.

It should be noted here that this definition of “best” is the Holy Grail of all recommenders and search engines alike. We are searching for what a user will find most helpful, which may or may not be directly represented by the data. Feature engineering allows us to abstract salient features into representations such that algorithms can expose both the explicit and implicit information contained therein.

## Summary
As you can see, building models for machine learning is easy. Building good models for useful results takes time and work. We hiked through the messy processes here of examining a collection of possible variables and experimenting with different feature engineering methods to achieve better results. We define “better” here not just in terms of good outcomes from our training and testing, but also reducing the size of the model and the time it takes us to iterate over different experiments.

We started this book by talking about how mastery of a subject comes from deeply learning the principles at work, in order to gain intuition to effectively put your knowledge to work. We hope that our work has given you the necessary tools to become more efficient and effective, as well as enriched your mathematical and computational understanding of how feature engineering is an essential skill to develop useful machine learning models.

## Bibliography
Sarwar, Badrul, George Karypis, Joseph Konstan, and John Riedl. “Item-Based Collaborative Filtering Recommendation Algorithms.” Proceedings of the 10th International Conference on the World Wide Web (2001) 285–295.

Sinha, Arnab, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. “An Overview of Microsoft Academic Service (MAS) and Applications.” Proceedings of the 24th International Conference on the World Wide Web (2015): 243–246.

Tang, Jie, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. “ArnetMiner: Extraction and Mining of Academic Social Networks.” Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008): 990–998.

Wickham, Hadley. “Tidy Data.” The Journal of Statistical Software 59 (2014).