[原版（英文）图书地址](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/)


**代码修改和整理**：[黄海广](https://github.com/fengdu78)，原文修改成jupyter notebook格式，并增加和修改了部分代码，测试全部通过，所有数据集已经放在[百度云](data/README.md)下载。

**备注**：本章还在更新中，请等待，代码还未测试，先放上英文原文和代码。

# 九、回到特征：将它们放到一起

当第一次看到图1-1 中从数据到结果的路径时，很可能会无所适从。纵贯本书，我们的重点在于介绍特征工程的基本原则，我们使用的是玩具模型和简单明了的数据集，这些例子是有意设计成有说明性和启发性的。

机器学习的常见例子是展示最理想的情况和最佳结果，这掩盖了本书中描述的路径中的艰辛。既然基础已经打好，我们就离开模拟数据的简单世界，投入到使用真实的、结构化数据集的特征工程中。在前进的每个阶段中，我们都会研究如何从原始数据生成特征，如何进行特征转换，以及特征工程中需要何种权衡取舍。

先说一下，这个综合示例的目标不是为数据集建立最好的模型，而是演示一下本书中几种技术的实际应用，以及如何更加深入地研究一下各种技术是否可以为建模过程提供价值。

## 基于物品的协同过滤

我们的任务是使用Microsoft Academic Graph数据集的子样本为学术论文构建推荐器。 对于正在搜索引文但没使用Google学术搜索的所有人来说，这应该非常方便。 以下是有关数据集的一些相关统计信息：

### Microsoft Academic Graph 数据集
这个数据集包含 166 192 182 篇论文，可经由 Open Academic Graph 获取，
- 只能用于研究目的。
- 完整数据集的大小是 104GB。
- 每条观测有 18 个变量用以标识论文，包括论文题目、论文摘要、作者姓名、关键字和研究领域。

**备注**：这个数据集已经下载并处理好了，如果只是为了跑通本文代码，就不需要再下载了，本文数据我已经放到了[百度云](data/)

这个数据集被设计成易于使用数据库存储和读取。对于机器学习模型来说，它可能不够整洁，需要做一些基本的数据整理。有些教师喜欢省略这个步骤，让学生直接建模并得到结果，但我们可不这么做，我们一切都从头开始。

第一步是将一些变量整理为正确的形式，建立一个基于项目的协同过滤器，看看能否快速有效地找到那些非常相似的论文。

### 基于物品的协同过滤的起源
这种方法最初是由 Amazon 公司开发的，作为基于用户的商品推荐算法的一种改进。Sarawar 等人详细介绍了将推荐算法从基于用户切换到基于物品的过程中的困难和收获(Sarawar et al. (2001))。

基于物品的协同过滤方法根据物品之间的相似程度来提供推荐。这项工作分为两个阶段： 首先找出物品之间的相似度评分，然后对所有评分进行排序，找到前 $N$ 个相似项目作为推荐。

### 建立基于物品的推荐器
基于物品的推荐器完成以下三项任务。

1.	生成关于“事物”或物品的信息。
2.	对所有物品进行评分，找出与某个项目“相似”的其他物品。
3.	返回评分排序 + 物品。

## 第一步：数据导入、清理和特征解析
与所有优秀的科学实验一样，我们从一个假设开始。在这个例子中，我们假定那些大约在同一时间而且在同一研究领域发表的论文对用户是最有用的。我们使用一种简单的方法从完整数据集的一个子样本中解析出这些字段。在生成了简单的稀疏数组后，我们在整个物品数组上运行基于物品的协同过滤器，看看能否得到满意的结果。

基于物品的协同过滤器使用相似度评分来比较物品。在这个例子中，余弦相似度可以在两个非零向量之间提供合理的比较。下面的例子使用的就是余弦距离，它是余弦相似度在正空间中的补集，即：

$$D_C(A,B)=1-S_C(A,B)$$
其中 $D_C$ 是余弦距离，而 $S_C$ 表示余弦相似度。

### 学术论文推荐器：简单方法
第一步就是导入和检查数据集。在例 9-1  中，我们先导入数据，然后选择出一些可用的字段，以此来开始实验。在保留的字段中仍然含有丰富的信息，如图 9-1 所示。

#### 例 9-1：导入并过滤数据

In [9]:
import pandas as pd

In [10]:
model_df = pd.read_json('data/mag_subset20K.txt', lines=True)

model_df.shape

(20000, 19)

In [11]:
model_df.columns

Index(['abstract', 'authors', 'doc_type', 'doi', 'fos', 'id', 'issue',
       'keywords', 'lang', 'n_citation', 'page_end', 'page_start', 'publisher',
       'references', 'title', 'url', 'venue', 'volume', 'year'],
      dtype='object')

In [12]:
# filter out non-English articles
# keep abstract, authors, fos, keywords, year, title
# model_df = model_df[model_df.lang == 'en'].drop_duplicates(
#     subset='title', keep='first').drop([
#         'doc_type', 'doi', 'id', 'issue', 'lang', 'n_citation', 'page_end',
#         'page_start', 'publisher', 'references', 'url', 'venue', 'volume'
#     ],
#                                        axis=1)
model_df = model_df.drop_duplicates(
    subset='title', keep='first').drop([
        'doc_type', 'doi', 'id', 'issue', 'n_citation', 'page_end',
        'page_start', 'publisher', 'venue', 'volume'
    ],
                                       axis=1)
model_df.shape

(20000, 9)

In [13]:
model_df.head(2)

Unnamed: 0,abstract,authors,fos,keywords,lang,references,title,url,year
0,A system and method for maskless direct write ...,,"[Electronic engineering, Computer hardware, En...",,en,"[354c172f-d877-4e60-a7eb-c1b1cf03ce4d, 76cf106...",System and Method for Maskless Direct Write Li...,[http://www.freepatentsonline.com/y2016/021111...,2015
1,,[{'name': 'Ahmed M. Alluwaimi'}],"[Biology, Virology, Immunology, Microbiology]","[paratuberculosis, of, subspecies, proceedings...",en,,The dilemma of the Mycobacterium avium subspec...,[http://www.omicsonline.org/proceedings/the-di...,2016


<center>图 9-1：Microsoft Academic Graph数据集的前两行</center>

从表 9-1 中可以非常清楚地看出，需要何种程度的数据整理才能将原始数据转换为更适合建模的形式。列表和字典便于数据存储，但如果不经过一些解包操作的话，就不够整洁，不能很好地适应机器学习（Wickham, 2014）。

<center>表9-1：model_df的数据概述</center>

|Field name|Description|Field type|# NaN|
|:-:|:-:|:-:|:-:|
|abstract|paper abstract|string	|4393|
|authors|author names and affiliations|list of dict, keys = name, org|1|
|fos|fields of study|list of strings|1733|
|keywords|keywords|list of strings|4294|
|title|paper title|string|0|
|year|published year|int|0|

在例 9-2 中，我们先重点关注两个字段，将它们从列表和整数转换为特征数组，如图 9-2所示。

####  例 9-2：协同过滤阶段 1：建立物品特征矩阵

In [14]:
unique_fos = sorted(list({ feature
                          for paper_row in model_df.fos.fillna('0')
                          for feature in paper_row }))

unique_year = sorted(model_df['year'].astype('str').unique())

len(unique_fos + unique_year)

9325

In [15]:
model_df.shape[0] - pd.isnull(model_df['fos']).sum()

13251

In [16]:
len(unique_fos)

9150

In [17]:
import random
[unique_fos[i] for i in sorted(random.sample(range(len(unique_fos)), 15)) ]

['Cograph',
 'Definition',
 'Dynamics',
 'Interleukin-6 receptor',
 'Kernel principal component analysis',
 'Metropolitan area network',
 'Moving target indication',
 'Organizational commitment',
 'Pure mathematics',
 'Quasar',
 'Self-assembled monolayer',
 'Specific gravity',
 'Spin magnetic moment',
 'Tax basis',
 'Transistor']

In [18]:
def feature_array(x, unique_array):
    row_dict = {}
    for i in x.index:
        var_dict = {}
        
        for j in range(len(unique_array)):
            if type(x[i]) is list:
                if unique_array[j] in x[i]:
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
            else:    
                if unique_array[j] == str(x[i]):
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
        
        row_dict.update({i : var_dict})
    
    feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
    
    return feature_df

In [None]:
%time year_features = feature_array(model_df['year'], unique_year)

Wall time: 55.5 s


In [None]:
%time fos_features = feature_array(model_df['fos'], unique_fos)

from sys import getsizeof
print('Size of fos feature array: ', getsizeof(fos_features))

In [None]:
year_features.shape[1] + fos_features.shape[1]

In [None]:
# now looking at 10399 x  7760 array for our feature space

%time first_features = fos_features.join(year_features).T

first_size = getsizeof(first_features)

print('Size of first feature array: ', first_size)

We will start with a simple example of building a recommender with just a few fields, building sparse arrays of available features to calculate for the cosine similary between papers. We will see if reasonably similar papers can be found in a timely manner.

In [None]:
first_features.shape

In [None]:
first_features.head()

<center>图 9-2：first_features 的头部——原始数据集中观测（论文）的索引是列，特征是行</center>

我们成功地将一个较小的数据集（大约 1 万行原始数据）转换成了 2.5GB 的特征。但对于需要快速迭代的数据探索过程来说，这种方法太笨重了。我们需要更快速的方法，使得出的特征占用更少的计算资源和实验时间。

稍安勿躁，不妨先看一下，现在这种特征在下一阶段能为我们做出多么好的推荐（见例 9-3）。我们定义“好”的推荐就是与输入相似的论文。


####  例 9-3   协同过滤阶段 2：查找相似物品

In [None]:
from scipy.spatial.distance import cosine


def item_collab_filter(features_df):
    item_similarities = pd.DataFrame(
        index=features_df.columns, columns=features_df.columns)

    for i in features_df.columns:
        for j in features_df.columns:
            item_similarities.loc[i][j] = 1 - cosine(features_df[i],
                                                     features_df[j])

    return item_similarities

In [None]:
%time first_items = item_collab_filter(first_features.loc[:, 0:1000])
#这一步时间非常长，大概要1小时

Why does it take so long for us to calculate the item similarities using only two features? We are taking the dot product of a 10,399 × 1,000 我们只是使用两个特征来计算项目相似度，为什么计算时间如此之长？因为我们使用了嵌套 for 循环来计算一个 10 399×1000 的矩阵的点积。如果向模型中添加了更多观测，那每次循环的时间还会增加。注意，我们只筛选出了英文论文，这只是整个可用数据集的一个子集。当得到一个差不多“好”的结果时，还需要回到更大的数据集合上进行测试，看看这是不是最好的结果。

怎么才能做得更快一些呢？因为每次只需要一个结果，所以可以修改一下函数，指定我们需要的前几个结果的数量，每次只计算一个项目。我们以后会这么做，因为需要持续改进实验。眼下还是使用全特征空间，理解一下在实际数据集上使用暴力算法时迭代造成的影响。

要得到好的推荐，需要一种更好的特征转换方法。我们有足够的观测来改进吗？让我们绘制一幅热图（见例 9-4），看看是否有彼此相似的论文。结果显示在图 9-3 中。


### Example 9-4. Heatmap of paper recommendations

In [19]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline

In [20]:
sns.set()
ax = sns.heatmap(
    first_items.fillna(0),
    vmin=0,
    vmax=1,
    cmap="YlGnBu",
    xticklabels=250,
    yticklabels=250)
ax.tick_params(labelsize=12)

NameError: name 'first_items' is not defined

Figure 9-3. Heatmap of similar papers based on two raw features: year and fields of study

Darker pixels signal items that are similar to one another. The dark diagonal line shows that the cosine similarity is correctly indicating that each paper is most similar to itself. However, because there are a lot of NaNs for one of our features, the line is broken along the diagonal. We can see that while most of the items are not similar to one another—i.e., our dataset is fairly diverse—there are some other high-scoring candidates. These may or may not be good recommendations qualitatively, but at least we can see that our methods are not so mad.

Example 9-5 shows how to translate these item similarities into a recommendation. The good news is that we have a wide variety of features still available, with lots of room for improvement.

### Example 9-5. Item-based collaborative filtering recommendations

In [None]:
def paper_recommender(paper_index, items_df):
    print('Based on the paper: \nindex = ', paper_index)
    print(model_df.iloc[paper_index])
    top_results = items_df.loc[paper_index].sort_values(
        ascending=False).head(4)
    print('\nTop three results: ')
    order = 1
    for i in top_results.index.tolist()[-3:]:
        print(order, '. Paper index = ', i)
        print('Similarity score: ', top_results[i])
        print(model_df.iloc[i], '\n')
        if order < 5: order += 1

In [None]:
paper_recommender(2, first_items)

Yikes. The good news is that the most similar paper returned is the one we are looking for. The bad news is that the next two papers don’t seem to be very close to our initial search, even for the features we have chosen.

“Yes, yes,” you may say, “but this is the era of Big Data! That will solve our problems! Can’t we just push more data through for better results?” Potentially. But even Big Data cannot compensate for poor data and engineering choices.

![](images/chapter9/9-4.png)
Figure 9-4. Machine learning (https://xkcd.com/1838/)

Our current brute-force methods are too slow for smart, iterative engineering. Let’s try some of our new feature engineering tricks to see if we can speed up the computation time and find better features and a better way to search for results.

## Second Pass: More Engineering and a Smarter Model

The initial approach of creating a large, sparse array and shoving it through a filter can be improved in many ways. The next steps will focus specifically on applying better techniques to the two initial features and altering the item-based collaborative filter method for faster iteration.

First, it is time to try out some of those great feature engineering tricks for the two variables in our hypothesis. Looking deeper into the features already developed, we can choose techniques that will address each type of variable and convert it to a “better” feature for our recommendation system.

### Academic Paper Recommender: Take 2
Let’s focus on the year first. In “Quantization or Binning”, we reviewed how using raw counts for features can be problematic for methods  using similarity metrics. Example 9-6 (and Figure 9-5) will examine how we can transform 'year' to better fit the model we have selected.

### Example 9-6. Fixed-width binning + dummy coding (part 1)

In [None]:
print("Year spread: ", model_df['year'].min()," - ", model_df['year'].max())
print("Quantile spread:\n", model_df['year'].quantile([0.25, 0.5, 0.75]))

In [None]:
# plot years to see the distribution
fig, ax = plt.subplots()
model_df['year'].hist(ax=ax, bins= model_df['year'].max() - model_df['year'].min())
ax.tick_params(labelsize=12)
ax.set_xlabel('Year Count', fontsize=12)
ax.set_ylabel('Occurrence', fontsize=12)

We can see from the skewed distribution (Figure 9-5) that this is an excellent candidate for binning.

Figure 9-5. Raw year distribution for 10K+ academic papers in dataset

The bins will be based on ranges within the variable, rather than the unique number of features. To further reduce the feature space, we will dummy-code the resultant bins (see Example 9-7). Pandas can do both using built-in functions. These methods will make our results easy to interpret, so we can do a quick check of the transformed features before moving on (see Figure 9-6).

### Example 9-7. Fixed-width binning + dummy coding (part 2)

In [None]:
# we'll base our bins on the range of the variable, rather than the unique number of features
model_df['year'].max() - model_df['year'].min()

In [None]:
# binning here (by 10 years)
bins = int(round((model_df['year'].max() - model_df['year'].min()) / 10))

temp_df = pd.DataFrame(index=model_df.index)
temp_df['yearBinned'] = pd.cut(model_df['year'].tolist(), bins, precision=0)

In [None]:
# now we only have as many bins as we created(grouping together by 10 years)
print('We have reduced from', len(model_df['year'].unique()), 'to',
      len(temp_df['yearBinned'].values.unique()),
      'features representing the year.')

In [None]:
X_yrs = pd.get_dummies(temp_df['yearBinned'])
X_yrs.head()

In [None]:
X_yrs.columns.categories

In [None]:
# let's look at the new distribution
fig, ax = plt.subplots()
X_yrs.sum().plot.bar(ax=ax)
ax.tick_params(labelsize=8)
ax.set_xlabel('Binned Years', fontsize=12)
ax.set_ylabel('Counts', fontsize=12)

Figure 9-6. Distribution of new binned X_yrs feature

We have preserved the underlying distribution of the original variable through binning by decades. If we desired to use a method that would benefit from a different distribution, we could alter our binning choices to change how this variable presents itself to the model. Since we are using cosine similarity, this is fine. Let’s move on to the next feature we originally included in our model.

The fields-of-study feature space contributed significantly to the original model’s size and processing time.

### Example 9-8. Converting bag-of-phrases pd.Series to NumPy sparse array

In [None]:
unique_fos = sorted(list({ feature
                          for paper_row in model_df.fos.fillna('0')
                          for feature in paper_row }))

In [None]:
def feature_array(x, unique_array):
    row_dict = {}
    for i in x.index:
        var_dict = {}
        
        for j in range(len(unique_array)):
            if type(x[i]) is list:
                if unique_array[j] in x[i]:
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
            else:    
                if unique_array[j] == str(x[i]):
                    var_dict.update({unique_array[j]: 1})
                else:
                    var_dict.update({unique_array[j]: 0})
        
        row_dict.update({i : var_dict})
    
    feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
    
    return feature_df

In [None]:
%time fos_features = feature_array(model_df['fos'], unique_fos)

In [None]:
fos_features.head(2)

In [None]:
X_fos = fos_features.values

In [None]:
# We can see how this will make a difference in the future by looking at the size of each
print('Our pandas Series, in bytes: ', getsizeof(fos_features))
print('Our hashed numpy array, in bytes: ', getsizeof(X_fos))

Much better! Putting it back together, we’ll pipe our features together (Example 9-9) and rerun our recommender (Example 9-10) to see if we have improved results, taking advantage of scikit-learn’s cosine similarity function. We will also reduce the computational time by only focusing on one item at a time.

### Example 9-9. Collaborative filtering stages 1 + 2: Build item feature matrix, search for similar items

In [None]:
X_yrs.shape[1] + X_fos.shape[1]

In [None]:
# now looking at 10399 x  7623 array for our feature space

%time second_features = np.append(X_fos, X_yrs, axis = 1)

second_size = getsizeof(second_features)

print('Size of second feature array, in bytes: ', second_size)

In [None]:
print("The power of feature engineering saves us, in bytes: ", 802239497 - second_size)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity


def piped_collab_filter(features_matrix, index, top_n):

    item_similarities = 1 - cosine_similarity(features_matrix[index:index + 1],
                                              features_matrix).flatten()
    related_indices = [
        i for i in item_similarities.argsort()[::-1] if i != index
    ]

    return [(index, item_similarities[index])
            for index in related_indices][0:top_n]

### Example 9-10. Item-based collaborative filtering recommendations: Take 2

In [None]:
def paper_recommender(items_df, paper_index, top_n):
    if paper_index in model_df.index:

        print('Based on the paper:')
        print('Paper index = ', model_df.loc[paper_index].name)
        print('Title :', model_df.loc[paper_index]['title'])
        print('FOS :', model_df.loc[paper_index]['fos'])
        print('Year :', model_df.loc[paper_index]['year'])
        print('Abstract :', model_df.loc[paper_index]['abstract'])
        print('Authors :', model_df.loc[paper_index]['authors'], '\n')

        # define the location index for the DataFrame index requested
        array_ix = model_df.index.get_loc(paper_index)

        top_results = piped_collab_filter(items_df, array_ix, top_n)

        print('\nTop', top_n, 'results: ')

        order = 1
        for i in range(len(top_results)):
            print(order, '. Paper index = ',
                  model_df.iloc[top_results[i][0]].name)
            print('Similarity score: ', top_results[i][1])
            print('Title :', model_df.iloc[top_results[i][0]]['title'])
            print('FOS :', model_df.iloc[top_results[i][0]]['fos'])
            print('Year :', model_df.iloc[top_results[i][0]]['year'])
            print('Abstract :', model_df.iloc[top_results[i][0]]['abstract'])
            print('Authors :', model_df.iloc[top_results[i][0]]['authors'],
                  '\n')
            if order < top_n: order += 1

    else:
        print('Whoops! Choose another paper. Try something from here: \n',
              model_df.index[100:200])

In [None]:
paper_recommender(second_features, 2, 3)

To be honest, I don’t think our feature selection is working out too well. There is a lot of missing data in these fields. Let’s keep going to see if we can choose richer features with more information.

### FINDING YOUR PLACE
Converting between Pandas DataFrames and NumPy matrices can make indices tricky—we have the same size index, but the index assignments are not the same. Pandas assists with this using .iloc, .loc, and .get_loc, as we show in Example 9-11:

- .loc returns the index based on the original Pandas DataFrame, allowing us to reference specific papers.

- .iloc uses the integer location, which is the same index as our NumPy array.

- .get_loc helps us find the integer location when we know the DataFrame index.

### Example 9-11. Maintaining index assignment during conversions

In [None]:
model_df.loc[21]

In [None]:
model_df.iloc[21]

In [None]:
model_df.index.get_loc(30)

## Third Pass: More Features = More Information
Our experiment thus far is not supporting the original hypothesis that year and fields-of-study would be sufficient to recommend a similar paper. At this point, we have a few options:

- Upload more of the original dataset to see if we get better results.
- Spend more time exploring the data to examine if we have a sufficiently dense set to provide good recommendations.
- Iterate on the current model by adding more features.

The first option makes the assumption that the problem is in our sampling of the data. This might be the case, but is similar to Figure 9-4’s analogy of stirring the data pile for better results.

The second option would give a better idea of the underlying raw data. This should be continually revisited based on how your decisions for features and model selection change during the exploration process. The initial subsample chosen here reflects this step. Since we have more variables available in the dataset, we will not go back here yet.

This leaves the third option, moving forward on our current model by adding more features. Providing more information about each item can improve the similarity scores and result in better recommendations.

Based on our initial exploration, the next steps will focus on the fields with the most information, abstract and authors.

### Academic Paper Recommender: Take 3

Looking back at Chapter 4, we can see that abstract is a good candidate for tf-idf to filter through the noise and find the salient associative words. We do this in Example 9-12.

### Example 9-12. Stopwords + tf-idf

In [None]:
# need to fill in NaN for sklearn
filled_df = model_df.fillna('None')

In [None]:
# abstract: stopwords, frequency based filtering (tf-idf?)
filled_df['abstract'].head()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    sublinear_tf=True, max_df=0.5, stop_words='english')
X_abstract = vectorizer.fit_transform(filled_df['abstract'])

X_abstract

In [None]:
print("n_samples: %d, n_features: %d" % X_abstract.shape)

In [None]:
third_features = np.append(second_features, X_abstract.toarray(), axis = 1)

In [None]:
paper_recommender(third_features, 2, 3)

We can reduce the computational load of the messy and uneven authors by wrangling into a dictionary and then running it through a one-hot encoder, as shown in Example 9-13.

### Example 9-13. One-hot encoding using scikit-learn’s DictVectorizer

In [None]:
authors_df = pd.DataFrame(filled_df.authors)
authors_df.head()

In [None]:
import json

In [None]:
authors_list = []

for row in authors_df.itertuples():
    # create a dictionary from each Series index
    if type(row.authors) is str:
        y = {'None': row.Index}
    if type(row.authors) is list:
        # add these keys + values to our running dictionary    
        y = dict.fromkeys(row.authors[0].values(), row.Index)
    authors_list.append(y)

In [None]:
authors_list[0:5]

In [None]:
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = authors_list
X_authors = v.fit_transform(D)

X_authors

In [None]:
print("n_samples: %d, n_features: %d" % X_authors.shape)

In [None]:
# now looking at 5167 x  38070 array for our feature space

fourth_features = np.append(third_features, X_authors, axis = 1)

Time to check in with the recommender to see how these new features are working out. Example 9-14 shows the results.

Example 9-14. Item-based collaborative filtering recommendations: Take 3

In [None]:
paper_recommender(fourth_features, 2, 3)

Even accounting for missing data in certain fields, the top three results from the last round of feature engineering are directing us to other papers in the medical field.

The range of papers represented in this dataset is broad; for example, a random sample of papers exposed fields of study such as “Coupling constant,” “Evapotranspiration,” “Hash function,” “IVMS,” “Meditation,” “Pareto analysis,” “Second-generation wavelet transform,” “Slip,” and “Spiral galaxy.” Given that there are 7,604 unique fields of study listed for 10K+ papers, these last results seem to be moving in the right direction. We can be confident that our work is progressing toward a useful model.

Continued iteration on more text variables, such as the finding the noun phrases of the paper titles or stemming the keywords, could bring us even closer to a “best” recommendation.

It should be noted here that this definition of “best” is the Holy Grail of all recommenders and search engines alike. We are searching for what a user will find most helpful, which may or may not be directly represented by the data. Feature engineering allows us to abstract salient features into representations such that algorithms can expose both the explicit and implicit information contained therein.

## Summary
As you can see, building models for machine learning is easy. Building good models for useful results takes time and work. We hiked through the messy processes here of examining a collection of possible variables and experimenting with different feature engineering methods to achieve better results. We define “better” here not just in terms of good outcomes from our training and testing, but also reducing the size of the model and the time it takes us to iterate over different experiments.

We started this book by talking about how mastery of a subject comes from deeply learning the principles at work, in order to gain intuition to effectively put your knowledge to work. We hope that our work has given you the necessary tools to become more efficient and effective, as well as enriched your mathematical and computational understanding of how feature engineering is an essential skill to develop useful machine learning models.

## Bibliography
Sarwar, Badrul, George Karypis, Joseph Konstan, and John Riedl. “Item-Based Collaborative Filtering Recommendation Algorithms.” Proceedings of the 10th International Conference on the World Wide Web (2001) 285–295.

Sinha, Arnab, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. “An Overview of Microsoft Academic Service (MAS) and Applications.” Proceedings of the 24th International Conference on the World Wide Web (2015): 243–246.

Tang, Jie, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. “ArnetMiner: Extraction and Mining of Academic Social Networks.” Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008): 990–998.

Wickham, Hadley. “Tidy Data.” The Journal of Statistical Software 59 (2014).