### 结合之前所学

在前一个 Notebook 中，或许你已经猜到了，使用所有特征可能会让模型出现过拟合。R 平方虽然能告诉你，模型在训练集上拟合得很好，但它没法预测模型在测试集上的表现。

在本次 Notebook 里，我们将继续之前的练习。首先读取数据集。

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import AllTogether as t
import seaborn as sns
%matplotlib inline

df = pd.read_csv('./survey_results_public.csv')
df.head()

#### Question 1

**1.** 将下面的字母变量替换到 format() 内的正确位置。注意，字符串中每个 **{}** 都对应一个变量的值。整段话讲述的正是我们之前了解到的内容。

In [None]:
a = 'test_score'
b = 'train_score'
c = 'linear model (lm_model)'
d = 'X_train and y_train'
e = 'X_test'
f = 'y_test'
g = 'train and test data sets'
h = 'overfitting'

q1_piat = '''In order to understand how well our {} fit the dataset, 
            we first needed to split our data into {}.  
            Then we were able to fit our {} on the {}.  
            We could then predict using our {}  by providing 
            the linear model the {} for it to make predictions.  
            These predictions were for {}. 

            By looking at the {}, it looked like we were doing awesome because 
            it was 1!  However, looking at the {} suggested our model was not 
            extending well.  The purpose of this notebook will be to see how 
            well we can get our model to extend to new data.
            
            This problem where our data fits the training data well, but does
            not perform well on test data is commonly known as 
            {}.'''.format(a, a, a, a, a, a, a, a, a, a) #replace a with the correct variable

print(q1_piat)

In [None]:
# Print the solution order of the letters in the format
t.q1_piat_answer()

#### Question 2

**2.** 现在，我们要改进模型。判断字典里关于优化模型的陈述是否正确。**每一句陈述都是独立的**。在判断的时候，也可以一边思考哪些是有用的**下一步**。

In [None]:
a = 'yes'
b = 'no'

q2_piat = {'add interactions, quadratics, cubics, and other higher order terms': #letter here, 
           'fit the model many times with different rows, then average the responses': #letter here,
           'subset the features used for fitting the model each time': #letter here,
           'this model is hopeless, we should start over': #letter here}

In [None]:
#Check your solution
t.q2_piat_check(q2_piat)

#### Question 3

**3.** 在进入下一步之前，根据函数给的步骤来创建模型需要的 X 和 y。如果你的答案是对的，应该能看到跟视频中相似的图。

In [None]:
def clean_data(df):
    '''
    INPUT
    df - pandas dataframe 
    
    OUTPUT
    X - A matrix holding all of the variables you want to consider when predicting the response
    y - the corresponding response vector
    
    Perform to obtain the correct X and y objects
    This function cleans df using the following steps to produce X and y:
    1. Drop all the rows with no salaries
    2. Create X as all the columns that are not the Salary column
    3. Create y as the Salary column
    4. Drop the Salary, Respondent, and the ExpectedSalary columns from X
    5. For each numeric variable in X, fill the column with the mean value of the column.
    6. Create dummy columns for all the categorical variables in X, drop the original columns
    '''
    
    return X, y
    
#Use the function to create X and y
X, y = clean_data(df)    

### 运行下面的单元格，利用得到的结果来回答 Question 4

In [None]:
#cutoffs here pertains to the number of missing values allowed in the used columns.
#Therefore, lower values for the cutoff provides more predictors in the model.
cutoffs = [5000, 3500, 2500, 1000, 100, 50, 30, 25]

#Run this cell to pass your X and y to the model for testing
r2_scores_test, r2_scores_train, lm_model, X_train, X_test, y_train, y_test = t.find_optimal_lm_mod(X, y, cutoffs)

#### Question 4

**4.** 利用前面的结果和图表，将下面的字母变量跟 **q4_piat** 字典中的陈述匹配起来。注意，上面仅给出了最优的模型结果，保存在这些变量中：**lm_model**、**X_train**、**X_test**、**y_train**、**y_test**。如果一个陈述有多个答案与之对应，那么将这些字母按照字母排序放在一个括号里，组成一个元组，比如 `(a,b)`。

In [None]:
# Cell for your computations to answer the next question

In [None]:
a = 'we would likely have a better rsquared for the test data.'
b = 1000
c = 872
d = 0.69
e = 0.82
f = 0.88
g = 0.72
h = 'we would likely have a better rsquared for the training data.'

q4_piat = {'The optimal number of features based on the results is': #letter here, 
               'The model we should implement in practice has a train rsquared of': #letter here, 
               'The model we should implement in practice has a test rsquared of': #letter here,
               'If we were to allow the number of features to continue to increase': #letter here
}

In [None]:
#Check against your solution
t.q4_piat_check(q4_piat)

#### Question 5

**5.** sklearn 中线性回归模型应用于系数的默认惩罚是岭回归（也被称为 L2 正则化）。因为存在这种惩罚并且所有变量都进行了标准化，我们可以查看模型中系数的大小，了解每个特征变量对目标变量的影响程度。特征变量的系数越大，说明该特征对目标变量的影响也越大。

利用以下单元格来查看系数，根据结果在下方字典 **q5_piat** 的陈述后面对应标注 **True** 或 **False**。

#### 运行以下单元格

In [None]:
def coef_weights(coefficients, X_train):
    '''
    INPUT:
    coefficients - the coefficients of the linear model 
    X_train - the training data, so the column names can be used
    OUTPUT:
    coefs_df - a dataframe holding the coefficient, estimate, and abs(estimate)
    
    Provides a dataframe that can be used to understand the most influential coefficients
    in a linear model by providing the coefficient estimates along with the name of the 
    variable attached to the coefficient.
    '''
    coefs_df = pd.DataFrame()
    coefs_df['est_int'] = X_train.columns
    coefs_df['coefs'] = lm_model.coef_
    coefs_df['abs_coefs'] = np.abs(lm_model.coef_)
    coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)
    return coefs_df

#Use the function
coef_df = coef_weights(lm_model.coef_, X_train)

#A quick look at the top results
coef_df.head(20)

In [None]:
a = True
b = False

#According to the data...
q5_piat = {'Country appears to be one of the top indicators for salary': #letter here,
               'Gender appears to be one of the indicators for salary': #letter here, 
               'How long an individual has been programming appears to be one of the top indicators for salary': #letter here,
               'The longer an individual has been programming the more they are likely to earn': #letter here}

In [None]:
t.q5_piat_check(q5_piat)

#### 恭喜你

希望这个练习有帮助到你温习旧知识，熟悉如何将步骤汇总进行分析。在下一课中，你将学习如何将分析结果展示给他人，帮助其展开行动。