### 删除缺失值 - Part II

现在你已经了解了如何删除存在缺失值的行来拟合模型，这样很棒，因为 sklearn 不会因为有缺失值而报错了。但这也意味着，我们将无法预测包含缺失值的数据。

在这个 Notebook 里，我们将回答上一视频里的几个问题，并且进行更多的步骤。

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
import RemovingData as t
%matplotlib inline

df = pd.read_csv('./survey_results_public.csv')

#Subset to only quantitative vars
num_vars = df[['Salary', 'CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]


num_vars.head()

#### Question 1

**1.** 数据里提供了工资信息的人，其占比是多少？

In [None]:
prop_sals = # Proportion of individuals in the dataset with salary reported

prop_sals

In [None]:
t.prop_sals_test(prop_sals) #test

#### Question 2

**2.** 删除 **num_vars** 数据集中，Salary 列存在缺失值的所有数据行。将得到的新数据保存在 **sal_rem** 变量中。

In [None]:
sal_rm = # dataframe with rows for nan Salaries removed

sal_rm.head()

In [None]:
t.sal_rm_test(sal_rm) #test

#### Question 3

**3.** 使用 **sal_rm** 数据中的所有数值变量，创建一个 DataFrame `X`（矩阵）。将要预测的目标变量（Salary）保存到 `y` 中。划分好数据之后，运行下面的代码，根据得到的结果，将正确的字母与 **question3_solution** 里的陈述匹配。

In [None]:
X = #Create X using explanatory variables from sal_rm
y = #Create y using the response variable of Salary

# Split data into training and test data, and fit a linear model
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=.30, random_state=42)
lm_model = LinearRegression(normalize=True)

# If our model works, it should just fit our model to the data. Otherwise, it will let us know.
try:
    lm_model.fit(X_train, y_train)
except:
    print("Oh no! It doesn't work!!!")


In [None]:
a = 'Python just likes to break sometimes for no reason at all.' 
b = 'It worked, because Python is magic.'
c = 'It broke because we still have missing values in X'

question3_solution = #Letter here

#test
t.question3_check(question3_solution)

#### Question 4

**4.** 移除 **num_vars** 中所有包含缺失值的行（之前视频中有讲到过）。将得到的数据存放在 **all_rm** 变量中。 

In [None]:
all_rm = # dataframe with rows for any nan column removed

all_rm.head()

In [None]:
t.all_rm_test(all_rm) #test

#### Question 5

**5.** 提取 **all_rm** 中所有的数值变量，并将其存在 **X_2** 变量中。需要预测的 Salary 存在 **y_2** 中。划分好数据之后，运行下面的代码，依据得到的结果，将正确的字母与 **question5_solution** 里的陈述匹配。

In [None]:
X_2 = #Create X using explanatory variables from all_rm
y_2 = #Create y using Salary from sal_rm

# Split data into training and test data, and fit a linear model
X_2_train, X_2_test, y_2_train, y_2_test = train_test_split(X_2, y_2 , test_size=.30, random_state=42)
lm_2_model = LinearRegression(normalize=True)

# If our model works, it should just fit our model to the data. Otherwise, it will let us know.
try:
    lm_2_model.fit(X_2_train, y_2_train)
except:
    print("Oh no! It doesn't work!!!")

In [None]:
a = 'Python just likes to break sometimes for no reason at all.' 
b = 'It worked, because Python is magic.'
c = 'It broke because we still have missing values in X'

question5_solution = #Letter here

#test
t.question5_check(question5_solution)

#### Question 6

**6.** 现在，用 **lm_2_model** 模型来预测 **y_2_test**，并计算 R 平方值，以评估模型的预测效果。

In [None]:
y_test_preds = # Predictions here using X_2 and lm_2_model
r2_test =  # Rsquared here for comparing test and preds from lm_2_model

# Print r2 to see result
r2_test

In [None]:
t.r2_test_check(r2_test)

#### Question 7

**7.** 用你之前学到的知识，将下面的字母与相应的陈述匹配。

In [None]:
a = 5009
b = 'Other'
c = 645
d = 'We still want to predict their salary'
e = 'We do not care to predict their salary'
f = False
g = True

question7_solution = {'The number of reported salaries in the original dataset': #Letter here,
                       'The number of test salaries predicted using our model': #Letter here,
                       'If an individual does not rate stackoverflow, but has a salary': #Letter here,
                       'If an individual does not have a a job satisfaction, but has a salary': #Letter here,
                       'Our model predicts salaries for the two individuals described above.': #Letter here}
                      
                      
#Check your answers against the solution - you should be told you were right if your answers are correct!                     
t.question7_check(question7_solution)

In [None]:
#Cell for work

In [None]:
#Cell for work