# 作业1

**The future of employment** 
How susceptible are jobs to computerisation?

We examine how susceptible jobs are to computerisation. To assess this, we begin by implementing a novel methodology to estimate the probability of computerisation for 702 detailed occupations, using a Gaussian process classifier. Based on these estimates, we examine expected impacts of future computerisation on **US labour market** outcomes, with the primary objective of analysing the number of jobs at risk and the relationship between an occupations probability of computerisation, wages and educational attainment.

- C. Frey, M. Osborne The future of employment: How susceptible are jobs to computerisation? Technological Forecasting & Social Change 114 (2017) 254–280

First, together with a group of ML researchers, we subjectively hand-labelled 70 occupations, assigning 1 if automatable, and 0 if not.

Second, we use objective **O*NET** variables corresponding to the defined bottlenecks to computerisation. We are interested in variables describing the level of perception and manipulation, creativity, and social intelligence required to perform it. We identified **nine variables** of O*NET that describe these attributes.

In [1]:
import pandas as pd
import numpy as np
import pylab as plt
import seaborn as sns


In [2]:
# https://github.com/SocratesAcademy/ccbook/tree/master/data/jobdata.csv
df = pd.read_csv('jobdata.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,soc,Element Name,id,label,Data Value,computerization
0,0,11-1011,Assisting and Caring for Others,70,0,2.205,0.015
1,1,11-1011,"Cramped Work Space, Awkward Positions",70,0,1.415,0.015
2,2,11-1011,Fine Arts,70,0,0.915,0.015
3,3,11-1011,Finger Dexterity,70,0,2.0,0.015
4,4,11-1011,Manual Dexterity,70,0,0.0,0.015


In [3]:
len(df)

585

In [4]:
data_list=list(df['Data Value'])
X=[]
for i in range(0,585,9):
    list1=data_list[i:i+9]
    X.append(list1)
X=np.array(X)

len(X)

65

In [5]:
data_list1=list(df['label'])
Y=[]
for i in range(0,585,9):
    list1=data_list1[i]
    Y.append(list1)
Y=np.array(Y)
Y=Y[:,np.newaxis]
Y[:3]

array([[0],
       [0],
       [0]])

In [6]:
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score, roc_auc_score, accuracy_score

X1, X2, y1, y2 = train_test_split(X, Y, random_state=0,
                                  train_size=0.6, test_size = 0.4)

In [7]:
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
bayes.fit(X1, y1.flatten())
y2_model = bayes.predict(X2)  
accuracy_score(y2, y2_model), roc_auc_score(y2, y2_model)

(0.8846153846153846, 0.8846153846153847)

**任务**: 使用RandomForestClassifier训练并计算accuracy_score和roc_auc_score 


**任务**: 使用SVC算法训练并计算accuracy_score和roc_auc_score

Accuracy score =  0.8846153846153846
ROC_AUC score = 0.8846153846153846


In [57]:
from sklearn.model_selection import cross_val_score

def cross_validation(model):
    roc_auc= cross_val_score(model, X, Y.flatten(), scoring="roc_auc", cv = 5)
    return roc_auc

svc_linear = SVC(kernel='linear', C=1E10)
np.mean(cross_validation(svc_linear))

0.9142857142857143

**任务**： 换用rbf的kernel来做交叉验证


0.9142857142857143

###  GPy 
The Gaussian processes framework in Python. https://github.com/SheffieldML/GPy

In [2]:
!pip install --upgrade GPy


Collecting GPy
  Using cached GPy-1.9.9-cp37-cp37m-macosx_10_9_x86_64.whl (1.5 MB)
Processing /Users/datalab/Library/Caches/pip/wheels/c8/4a/0e/6e0dc85541825f991c431619e25b870d4b812c911214690cf8/paramz-0.9.5-cp37-none-any.whl
Installing collected packages: paramz, GPy
Successfully installed GPy-1.9.9 paramz-0.9.5


In [27]:
import GPy

kernel = GPy.kern.RBF(input_dim=9, variance=1., lengthscale=1.)
m = GPy.models.GPRegression(X,Y,kernel)
m.optimize(messages=False)

print(m)



Name : GP regression
Objective : 28.130643010540453
Number of Parameters : 3
Number of Optimization Parameters : 3
Updates : True
Parameters:
  [1mGP_regression.         [0;0m  |               value  |  constraints  |  priors
  [1mrbf.variance           [0;0m  |  0.3113242734729479  |      +ve      |        
  [1mrbf.lengthscale        [0;0m  |   3.933616340596464  |      +ve      |        
  [1mGaussian_noise.variance[0;0m  |  0.0964434513555219  |      +ve      |        


In [26]:
X1, X2, y1, y2 = train_test_split(X, Y, random_state=0,
                                  train_size=0.6, test_size = 0.4)
m = GPy.models.GPRegression(X1,y1,kernel)#, normalizer = True)
m.optimize(messages=False)
y2_model = m.predict(X2)[0]
y2_hat = [1 if i > 0.5 else 0  for i in y2_model ]
print('Accuracy score = ', accuracy_score(y2, y2_hat))
print('ROC_AUC score =', roc_auc_score(y2, y2_hat))

Accuracy score =  0.9230769230769231
ROC_AUC score = 0.9230769230769231
