## Introduction:

The regression algorithms contained in this notebook are K nearest neighbor and neural network(keras).
The original data is standardized and then trained with the regressors. 


**Evaluation metric** The evaluation metric is `PRE(Proportion of reduction in error) or score` which is defined as:
$\frac{SSE_{baseline}-SSE_{regression}}{SSE_{baseline}}$

where $SSE_{baseline}$ is the Sum of squared error of estimating with sample mean and $SSE_{regression}$ is the Sum of squared error of estimating with regressor(ie.NN,KNN). The PRE is a the percentage of reduction in error which mostly ranges from 0 to 1. Note that this value could be negative if our model perform worse than using the sample mean.

**Missing value**:Impute with mean of the column(for numeric) and the mode of the column(for categorical).

## Findings:
### KNN: 
* KNN does not have a model and it does not have the "training" part(like neural network). It simply computes a distance matrix of all the data points and choose the neighbor based the distance.It Works relatively well on small to medium dataset.(in both computational time and the amount of reduction in error). However, the main drawback is that when the sample size of the dataset is very large(such as the `song` dataset), computing a distance matrix cannot be accomplished with avalible computational power(ie.computer memory) in reasonable time. In addition, this algorthm also suffers from "curse of dimensionality" as we can see for the dataset with relatively high dimension(crime, mercedes,etc), KNN perform less well. 

### Keras Neural network:

*  Main disadvantage for neural network is that it is able to handle dataset with large sample size and high dimensions in reasonable amount of time and the its performance appears to become better as the sample size and dimensionality increase(relatively to KNN), however, the training process can take a very long time. And also because neural network choose initialize the weights of neurons randomly, sometimes they may end up in different local minimum.(which might explain the negative score in the last trial of the `song` dataset.)



## 1.Graduate admission rate: 
## KNN 

In [5]:
#Import Libaries
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.metrics import r2_score

In [414]:
#use 10 neighbors
def Knn_estimator(x,y,n_neigh=10,weights='uniform',algo='auto',p=2):
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)

    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    
    neigh=KNeighborsRegressor(n_neighbors=n_neigh,weights=weights,algorithm=algo,p=p)
    neigh.fit(X_train,y_train)

    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    
    train_mse=mean_squared_error(list(y_train),neigh.predict(X_train))
    pred_mse=mean_squared_error(list(y_test),neigh.predict(X_test))
    #r_square=(baseline-pred_mse)/baseline
    test_score=r2_score(y_test,neigh.predict(X_test))
    train_score=r2_score(y_train,neigh.predict(X_train))
    
    print("Baseline MSE:",round(baseline,4),"Testing MSE:",round(pred_mse,4),"Test Score(PRE):",
          round(test_score,4),"Training MSE:",round(train_mse,4),"Training Score:",round(train_score,4))
    return test_score


In [415]:
#import data
np.random.seed(0)
admission_rate=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/Admission_Predict.csv')
y=admission_rate['Chance of Admit ']
x=admission_rate.iloc[:,1:8]

t0=time.time()
overall=sum([Knn_estimator(x,y) for i in range(3)])/3
print("Overall score(PRE) over 3 trials:",overall)
t1=time.time()
print("Average time of running:",(t1-t0)/3,"sec.")

Baseline MSE: 0.0172 Testing MSE: 0.005 Test Score(PRE): 0.7115 Training MSE: 0.0038 Training Score: 0.8227
Baseline MSE: 0.0199 Testing MSE: 0.0051 Test Score(PRE): 0.7445 Training MSE: 0.004 Training Score: 0.8051
Baseline MSE: 0.0203 Testing MSE: 0.0049 Test Score(PRE): 0.759 Training MSE: 0.0042 Training Score: 0.7919
Overall score(PRE) over 3 trials: 0.7383400794062083
Average time of running: 0.6186666488647461 sec.


## Neural Network(Keras)

In [32]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from keras.optimizers import Adam

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [416]:
#use no hidden layers since it is a small dataset, 30 iteration.
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(7,input_dim=7,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1,kernel_initializer='normal'))
    model.compile(loss='mean_squared_error',optimizer='adam')
    history=model.fit(X_train,y_train,epochs=30,batch_size=20,validation_split=0.1,verbose=0)
    pred=model.predict(X_test)

    train_mse=mean_squared_error(list(y_train),model.predict(X_train))
    pred_mse=mean_squared_error(list(y_test),pred)
    #r_square=(baseline-pred_mse)/baseline
    test_score=r2_score(y_test,model.predict(X_test))
    train_score=r2_score(y_train,model.predict(X_train))
    
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)

    print("Baseline MSE:",round(baseline,4),"Testing MSE:",round(pred_mse,4),"Test Score(PRE):",round(test_score,4),
          "Training MSE:",round(train_mse,4),"Training Score:",round(train_score,4))
    return test_score



In [417]:
t0=time.time()
overall=sum([keras_model() for i in range(3)])/3
print("Overall Score(PRE) over 3 trials:",overall)
t1=time.time()
print("Average time of running:",(t1-t0)/3,"sec.")

Baseline MSE: 0.0191 Testing MSE: 0.0076 Test Score(PRE): 0.6018 Training MSE: 0.008 Training Score: 0.6139
Baseline MSE: 0.0211 Testing MSE: 0.0097 Test Score(PRE): 0.5433 Training MSE: 0.0078 Training Score: 0.6057


Exception ignored in: <bound method ScopedTFStatus.__del__ of <tensorflow.python.framework.c_api_util.ScopedTFStatus object at 0x0000019400EF9DA0>>
Traceback (most recent call last):
  File "C:\Users\zhenguo\Anaconda3\lib\site-packages\tensorflow\python\framework\c_api_util.py", line 39, in __del__
    c_api.TF_DeleteStatus(self.status)
AttributeError: 'ScopedTFStatus' object has no attribute 'status'


Baseline MSE: 0.0206 Testing MSE: 0.0087 Test Score(PRE): 0.5768 Training MSE: 0.008 Training Score: 0.6051
Overall Score(PRE) over 3 trials: 0.5739685595143577
Average time of running: 55.556661446889244 sec.


**Summary for `Admission`:**
* Dataset characteristics: small-scale, low dimension, numeric attributes
* Result: KNN(75%) outperforms neural network(57%) in both recduction in error and computational time.


## 2. Crime incidence in community
## KNN

In [418]:
crime=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/Violent_Crime_pred.csv')
#subseting the data
x=crime.iloc[:,2:-1]
y=crime['target']

#check the missing valuings in each column
#preliminary decision: remove the columns with 1675(over 80%) missing values,to be discussed
x.iloc[:,np.sort(x.isna().sum())==0]

x=x.dropna(axis='columns')

t0=time.time()
overall=sum([Knn_estimator(x,y,n_neigh=10) for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print("Average time of running:",(t1-t0)/3,"sec.")

Baseline MSE: 348761.8031 Testing MSE: 125179.2976 Test Score(PRE): 0.6411 Training MSE: 137352.8568 Training Score: 0.648
Baseline MSE: 378456.671 Testing MSE: 164338.6886 Test Score(PRE): 0.5658 Training MSE: 129653.5018 Training Score: 0.656
Baseline MSE: 465043.2642 Testing MSE: 196202.7616 Test Score(PRE): 0.5781 Training MSE: 122521.9942 Training Score: 0.6386
Overall PRE over 3 trials: 0.5949796993508171
Average time of running: 2.5710953871409097 sec.


## Neural Network

In [419]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(101,input_dim=101,activation='relu',kernel_initializer='normal'))
    model.add(Dense(30,input_dim=101,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1))
    
    model.compile(Adam(lr=0.001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=100,epochs=100,validation_split=0.1,verbose=0)
    pred=model.predict(X_test)

    train_mse=mean_squared_error(list(y_train),model.predict(X_train))
    pred_mse=mean_squared_error(list(y_test),pred)
    #r_square=(baseline-pred_mse)/baseline
    test_score=r2_score(y_test,model.predict(X_test))
    train_score=r2_score(y_train,model.predict(X_train))
    
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)

    print("Baseline MSE:",round(baseline,4),"Testing MSE:",round(pred_mse,4),"Test Score(PRE):",round(test_score,4),
          "Training MSE:",round(train_mse,4),"Training Score:",round(train_score,4))
    return test_score


In [420]:
t0=time.time()
overall=sum([keras_model() for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print("Average time of running:",(t1-t0)/3,"sec.")

Baseline MSE: 365370.0568 Testing MSE: 186452.6919 Test Score(PRE): 0.4897 Training MSE: 139215.1094 Training Score: 0.6364
Baseline MSE: 413276.6482 Testing MSE: 177959.9205 Test Score(PRE): 0.5694 Training MSE: 122520.8335 Training Score: 0.662
Baseline MSE: 380727.5548 Testing MSE: 135213.0401 Test Score(PRE): 0.6449 Training MSE: 133096.8933 Training Score: 0.6465
Overall PRE over 3 trials: 0.5679790050541115
Average time of running: 29.540505806605022 sec.


**Summary for `Crime`:**
* Dataset characteristics: Medium-scale, relatively high dimension with irrelavant features, numeric attributes
**Result:** 
*  KNN(58%): faster run time
*  Neural network(60%): Takes longer to run

## 3.Automobile
## KNN

In [421]:
#Cleaning data,process missing values
import numpy as np
from sklearn.preprocessing import Imputer
cars=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/automobile.csv')
cars=cars.replace('?',np.nan).iloc[:,3:] #replace ? with NaN
cars['X6'][cars['X6'].isna()]='four' #mode is four, assign four to the missing value
impute=Imputer(strategy='mean') #use column mean to impute missing value
cars[['X19','X20','X22','X23']]=impute.fit_transform(cars[['X19','X20','X22','X23']])
pd.isna(cars).sum() #now no missing values
x=cars.iloc[:,0:-1]
y=cars.iloc[:,-1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [422]:
#get categorical column
cate_col=list(x.columns[x.dtypes=='object'])
x=pd.get_dummies(x,columns=cate_col) #create dummy variables for the categorical var

t0=time.time()
overall=sum([Knn_estimator(x,y,n_neigh=1) for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print("Average time of running:",(t1-t0)/3,"sec.")

Baseline MSE: 63748647.5082 Testing MSE: 6843711.6393 Test Score(PRE): 0.8926 Training MSE: 175952.9571 Training Score: 0.9972
Baseline MSE: 58117276.738 Testing MSE: 6470144.8361 Test Score(PRE): 0.8887 Training MSE: 183827.9571 Training Score: 0.9972
Baseline MSE: 72659437.792 Testing MSE: 9885418.1967 Test Score(PRE): 0.8639 Training MSE: 122113.6714 Training Score: 0.9979
Overall PRE over 3 trials: 0.8817549570219541
Average time of running: 0.10473521550496419 sec.


## Neural Network


In [423]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(72,input_dim=72,activation='relu',kernel_initializer='normal'))
    model.add(Dense(30,input_dim=72,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.01),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=10,epochs=100,validation_split=0.1,verbose=0)
    pred=model.predict(X_test)

    train_mse=mean_squared_error(list(y_train),model.predict(X_train))
    pred_mse=mean_squared_error(list(y_test),pred)
    #r_square=(baseline-pred_mse)/baseline
    test_score=r2_score(y_test,model.predict(X_test))
    train_score=r2_score(y_train,model.predict(X_train))
    
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)

    print("Baseline MSE:",round(baseline,4),"Testing MSE:",round(pred_mse,4),"Test Score(PRE):",round(test_score,4),
          "Training MSE:",round(train_mse,4),"Training Score:",round(train_score,4))
    return test_score


In [424]:
t0=time.time()
overall=sum([keras_model() for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print("Average running time:",(t1-t0)/3,"sec.")

Baseline MSE: 38007745.0035 Testing MSE: 3376383.6556 Test Score(PRE): 0.9112 Training MSE: 3678298.2118 Training Score: 0.9494
Baseline MSE: 61561988.3042 Testing MSE: 7040286.6944 Test Score(PRE): 0.8856 Training MSE: 3074646.8421 Training Score: 0.9515
Baseline MSE: 47916909.9629 Testing MSE: 8310035.1433 Test Score(PRE): 0.8266 Training MSE: 3354976.3282 Training Score: 0.9516
Overall PRE over 3 trials: 0.8744596756869102
Average running time: 22.733724117279053 sec.


**Summary for `Automobile`:**
* Dataset characteristics: small-scale, low dimension, numeric and categorical attributes
**Result:**
*  KNN(80%)
*  Neural network(85%)

## 4.Mercedes
## KNN

In [425]:
import pandas as pd
import numpy as np
pass_test=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/pass_testing.csv')
# pass_test.shape #4209 by 378
# pd.isna(pass_test).sum().sum() #no missing value

x=pass_test.iloc[:,2:]
y=pass_test['y']
cate_col=list(x.columns[x.dtypes=='object'])
x=pd.get_dummies(x,columns=cate_col) #change categorical variables to dummy var
x.shape

(4209, 563)

In [426]:
t0=time.time()
overall=sum([Knn_estimator(x,y,n_neigh=20) for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print("Average running time:",(t1-t0)/3,"sec.")

Baseline MSE: 163.5831 Testing MSE: 90.7399 Test Score(PRE): 0.4453 Training MSE: 81.8885 Training Score: 0.4866
Baseline MSE: 154.1669 Testing MSE: 84.1935 Test Score(PRE): 0.4539 Training MSE: 84.5218 Training Score: 0.4832
Baseline MSE: 178.0325 Testing MSE: 106.2395 Test Score(PRE): 0.4033 Training MSE: 74.9296 Training Score: 0.5112
Overall PRE over 3 trials: 0.4341454254883413
Average running time: 50.60247008005778 sec.


## Neural Network

In [427]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(563,input_dim=563,activation='relu',kernel_initializer='normal'))
    model.add(Dense(50,input_dim=100,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.0001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=60,epochs=40,validation_split=0.1,verbose=0)
    pred=model.predict(X_test)

    train_mse=mean_squared_error(list(y_train),model.predict(X_train))
    pred_mse=mean_squared_error(list(y_test),pred)
    #r_square=(baseline-pred_mse)/baseline
    test_score=r2_score(y_test,model.predict(X_test))
    train_score=r2_score(y_train,model.predict(X_train))
    
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)

    print("Baseline MSE:",round(baseline,4),"Testing MSE:",round(pred_mse,4),"Test Score(PRE):",round(test_score,4),
          "Training MSE:",round(train_mse,4),"Training Score:",round(train_score,4))
    return test_score


In [428]:
t0=time.time()
overall=sum([keras_model() for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print("Average running time:",(t1-t0)/3,"sec.")

Baseline MSE: 179.161 Testing MSE: 96.7144 Test Score(PRE): 0.4602 Training MSE: 51.8892 Training Score: 0.6604
Baseline MSE: 149.5856 Testing MSE: 70.0858 Test Score(PRE): 0.5315 Training MSE: 62.0418 Training Score: 0.6249
Baseline MSE: 147.5254 Testing MSE: 64.8653 Test Score(PRE): 0.5603 Training MSE: 64.8466 Training Score: 0.6103
Overall PRE over 3 trials: 0.5173196335544749
Average running time: 88.41292119026184 sec.


**Summary for `Mercedes`:**
* Dataset characteristics: medium scale, high dimension, sparse attributes.
**Result:**
*  KNN(42%)
*  Neural network(54%)

## 5. Parkinson dataset
## KNN

In [429]:
parkinson=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/parkinsons_updrs.data')

In [4]:
parkinson.shape

In [430]:
y=parkinson['total_UPDRS']
x=parkinson.drop(['subject#','total_UPDRS','motor_UPDRS'],axis=1) #drop correlated variable and subjectID 

t0=time.time()
overall=sum([Knn_estimator(x,y,n_neigh=4) for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print("Average running time:",(t1-t0)/3,"sec.")

Baseline MSE: 112.7311 Testing MSE: 41.7886 Test Score(PRE): 0.6293 Training MSE: 23.7594 Training Score: 0.7938
Baseline MSE: 110.4615 Testing MSE: 37.3857 Test Score(PRE): 0.6615 Training MSE: 24.8456 Training Score: 0.7862
Baseline MSE: 109.6746 Testing MSE: 43.1917 Test Score(PRE): 0.6062 Training MSE: 24.2849 Training Score: 0.7916
Overall PRE over 3 trials: 0.6323467262809421
Average running time: 2.6851421197255454 sec.


## Neural Network

In [431]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(19,input_dim=19,activation='sigmoid',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.01),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=20,epochs=100,validation_split=0.1,verbose=0)

    pred=model.predict(X_test)

    train_mse=mean_squared_error(list(y_train),model.predict(X_train))
    pred_mse=mean_squared_error(list(y_test),pred)
    #r_square=(baseline-pred_mse)/baseline
    test_score=r2_score(y_test,model.predict(X_test))
    train_score=r2_score(y_train,model.predict(X_train))
    
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)

    print("Baseline MSE:",round(baseline,4),"Testing MSE:",round(pred_mse,4),"Test Score(PRE):",round(test_score,4),
          "Training MSE:",round(train_mse,4),"Training Score:",round(train_score,4))
    return test_score


In [432]:
#code works,takes time to run 42%~
t0=time.time()
overall=sum([keras_model() for i in range(3)])/3
print("Overall PRE over 10 trials:",overall)
t1=time.time()
print("Average running time:",(t1-t0)/3,"sec.")

Baseline MSE: 113.9803 Testing MSE: 58.9867 Test Score(PRE): 0.4825 Training MSE: 57.6232 Training Score: 0.4975
Baseline MSE: 115.2662 Testing MSE: 64.7751 Test Score(PRE): 0.438 Training MSE: 58.9313 Training Score: 0.4837
Baseline MSE: 118.7042 Testing MSE: 66.2672 Test Score(PRE): 0.4417 Training MSE: 62.3622 Training Score: 0.4462
Overall PRE over 10 trials: 0.4540892186819332
Average running time: 133.49051674207053 sec.


**Summary for `Mercedes`:**
* Dataset characteristics: medium scale, low dimension.
**Result:**
*  KNN(63%) faster run time
*  Neural network(42%)

## 6.Song prediction:

## KNN: Not appropriate for this dataset
*  When we see this vast amount of sample size it appears that KNN is not appropriate since it calculates the distance of all the data points between the training data and test data.The native KNN alogrithm will generate a distance matrix of approximate 500k * 500k which is very computationally expensive takes too long(did not finish running in a day) to run and the other built in KNN algorithms('brute force',ball tree','kd_tree') lead to crush.



In [400]:
song=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/YearPredictionMSD.csv',header=None)

In [401]:
#test=song.sample(frac=0.01)
y=song.iloc[:,0]
x=song.iloc[:,1:]

## Neural Network

In [409]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(90,input_dim=90,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=100,epochs=15,validation_split=0.1,verbose=2)


    pred=model.predict(X_test)

    train_mse=mean_squared_error(list(y_train),model.predict(X_train))
    pred_mse=mean_squared_error(list(y_test),pred)
    #r_square=(baseline-pred_mse)/baseline
    test_score=r2_score(y_test,model.predict(X_test))
    train_score=r2_score(y_train,model.predict(X_train))
    
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)

    print("Baseline MSE:",round(baseline,4),"Testing MSE:",round(pred_mse,4),"Test Score(PRE):",round(test_score,4),
          "Training MSE:",round(train_mse,4),"Training Score:",round(train_score,4))
    return test_score


In [410]:
t0=time.time()
keras_model()
t1=time.time()

Train on 324666 samples, validate on 36075 samples
Epoch 1/15
 - 52s - loss: 568490.2849 - mean_squared_error: 568490.2849 - val_loss: 23190.1720 - val_mean_squared_error: 23190.1720
Epoch 2/15
 - 22s - loss: 16192.3731 - mean_squared_error: 16192.3731 - val_loss: 12803.8490 - val_mean_squared_error: 12803.8490
Epoch 3/15
 - 22s - loss: 10970.7369 - mean_squared_error: 10970.7369 - val_loss: 8944.8871 - val_mean_squared_error: 8944.8871
Epoch 4/15
 - 22s - loss: 7182.0691 - mean_squared_error: 7182.0691 - val_loss: 5223.9855 - val_mean_squared_error: 5223.9855
Epoch 5/15
 - 21s - loss: 3660.1010 - mean_squared_error: 3660.1010 - val_loss: 2221.9598 - val_mean_squared_error: 2221.9598
Epoch 6/15
 - 20s - loss: 1286.7954 - mean_squared_error: 1286.7954 - val_loss: 588.6577 - val_mean_squared_error: 588.6577
Epoch 7/15
 - 22s - loss: 322.1551 - mean_squared_error: 322.1551 - val_loss: 160.9455 - val_mean_squared_error: 160.9455
Epoch 8/15
 - 25s - loss: 134.0348 - mean_squared_error: 134.

In [412]:
t2=time.time()
keras_model()
t3=time.time()

Train on 324666 samples, validate on 36075 samples
Epoch 1/15
 - 106s - loss: 528008.2532 - mean_squared_error: 528008.2532 - val_loss: 22935.4314 - val_mean_squared_error: 22935.4314
Epoch 2/15
 - 25s - loss: 16252.9355 - mean_squared_error: 16252.9355 - val_loss: 13076.4931 - val_mean_squared_error: 13076.4931
Epoch 3/15
 - 21s - loss: 11031.8466 - mean_squared_error: 11031.8466 - val_loss: 9293.5219 - val_mean_squared_error: 9293.5219
Epoch 4/15
 - 22s - loss: 7436.1244 - mean_squared_error: 7436.1244 - val_loss: 5885.7071 - val_mean_squared_error: 5885.7071
Epoch 5/15
 - 21s - loss: 4212.7726 - mean_squared_error: 4212.7726 - val_loss: 2880.0413 - val_mean_squared_error: 2880.0413
Epoch 6/15
 - 20s - loss: 1671.6122 - mean_squared_error: 1671.6122 - val_loss: 888.0241 - val_mean_squared_error: 888.0241
Epoch 7/15
 - 19s - loss: 395.8432 - mean_squared_error: 395.8432 - val_loss: 288.9082 - val_mean_squared_error: 288.9082
Epoch 8/15
 - 19s - loss: 130.1714 - mean_squared_error: 130

In [413]:
t4=time.time()
keras_model()
t5=time.time()


Train on 324666 samples, validate on 36075 samples
Epoch 1/15
 - 150s - loss: 522094.9351 - mean_squared_error: 522094.9351 - val_loss: 22686.8333 - val_mean_squared_error: 22686.8333
Epoch 2/15
 - 22s - loss: 15899.6893 - mean_squared_error: 15899.6893 - val_loss: 12872.1615 - val_mean_squared_error: 12872.1615
Epoch 3/15
 - 22s - loss: 10851.2425 - mean_squared_error: 10851.2425 - val_loss: 8986.1074 - val_mean_squared_error: 8986.1074
Epoch 4/15
 - 22s - loss: 7184.8285 - mean_squared_error: 7184.8285 - val_loss: 5485.5700 - val_mean_squared_error: 5485.5700
Epoch 5/15
 - 23s - loss: 3895.4770 - mean_squared_error: 3895.4770 - val_loss: 2463.4525 - val_mean_squared_error: 2463.4525
Epoch 6/15
 - 22s - loss: 1418.9598 - mean_squared_error: 1418.9598 - val_loss: 682.0889 - val_mean_squared_error: 682.0889
Epoch 7/15
 - 22s - loss: 352.7137 - mean_squared_error: 352.7137 - val_loss: 179.8624 - val_mean_squared_error: 179.8624
Epoch 8/15
 - 23s - loss: 134.7682 - mean_squared_error: 134

In [433]:
print("Averaged time used to run the code:",(t1-t0+t3-t2+t5-t4)/3,"sec.")

Averaged time used to run the code: 631.2542164325714 sec.


## 7.Solar Flares
## KNN

In [434]:
solar=pd.read_csv("C:/Users/zhenguo/Desktop/flare.data2",sep=" ")
x=solar.iloc[:,1:6]
y=solar.iloc[:,7]

In [435]:
t0=time.time()
overall=sum([Knn_estimator(x,y,n_neigh=50) for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print("Average running time:",(t1-t0)/3,"sec.")

Baseline MSE: 0.11 Testing MSE: 0.108 Test Score(PRE): 0.0181 Training MSE: 0.0789 Training Score: 0.0576
Baseline MSE: 0.1113 Testing MSE: 0.1079 Test Score(PRE): 0.0303 Training MSE: 0.0797 Training Score: 0.039
Baseline MSE: 0.0941 Testing MSE: 0.0898 Test Score(PRE): 0.0456 Training MSE: 0.0881 Training Score: 0.0273
Overall PRE over 3 trials: 0.031334725032545495
Average running time: 0.4180034001668294 sec.


## Neural Network
*  performs worse than sample mean

In [436]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(5,input_dim=5,activation='sigmoid',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=50,epochs=100,validation_split=0.1,verbose=0)
    pred=model.predict(X_test)

    train_mse=mean_squared_error(list(y_train),model.predict(X_train))
    pred_mse=mean_squared_error(list(y_test),pred)
    #r_square=(baseline-pred_mse)/baseline
    test_score=r2_score(y_test,model.predict(X_test))
    train_score=r2_score(y_train,model.predict(X_train))
    
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)

    print("Baseline MSE:",round(baseline,4),"Testing MSE:",round(pred_mse,4),"Test Score(PRE):",round(test_score,4),
          "Training MSE:",round(train_mse,4),"Training Score:",round(train_score,4))
    return test_score



In [437]:
t0=time.time()
overall=sum([keras_model() for i in range(3)])/3
print("Overall PRE over 10 trials:",overall)
t1=time.time()
print("Average running time:",(t1-t0)/3,"sec.")

Baseline MSE: 0.0515 Testing MSE: 0.0537 Test Score(PRE): -0.0437 Training MSE: 0.1032 Training Score: 0.0519
Baseline MSE: 0.0878 Testing MSE: 0.0864 Test Score(PRE): 0.0162 Training MSE: 0.09 Training Score: 0.0345
Baseline MSE: 0.0273 Testing MSE: 0.0326 Test Score(PRE): -0.1909 Training MSE: 0.1087 Training Score: 0.0866
Overall PRE over 10 trials: -0.07277034424293709
Average running time: 25.82631508509318 sec.


## 8.Forest Fire

In [438]:
fire=pd.read_csv("C:/Users/zhenguo/Desktop/forestfires.csv")
y=fire['area']
x=fire.drop(['area'],axis=1)

cate_col=list(x.columns[x.dtypes=='object'])
x=pd.get_dummies(x,columns=cate_col) #change categorical variables to dummy var

 ## KNN
 * performs worse than sample mean

In [440]:
import time
t0=time.time()
Knn_estimatorll=sum([Knn_estimator(x,y,n_neigh=10) for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print('Average time of running time:',(t1-t0)/3,'sec.')

Baseline MSE: 8204.8828 Testing MSE: 8341.5585 Test Score(PRE): -0.0167 Training MSE: 2018.6228 Training Score: 0.0958
Baseline MSE: 1222.9926 Testing MSE: 1713.6306 Test Score(PRE): -0.4012 Training MSE: 4893.7961 Training Score: 0.0702
Baseline MSE: 8522.269 Testing MSE: 8548.0612 Test Score(PRE): -0.003 Training MSE: 1903.3954 Training Score: 0.0885
Overall PRE over 3 trials: -0.07277034424293709
Average time of running time: 0.08633104960123698 sec.


## Neural Network 
- performs worse than estimating with sample mean

In [398]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(29,input_dim=29,activation='sigmoid',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=50,epochs=50,validation_split=0.1,verbose=0)
    pred=model.predict(X_test)

    train_mse=mean_squared_error(list(y_train),model.predict(X_train))
    pred_mse=mean_squared_error(list(y_test),pred)
    #r_square=(baseline-pred_mse)/baseline
    test_score=r2_score(y_test,model.predict(X_test))
    train_score=r2_score(y_train,model.predict(X_train))
    
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)

    print("Baseline MSE:",round(baseline,4),"Testing MSE:",round(pred_mse,4),"Test Score(PRE):",round(test_score,4),
          "Training MSE:",round(train_mse,4),"Training Score:",round(train_score,4))
    return test_score

In [399]:
t0=time.time()
overall=sum([keras_model() for i in range(3)])/3
print("Overall PRE over 3 trials:",overall)
t1=time.time()
print('Average time of running time:',(t1-t0)/3,'sec.')

Baseline MSE: 1374.1884 Testing MSE: 1430.0818 Test Score(PRE): -0.0407 Training MSE: 5267.608 Training Score: -0.0134
Baseline MSE: 3788.3997 Testing MSE: 3821.4437 Test Score(PRE): -0.0087 Training MSE: 4211.6425 Training Score: -0.0139
Baseline MSE: 1017.9727 Testing MSE: 1026.884 Test Score(PRE): -0.0088 Training MSE: 5382.2354 Training Score: -0.0061
Overall PRE over 3 trials: -0.01938338808991606
Average time of running time: 16.736035108566284 sec.
