## Introduction:

*  The regression algorithms contained in this notebook are K nearest neighbor and neural network(keras).
*  The original data is standardized and then trained with the regressors. 


**Evaluation metric** The evaluation metric is `PRE(Proportion of reduction in error)` which is defined as:
$\frac{MSE_{baseline}-MSE_{regression}}{MSE_{baseline}}$

where $MSE_{baseline}$ is the Mean squared error of estimating with sample mean and $MSE_{regression}$ is the Mean squared error of estimating with regressor(ie.NN,KNN). The PRE is a the percentage of reduction in error which mostly ranges from 0 to 1.Note that this value could be negative if our model perform worse than using the sample mean.

**Missing value**:Impute with mean of the column(for numeric) and the mode of the column(for categorical).



## Graduate admission rate: 
## KNN:75% reduction in error 

In [1]:
#Import Libaries
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [2]:
#use 10 neighbors
def Knn_estimator(x,y,n_neigh=10,weights='uniform',algo='auto',p=2):
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)

    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    
    neigh=KNeighborsRegressor(n_neighbors=n_neigh,weights=weights,algorithm=algo,p=p)
    neigh.fit(X_train,y_train)
    
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    
    pred_mse=mean_squared_error(list(y_test),neigh.predict(X_test))
    r_square=(baseline-pred_mse)/baseline
    
    print("Baseline MSE:",round(baseline,4),"KNN MSE:",round(pred_mse,4),"PRE:",round(r_square,4))
    return r_square


In [2]:
#import data
admission_rate=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/Admission_Predict.csv')
y=admission_rate['Chance of Admit ']
x=admission_rate.iloc[:,1:8]

overall=sum([Knn_estimator(x,y) for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 0.0196 KNN MSE: 0.0046 PRE: 0.7651
Baseline MSE: 0.0239 KNN MSE: 0.0057 PRE: 0.7638
Baseline MSE: 0.0213 KNN MSE: 0.0054 PRE: 0.7467
Baseline MSE: 0.025 KNN MSE: 0.0061 PRE: 0.7546
Baseline MSE: 0.0194 KNN MSE: 0.0044 PRE: 0.7717
Baseline MSE: 0.02 KNN MSE: 0.0064 PRE: 0.68
Baseline MSE: 0.0204 KNN MSE: 0.0042 PRE: 0.7942
Baseline MSE: 0.0191 KNN MSE: 0.0043 PRE: 0.7739
Baseline MSE: 0.0196 KNN MSE: 0.0059 PRE: 0.6991
Baseline MSE: 0.0164 KNN MSE: 0.0041 PRE: 0.75
Overall PRE over 10 trials: 0.749932411216001


## Regression with Neural Network(Keras): 54%

In [68]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from keras.optimizers import Adam

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [133]:
#use no hidden layers since it is a small dataset, 30 iteration.
def keras_model():
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(7,input_dim=7,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1,kernel_initializer='normal'))
    model.compile(loss='mean_squared_error',optimizer='adam')
    history=model.fit(X_train,y_train,epochs=30,validation_split=0.1,verbose=0)
    pred=model.predict(X_test)

    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    pred_mse=mean_squared_error(list(y_test),pred)    
    r_square=(baseline-pred_mse)/baseline
    print("Baseline MSE:",round(baseline,4),"Keras MSE:",round(pred_mse,4),"PRE:",round(r_square,4))
    return r_square



In [134]:
overall=sum([keras_model() for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 0.0214 Keras MSE: 0.0105 PRE: 0.5086
Baseline MSE: 0.02 Keras MSE: 0.009 PRE: 0.5482
Baseline MSE: 0.017 Keras MSE: 0.0072 PRE: 0.5765
Baseline MSE: 0.0193 Keras MSE: 0.0083 PRE: 0.5709
Baseline MSE: 0.0186 Keras MSE: 0.0081 PRE: 0.5677
Baseline MSE: 0.0182 Keras MSE: 0.0082 PRE: 0.5523
Baseline MSE: 0.0218 Keras MSE: 0.0099 PRE: 0.5477
Baseline MSE: 0.0196 Keras MSE: 0.01 PRE: 0.4901
Baseline MSE: 0.0221 Keras MSE: 0.011 PRE: 0.4997
Baseline MSE: 0.0192 Keras MSE: 0.0079 PRE: 0.5918
Overall PRE over 10 trials: 0.5453546316152135


**Summary for `Admission`:**
* Dataset characteristics: small-scale, low dimension, numeric attributes
* Result: KNN(75%) outperforms neural network(54%) in both recduction in error and computational time.


## Crime incidence in community
### KNN:58% reduction in error

In [106]:
crime=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/Violent_Crime_pred.csv')
#subseting the data
x=crime.iloc[:,2:-1]
y=crime['target']

#check the missing valuings in each column
#preliminary decision: remove the columns with 1675(over 80%) missing values,to be discussed
x.iloc[:,np.sort(x.isna().sum())==0]

x=x.dropna(axis='columns')


overall=sum([Knn_estimator(x,y,n_neigh=10) for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 405784.2056 KNN MSE: 188459.6628 PRE: 0.5356
Baseline MSE: 354282.6461 KNN MSE: 150688.9022 PRE: 0.5747
Baseline MSE: 386027.6508 KNN MSE: 158331.6898 PRE: 0.5898
Baseline MSE: 384972.0335 KNN MSE: 153695.831 PRE: 0.6008
Baseline MSE: 454711.4389 KNN MSE: 205837.6454 PRE: 0.5473
Baseline MSE: 376035.4971 KNN MSE: 156647.2317 PRE: 0.5834
Baseline MSE: 390505.3232 KNN MSE: 152794.9659 PRE: 0.6087
Baseline MSE: 340393.6825 KNN MSE: 152998.8462 PRE: 0.5505
Baseline MSE: 305672.3447 KNN MSE: 128184.1639 PRE: 0.5806
Baseline MSE: 470380.1817 KNN MSE: 208468.2079 PRE: 0.5568
Overall PRE over 10 trials: 0.572828950213284


### Neural Network: 60% reduction in error

In [141]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)
    model=Sequential()
    model.add(Dense(101,input_dim=101,activation='relu',kernel_initializer='normal'))
    model.add(Dense(30,input_dim=101,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1))
    
    model.compile(Adam(lr=0.001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=50,epochs=100,validation_split=0.1,verbose=0)
    pred=model.predict(X_test)

    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    pred_mse=mean_squared_error(list(y_test),pred)    
    r_square=(baseline-pred_mse)/baseline
    print("Baseline MSE:",round(baseline,4),"Keras MSE:",round(pred_mse,4),"PRE:",round(r_square,4))
    return r_square



In [143]:
overall=sum([keras_model() for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 399407.9197 Keras MSE: 174416.6583 PRE: 0.5633
Baseline MSE: 392564.8185 Keras MSE: 159461.854 PRE: 0.5938
Baseline MSE: 374590.9379 Keras MSE: 149265.7801 PRE: 0.6015
Baseline MSE: 411878.5184 Keras MSE: 168782.0356 PRE: 0.5902
Baseline MSE: 333814.1456 Keras MSE: 126122.4371 PRE: 0.6222
Baseline MSE: 373999.8708 Keras MSE: 134377.3122 PRE: 0.6407
Baseline MSE: 400469.2474 Keras MSE: 139923.6825 PRE: 0.6506
Baseline MSE: 355732.0766 Keras MSE: 137451.6639 PRE: 0.6136
Baseline MSE: 315501.8113 Keras MSE: 129092.1408 PRE: 0.5908
Baseline MSE: 328360.6527 Keras MSE: 144484.4955 PRE: 0.56
Overall PRE over 10 trials: 0.6026751535581809


**Summary for `Crime`:**
* Dataset characteristics: Medium-scale, relatively high dimension with irrelavant features, numeric attributes
**Result:** 
*  KNN(58%): faster run time
*  Neural network(60%): Takes longer to run

## Automobile: categorical and numerical variables
## KNN: 80% reduction in error

In [109]:
#Cleaning data,process missing values
import numpy as np
from sklearn.preprocessing import Imputer
cars=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/automobile.csv')
cars=cars.replace('?',np.nan).iloc[:,3:] #replace ? with NaN
cars['X6'][cars['X6'].isna()]='four' #mode is four, assign four to the missing value
impute=Imputer(strategy='mean') #use column mean to impute missing value
cars[['X19','X20','X22','X23']]=impute.fit_transform(cars[['X19','X20','X22','X23']])
pd.isna(cars).sum() #now no missing values
x=cars.iloc[:,0:-1]
y=cars.iloc[:,-1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [112]:
#get categorical column
cate_col=list(x.columns[x.dtypes=='object'])
x=pd.get_dummies(x,columns=cate_col) #create dummy variables for the categorical var

overall=sum([Knn_estimator(x,y,n_neigh=1) for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 69981239.011 KNN MSE: 11444136.4426 PRE: 0.8365
Baseline MSE: 81672005.7834 KNN MSE: 14215479.2951 PRE: 0.8259
Baseline MSE: 51954682.5477 KNN MSE: 11094325.9344 PRE: 0.7865
Baseline MSE: 47282803.7974 KNN MSE: 6081208.7049 PRE: 0.8714
Baseline MSE: 51406837.663 KNN MSE: 9430858.1639 PRE: 0.8165
Baseline MSE: 75303550.7025 KNN MSE: 5121470.0 PRE: 0.932
Baseline MSE: 63408579.3496 KNN MSE: 38669133.8525 PRE: 0.3902
Baseline MSE: 71592368.7073 KNN MSE: 10376146.8525 PRE: 0.8551
Baseline MSE: 62816203.9307 KNN MSE: 11952412.541 PRE: 0.8097
Baseline MSE: 78273393.2959 KNN MSE: 9037491.3934 PRE: 0.8845
Overall PRE over 10 trials: 0.8008283461053187


## Neural Network
## Reduction in error:85%

In [121]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)

    model=Sequential()
    model.add(Dense(72,input_dim=72,activation='relu',kernel_initializer='normal'))
    model.add(Dense(30,input_dim=72,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.01),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=10,epochs=100,validation_split=0.1,verbose=0)

    pred=model.predict(X_test)
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    pred_mse=mean_squared_error(list(y_test),pred)    
    r_square=(baseline-pred_mse)/baseline
    print("Baseline MSE:",round(baseline,4),"Keras MSE:",round(pred_mse,4),"PRE:",round(r_square,4))
    return r_square

In [122]:
overall=sum([keras_model() for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 51809612.8729 Keras MSE: 12573943.5237 PRE: 0.7573
Baseline MSE: 54178137.5243 Keras MSE: 8323390.5802 PRE: 0.8464
Baseline MSE: 59519766.9782 Keras MSE: 6269259.7456 PRE: 0.8947
Baseline MSE: 55601864.5402 Keras MSE: 4921577.871 PRE: 0.9115
Baseline MSE: 87577031.1508 Keras MSE: 11983410.8109 PRE: 0.8632
Baseline MSE: 42201287.0314 Keras MSE: 6483283.4757 PRE: 0.8464
Baseline MSE: 78607754.738 Keras MSE: 20062081.2037 PRE: 0.7448
Baseline MSE: 35341461.3255 Keras MSE: 4332851.3296 PRE: 0.8774
Baseline MSE: 63590487.4394 Keras MSE: 7154329.5737 PRE: 0.8875
Baseline MSE: 65787990.5751 Keras MSE: 4981221.9195 PRE: 0.9243
Overall PRE over 10 trials: 0.8553329174253268


**Summary for `Automobile`:**
* Dataset characteristics: small-scale, low dimension, numeric and categorical attributes
**Result:**
*  KNN(80%)
*  Neural network(85%)

## Mercede: sparse data with binary variables and categorical variables
## KNN:42%

In [72]:
import pandas as pd
import numpy as np
pass_test=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/pass_testing.csv')
# pass_test.shape #4209 by 378
# pd.isna(pass_test).sum().sum() #no missing value

x=pass_test.iloc[:,2:]
y=pass_test['y']
cate_col=list(x.columns[x.dtypes=='object'])
x=pd.get_dummies(x,columns=cate_col) #change categorical variables to dummy var
x.shape

(4209, 563)

In [55]:
overall=sum([Knn_estimator(x,y,n_neigh=20) for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 166.8287 KNN MSE: 99.6932 PRE: 0.4024
Baseline MSE: 155.9393 KNN MSE: 86.065 PRE: 0.4481
Baseline MSE: 172.868 KNN MSE: 110.6822 PRE: 0.3597
Baseline MSE: 178.119 KNN MSE: 108.3273 PRE: 0.3918
Baseline MSE: 166.2521 KNN MSE: 92.9012 PRE: 0.4412
Baseline MSE: 149.669 KNN MSE: 83.1095 PRE: 0.4447
Baseline MSE: 167.5927 KNN MSE: 93.5079 PRE: 0.4421
Baseline MSE: 153.6807 KNN MSE: 87.8013 PRE: 0.4287
Baseline MSE: 163.8447 KNN MSE: 89.3852 PRE: 0.4545
Baseline MSE: 151.655 KNN MSE: 80.0287 PRE: 0.4723
Overall PRE over 10 trials: 0.4285458044504349


## Neural Network:53%

In [69]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)

    model=Sequential()
    model.add(Dense(563,input_dim=563,activation='relu',kernel_initializer='normal'))
    model.add(Dense(50,input_dim=100,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.0001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=60,epochs=40,validation_split=0.1,verbose=0)

    pred=model.predict(X_test)
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    pred_mse=mean_squared_error(list(y_test),pred)    
    r_square=(baseline-pred_mse)/baseline
    print("Baseline MSE:",round(baseline,4),"Keras MSE:",round(pred_mse,4),"PRE:",round(r_square,4))
    return r_square


In [70]:
overall=sum([keras_model() for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 154.4703 Keras MSE: 74.9101 PRE: 0.5151
Baseline MSE: 157.3397 Keras MSE: 69.7739 PRE: 0.5565
Baseline MSE: 162.5573 Keras MSE: 74.1338 PRE: 0.544
Baseline MSE: 155.1553 Keras MSE: 73.516 PRE: 0.5262
Baseline MSE: 170.2172 Keras MSE: 88.6884 PRE: 0.479
Baseline MSE: 159.3436 Keras MSE: 69.6261 PRE: 0.563
Baseline MSE: 169.3383 Keras MSE: 83.6642 PRE: 0.5059
Baseline MSE: 156.3171 Keras MSE: 70.6824 PRE: 0.5478
Baseline MSE: 147.8653 Keras MSE: 65.0646 PRE: 0.56
Baseline MSE: 152.6489 Keras MSE: 66.336 PRE: 0.5654
Overall PRE over 10 trials: 0.536290524387286


**Summary for `Mercedes`:**
* Dataset characteristics: medium scale, high dimension, sparse attributes.
**Result:**
*  KNN(42%)
*  Neural network(54%)

## Text data(extra dataset)

In [3]:
price=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/price_predict_reduced.csv',encoding='ISO-8859-1')
price=price.dropna()
import re
import nltk
#nltk.download()
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import FrenchStemmer 
stemmer =FrenchStemmer() 
analyzer= CountVectorizer().build_analyzer() 
def stemmed_words(doc): 
    return (stemmer.stem(w) for w in analyzer(doc))




In [4]:
text=price['text']
text[10]
def clean_text(review):
    letter_only=re.sub("[^a-zA-Z]"," ",review)
    letter_only=letter_only.lower()
    words=letter_only.split()
    sw=stopwords.words('english')
    words=[w for w in words if w not in sw]
    clean=" ".join(words)
    return clean

cleaned=[clean_text(i) for i in text]

vecterizor=TfidfVectorizer(token_pattern='[a-z]{3,15}')
matrix_TF=vecterizor.fit_transform(cleaned)

In [18]:
x=matrix_TF
y=price['price']
X_new = SelectKBest(chi2, k=1000).fit_transform(x, y)
X_new.shape


(9998, 1000)

In [66]:
overall=sum([Knn_estimator(X_new,y,n_neigh=9) for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 992.7657 KNN MSE: 906.465 PRE: 0.0869
Baseline MSE: 1423.8537 KNN MSE: 1320.8562 PRE: 0.0723
Baseline MSE: 1869.8584 KNN MSE: 1766.2627 PRE: 0.0554
Baseline MSE: 1129.1042 KNN MSE: 1029.4719 PRE: 0.0882
Baseline MSE: 2040.7538 KNN MSE: 1907.9898 PRE: 0.0651
Baseline MSE: 1641.2034 KNN MSE: 1538.19 PRE: 0.0628
Baseline MSE: 2327.3217 KNN MSE: 2225.4716 PRE: 0.0438
Baseline MSE: 1819.8309 KNN MSE: 1647.5951 PRE: 0.0946
Baseline MSE: 1364.0498 KNN MSE: 1281.5501 PRE: 0.0605
Baseline MSE: 1579.2404 KNN MSE: 1505.451 PRE: 0.0467
Overall PRE over 10 trials: 0.06763458034836624


In [130]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(X_new)
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)

    model=Sequential()
    model.add(Dense(1000,input_dim=1000,activation='relu',kernel_initializer='normal'))
    model.add(Dense(300,input_dim=1000,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=200,epochs=50,validation_split=0.1,verbose=0)

    pred=model.predict(X_test)
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    pred_mse=mean_squared_error(list(y_test),pred)    
    r_square=(baseline-pred_mse)/baseline
    print("Baseline MSE:",round(baseline,4),"Keras MSE:",round(pred_mse,4),"PRE:",round(r_square,4))
    return r_square

In [132]:
#code works, take time to run
#overall=sum([keras_model() for i in range(10)])/10
#print("Overall PRE over 10 trials:",overall)

## Parkinson dataset
## KNN: 63%

In [233]:
parkinson=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/parkinsons_updrs.data')

In [234]:
y=parkinson['total_UPDRS']
x=parkinson.drop(['subject#','total_UPDRS','motor_UPDRS'],axis=1) #drop correlated variable and subjectID 

overall=sum([Knn_estimator(x,y,n_neigh=4) for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 116.8036 KNN MSE: 41.6373 PRE: 0.6435
Baseline MSE: 116.7677 KNN MSE: 40.7814 PRE: 0.6507
Baseline MSE: 113.2802 KNN MSE: 42.775 PRE: 0.6224
Baseline MSE: 112.6663 KNN MSE: 41.4371 PRE: 0.6322
Baseline MSE: 115.572 KNN MSE: 43.4973 PRE: 0.6236
Baseline MSE: 117.2201 KNN MSE: 41.3375 PRE: 0.6474
Baseline MSE: 112.6452 KNN MSE: 43.6306 PRE: 0.6127
Baseline MSE: 115.0374 KNN MSE: 44.4435 PRE: 0.6137
Baseline MSE: 119.3836 KNN MSE: 43.9502 PRE: 0.6319
Baseline MSE: 112.2652 KNN MSE: 42.4233 PRE: 0.6221
Overall PRE over 10 trials: 0.6300176987782858


In [235]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)

    model=Sequential()
    model.add(Dense(19,input_dim=19,activation='sigmoid',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.01),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=20,epochs=100,validation_split=0.1,verbose=0)

    pred=model.predict(X_test)
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    pred_mse=mean_squared_error(list(y_test),pred)    
    r_square=(baseline-pred_mse)/baseline
    print("Baseline MSE:",round(baseline,4),"Keras MSE:",round(pred_mse,4),"PRE:",round(r_square,4))
    return r_square

In [236]:
#code works,takes time to run 42%~
overall=sum([keras_model() for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 120.0527 Keras MSE: 66.045 PRE: 0.4499
Baseline MSE: 114.5761 Keras MSE: 65.3685 PRE: 0.4295
Baseline MSE: 115.9442 Keras MSE: 64.333 PRE: 0.4451
Baseline MSE: 108.0987 Keras MSE: 60.6223 PRE: 0.4392
Baseline MSE: 113.7759 Keras MSE: 62.7369 PRE: 0.4486
Baseline MSE: 113.4425 Keras MSE: 64.5762 PRE: 0.4308
Baseline MSE: 113.5674 Keras MSE: 73.3118 PRE: 0.3545
Baseline MSE: 111.404 Keras MSE: 67.4098 PRE: 0.3949
Baseline MSE: 115.5393 Keras MSE: 66.7562 PRE: 0.4222
Baseline MSE: 117.3396 Keras MSE: 63.0667 PRE: 0.4625
Overall PRE over 10 trials: 0.4277145193457866


**Summary for `Mercedes`:**
* Dataset characteristics: medium scale, low dimension.
**Result:**
*  KNN(63%) faster run time
*  Neural network(42%)

## Song prediction:

*  When we see this vast amount of sample size it appears that KNN is not appropriate since it calculates the distance of all the data points between the training data and test data.The native KNN will generate a distance matrix of approximate 500k * 500k which is very computationally expensive takes too long to run and the other built in KNN algorithms('brute force',ball tree','kd_tree') lead to error.

*  What I tried below is to randomly sample 5% of the data from the original dataset, run KNN on the subset of the data, repeat this process for 10 times and compute the average reduction in error. 

In [173]:
song=pd.read_csv('C:/Users/zhenguo/Desktop/STA141C/YearPredictionMSD.csv',header=None)

In [206]:
y=song.iloc[:,0]
x=song.iloc[:,1:]

In [230]:
def one_sample():
    rand=song.sample(frac=0.05,replace=True)
    y=rand.iloc[:,0]
    x=rand.iloc[:,1:]
    Knn_estimator(x,y,algo='ball_tree')
    

In [231]:
overall=sum([one_sample() for i in range(10)])/10
print("Average PRE over 10 random sampling:",overall)
#0.16632

Baseline MSE: 119.9981 KNN MSE: 99.0966 PRE: 0.1742
Baseline MSE: 120.8378 KNN MSE: 103.838 PRE: 0.1407
Baseline MSE: 116.4953 KNN MSE: 98.3594 PRE: 0.1557
Baseline MSE: 125.5185 KNN MSE: 105.2531 PRE: 0.1615
Baseline MSE: 122.2611 KNN MSE: 102.0804 PRE: 0.1651
Baseline MSE: 118.0886 KNN MSE: 97.1058 PRE: 0.1777
Baseline MSE: 120.4048 KNN MSE: 100.2495 PRE: 0.1674
Baseline MSE: 126.6178 KNN MSE: 104.6845 PRE: 0.1732
Baseline MSE: 121.5631 KNN MSE: 97.55 PRE: 0.1975
Baseline MSE: 119.7318 KNN MSE: 101.7491 PRE: 0.1502


TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [187]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)

    model=Sequential()
    model.add(Dense(90,input_dim=90,activation='relu',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=100,epochs=200,validation_split=0.1,verbose=2)

    pred=model.predict(X_test)
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    pred_mse=mean_squared_error(list(y_test),pred)    
    r_square=(baseline-pred_mse)/baseline
    print("Baseline MSE:",round(baseline,4),"Keras MSE:",round(pred_mse,4),"PRE:",round(r_square,4))
    return r_square



In [191]:
import time
t0=time.time()
keras_model()
t1=time.time()
print("Time used to run the code once:",t1-t0,"sec.")

Train on 324666 samples, validate on 36075 samples
Epoch 1/200
 - 81s - loss: 509861.8214 - mean_squared_error: 509861.8214 - val_loss: 22750.9029 - val_mean_squared_error: 22750.9029
Epoch 2/200
 - 20s - loss: 16102.1918 - mean_squared_error: 16102.1918 - val_loss: 12957.1316 - val_mean_squared_error: 12957.1316
Epoch 3/200
 - 19s - loss: 11045.2747 - mean_squared_error: 11045.2747 - val_loss: 9044.3250 - val_mean_squared_error: 9044.3250
Epoch 4/200
 - 19s - loss: 7268.4593 - mean_squared_error: 7268.4593 - val_loss: 5498.2535 - val_mean_squared_error: 5498.2535
Epoch 5/200
 - 19s - loss: 4069.3904 - mean_squared_error: 4069.3904 - val_loss: 2701.7978 - val_mean_squared_error: 2701.7978
Epoch 6/200
 - 20s - loss: 1722.1742 - mean_squared_error: 1722.1742 - val_loss: 869.2433 - val_mean_squared_error: 869.2433
Epoch 7/200
 - 19s - loss: 495.3708 - mean_squared_error: 495.3708 - val_loss: 208.1331 - val_mean_squared_error: 208.1331
Epoch 8/200
 - 19s - loss: 151.3083 - mean_squared_err

Epoch 67/200
 - 18s - loss: 101.0300 - mean_squared_error: 101.0300 - val_loss: 98.4014 - val_mean_squared_error: 98.4014
Epoch 68/200
 - 18s - loss: 101.9435 - mean_squared_error: 101.9435 - val_loss: 95.2120 - val_mean_squared_error: 95.2120
Epoch 69/200
 - 19s - loss: 101.4305 - mean_squared_error: 101.4305 - val_loss: 96.9483 - val_mean_squared_error: 96.9483
Epoch 70/200
 - 18s - loss: 101.4315 - mean_squared_error: 101.4315 - val_loss: 96.4636 - val_mean_squared_error: 96.4636
Epoch 71/200
 - 18s - loss: 101.3227 - mean_squared_error: 101.3227 - val_loss: 96.6604 - val_mean_squared_error: 96.6604
Epoch 72/200
 - 18s - loss: 101.1861 - mean_squared_error: 101.1861 - val_loss: 98.1527 - val_mean_squared_error: 98.1527
Epoch 73/200
 - 19s - loss: 102.0549 - mean_squared_error: 102.0549 - val_loss: 95.9026 - val_mean_squared_error: 95.9026
Epoch 74/200
 - 18s - loss: 101.7005 - mean_squared_error: 101.7005 - val_loss: 94.4309 - val_mean_squared_error: 94.4309
Epoch 75/200
 - 18s - lo

Epoch 134/200
 - 19s - loss: 99.7846 - mean_squared_error: 99.7846 - val_loss: 104.1752 - val_mean_squared_error: 104.1752
Epoch 135/200
 - 18s - loss: 99.6131 - mean_squared_error: 99.6131 - val_loss: 97.0436 - val_mean_squared_error: 97.0436
Epoch 136/200
 - 19s - loss: 99.9967 - mean_squared_error: 99.9967 - val_loss: 96.1783 - val_mean_squared_error: 96.1783
Epoch 137/200
 - 19s - loss: 99.9835 - mean_squared_error: 99.9835 - val_loss: 127.0987 - val_mean_squared_error: 127.0987
Epoch 138/200
 - 19s - loss: 100.0722 - mean_squared_error: 100.0722 - val_loss: 96.2400 - val_mean_squared_error: 96.2400
Epoch 139/200
 - 19s - loss: 99.7273 - mean_squared_error: 99.7273 - val_loss: 102.8657 - val_mean_squared_error: 102.8657
Epoch 140/200
 - 19s - loss: 99.1650 - mean_squared_error: 99.1650 - val_loss: 106.8927 - val_mean_squared_error: 106.8927
Epoch 141/200
 - 19s - loss: 100.3583 - mean_squared_error: 100.3583 - val_loss: 94.6413 - val_mean_squared_error: 94.6413
Epoch 142/200
 - 18s

Time used to run the code once: 3878.1280975341797 sec.


## Solar Flares
## KNN: ~1% reduction in error

In [51]:
solar=pd.read_csv("C:/Users/zhenguo/Desktop/flare.data2",sep=" ")
x=solar.iloc[:,1:6]
y=solar.iloc[:,7]

Baseline MSE: 0.0611 KNN MSE: 0.0521 PRE: 0.1468
Baseline MSE: 0.1218 KNN MSE: 0.1175 PRE: 0.0356
Baseline MSE: 0.1613 KNN MSE: 0.1615 PRE: -0.0011
Baseline MSE: 0.2095 KNN MSE: 0.199 PRE: 0.0502
Baseline MSE: 0.1253 KNN MSE: 0.135 PRE: -0.0774
Baseline MSE: 0.0577 KNN MSE: 0.0655 PRE: -0.1347
Baseline MSE: 0.0548 KNN MSE: 0.0605 PRE: -0.1031
Baseline MSE: 0.1406 KNN MSE: 0.1311 PRE: 0.0673
Baseline MSE: 0.0461 KNN MSE: 0.1029 PRE: -1.2323
Baseline MSE: 0.0975 KNN MSE: 0.0972 PRE: 0.0026
Overall PRE over 10 trials: -0.12462172487951421


In [139]:
overall=sum([Knn_estimator(x,y,n_neigh=50) for i in range(50)])/50
print("Overall PRE over 50 trials:",overall)

Baseline MSE: 0.0572 KNN MSE: 0.0547 PRE: 0.0432
Baseline MSE: 0.1343 KNN MSE: 0.1371 PRE: -0.0209
Baseline MSE: 0.1253 KNN MSE: 0.1209 PRE: 0.0353
Baseline MSE: 0.114 KNN MSE: 0.1096 PRE: 0.0386
Baseline MSE: 0.1586 KNN MSE: 0.1565 PRE: 0.0132
Baseline MSE: 0.1246 KNN MSE: 0.1233 PRE: 0.0105
Baseline MSE: 0.0394 KNN MSE: 0.0384 PRE: 0.0257
Baseline MSE: 0.0543 KNN MSE: 0.0514 PRE: 0.0548
Baseline MSE: 0.0273 KNN MSE: 0.0286 PRE: -0.0449
Baseline MSE: 0.0822 KNN MSE: 0.0856 PRE: -0.042
Baseline MSE: 0.0611 KNN MSE: 0.0627 PRE: -0.0259
Baseline MSE: 0.0702 KNN MSE: 0.0687 PRE: 0.0215
Baseline MSE: 0.1503 KNN MSE: 0.1549 PRE: -0.0304
Baseline MSE: 0.0878 KNN MSE: 0.0842 PRE: 0.041
Baseline MSE: 0.0884 KNN MSE: 0.0921 PRE: -0.041
Baseline MSE: 0.1086 KNN MSE: 0.1053 PRE: 0.0301
Baseline MSE: 0.1453 KNN MSE: 0.1395 PRE: 0.0396
Baseline MSE: 0.0765 KNN MSE: 0.0793 PRE: -0.0369
Baseline MSE: 0.0184 KNN MSE: 0.0162 PRE: 0.1175
Baseline MSE: 0.0273 KNN MSE: 0.0273 PRE: 0.0008
Baseline MSE: 0.0

In [145]:
def keras_model():
    scaler = StandardScaler(with_mean=False)
    x_std=scaler.fit_transform(x)
    X_train, X_test, y_train, y_test = train_test_split(x_std, y, test_size=0.3)

    model=Sequential()
    model.add(Dense(5,input_dim=5,activation='sigmoid',kernel_initializer='normal'))
    model.add(Dense(1))

    model.compile(Adam(lr=0.001),loss='mean_squared_error',metrics=['mse'])
    history=model.fit(X_train,y_train,batch_size=50,epochs=100,validation_split=0.1,verbose=0)

    pred=model.predict(X_test)
    pred_with_mean=[sum(y_test)/len(y_test)]*len(y_test)
    baseline=mean_squared_error(y_test,pred_with_mean)
    pred_mse=mean_squared_error(list(y_test),pred)    
    r_square=(baseline-pred_mse)/baseline
    print("Baseline MSE:",round(baseline,4),"Keras MSE:",round(pred_mse,4),"PRE:",round(r_square,4))
    return r_square



## Keras ~1.4%

In [146]:
overall=sum([keras_model() for i in range(10)])/10
print("Overall PRE over 10 trials:",overall)

Baseline MSE: 0.1003 Keras MSE: 0.0982 PRE: 0.0205
Baseline MSE: 0.1836 Keras MSE: 0.1747 PRE: 0.0484
Baseline MSE: 0.0306 Keras MSE: 0.0337 PRE: -0.0992
Baseline MSE: 0.0481 Keras MSE: 0.0424 PRE: 0.119
Baseline MSE: 0.1378 Keras MSE: 0.1441 PRE: -0.0457
Baseline MSE: 0.1128 Keras MSE: 0.1055 PRE: 0.0644
Baseline MSE: 0.0519 Keras MSE: 0.0491 PRE: 0.0546
Baseline MSE: 0.1621 Keras MSE: 0.1531 PRE: 0.0553
Baseline MSE: 0.0244 Keras MSE: 0.026 PRE: -0.0683
Baseline MSE: 0.1031 Keras MSE: 0.1006 PRE: 0.0238
Overall PRE over 10 trials: 0.017286863728540023


In [150]:
bc=pd.read_csv("C:/Users/zhenguo/Desktop/breast-cancer-wisconsin.data",header=None)
bc.shape

(699, 11)