# 1. Feature tools

### featuretools can synthesize features by creating relationship between them. 

##### Typically, featuretools is used when we have several different datasets (e.g. several .csv files, or several pandas dataframes), and they should have some inner-relationship similar to relational database.

##### different dataframe will be loaded as different EntitySet in featuretools, and we should define an id (similar to pk or fk in database) for each of the dataset

##### when we have a relationship between datasets, we can have a Deep Feature Synthesis using featuretools.

`for here, we have only one dataset, featuretools is not applicable for this assignment`

# 2. TPOT(Tree-based Pipeline Optimization Tool)

### installation: 
1.`pip install deap update_checker tqdm stopit`

2.`pip install pywin32` (if python is not installed via Anaconda)


3.`pip install tpot`

In [12]:
from tpot import TPOTRegressor
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.feature_extraction import DictVectorizer

In [10]:
def read_and_get_data():
    train_data = pd.read_csv('./training.csv')
    test_data = pd.read_csv('./testing.csv')
    train_data = split_datetime(train_data)
    test_data = split_datetime(test_data)
    
    x_train = catogirical_to_numerical(train_data.drop(['Appliances'], axis = 1))
    y_train = train_data['Appliances'].values

    x_test = catogirical_to_numerical(test_data.drop(['Appliances'], axis = 1))
    y_test = test_data['Appliances'].values
    return x_train, y_train, x_test, y_test

def catogirical_to_numerical(df):
    categorical_columns = df.select_dtypes(include=['object']).columns
    for col in categorical_columns:
        df[col] = df[col].astype('category')
    vec = DictVectorizer(sparse=False, dtype=int)
    dc = df.to_dict('records')
    result = vec.fit_transform(dc)
    return result
def split_datetime(df):
    df['date'] = pd.to_datetime(df['date'])
    year = list()
    month = list()
    day = list()
    hour=list()
    for i in np.arange(df.count()[0]):
        year.append(df['date'][i].year)
        month.append(df['date'][i].month)
        day.append(df['date'][i].day)
        hour.append(df['date'][i].hour)
    df['year'] = year
    df['month'] = month
    df['day'] = day
    df['hour'] = hour
    df = df.drop(['date'], axis = 1)
    return df

def mean_absolute_percentage_error(y_true, y_pred): 
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def cal_errors(y_test,y_pred):
    mae = mean_absolute_error(y_test, y_pred)
    rms = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mape = mean_absolute_percentage_error(y_test, y_pred)
    print('MAE = {}, RMS = {}, R2 = {}, MAPE = {}'.format(mae,rms,r2,mape))


In [13]:
x_train, y_train, x_test, y_test = read_and_get_data()
tpot = TPOTRegressor(generations = 5, population_size=50, verbosity=2)
tpot.fit(x_train,y_train)
y_pred = tpot.predict(x_test)
cal_errors(y_test,y_pred)

  return f(*args, **kwds)




HBox(children=(IntProgress(value=0, description='Optimization Progress', max=300, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: -8799.849473934231
Generation 2 - Current best internal CV score: -8771.209432220396
Generation 3 - Current best internal CV score: -8701.563001635453
Generation 4 - Current best internal CV score: -8699.117897982436
Generation 5 - Current best internal CV score: -8593.739592487294

Best pipeline: ExtraTreesRegressor(LassoLarsCV(FastICA(input_matrix, tol=0.25), normalize=False), bootstrap=False, max_features=0.35000000000000003, min_samples_leaf=18, min_samples_split=15, n_estimators=100)
MAE = 42.244488857464084, RMS = 6980.216082289274, R2 = 0.32384072634803185, MAPE = 44.79586610281174
