**Part 3B**

**Instructions:**

See the instructions in the pdf file provided in **Instructions and details for all parts**

- Use this notebook to 
> - show the work you carried out to produce a prediction model and comma delimited file with your predictions of DSHARES in the test dataset.
> - show the work you carried out to produce your estimate of M, the mean absolute error in predicting log(1+DSHARES)
> - provide your estimate of M in the cell provided
- You need to upload the comma delimited file for part 3A 
- You need to upload this notebook or part 3B.

Use any number of cells you need to for your work. 
Make sure you assign a value to M and print it in the last (print) cell provided.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Load the data
train=pd.read_csv("train029.csv")
test = pd.read_csv("test029.csv")
print(train.shape)
print(test.shape)

(3424, 4)
(500, 3)


In [3]:
# Check for missing elements
missing_values_train = train.isnull().sum()
missing_values_test = test.isnull().sum()
missing_values_train, missing_values_test
# Nothing missing

(TRANS_DATE             0
 ASHARES                0
 TRANS_PRICEPERSHARE    0
 DSHARES                0
 dtype: int64,
 TRANS_DATE             0
 ASHARES                0
 TRANS_PRICEPERSHARE    0
 dtype: int64)

In [4]:
train["date_time"]=pd.to_datetime(train["TRANS_DATE"])
test['date_time'] = pd.to_datetime(test['TRANS_DATE'])
train['YEAR'] = train['date_time'].dt.year
train['MONTH'] = train['date_time'].dt.month
train['DAY'] = train['date_time'].dt.day
test['YEAR'] = test['date_time'].dt.year
test['MONTH'] = test['date_time'].dt.month
test['DAY'] = test['date_time'].dt.day
train['WEEKDAY'] = train['date_time'].dt.weekday
train['QUARTER'] = train['date_time'].dt.quarter
test['WEEKDAY'] = test['date_time'].dt.weekday
test['QUARTER'] = test['date_time'].dt.quarter

In [5]:
# Splitting the training data for internal training and testing
X = train[['YEAR', 'MONTH', 'DAY', 'WEEKDAY', 'QUARTER', 'ASHARES', 'TRANS_PRICEPERSHARE']]
y = train['DSHARES']

In [6]:
# Preprocessing
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV

def log_mae(estimator, X, y):
    y_pred = estimator.predict(X)
    y_pred = np.clip(y_pred, 0, None)
    return -np.mean(np.abs(np.log1p(y) - np.log1p(y_pred)))

errors = []

def cross_val_scoring(model, X, y, model_name):
    scores = cross_val_score(model, X, y, cv=5, scoring=log_mae)
    avg_score = -scores.mean()
    print(f"{model_name} - Log Mean Absolute Error (Cross-Validation): {avg_score:.5f}")
    errors.append(avg_score)

# Linear Regression
lr_model = LinearRegression()
cross_val_scoring(lr_model, X, y, 'Linear Regression')

# Decision Tree
dt_model = DecisionTreeRegressor()
cross_val_scoring(dt_model, X, y, 'Decision Tree Regressor')

# Random Forest
rf_params = {
    'n_estimators': [50, 100, 125, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
rf_grid_search = GridSearchCV(RandomForestRegressor(), rf_params, cv=10, scoring=log_mae)
rf_grid_search.fit(X, y)
print("Best parameters for Random Forest:", rf_grid_search.best_params_)
print("Best cross-validation score (Log MAE):", -rf_grid_search.best_score_)
rf_best_model = RandomForestRegressor(**rf_grid_search.best_params_)
cross_val_scoring(rf_best_model, X, y, 'Random Forest Regressor (Tuned)')

# Support Vector Regression
svr_model = SVR()
cross_val_scoring(svr_model, X, y, 'Support Vector Regression')

# k Nearest Neighbors
knn_params = {'n_neighbors': range(1, 31)}
knn_grid_search = GridSearchCV(KNeighborsRegressor(), knn_params, cv=10, scoring=log_mae)
knn_grid_search.fit(X, y)
print("Best parameters for KNN Regressor:", knn_grid_search.best_params_)
print("Best cross-validation score (Log MAE):", -knn_grid_search.best_score_)
knn_best_model = KNeighborsRegressor(**knn_grid_search.best_params_)
cross_val_scoring(knn_best_model, X, y, 'KNN Regressor (Tuned)')

# Gradient Boosting
gbr_model = GradientBoostingRegressor()
cross_val_scoring(gbr_model, X, y, 'Gradient Boosting Regressor')

Linear Regression - Log Mean Absolute Error (Cross-Validation): 6.55971
Decision Tree Regressor - Log Mean Absolute Error (Cross-Validation): 2.14664
Best parameters for Random Forest: {'max_depth': 30, 'min_samples_split': 2, 'n_estimators': 50}
Best cross-validation score (Log MAE): 2.291001026455093
Random Forest Regressor (Tuned) - Log Mean Absolute Error (Cross-Validation): 2.76042
Support Vector Regression - Log Mean Absolute Error (Cross-Validation): 5.13166
Best parameters for KNN Regressor: {'n_neighbors': 2}
Best cross-validation score (Log MAE): 2.226673007091265
KNN Regressor (Tuned) - Log Mean Absolute Error (Cross-Validation): 2.46154
Gradient Boosting Regressor - Log Mean Absolute Error (Cross-Validation): 5.62207


In [8]:
# KNN regressor was used because decision tree (originally rank first place) failed to produce meaningful predictions
# instead, it predicts repeating outputs.
X_test = test[['YEAR', 'MONTH', 'DAY', 'WEEKDAY', 'QUARTER', 'ASHARES', 'TRANS_PRICEPERSHARE']]
X_test = StandardScaler().fit_transform(X_test)
knn_best_model.fit(X, y)
y_test_pred = knn_best_model.predict(X_test)
test['DSHARES'] = y_test_pred
test

Unnamed: 0,TRANS_DATE,ASHARES,TRANS_PRICEPERSHARE,date_time,YEAR,MONTH,DAY,WEEKDAY,QUARTER,DSHARES
0,2013-01-05,529282.00,15.449495,2013-01-05,2013,1,5,5,1,3484.0
1,2013-01-09,12618167.77,17.557271,2013-01-09,2013,1,9,2,1,17369599.5
2,2013-01-25,63190087.04,51.926085,2013-01-25,2013,1,25,4,1,102643438.5
3,2013-01-30,61029702.77,197924.854822,2013-01-30,2013,1,30,2,1,15432622.0
4,2013-02-12,63318385.44,113.767843,2013-02-12,2013,2,12,1,1,4013735.0
...,...,...,...,...,...,...,...,...,...,...
495,2023-08-12,1732008.00,80.567123,2023-08-12,2023,8,12,5,3,1.5
496,2023-08-16,65048517.24,63.490153,2023-08-16,2023,8,16,2,3,10730.0
497,2023-08-20,659387.00,16.964496,2023-08-20,2023,8,20,6,3,1.0
498,2023-08-21,23403723.99,74.239318,2023-08-21,2023,8,21,0,3,18372.0


In [9]:
selected_columns = test[['TRANS_DATE', 'DSHARES']]
selected_columns.to_csv('Part 3A.csv', index=False)

In [10]:
M = sorted(errors)[1]
# Choose the corresponding error for random forest
errors

[6.55971065259573,
 2.146636713486078,
 2.760415721846041,
 5.131663263295438,
 2.4615374881479837,
 5.622073040046853]

In the following cell assign a value to the variable **M**.

In [11]:
# Print cell for M - do not modify or delete this line
# Do execute it
print(M)

2.4615374881479837


**Make sure you successfully print the value of M in the cell above**

**Make sure you save your notebook before submitting it**