Instructions:
1: Finish Major Preprocessing, this includes scaling and/or transforming your data, imputing your data, encoding your data, feature expansion, Feature expansion (example is taking features and generating new features by transforming via polynomial, log multiplication of features).

2: Train your first model

3: Evaluate your model and compare training vs. test error

4: Answer the questions: Where does your model fit in the fitting graph? and What are the next models you are thinking of and why?

5: Update your README.md to include your new work and updates you have all added. Make sure to upload all code and notebooks. Provide links in your README.md

6. Conclusion section: What is the conclusion of your 1st model? What can be done to possibly improve it?

In [17]:
# Importing required packages/tools
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, PolynomialFeatures
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

df_original = pd.read_csv('/content/infolimpioavanzadoTarget.csv')

In [32]:
# Filtering the dataset
# We only want numerican features for our model
numerical_features = df_original.select_dtypes(include=['float64', 'int64']).columns

# Calculate the correlation between the 'close' feature and every other (now numerical) feature
correlation_with_close = df_original[numerical_features].corr()['close'].drop('close')

# Filter for features with high correlation (for example, correlations above 0.7 or below -0.7)
high_correlation_features = correlation_with_close[correlation_with_close.abs() > 0.3]

# Include 'date' and 'ticker' explicitly in the final DataFrame
df = df_original[['date', 'ticker', 'close'] + high_correlation_features.index.tolist()]

In [None]:
# Encoding
df['date'] = pd.to_datetime(df['date'])  # This line converts the 'date' column to datetime
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = pd.to_datetime(df['date']).dt.dayofweek
df['week_of_year'] = pd.to_datetime(df['date']).dt.isocalendar().week
df = df.drop('date', axis=1)

df = pd.get_dummies(df, columns=['ticker'], drop_first=True)

In [34]:
# Data Splitting
X = df.drop('close', axis = 1)
y = df['close']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [35]:
# Handling Missing Data
median_imputer = SimpleImputer(strategy='median')
num_cols = X.select_dtypes(include=['float64', 'int64']).columns

X_train[num_cols] = median_imputer.fit_transform(X_train[num_cols])
X_test[num_cols] = median_imputer.transform(X_test[num_cols])

In [37]:
# Feature Scaling - Standard Scaler for Gaussian Distributions and MinMaxScaler for features that need to remain bounded
stdScaler = StandardScaler()
minMaxScaler = MinMaxScaler()

standard = ['open', 'MACDsig-adjclose-50', 'MACDdif-adjclose-50-1', 'vwapadjclosevolume', 'atr5', 'atr10', 'atr15', 'atr20', 'velaF']
min_max = ['open', 'high', 'low', 'velaE', 'velaF', 'low-5', 'high-5', 'low-10', 'high-10', 'low-15', 'high-15']

X_train[standard] = stdScaler.fit_transform(X_train[standard])
X_test[standard] = stdScaler.transform(X_test[standard])

X_train[min_max] = minMaxScaler.fit_transform(X_train[min_max])
X_test[min_max] = minMaxScaler.transform(X_test[min_max])

In [38]:
# Feature Expansion- Polynomial transformations for few predictive numerical features(High Correlation) and Log Transformations for highly skewed features
polynomial = PolynomialFeatures(degree=2, include_bias=False)
X_train_polynomial = polynomial.fit_transform(X_train)
X_test_polynomial = polynomial.transform(X_test)

In [46]:
# Dimensionality Reduction - L1 Regularization to shrink and eliminate unimportant features and PCA if relationships among featrures indiciate redundancy (95% variance)
lasso = Lasso(alpha=0.01, random_state = 42, max_iter = 5000)

lasso.fit(X_train_polynomial, y_train)

feature_names = polynomial.get_feature_names_out(input_features=X_train.columns)
coefficients = lasso.coef_

important_features = [feature for feature, coef in zip(feature_names, coefficients) if coef != 0]
print("Important Features:", important_features)

Important Features: ['adjclose', 'open adjclose', 'open year', 'high adjclose', 'high year', 'low adjclose', 'low year', 'adjclose^2', 'adjclose MACDsig-adjclose-50', 'adjclose MACDdif-adjclose-50-0', 'adjclose MACDdif-adjclose-50-1', 'adjclose vwapadjclosevolume', 'adjclose atr5', 'adjclose atr10', 'adjclose atr15', 'adjclose atr20', 'adjclose velaE', 'adjclose velaF', 'adjclose low-5', 'adjclose high-5', 'adjclose low-10', 'adjclose high-10', 'adjclose low-15', 'adjclose year', 'adjclose month', 'adjclose day', 'adjclose day_of_week', 'adjclose week_of_year', 'adjclose ticker_ASML', 'adjclose ticker_ASTE', 'adjclose ticker_ATLC', 'MACDsig-adjclose-50 year', 'MACDsig-adjclose-50 week_of_year', 'MACDdif-adjclose-50-0^2', 'MACDdif-adjclose-50-0 atr15', 'MACDdif-adjclose-50-0 atr20', 'MACDdif-adjclose-50-0 year', 'MACDdif-adjclose-50-0 day', 'MACDdif-adjclose-50-0 week_of_year', 'MACDdif-adjclose-50-1 year', 'MACDdif-adjclose-50-1 week_of_year', 'vwapadjclosevolume year', 'atr5 year', 'a

In [39]:
# Target(close) Transformation - Log Transformation for better stability and to handle outliers
y_train_log = np.log1p(y_train)
y_test_log = np.log1p(y_test)

In [None]:
# Training First Model


In [None]:
# Evaluating model and comparing training vs. test error


**Q: Where does your model fit in the fitting graph? and What are the next models you are thinking of and why?**

A:

**Q: What is the conclusion of your 1st model? What can be done to possibly improve it?**

A: