# Train

Algoritmi:

**K-Nearest Neighbors (KNN)**:
Pro: Semplice da implementare, non richiede addestramento costoso.
Contro: Prestazioni possono decadere con dataset di grandi dimensioni. Sensibile alla scala delle feature.

**Decision Trees**:
Pro: Facile da interpretare, gestisce automaticamente le feature rilevanti.
Contro: Tendenza all'overfitting. Può essere instabile con piccole variazioni nei dati.

**Linear Regression**:
Pro: Semplice e interpretabile, adatta per relazioni lineari tra feature e target.
Contro: Sensibile a outliers. Non gestisce bene relazioni complesse.

**Logistic Regression**:
Pro: Buona per problemi di classificazione binaria.
Contro: Assume una relazione lineare tra le feature e il log-odds. Non gestisce bene relazioni complesse.

**Support Vector Machines (SVM)**:
Pro: Buone prestazioni in spazi delle feature ad alta dimensione.
Contro: Richiede una scelta accurata dei parametri. Non sempre efficace con dataset molto grandi.

**Random Forest**:
Pro: Buona capacità di gestire complessità e overfitting. Può fornire importanza delle feature.
Contro: Meno interpretabile rispetto ai singoli alberi. Può richiedere tempo per l'addestramento.

**Ensemble Methods**:
Pro: Combina diversi modelli per migliorare le prestazioni complessive.
Contro: Complessità e interpretabilità possono essere un problema.

**Neural Networks**:
Pro: Eccellenti per problemi complessi e non lineari. Addestramento su grandi quantità di dati.
Contro: Richiede molto dati e risorse di calcolo. Complessità nella scelta dell'architettura.

**Convolutional Neural Networks (CNN)**:
Pro: Eccellenti per dati strutturati come immagini. Applicabili anche a dati sequenziali.
Contro: Richiede dati etichettati in grandi quantità.

Escludiamo quindi Linear Regression, Decision Trees,Support Vector Machines (SVM) e Convolutional Neural Networks (CNN).

Considerazioni:

Possiamo utilizzare i rimanenti algoritmi per dividere il problema il 4 categorie: Unsafe, Target 1 day, Target 5 days, Target 30 days.

**Knn**:
La scelta del K è importante e può variare molto prendendo due K molto vicini.
Per questo motivo abbiamo deciso di affidarci ad altri tipi di algoritmi per questo tipo di problema.
 
**Logistic Regression**: 
Possiamo utilizzare la Regressione Logistica per dividere il problema in 3 sottoproblemi. Possiamo classificare se le predizioni appartengono o meno all'etichetta Target 1 day, Target 5 days o Target 30 days. Se tutte e 3 dovessero essere 0, la scelta ricadrebbe su Unsafe.
Questo vorrebbe dire allenare 3 volte l'algoritmo predicendo una alla volta tutte e tre le etichette.
(Da testare)

**Random Forests**:
Possiamo utilizzare la tecnica del Bagging per ridurre la varianza, tipicamente elevata con questo algoritmo, per avere un accuracy più alta.
(Da testare)

**Ensemble Methods**: 
(Che metodi utilizzare?)
(Da Testare)

**Neural Networks**: 
L'utilizzo delle neural networks risulta essere più complicato degli altri poiché è più difficile da costruire e può svilupparsi in diverse varianti.
La parte difficile appunto sta nella scelta del numero di hidden layers e nel numero di nodi di questi. Tale scelta può essere affinata con test ripetuti per arrivare ad una soluzione comune.
(Da testare)

# Import Data

In [73]:
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import make_blobs
import pandas as pd
from sklearn.model_selection import cross_val_score

In [74]:
df = pd.ExcelFile('final_dataset.xlsx').parse('Sheet1')
df

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,Daily_Return,Target_1day,...,MA_30,MA_50,RSI,MACD,Signal_Line,Bollinger_Mid_Band,Bollinger_Upper_Band,Bollinger_Lower_Band,Volatility,Ticker
0,2020-06-30,90534.565179,90717.278731,89255.570312,89255.570312,988987,0.0,0.0,0.000000,1,...,92851.977604,89900.544844,29.090917,-314.945626,517.337196,94577.098437,104488.275493,84665.921382,0.030295,005380KS
1,2020-07-01,89986.421883,90899.989618,89529.638016,89712.351562,640540,0.0,0.0,0.005118,1,...,92879.384896,89858.520781,33.114757,-434.587634,326.952230,94106.611328,104003.310201,84209.912455,0.018542,005380KS
2,2020-07-02,89529.635417,91082.700521,89529.635417,90443.203125,730963,0.0,0.0,0.008147,1,...,92934.198698,89884.100625,41.444863,-465.070135,168.547757,93672.666797,103409.739151,83935.594443,0.012799,005380KS
3,2020-07-03,91265.419308,91813.559964,90077.781218,90625.921875,569575,0.0,0.0,0.002020,1,...,93077.324479,89988.247500,55.500008,-469.076637,41.022878,93133.661719,102267.155198,84000.168239,0.012387,005380KS
4,2020-07-06,91082.695405,93640.684845,90260.484514,92727.117188,1189877,0.0,0.0,0.023185,0,...,93244.811979,90150.862500,50.000000,-299.253312,-27.032360,92608.360156,100464.250846,84752.469466,0.009195,005380KS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26675,2023-10-20,112.919998,113.320000,110.790001,111.080002,22439800,0.0,0.0,-0.017165,0,...,113.866000,112.431415,41.095898,-0.672795,-0.697816,112.580000,120.644831,104.515168,0.012627,XOM
26676,2023-10-23,110.629997,110.959999,108.680000,109.449997,18185000,0.0,0.0,-0.014674,0,...,113.709000,112.402000,38.176416,-0.775562,-0.713365,112.240999,120.232491,104.249507,0.014787,XOM
26677,2023-10-24,109.699997,109.820000,108.120003,108.389999,16786100,0.0,0.0,-0.009685,1,...,113.405666,112.349600,43.441582,-0.931798,-0.757052,111.839999,119.758208,103.921790,0.012802,XOM
26678,2023-10-25,108.519997,109.500000,108.129997,108.589996,22047300,0.0,0.0,0.001845,0,...,113.143999,112.358200,49.065416,-1.027632,-0.811168,111.259499,118.283746,104.235252,0.008695,XOM


# Knn

Dobbiamo convertire tutte le feature in float per poter utilizzare l'algoritmo.

In [75]:
# Converti date in float
df['Date'] = pd.to_datetime(df['Date']).astype(int) / 10**9
df['Ticker'] = pd.factorize(df.Ticker)[0]
df['Volume'] = df['Volume'].astype(float)
df['Target_1day'] = df['Target_1day'].astype(float)
df['Target_5days'] = df['Target_5days'].astype(float)
df['Target_30days'] = df['Target_30days'].astype(float)
df['Net Income'] = df['Net Income'].astype(float)
df['Total Revenue'] = df['Total Revenue'].astype(float)
df['Normalized EBITDA'] = df['Normalized EBITDA'].astype(float)
df['Total Unusual Items'] = df['Total Unusual Items'].astype(float)
df['Total Unusual Items Excluding Goodwill'] = df['Total Unusual Items Excluding Goodwill'].astype(float)
df['Operating Cash Flow'] = df['Operating Cash Flow'].astype(float)
df['Capital Expenditure'] = df['Capital Expenditure'].astype(float)
df['Free Cash Flow'] = df['Free Cash Flow'].astype(float)
df['Cash Flow From Continuing Operating Activities'] = df['Cash Flow From Continuing Operating Activities'].astype(float)
df['Cash Flow From Continuing Investing Activities'] = df['Cash Flow From Continuing Investing Activities'].astype(float)
df['Cash Flow From Continuing Financing Activities'] = df['Cash Flow From Continuing Financing Activities'].astype(float)
df['Ticker'] = df['Ticker'].astype(float)

# divide test and train
X = df.drop(['Target_1day', 'Target_5days', 'Target_30days'], axis=1)
Y = df[['Target_1day', 'Target_5days', 'Target_30days']]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

In [76]:
# Train
for i in [1,3,5,7,9,11,13,15,17,19,21]:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, Y_train)
    Y_pred = knn.predict(X_test)
    print("Accuracy for KNN with K = " + str(i) + " is " + str(accuracy_score(Y_test, Y_pred)))


Accuracy for KNN with K = 1 is 0.3779985007496252
Accuracy for KNN with K = 3 is 0.35344827586206895
Accuracy for KNN with K = 5 is 0.30603448275862066
Accuracy for KNN with K = 7 is 0.3154047976011994
Accuracy for KNN with K = 9 is 0.30359820089955025
Accuracy for KNN with K = 11 is 0.29685157421289354
Accuracy for KNN with K = 13 is 0.2946026986506747
Accuracy for KNN with K = 15 is 0.2912293853073463
Accuracy for KNN with K = 17 is 0.2901049475262369
Accuracy for KNN with K = 19 is 0.28523238380809596
Accuracy for KNN with K = 21 is 0.2848575712143928


# Logistic Regression

Dividiamo il nostro problema di classificazione in 3 sottoproblemi:

- X_1 e Y_1: Target_1day
- X_2 e Y_2: Target_5days
- X_3 e Y_3: Target_30days

In questo modo per ogni sottoproblema possiamo allenare un modello di regressione logistica per classificare se una predizione appartiene o meno all'etichetta.

In [77]:
from sklearn.linear_model import LogisticRegression

X_1 = df.drop('Target_1day', axis=1)
Y_1 = df['Target_1day']

X_train_1_80, X_test_1, Y_train_1_80, Y_test_1 = train_test_split(X_1, Y_1, test_size=0.2)
X_train_1, X_valid_1, Y_train_1, Y_valid_1 = train_test_split(X_train_1_80, Y_train_1_80, test_size=0.20)

X_2 = df.drop('Target_5days', axis=1)
Y_2 = df['Target_5days']

X_train_2_80, X_test_2, Y_train_2_80, Y_test_2 = train_test_split(X_2, Y_2, test_size=0.2)
X_valid_2, X_train_2, Y_valid_2, Y_train_2 = train_test_split(X_train_2_80, Y_train_2_80, test_size=0.20)

X_3 = df.drop('Target_30days', axis=1)
Y_3 = df['Target_30days']

X_train_3_80, X_test_3, Y_train_3_80, Y_test_3 = train_test_split(X_3, Y_3, test_size=0.2)
X_valid_3, X_train_3, Y_valid_3, Y_train_3 = train_test_split(X_train_3_80, Y_train_3_80, test_size=0.20)

In [78]:
logreg = LogisticRegression()
logreg.fit(X_train_1, Y_train_1)
train_acc = accuracy_score(y_true= Y_train_1, y_pred= logreg.predict(X_train_1))
scores = cross_val_score(logreg, X_train_1_80, Y_train_1_80, 
                         cv=5, scoring='accuracy', 
                         verbose = 0)
print("Train set 1: {:.2f}".format(train_acc))
print('Validation set 1: {:.2f}'.format(scores.mean()))

logreg = LogisticRegression()
logreg.fit(X_train_1_80, Y_train_1_80)
test_acc = accuracy_score(y_true= Y_test_1, y_pred= logreg.predict(X_test_1))

print('Test set 1: {:.2f}'.format(test_acc))

Train set 1: 0.51
Validation set 1: 0.51
Test set 1: 0.51


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [79]:
logreg = LogisticRegression()
logreg.fit(X_train_2, Y_train_2)
train_acc = accuracy_score(y_true= Y_train_2, y_pred= logreg.predict(X_train_2))
scores = cross_val_score(logreg, X_train_2_80, Y_train_2_80, 
                         cv=5, scoring='accuracy', 
                         verbose = 0)
print("Train set 2: {:.2f}".format(train_acc))
print('Validation set 2: {:.2f}'.format(scores.mean()))

logreg = LogisticRegression()
logreg.fit(X_train_2_80, Y_train_2_80)
test_acc = accuracy_score(y_true= Y_test_2, y_pred= logreg.predict(X_test_2))

print('Test set 2: {:.2f}'.format(test_acc))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Train set 2: 0.52
Validation set 2: 0.51
Test set 2: 0.51


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [80]:
logreg = LogisticRegression()
logreg.fit(X_train_3, Y_train_3)
train_acc = accuracy_score(y_true= Y_train_3, y_pred= logreg.predict(X_train_3))
scores = cross_val_score(logreg, X_train_3_80, Y_train_3_80, 
                         cv=5, scoring='accuracy', 
                         verbose = 0)
print("Train set 3: {:.2f}".format(train_acc))
print('Validation set 3: {:.2f}'.format(scores.mean()))

logreg = LogisticRegression()
logreg.fit(X_train_3_80, Y_train_3_80)
test_acc = accuracy_score(y_true= Y_test_3, y_pred= logreg.predict(X_test_3))

print('Test set 3: {:.2f}'.format(test_acc))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Train set 3: 0.55
Validation set 3: 0.52
Test set 3: 0.51


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Random Forest

In [81]:
from sklearn.ensemble import RandomForestClassifier

best_i = []
for i in [2,10,30,50,70,100,500,1000]:
    print('Max depth: ' + str(i) + '\n')
    
    # Target 1 day
    rm_1 = RandomForestClassifier(max_depth=i)
    rm_1.fit(X_train_1, Y_train_1)
    train_acc_1 = accuracy_score(y_true= Y_train_1, y_pred= rm_1.predict(X_train_1))
    scores_1 = cross_val_score(rm_1, X_train_1_80, Y_train_1_80, 
                             cv=5, scoring='accuracy', 
                             verbose = 0)
    print("Train set 1: {:.2f}".format(train_acc_1))
    print('Validation set 1: {:.2f}'.format(scores_1.mean()))
    print('\n')
    # Target 5 days
    rm_2 = RandomForestClassifier(max_depth=i)
    rm_2.fit(X_train_2, Y_train_2)
    train_acc_2 = accuracy_score(y_true= Y_train_2, y_pred= rm_2.predict(X_train_2))
    scores_2 = cross_val_score(rm_2, X_train_2_80, Y_train_2_80, 
                             cv=5, scoring='accuracy', 
                             verbose = 0)
    print("Train set 2: {:.2f}".format(train_acc_2))
    print('Validation set 2: {:.2f}'.format(scores_2.mean()))
    print('\n')
    # Target 30 days
    rm_3 = RandomForestClassifier(max_depth=i)
    rm_3.fit(X_train_3, Y_train_3)
    train_acc_3 = accuracy_score(y_true= Y_train_3, y_pred= rm_3.predict(X_train_3))
    scores_3 = cross_val_score(rm_3, X_train_3_80, Y_train_3_80, 
                             cv=5, scoring='accuracy', 
                             verbose = 0)
    print("Train set 3: {:.2f}".format(train_acc_3))
    print('Validation set 3: {:.2f}'.format(scores_3.mean()))
    print('\n')
    
    best_i.append([i, scores_1.mean() + scores_2.mean() + scores_3.mean()])
    
    
i = max(best_i, key=lambda x:x[1])[0]
print('Best max depth: ' + str(i) + '\n')
# Target 1 day
rm_1 = RandomForestClassifier(max_depth=i)
rm_1.fit(X_train_1_80, Y_train_1_80)
test_acc_1 = accuracy_score(y_true= Y_test_1, y_pred= rm_1.predict(X_test_1))
print('Test set 1: {:.2f}'.format(test_acc_1))

# Target 5 days
rm_2 = RandomForestClassifier(max_depth=i)
rm_2.fit(X_train_2_80, Y_train_2_80)
test_acc_2 = accuracy_score(y_true= Y_test_2, y_pred= rm_2.predict(X_test_2))
print('Test set 2: {:.2f}'.format(test_acc_2))

# Target 30 days
rm_3 = RandomForestClassifier(max_depth=i)
rm_3.fit(X_train_3_80, Y_train_3_80)
test_acc_3 = accuracy_score(y_true= Y_test_3, y_pred= rm_3.predict(X_test_3))
print('Test set 3: {:.2f}'.format(test_acc_3))

Max depth: 2


KeyboardInterrupt: 

# Neural Networks

In [None]:
from sklearn.neural_network import MLPClassifier

In [None]:

# Train 1

clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train_1, Y_train_1)
Y_pred_1 = clf.predict(X_test_1)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(clf.score(X_test_1, Y_test_1)))


In [None]:

# Train 2

clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train_2, Y_train_2)
Y_pred_2 = clf.predict(X_test_2)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(clf.score(X_test_2, Y_test_2)))


In [None]:

# Train 3

clf = MLPClassifier(random_state=1, max_iter=300).fit(X_train_3, Y_train_3)
Y_pred_3 = clf.predict(X_test_3)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(clf.score(X_test_3, Y_test_3)))


In [None]:
best_i = []
for i in [100,500,1000]:
    print('Max iter: ' + str(i) + '\n')
    for j in [1,7,10,20]:
        print('Random state: ' + str(j) + '\n')
        # Target 1 day
        ann = MLPClassifier(random_state=j, max_iter=i)
        ann.fit(X_train_1, Y_train_1)
        train_acc_1 = accuracy_score(y_true= Y_train_1, y_pred= ann.predict(X_train_1))
        scores_1 = cross_val_score(ann, X_train_1_80, Y_train_1_80, 
                                 cv=5, scoring='accuracy', 
                                 verbose = 0)
        print("Train set 1: {:.2f}".format(train_acc_1))
        print('Validation set 1: {:.2f}'.format(scores_1.mean()))
        print('\n')
        # Target 5 days
        ann = MLPClassifier(random_state=j, max_iter=i)
        ann.fit(X_train_2, Y_train_2)
        train_acc_2 = accuracy_score(y_true= Y_train_2, y_pred= ann.predict(X_train_2))
        scores_2 = cross_val_score(ann, X_train_2_80, Y_train_2_80, 
                                 cv=5, scoring='accuracy', 
                                 verbose = 0)
        print("Train set 2: {:.2f}".format(train_acc_2))
        print('Validation set 2: {:.2f}'.format(scores_2.mean()))
        print('\n')
        # Target 30 days
        ann = MLPClassifier(random_state=j, max_iter=i)
        ann.fit(X_train_3, Y_train_3)
        train_acc_3 = accuracy_score(y_true= Y_train_3, y_pred= ann.predict(X_train_3))
        scores_3 = cross_val_score(ann, X_train_3_80, Y_train_3_80, 
                                 cv=5, scoring='accuracy', 
                                 verbose = 0)
        print("Train set 3: {:.2f}".format(train_acc_3))
        print('Validation set 3: {:.2f}'.format(scores_3.mean()))
        print('\n')
        best_i.append([i, j, scores_1.mean() + scores_2.mean() + scores_3.mean()])
        
i = max(best_i, key=lambda x:x[2])[0]
j = max(best_i, key=lambda x:x[2])[1]
print('Best max iter: ' + str(i))
print('Best random state: ' + str(j))

# Target 1 day
ann = MLPClassifier(random_state=j, max_iter=i)
ann.fit(X_train_1_80, Y_train_1_80)
test_acc_1 = accuracy_score(y_true= Y_test_1, y_pred= ann.predict(X_test_1))
print('Test set 1: {:.2f}'.format(test_acc_1))

# Target 5 days
ann = MLPClassifier(random_state=j, max_iter=i)
ann.fit(X_train_2_80, Y_train_2_80)
test_acc_2 = accuracy_score(y_true= Y_test_2, y_pred= ann.predict(X_test_2))
print('Test set 2: {:.2f}'.format(test_acc_2))

# Target 30 days
ann = MLPClassifier(random_state=j, max_iter=i)
ann.fit(X_train_3_80, Y_train_3_80)
test_acc_3 = accuracy_score(y_true= Y_test_3, y_pred= ann.predict(X_test_3))
print('Test set 3: {:.2f}'.format(test_acc_3))