# Data Preprocessing

In [27]:
import pandas as pd
import numpy as np
import os

## Sheet names

Raccolti i nomi dei fogli di calcolo che ci serviranno per la progettazione del dataset finale.

In [36]:
sheet_names = [
    'Info',
    'Historical',
    'Income Statement',
    'Quarterly Income Statement',
    'Cashflow',
    'Institutional Holders',
    'Mutual Fund Holders',
    'Major Holders'
]

#riempire stocks di tutti i vari codici, fare la retrive di tutti i file e buttarli in df_stock per poi poter lavorare su tutti i dati
directory = "./data"
stocks = [os.path.join(directory, file) for file in os.listdir(directory)]

## Integrazione dei dati finanziari
Colonne aggiunte:
- **Daily_Return**: rendimento giornaliero.

- **Target_1day**: indica se il prezzo di chiusura del giorno successivo sarà superiore (1) o inferiore (0) rispetto al prezzo di chiusura del giorno corrente.

- **Target_5days**: indica se il prezzo di chiusura a 5 giorni nel futuro sarà superiore (1) o inferiore (0) rispetto al prezzo di chiusura del giorno corrente.

- **Target_30days**: indica se il prezzo di chiusura a 30 giorni nel futuro sarà superiore (1) o inferiore (0) rispetto al prezzo di chiusura del giorno corrente.

- **Net income** (Income statement): This metric measures a company's profit after all expenses and taxes have been paid. It is the most important metric for investors, as it represents the company's bottom line.

- **Diluted EPS** (Income statement): This metric measures a company's profit per share of common stock. It is a good measure of a company's profitability per share.

- **Total Revenue** (Income statement): This metric measures the total amount of sales that a company generates. It is a good measure of a company's top line growth.

- **Cost of revenue** (Income statement): This metric measures the cost of the goods that a company sells. It is an important metric for assessing a company's profitability.

- **Operating revenue** (Income statement): This metric measures the non-production costs that a company incurs. It is important to consider operating expenses when assessing a company's profitability and cash flow generation.

- **Cash flow from continuing operating activities** (Cashflow): This metric measures the amount of cash that a company generates from its core business operations.

- **Cash flow from continuing investing activities** (Cashflow): This metric measures the amount of cash that a company generates from its investments, such as the sale of property, plant, and equipment

- **Cash flow from continuing financing activities** (Cashflow): This metric measures the amount of cash that a company generates from its financing activities, such as the issuance of debt or equity.

Integrato i vari sheet "Income Statement" e "Cashflow" in un singolo excel. NB: Dato che questi fogli contengono dati finanziari annuali o trimestrali un approccio comune è portare avanti l'ultimo valore noto per ogni giorno fino a quando non si dispone di un nuovo valore. Per alcuni anni finanziari sarà Nan perché non li abbiamo.

I dataframe per ogni stock sono stati tagliati al 30/06/2020 e sono stati inoltre esportati uno ad uno per poterli utilizzare in seguito se necessario. Infine abbiamo aggregato tutte le stock in un unico dataset finale contenente tutte le stock e che le riconosce seguendo la feature "Ticker" che abbiamo prontamente aggiunto.

Alcune aziende non presentavano tutti i dati finanziari che abbiamo deciso di analizzare. Perciò, di comune accordo, abbiamo deciso di considerare ugualmente tali dati e di riempire i le features che appaiono vuote con dei valori neutri (0). Questo ci permette di considerare ugualmente questi dati e dare un valore aggiunto all'analisi delle aziende i cui dati sono presenti. 


## Feature Engineering
- **Medie mobili**: Calcoliamo le medie mobili a breve e lungo termine per il prezzo di chiusura, che sono comuni nel trading algoritmico. Ad esempio, medie mobili a 5, 10, 30 e 50 giorni.
- **RSI (Relative Strength Index)**: Questo è un indicatore di momentum che può aiutare a identificare se un'azione è in condizione di "overbought" o "oversold".
- **MACD (Moving Average Convergence Divergence)**: Un altro indicatore di momentum.
- **Bollinger Bands**: Questi sono basati su medie mobili e possono aiutare a identificare se un prezzo è relativamente alto o basso.
- **Volatilità**: Potremmo calcolare la volatilità come la deviazione standard dei rendimenti giornalieri in una finestra temporale specifica.

In [41]:
counter = 0
df = pd.DataFrame()
for file in stocks:
    #if file.split("/")[2][:-5] != "1398.HK":
        df_stock = pd.ExcelFile(file)
        
        # prevent false postive warnings, reference_ https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
        pd.options.mode.chained_assignment = None # default='warn'
        
        # Loading the 'Historical' data stock
        historical_data = df_stock.parse('Historical')
        
        # Renaming and setting the Date column
        historical_data.rename(columns={'Unnamed: 0': 'Date'}, inplace=True)
        historical_data['Date'] = pd.to_datetime(historical_data['Date'])
        historical_data.set_index('Date', inplace=True)
        # Calculate daily return
        historical_data['Daily_Return'] = historical_data['Close'].pct_change()
        
        # Create target variables for next day, next 5 days and next 30 days
        historical_data['Target_1day'] = (historical_data['Close'].shift(-1) > historical_data['Close']).astype(int)
        historical_data['Target_5days'] = (historical_data['Close'].shift(-5) > historical_data['Close']).astype(int)
        historical_data['Target_30days'] = (historical_data['Close'].shift(-30) > historical_data['Close']).astype(int)
        
        # Drop rows with NaN values (will be present due to the shifting for target creation)
        historical_data = historical_data.dropna()
        
        # Loading the 'Income Statement' data for XOM
        income_statement = df_stock.parse('Income Statement')
        
        # Transposing the data for easier integration
        income_statement = income_statement.set_index('Unnamed: 0').transpose()
        income_statement.index = pd.to_datetime(income_statement.index)
        
        
        # Selecting some of the key financial metrics (you can add or remove based on relevance)
        selected_metrics = [
            'Net Income',
            'Diluted EPS',
            'Total Revenue',
            'Normalized EBITDA',
            'Total Unusual Items',
            'Total Unusual Items Excluding Goodwill'
        ]
        
        # check if columns exist, in case create them
        for metric in selected_metrics:
            if metric not in income_statement.columns:
                income_statement[metric] = np.nan
                
        
        income_statement = income_statement[selected_metrics]
        
        # Merging the income statement data with the historical data
        merged_data = historical_data.join(income_statement, how='left')
        
        # Forward filling the NaN values
        merged_data[selected_metrics] = merged_data[selected_metrics].fillna(method='ffill')
        
        # Loading the 'Cashflow' data for XOM
        cashflow = df_stock.parse('Cashflow')
        
        # Transposing the data for easier integration
        cashflow = cashflow.set_index('Unnamed: 0').transpose()
        cashflow.index = pd.to_datetime(cashflow.index)
        
        # Selecting some of the key cashflow metrics (you can add or remove based on relevance)
        selected_cashflow_metrics = [
            'Operating Cash Flow',
            'Capital Expenditure',
            'Free Cash Flow',
            'Cash Flow From Continuing Operating Activities',
            'Cash Flow From Continuing Investing Activities',
            'Cash Flow From Continuing Financing Activities'
        ]
        
        for metric in selected_cashflow_metrics:
            if metric not in cashflow.columns:
                cashflow[metric] = np.nan
        
        cashflow = cashflow[selected_cashflow_metrics]
        
        # Merging the cashflow data with the existing dataframe
        merged_data = merged_data.join(cashflow, how='left', rsuffix='_cashflow')
        
        # Forward filling the NaN values
        merged_data[selected_cashflow_metrics] = merged_data[selected_cashflow_metrics].fillna(method='ffill')
        
        

        # Moving Averages
        merged_data['MA_5'] = merged_data['Close'].rolling(window=5).mean()
        merged_data['MA_10'] = merged_data['Close'].rolling(window=10).mean()
        merged_data['MA_30'] = merged_data['Close'].rolling(window=30).mean()
        merged_data['MA_50'] = merged_data['Close'].rolling(window=50).mean()
        
        # RSI
        delta = merged_data['Close'].diff()
        gain = (delta.where(delta > 0, 0)).fillna(0)
        loss = (-delta.where(delta < 0, 0)).fillna(0)
        avg_gain = gain.rolling(window=14).mean()
        avg_loss = loss.rolling(window=14).mean()
        rs = avg_gain / avg_loss
        merged_data['RSI'] = 100 - (100 / (1 + rs))
        
        # MACD
        merged_data['MACD'] = merged_data['Close'].ewm(span=12, adjust=False).mean() - merged_data['Close'].ewm(span=26, adjust=False).mean()
        merged_data['Signal_Line'] = merged_data['MACD'].ewm(span=9, adjust=False).mean()
        
        # Bollinger Bands
        merged_data['Bollinger_Mid_Band'] = merged_data['Close'].rolling(window=20).mean()
        merged_data['Bollinger_Upper_Band']  = merged_data['Bollinger_Mid_Band'] + 1.96*merged_data['Close'].rolling(window=20).std()
        merged_data['Bollinger_Lower_Band']  = merged_data['Bollinger_Mid_Band'] - 1.96*merged_data['Close'].rolling(window=20).std()
        
        # Volatility
        merged_data['Volatility'] = merged_data['Daily_Return'].rolling(window=5).std()
        
        to_drop_na = ['MA_5', 'MA_10', 'MA_30', 'MA_50', 'RSI', 'Volatility', 'Cash Flow From Continuing Operating Activities', 'Cash Flow From Continuing Investing Activities', 'Cash Flow From Continuing Financing Activities', 'Net Income', 'Diluted EPS', 'Total Revenue','Normalized EBITDA', 'Total Unusual Items', 'Total Unusual Items Excluding Goodwill', 'Operating Cash Flow', 'Capital Expenditure','Free Cash Flow',]
    
        if 'Ticker' not in merged_data.columns:
                merged_data['Ticker'] = file.split("/")[2].replace(".", "")[:-4]
            
        # Display the updated dataframe with integrated cashflow metrics
        merged_data.iloc[counter : counter + len(merged_data), merged_data.columns.get_loc("Ticker")] = file.split("/")[2].replace(".", "")[:-4]
            
        counter = len(merged_data)
        
        for column in to_drop_na:
            merged_data[column] = merged_data[column].fillna(0)

        merged_data = merged_data[merged_data.index >= '2020-06-30']
        #indices_to_drop = merged_data.index[merged_data.isna().sum(axis=1) > 3].tolist()
        
        #merged_data.drop(indices_to_drop, inplace=True)
        
        # Export in Excel company data
        if not os.path.exists('./Processed'):
            os.makedirs('./Processed')
        with pd.ExcelWriter(f'./Processed/{file.split("/")[2][:-5]}.xlsx', mode = "w", engine = "openpyxl") as writer:
            merged_data.to_excel(writer, sheet_name="Sheet1")
        # Append to one single dataframe
        df = pd.concat([df, merged_data])

  merged_data[selected_metrics] = merged_data[selected_metrics].fillna(method='ffill')
  merged_data[selected_cashflow_metrics] = merged_data[selected_cashflow_metrics].fillna(method='ffill')
  merged_data[selected_metrics] = merged_data[selected_metrics].fillna(method='ffill')
  merged_data[selected_cashflow_metrics] = merged_data[selected_cashflow_metrics].fillna(method='ffill')
  merged_data[selected_metrics] = merged_data[selected_metrics].fillna(method='ffill')
  merged_data[selected_cashflow_metrics] = merged_data[selected_cashflow_metrics].fillna(method='ffill')
  merged_data[selected_metrics] = merged_data[selected_metrics].fillna(method='ffill')
  merged_data[selected_cashflow_metrics] = merged_data[selected_cashflow_metrics].fillna(method='ffill')
  merged_data[selected_metrics] = merged_data[selected_metrics].fillna(method='ffill')
  merged_data[selected_cashflow_metrics] = merged_data[selected_cashflow_metrics].fillna(method='ffill')
  merged_data[selected_metrics] = merged

## Export del dataset finale

In [42]:
output_filepath = "final_dataset.xlsx"
df.to_excel(output_filepath)

In [None]:
# TODO: Rifinire il markdown del blocco di codice che fa tutto e gli altri due.
# TODO: Procedere al train del se il dataset viene considerato pronto.
# TODO: Replicare con le crypto.