# Data Acquisition and Processing Systems (DaPS) (ELEC0136)    
### Final Assignment
---

In [1]:
# Importing package libraries.
import pandas as pd
import numpy as np
import os
from datetime import datetime
import time
import matplotlib.pyplot as plt
import csv
import sys
import requests
import json
import urllib
import re
import yfinance as yf
import seaborn as sns
import io
from pandas import json_normalize
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.request import Request
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from meteostat import Point
from meteostat import Daily
from datetime import datetime
import dataframe_image as dfi
from matplotlib.ticker import MultipleLocator
import scipy.stats
from scipy.stats import pearsonr
from scipy.stats import spearmanr
from math import *
from scipy.spatial.distance import minkowski
from calendar import day_abbr, month_abbr, mdays
from scipy.stats import chi2_contingency
from scipy.stats import chi2
import matplotlib.image as mpimg
from fbprophet import Prophet
import holidays


<div class="alert alert-heading alert-info">

#### Task 1: Data Acquisition

You will first have to acquire the necessary data for conducting your study. One essential type of
data that you will need, are the stock prices for each company from April 2017 to April 2021 as
described in Section 1. Since these companies are public, the data is made available online. The
first task is for you to search and collect this data, finding the best way to access and download
it. A good place to look is on platforms that provide free data relating to the stock market such as
Google Finance or Yahoo! Finance.

[Optional] Providing more than one method to acquire the very same or different data, e.g. from
a downloaded comma-separated-value file and a web API, will result in a higher score.

There are many valuable sources of information for analysing the stock market. In addition to time
series depicting the evolution of stock prices, acquire auxiliary data that is likely to be useful for
the forecast, such as:

- Social Media, e.g., Twitter: This can be used to uncover the public’s sentimental
response to the stock market
- Financial reports: This can help explain what kind of factors are likely to affect the stock
market the most
- News: This can be used to draw links between current affairs and the stock market
- Climate data: Sometimes weather data is directly correlated to some companies’ stock
prices and should therefore be taken into account in financial analysis
- Others: anything that can justifiably support your analysis.

Remember, you are looking for historical data, not live data.
   
    
</div>

### Extracting stock data for Apple Inc.

I have chosen Apple because **(TO INCLUDE ONLY IN REPORT)**:
- I am not fully aware of the impact and work of **American Airlines** in terms of their work; mostly domestic flights or international, effect of the pandemic in the US, is there actual quarentine?
- **Zoom** only started trading in the stock market on April 2019, so I didn´t think two years was a long enough period of time to perform detailed analysis on.
- That left me with either Apple or **Microsoft**; in this sense I felt like Apple offers a larger variety of products, and is a bigger company.


In [2]:
def pandas_settings():
    # Here I have set pandas preferences so that all data is displayed appropiately in the dataframes.
    pd.set_option("display.max_column", None)
    pd.set_option("display.max_rows", 10)
    pd.set_option("display.float_format", '{:,.3f}'.format)
    return


#### 1. Acquire data using the Yahoo Finance API.

In [3]:
def acquire_stock_data():
    """
    Acquiring Stock Data
    """
    stock_data = yf.download("AAPL", # Define Zoom (ZM) as our stock option.
                             start = "2017-04-01", # Define start and end years for fiscal years from 2017 to 2021.
                             end = "2021-04-30",
                             interval = "1d", # Set interval to ouput daily results of the stock.
                             progress = False) # If true, it wouldprint the download progress bar as the data is downloaded.
    stock_data.to_csv("Apple_Stock_2017_2021_API.csv") # Store the data locally as a .csv file.
    
    """"
    Alternative.
    """
    #web.DataReader("AAPL", data_source = "yahoo", start = "2017-04-01", end = "2021-04-30")
    
    return


`acquire_stock_data()` returns the apple stock data.
This data is often refered to as **OHLC Chart Data**.

### TO INCLUDE IN REPORT
- Open: Price at which stock started trading at the begining of the day.
- Close: Price at which stock trades at the end of the day.
- High: Highest price of stock throughout the whole day (intraday high).
- Low: Lowest price of stock throughout the whole day (intraday low).
- Volume: Total number of shares traded throughout the whole period.
- **Adjusted close: This is the close value incorporating changes occuring from corporate actions like devidend payments, stock splits or new share insurance.**


#### Or, downloading manually from the web in Yahoo Finance.

In this case we manually export a csv file from the [Yahoo Finance web browser](https://uk.finance.yahoo.com/quote/AAPL/history?period1=1491004800&period2=1619740800&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true) by: 
1. Selecting the "Historical Data" tab,
2. Selecting the appropiate time period (from April 2017 to April 2021).
3. Selecting the "Frequency" as "Daily" to get daily data.
4. Downloading the data and naming it **Apple_Stock_2017_2021_Web.csv**.


<img src="./Images Code/Alternative_Data_Acquisition_Web.png" style="width:1000px;height:420px"/>

As we can see by comparing the two dataframes, these two methods output the same data.

##### However, **I will use the API method as it will automate the whole code.**

For this reason, I have alternatively inserted an image of the code to read the web browser `.csv` file instead of including the actual code, as this would give a compiling error as you have to download the `.csv` file manually.

### Extracting auxiliary data.

#### 1. Social Media.

Have tried and failed with: 
- `twint` only outputs data from April 30th 2021.
- `tweepy`does not grant access to socail media data in the regular API pass. The premium API was rejected to me by `Twitter`.
- `financial modelling prep API` for social media response data is only available from November 2021. For gathering other data, this API was useful, as shows later on.
- `iex cloud` only offers social media response data to premium users. This costs about 100 GBP.

##### As a consequence, I have used `Kaggle`. Here, I found a dataset containing twitter responses to Apple stock. However, the data is only from 2016 to September 2019. It does not cover the whole period we want to analyse, but this will offer more information than all the other attempted methods.

##### Twint

<img src="./Images Code/Twint.png" style="width:1000px;height:300px"/>

##### Tweepy

<img src="./Images Code/Tweepy.png" style="width:1000px;height:300px"/>

##### Financial Modelling Prep

In [4]:
def alternative_fmp():
    """
    Alternative acqusition.
    """
    # This shows all historical social sentiment to Apple stocks.
    sm = requests.get("https://financialmodelingprep.com/api/v4/historical/social-sentiment?symbol=AAPL&limit=100000000&apikey=9e891fadc99cdfa0f9cd27949a4ceb4d")
    smr = sm.json()
    social_media_response = pd.DataFrame.from_dict(smr)
    
    return social_media_response


#### Kaggle Dataset

In [5]:
def acquire_twitter_sentiment_data():
    """
    Acquiring Twitter reponse data.
    """
    # Defining my details in order to access the Kaggle data.
    os.environ['KAGGLE_USERNAME'] = "borjamartina"  
    os.environ['KAGGLE_KEY'] = "6a5fcfa43e992444bcaa83b7a32009b1"
    
    !kaggle datasets download --d nadun94/twitter-sentiments-aapl-stock
    
    twitter_data = pd.read_csv("twitter-sentiments-aapl-stock.zip", compression = "zip")
    twitter_data.to_csv("Twitter_Sentiment_Apple_Stock.csv", index = False)
    
    return


#### 2. Financial Reports.

This includes financial statistics of the company. It can be accessed, again, by using the **Yahoo Finance API**:
- `yf.Ticker("AAPL").actions`: dividends and stock splits.     **++++++++++++++++**
- `si.get_stats("AAPL")`: P/E ratio.
- `si.get_quote_table("AAPL", dict_result=False)`: 52 week range, forward divident & yield, market cap, PE ratio.
- `si.get_stats("AAPL")`: shares short, short ratio, forward/trailing annual dividend, revenue, revenue per share, return on assests & equity, gross profit, earnings growth, total cash and per share, total debt, operating cash flow.
- `yf.Ticker("AAPL").quarterly_financials`: research development, income pretax, net income, ebit, total revenue, cost of revenue...     **++++++++++++++++**
- `yf.Ticker("AAPL").quarterly_balance_sheet`: total stock holder equity, total assets, treasury stock, net tagible assets, short/long term investments, long term debt...        **++++++++++++++++**
- `yf.Ticker("AAPL").quarterly_cashflow`: operating/investing/financing cash flow, capital expenditure, dividends paid, net income...      **++++++++++++++++**
- `yf.Ticker("AAPL").major_holders`: % of shares held by instiutations/all insider...       **++++++++++++++++**
- `yf.Ticker("AAPL").institutional_holders`: major shareholders.         **++++++++++++++++**
- `yf.Ticker("AAPL").sustainability`: are they environmentally friendly?, social score, environmental score...        **++++++++++++++++**

##### Or by using `si.get_`


Unfortunately, the Yahoo Finance API doesn´t offer the historical financial data we want. Therefore, I decided to use the [Financial Modelling Prep](https://site.financialmodelingprep.com/developer/docs#FMP-Articles) API.

In [6]:
def acquire_balance_sheet():
    """
    Acquiring Apple Balance Sheet
    """
    b = requests.get("https://financialmodelingprep.com/api/v3/balance-sheet-statement-as-reported/AAPL?period=quarter&limit=20&apikey=9e891fadc99cdfa0f9cd27949a4ceb4d")
    bs = b.json()
    balance_sheet = pd.DataFrame.from_dict(bs)
    balance_sheet.to_csv("Apple_Balance_Sheet_2017_2021.csv", index = False) # Store the data locally as a .csv file.
    
    return


In [7]:
def acquire_cash_flow_statement():
    """
    Acquiring Apple Cash Flow Statement
    """
    cf = requests.get("https://financialmodelingprep.com/api/v3/cash-flow-statement-as-reported/AAPL?period=quarter&limit=20&apikey=9e891fadc99cdfa0f9cd27949a4ceb4d")
    cfs = cf.json()
    cash_flow = pd.DataFrame.from_dict(cfs)
    cash_flow.to_csv("Apple_Cash_Flow_Statement_2017_2021.csv", index = False) # Store the data locally as a .csv file.
    
    return


In [8]:
def acquire_financial_ratios():
    """
    Acquiring Apple Financial Ratios
    """
    fr = requests.get("https://financialmodelingprep.com/api/v3/ratios/AAPL?period=quarter&limit=20&apikey=9e891fadc99cdfa0f9cd27949a4ceb4d")
    frs = fr.json()
    financial_ratios = pd.DataFrame.from_dict(frs)
    financial_ratios.to_csv("Apple_Financial_Ratios_2017_2021.csv", index = False) # Store the data locally as a .csv file.
    
    return


In [9]:
def acquire_income_statement():
    """
    Acquiring Apple Financial Statement
    """
    f = requests.get("https://financialmodelingprep.com/api/v3/income-statement/AAPL?period=quarter&limit=20&apikey=9e891fadc99cdfa0f9cd27949a4ceb4d")
    fs = f.json()
    income_statement = pd.DataFrame.from_dict(fs)
    income_statement.to_csv("Apple_Income_Statement_2017_2021.csv", index = False) # Store the data locally as a .csv file.
    
    return


### Function to acquire all financial data.

In [10]:
def acquire_financial_data():
    """
    Acquiring financial data to help with the stock analysis.
    """
    # Balance sheets
    acquire_balance_sheet()
    
    # Cash flow statements
    acquire_cash_flow_statement()
    
    # Financial ratios
    acquire_financial_ratios()
    
    # Income statements
    acquire_income_statement()
    
    return 


#### 3. News reports.

- `finviz` only displays recent reports, within the last month.
- `Yahoo Finance` only displays the last few recent news reports.
- `Financial Modelling Prep` only offers data from December 2020.

Finally, I used `Seeking Alpha`, this API is fantastic, as it offers full coverage of the time period in question.

##### Finviz

In [11]:
def alternative_finviz():
    """
    Alternative.
    """
    finviz_url = "https://finviz.com/quote.ashx?t="
    ticker = "AAPL"
    
    url = finviz_url + ticker
    req = Request(url = url, headers = {"user-agent": "my-app"})
    response = urlopen(req)
    
    html = BeautifulSoup(response, features = "html.parser")
    news_table = html.find(id = "news-table")
    
    return news_table


##### Yahoo Finance

In [12]:
def alternative_yf():
    """
    Alternative.
    """
    json_yf = yf.Ticker("AAPL").news
    stock_news_yf = pd.DataFrame.from_dict(json_yf)
    
    return stock_news_yf


##### Financial Modelling Prep

In [13]:
def alternative_fmp_news():
    """
    Alternative.
    """
    sn = requests.get("https://financialmodelingprep.com/api/v3/stock_news?tickers=AAPL&page=144&apikey=9e891fadc99cdfa0f9cd27949a4ceb4d")
    sns = sn.json()
    
    stock_news = pd.DataFrame.from_dict(sns)
    
    return stock_news


#### Seeking Alpha

In [14]:
def acquire_news_data():
    """
    Acquiring News Data
    """
    headers = {
    "x-rapidapi-host": "seeking-alpha.p.rapidapi.com",
    "x-rapidapi-key": "f090a7b7fcmshb87ac67ed51063ep1324e2jsnf3480088a86a"
    }
    
    limit = 40
    news = []
    
    for page_num in range(1,59):
        page = "%s" % page_num
        url = "https://seeking-alpha.p.rapidapi.com/news/v2/list-by-symbol"
        querystring = {"id":"aapl","until":"1619740800","since":"1491004800","size":limit,"number":page}
        response = requests.request("GET", url, headers=headers, params=querystring)
        response_json = response.json()
        news.extend(response_json["data"])
        
    newss = json_normalize(news)
    apple_news = newss[["attributes.publishOn", "attributes.title"]]
    apple_news = apple_news.rename(columns={"attributes.publishOn": "DateTime", "attributes.title": "Title"})
    
    """
    Sentiment Analysis
    """
    nltk.download("vader_lexicon", quiet = True)
    
    def sentiment_analysis(dataset):
        results = []
        
        # Use NLTK Vader analyzer to covert text titles to sentiment scores.
        for headline in dataset["Title"]:
            pol_score = SIA().polarity_scores(headline) # Run analyzer.
            pol_score["headline"] = headline
            results.append(pol_score) # Collect pole scores data.
        
        dataset["Score"] = pd.DataFrame(results)["compound"] # Select "compound" in order to extract single value sentiment score, and added as a new column into our dataframe.
        
        # Now that we have the seniment score, we can delete the "Title" column.
        dataset.drop("Title", axis = 1, inplace = True)
        
        return dataset
    
    sentiment_analysis(apple_news)
    
    apple_news.to_csv("Apple_News_Scores.csv")
    
    return


#### 5. Weather data.

In [15]:
def acquire_weather_data():
    """
    Acquiring London Weather Data
    """
    # Set time period.
    start = datetime(2017, 4, 1)
    end = datetime(2021, 4, 30)
    
    # Create Point by setting the coordinates for London.
    location = Point(51.5085, -0.1257, 25)
    
    data = Daily(location, start, end)
    london_weather = data.fetch()
    
    london_weather.to_csv("London_Weather_2017_2021.csv")
    
    return


# Acquire all data

In [16]:
def acquire_all():
    """
    Stock data.
    """
    acquire_stock_data()
    
    """
    Twitter sentiment data.
    """
    acquire_twitter_sentiment_data()
    
    """
    Financial data.
    """
    acquire_financial_data()
    
    """
    News sentiment data.
    """
    acquire_news_data()
    
    """
    Weather data.
    """
    acquire_weather_data()
    
    return


<div class="alert alert-heading alert-info">
    
## Task 2: Data Storage

Once you have found a way to acquire the relevant data, you need to decide on how to store it.
You should choose a format that allows an efficient read access to allow training a parametric
model. Also, the data corpus should be such that it can be easily inspected. Data can be stored
locally, on your computer.
    
</div>

#### I have stored my data in `Oracle Apex` because I had previously acquired all my data in a SQL format (structured tables).

I stored the data manually (mouse and clicks) due to compatibility issues between my mac and oracle database client. The following bullet points outline how the data was stored. 

1. Created a workspace named `ucl_daps_final_assignment`.
2. Then, to upload the first file; SQL Workshop−>Utilities−>Data Workshop−>Load Data.

<div class="alert alert-heading alert-warning">

[Optional] Create a simple API to allow Al retrieving the compound of data you collected. It is enough to provide a single access point to retrieve all the data, and not implement query mechanism. The API must be accessible from the web. If you engage in this task data must be stored online.  
    
</div>

In [17]:
def retrieve_stored_data(dataset):
    """
    Retrieve the data stores in Oracle Apex through the REST API.
    """
    base_url = "https://apex.oracle.com/pls/apex/ucl_daps_final_assignment/"
    limit = "?limit=2340" # This limit is said in order to allow all rows to be passed in for all data files. 
    r = requests.get(base_url + dataset + limit).json()
    df = json_normalize(r["items"]).sort_values("id") # Only output the data itselt ("items").
    
    return df


## Retrieve all data.

In [18]:
def retrieve_all():
    """
    Retrieve each dataset from the Orcale SQL database.
    """
    dfBalanceSheet = retrieve_stored_data(dataset = "balance_sheet")
    dfCashFlow = retrieve_stored_data(dataset = "cash_flow")
    dfFinancialRatios = retrieve_stored_data(dataset = "financial_ratios")
    dfIncomeStatement = retrieve_stored_data(dataset = "income_statement")
    dfStocks = retrieve_stored_data(dataset = "stocks")
    dfNews = retrieve_stored_data(dataset = "news")
    dfWeather = retrieve_stored_data(dataset = "weather")
    dfTwitter = retrieve_stored_data(dataset = "twitter")
    
    return (dfBalanceSheet, dfCashFlow, dfFinancialRatios, dfIncomeStatement, dfStocks,
            dfNews, dfWeather, dfTwitter)


##### Altenative code to retrieve the data from the Oracle Apex database. In this case the code explores performing said functions with a for loop.

In [19]:
def alternative_retrieval():
    """
    Alternative method.
    """
    # Define the url and parameter queries.
    base_url = "https://apex.oracle.com/pls/apex/ucl_daps_final_assignment/"
    limit = "?limit=2340"
    
    # State all the files we want to retrieve.
    files = ["balance_sheet", "cash_flow", "financial_ratios", "income_statement", "stocks", 
             "news", "weather", "twitter"]
    
    data = {}
    
    # Perfom URL request for each of the above file retrievals.
    for file in files:
        r = requests.get(base_url + str(file) + limit).json()
        data[file] = (json_normalize(r["items"])).sort_values("id")
    
    return


<div class="alert alert-heading alert-info">

## Task 3: Data Preprocessing

Now that you have the data stored, you can start preprocessing it. Think about what features to
keep, which ones to transform, combine or discard. Make sure your data is clean and consistent
(e.g., are there many outliers? any missing values?). You are expected to:

1. Clean the data from missing values and outliers, if any.
2. Provide useful visualisation of the data. Plots should be saved on disk, and not printed on
the juptyer notebook.
3. Transform your data (e.g., using normalization, dimensionality reduction, etc.) to improve
the forecasting performance.

</div>

## 1. Initial Data Cleaning

#### First, lets do set up our workspace appropiately, and define functions to save images, etc.

Now that I will begin to explore the data, lets create a new directory to store the figures.

In [20]:
def create_new_dir():
    """
    Create directory.
    """
    new_dir_path = "Saved Figures" # New folder.
    os.makedirs(new_dir_path, exist_ok = True)
    
    return


Now that we have all the data structured properly, lets save a figure of two of the dataframes for reference.

In [21]:
def save_df_normal(df, title):
    """
    Save dataframe as an image normally.
    """
    try:
        if len(df) > 10:
            # Slice the first and last 5 rows of the respective dataframe as some have way too many columns to display.
            df = (df.head(5)).append(df.tail(5))
    except TypeError:
        pass
    
    # Export as image.
    dfi.export(df, "./Saved Figures/" + str(title) + ".png")
    
    return


In [22]:
def save_df_gradient(df, title):
    """
    Save dataframe in a colourway defined by the gradient of each item.
    """
    if len(df) > 10:
        # Slice the first and last 5 rows of the respective dataframe as some have way too many columns to display.
        df = (df.head(5)).append(df.tail(5))
    
    # Export as image.
    df_styled = df.style.background_gradient()# Gradient coloring style.
    dfi.export(df_styled, "./Saved Figures/" + str(title) + ".png")
    
    return


In [23]:
def save_df_max(df, title):
    """
    Save dataframe and highlight the max value from each column.
    """
    if len(df) > 10:
        # Slice the first and last 5 rows of the respective dataframe as some have way too many columns to display.
        df = (df.head(5)).append(df.tail(5))
    
    # Export as image.
    df_styled = df.style.highlight_max()
    dfi.export(df_styled, "./Saved Figures/" + str(title) + ".png")
    
    return


#### Now, lets define a few functions in order to oragnise and noramlize the dataframes.

In [24]:
# Function to drop a list of columns from a dataframe.
def drop_columns(columns, df):
    # Select each item from the list "drop_balance".
    for item in columns:
        if item in df.columns:
            df.drop(item, axis = 1, inplace = True)
    return


In [25]:
# Function to rename a list of columns from a dataframe.
def rename_columns(columns, df):
    # Select each item from the list.
    for heading in columns:
        df.rename(columns = heading, inplace = True)# Rename.
    return


**This function will perform data cleaning functions that are requireds in all dataframes.**

In [26]:
def intial_column_cleaning():
    """
    This function will perform the initial data cleaning common for all dataframes.
    """
    # Define the dataframes we want to clean.
    dfs = [dfBalanceSheet, dfCashFlow, dfFinancialRatios, dfIncomeStatement,
           dfStocks, dfNews, dfWeather, dfTwitter]
    
    # List of columns to delete in the first section of loop.
    del_cols = ["id", "links", "link", "finallink", "column_", "symbol", "reportedcurrency", "cik", "period",
                "calendaryear", "fillingdate", "accepteddate"]
    
    # Select dataframes within the "dfs" list.
    for df in dfs:
        """
        Drop the current index.
        """
        df.reset_index(drop = True, inplace = True)
        
        
        """
        Drop useless columns generated through storage and retrieval of data.
        """
        drop_columns(columns = del_cols, df = df)
        
        
        """
        Rename date columns to a common name "DateTime".
        """
        # List of column names for "Date" in each dataframe.
        time_formats = ["date_", "datetime", "time"]
        
        # Select column names within "time_formats"
        for title in time_formats:
            if title in df.columns:# If column named like that, rename to common "DateTime".
                df.rename(columns = {title: "DateTime"}, inplace=True)
        
        
        """
        Change DateTime format to a simple YYYY-MM-DD format.
        """
        new_date_list = []
        
        # Select all row values in the "DateTime" column.
        for rows in df["DateTime"]:
            date_format = datetime.strptime(rows, "%Y-%m-%dT%H:%M:%SZ")# Original format.
            new_dates = date_format.strftime("%Y-%m-%d")# New format.
            new_date_list.append(new_dates)# Apply.
        
        # Create new column with the new date format. 
        df["Date"] = new_date_list
        
        # Delete old datetime column. 
        del df["DateTime"]
        
        """
        Drop columns with at least 20% missing values.
        Except for the Weather dataframe, as "snow" obviously has many missing values, but its still valuable.
        """
        # Select for all dataframes except for weather (thus "snow").
        if "snow" not in df.columns:
            thresh = len(df) * .8# Set threshold at 20%.
            df.dropna(thresh = thresh, axis = 1, inplace = True)# Drop
            
    return


Define our updated dataframes, and for the financial data we will shift the whole dataframe up 2 days because all of them display their values on Saturdays, and the stock market is only open midweek. Thus, when we concat later on these dataframes to the stock dataframe, we can have matching values.

In [27]:
# Define a function to use for the other dataframes later.
def sort_datetime(df):
    df_period = df.loc[(df["Date"] >= "2017-04-01") & (df["Date"] <= "2021-04-30")]# Select time period.
    sorted_df = df_period.sort_values(by = "Date", ascending = True)# Sort values in ascending order.
    sorted_df.set_index("Date", inplace = True)# Convert the "Date" column to index.
    sorted_df.index = pd.to_datetime(sorted_df.index, format = "%Y-%m-%d")
    sort = sorted_df.astype(float)
    
    return sort


In [28]:
def sort_dfs_datetime():
    """
    Apply sort_datetime function to all dfs.
    """
    BalanceSheet = (sort_datetime(df = dfBalanceSheet)).shift(2, freq = "D")
    CashFlow = (sort_datetime(df = dfCashFlow)).shift(2, freq = "D")
    FinancialRatios = (sort_datetime(df = dfFinancialRatios)).shift(2, freq = "D")
    IncomeStatement = (sort_datetime(df = dfIncomeStatement)).shift(2, freq = "D")
    Stocks = sort_datetime(df = dfStocks)
    News = sort_datetime(df = dfNews)
    Weather = sort_datetime(df = dfWeather)
    Twitter = sort_datetime(df = dfTwitter)
    
    return (BalanceSheet, CashFlow, FinancialRatios, IncomeStatement,
            Stocks, News, Weather, Twitter)


## 2. Data Reductionality

As I was only able to acquire twitter data from April 2017 to September 2019, there are going to be many missing values in the `twitter_volume`.

In [29]:
def group_sentimental_scores(News, Twitter):
    """
    Groupby to get daily data.
    """
    News.rename(columns = {"score": "Score"}, inplace = True)# Rename.
    Twitter.rename(columns = {"twitter_volume": "Twitter Volume", "ts_polarity": "Twitter Score"}, inplace = True)# Rename.
    
    # Groupby in order to only have one score per day.
    # Mean makes more sense than the sum due to the fact that it gives an average value for the sentiment.
    News = News.groupby(News.index).mean()
    Twitter = Twitter.groupby(Twitter.index).mean()
    
    return News, Twitter


Currently, our data is way too large, thus we have to perform some type of data reductionality, specially on the financial dataframes.

There are two main ways to approach this:
- PCA.
- Correlation analysis of variables.

Perhaps the most effective approach is PCA, however in this particular project I will use correlation analysis. The main reason for this is because throughout this project I also want to evaluate the significance of every financial variable and apply its context. With PCA this is not possible, as components are extract as a mixture of different variables. Thus I will reduce the size of the data by calculating the Spearman and Pearson correlation values in respect to the stock adjusted closing price, and only keeping the variables which have a significant level of correlation to stock prices.

In order to have an idea of how PCA is applied, the figure below is a snippet for the code used to perform this type of data reductionality.

<img src="./Images Code/PCA.png" style="width:1000px;height:300px"/>

### Correlation Analysis

1. Clean missing values so that the correlation calculation isn't affected.
2. Merge each financial and weather dataframe to stock data in order to directly analyse each variables impact on stock data.
3. Build corelations.
4. Filter columns through correlation values.

#### 1. Cleaning missing values.

First, we have to evaluate `Weather`. The `snow` and  `prcp` variables have many zero and missing values, this is totally norma, and therefore not a concern. Thus for these two variables we will replace any missing values with zeros because this is an accurate representation of the data. However, the `wpgt` variable does not have any non-zero values until September 28th 2018, this definitely proves it has many missing values. Thus we will create a new dataframe with just the `wpgt` variable for the period of time it has acutal values, by doing this we will avoid any misconceptions with the correlation values or time series plots. We could also have imputed the values, however the period of time with missing values is so large that the imputation would not be accurate enough.

In [30]:
def prepare_weather_data(Weather):
    """
    Split weather data corresponding to datetime periods available.
    """
    WindData = Weather["wpgt"].iloc[544:]# Find the wind data when acutal values are taken.
    Wind = pd.DataFrame(WindData)# Define the new Wind dataframe.
    Weather.drop("wpgt", axis = 1, inplace = True)# Drop the wind value from the original Weather.
    
    return Wind, Weather


In [31]:
def clean_missing_values():
    """
    Replace missing values.
    """
    dfs = [BalanceSheet, CashFlow, FinancialRatios, IncomeStatement, Stocks,
           News, Weather, Wind, Twitter]
    
    for df in dfs:
        # For all dataframes except Weather.
        if "snow" not in df.columns:
            df.fillna(df.mean(), inplace = True)
        
        # Now the Weather.
        if "snow" in df.columns:
            df.drop("tsun", axis = 1, inplace = True)# Drop as it has all null values.
            
            # Replace these variables missing values with their mean.
            normal_cols = ["tavg", "tmin", "tmax", "prcp", "wdir", "wspd", "pres"]
            
            for col in normal_cols:
                df[col].fillna(df[col].mean(), inplace = True)
            
            # For "snow", replace with 0 because snow is rare in London.
            df["snow"].fillna(0, inplace = True)


In [32]:
def merge_close(df):
    """
    Merge adjuste closing stock price to the defined datframe.
    """
    merged = pd.merge(Stocks["adj_close"], df, left_index = True, right_index = True, how = "outer")
    merged = merged.dropna()# Drop all rows with missing values.
    
    return merged


In [33]:
def apply_merge(BalanceSheet, CashFlow, FinancialRatios,
                IncomeStatement, News, Weather, Wind, Twitter):
    """
    Apply merge functions above.
    """
    balance_stock = merge_close(df = BalanceSheet)
    cash_stock = merge_close(df = CashFlow)
    ratios_stock = merge_close(df = FinancialRatios)
    income_stock = merge_close(df = IncomeStatement)
    news_stock = merge_close(df = News)
    weather_stock = merge_close(df = Weather)
    wind_stock = merge_close(df = Wind)
    twitter_stock = merge_close(df = Twitter)
    
    return (balance_stock, cash_stock, ratios_stock, income_stock,
            news_stock, weather_stock, wind_stock, twitter_stock)


#### 2. Finding variables with the best correlation results.

In [34]:
def correlation_rest(cor, data):
    """
    Define correlation function for either correlation coefficient.
    """
    x = data["adj_close"]
    
    results = []
    
    for col in data.columns:
        y = data[col]
        corr, _ = cor(x, y)# Get the coefficient.
        results.append([col, corr])# Append the results to evaluate all at once.
    
    return results


In [35]:
def filter_columns_rest(df):
    """
    Filter columns realtive to the correlation coefficients.
    """
    # Both correlation coefficients.
    pear = correlation_rest(cor = pearsonr, data = df)
    spear = correlation_rest(cor = spearmanr, data = df)
    
    # Create dataframe columns.
    pear_df = pd.DataFrame(pear, columns = ["Column", "Pearson Correlation"])
    spear_df = pd.DataFrame(spear, columns = ["Column", "Spearman Correlation"])
    
    # Compose single dataframe with both correlation coefficients.
    pear_df["Spearman Correlation"] = spear_df["Spearman Correlation"]
    pear_df = pear_df[1:]# Leave the "adj_close" column out of the dataframe, as the value would obviously be 1.
    
    # Define the threshold to define a useful variable.
    useful_pear = pear_df.loc[(pear_df['Pearson Correlation'] > 0.9) | (pear_df['Pearson Correlation'] < -0.9) | 
                          (pear_df['Spearman Correlation'] < -0.9) | (pear_df['Pearson Correlation'] > 0.9)]
    
    return pear_df, useful_pear


In [36]:
def check_filtering(df_original, df_filtered):
    """
    Get the change in number of variables.
    """
    original = len(df_original)
    filtered = len(df_filtered)
    result = print("Original number of variables: " + str(original) + 
                  ". Filtered number of variables: " + str(filtered))
    
    return result


As we can see above the function has reduced the number of variables in each dataframe. This is great as it will reduce computionality, however it has filtered all columns from both the `CashFlow` and `Weather` dataframes. Although, evidently the correlation indicates that these variables do not have a significant impact on stock prices, some of these variables may still be valuable to the investigation. Thus, in the next functions I will find the largest correlation values for these two dataframes in order to have a closer look. 

In [37]:
def find_best_correlation(df):
    """
    Find the variable withina given dataframe with the largest coefficient.
    """
    # Spearman Correlation.
    large_spear = df["Spearman Correlation"].max()# Get 5 largest values of Spearman correlation.
    small_spear = df["Spearman Correlation"].min()# Now the smallest.
    
    # Pearson Correlation.
    large_pear = df["Pearson Correlation"].max()
    small_pear = df["Pearson Correlation"].min()
    
    # Finding the largest positive correlation between the Spearman and Pearson values.
    if large_spear > large_pear:
        largest = large_spear
    else:
        largest = large_pear
    
    # Finding the largest negative correlation between the Spearman and Pearson values.
    if small_spear > small_pear:
        smallest = small_pear
    else:
        smallest = small_spear
    
    result = print("The largest positive and negative correlation values are: " + str(largest) + " & " +
                   str(smallest))
    
    return result


After having a simple overview of the data, there are some values that do have some sort of correlation (the largest correlation for the `CashFlow` and `Weather` respectively are, 0.59 and 0.63 respectively). However, this level of correlation is clearly not good enough to help in the stock price prediction. Nonetheless, I will select manually the variables that I will keep from the `Weather` dataframe so that I can perform further analysis on some Weather data. In terms of the `CashFlow`, I will drop all variables because I already have a lot of financial data from the other three dataframes (`Balance Sheet`, `FinancialRatios` and `IncomeStatement`).

#### 3. Filtering dataframe variables.

`BalanceSheet`, `FinancialRatios` and `IncomeStatement`: Through correlation filtering functions.

In [38]:
def new_filtered_financial(dc, df):
    """
    Define the high correlated variables as the ones to keep from the original dataframe.
    """
    variables = list(dc["Column"])
    new_df = df[variables]
    
    return new_df


In [39]:
def apply_correlation_filtering(balance_stock, ratios_stock,
                                income_stock, BalanceSheet,
                                FinancialRatios, IncomeStatement):
    """
    Apply all functions showed above to filter dfs.
    """
    bal_pre, bal_filt = filter_columns_rest(df = balance_stock)
    rat_pre, rat_filt = filter_columns_rest(df = ratios_stock)
    inc_pre, inc_filt = filter_columns_rest(df = income_stock)
    
    BalanceSheet = new_filtered_financial(dc = bal_filt, df = BalanceSheet)
    FinancialRatios = new_filtered_financial(dc = rat_filt, df = FinancialRatios)
    IncomeStatement = new_filtered_financial(dc = inc_filt, df = IncomeStatement)
    
    return (BalanceSheet, FinancialRatios, IncomeStatement)


In [40]:
def balance_add_commercial(BalanceSheet, balance_stock):
    """
    Keep Commercial Paper in the Balance Sheet dataframe as it has interesting results.
    """
    BalanceSheet[["commercialpaper"]] = balance_stock[["commercialpaper"]]
    BalanceSheet.fillna(method = "ffill", inplace=True)# Replace missing values wiht previous value.
    
    return BalanceSheet


`Weather`: I will identify the variables with highest correlation coefficient and then examine the variables' possible contextual significance on the stock market.

By first looking at the correlation values for the `Weather` dataframe, we can see that the three temperature parameters and the `prcp` variable have the most strong correlation coefficients.

Thus, I will keep `tavg` as it represent the average temperature, so it takes into account both `tmin` and `tmax` variables. Also, I will keep both `prcp` and `snow` variables; although evidently they have a very small correlation coefficient, rain and snow are the two weather phenomenons that affect human thinking the most, apart from natural disasters of course.

In [41]:
def final_weather(Weather):
    """
    Define the varibales which are going to be kept in the df.
    """
    Weather = Weather[["tavg", "prcp", "snow"]]
    
    return Weather


So now, the dataframes we have are:
- `BalanceSheet`
- `FinancialRatios`
- `IncomeStatement`
- `Stocks`
- `Weather`
- `Wind`
- `MediaSentiment`
- `TwitterVolume`

## 3. Data Normalisation

This function will simply clean up the column names in each dataframe because as the data was stored and retrieved back, column names were bundled up.

In [42]:
def data_normalisation(BalanceSheet, FinancialRatios, IncomeStatement,
                       Stocks, Weather, Wind):
    """
    Normalize.
    """
    # Balance Sheet
    balance_changes = [{"retainedearningsaccumulateddeficit": "Retained Earnings Accumulated Deficit"},
                       {"commercialpaper": "Commercial Paper"},
                       {"stockholdersequity": "Stock Holders Equity"}]
    
    rename_columns(columns = balance_changes, df = BalanceSheet)
    
    
    # Financial Ratios
    ratios_changes = [{"debtratio": "Debt Ratio"}, {"debtequityratio": "Debt Equity Ratio"},
                      {"longtermdebttocapitalization": "Long Term Debt to Capitalization"},
                      {"totaldebttocapitalization": "Total Debt to Capitalization"},
                      {"companyequitymultiplier": "Company Equity Multiplier"},
                      {"pricebookvalueratio": "Price Book Value Ratio"}, {"pricetobookratio": "Price to Book Ratio"},
                      {"dividendyield": "Dividend Yield"}, {"pricefairvalue": "Price Fair Value"}]
    
    rename_columns(columns = ratios_changes, df = FinancialRatios)
    
    
    # Income Statement 
    income_changes = [{"interestincome": "Interest Income"},
                      {"weightedaverageshsout": "Weighted Average Shares Outstanding"},
                      {"weightedaverageshsoutdil": "Weighted Average Diluted Shares Outstanding"}]
    
    rename_columns(columns = income_changes, df = IncomeStatement)
    
    
    # Stock
    stock_changes = [{"open": "Open"}, {"high": "High"}, {"low": "Low"}, {"close": "Close"},
                     {"adj_close": "Adjusted Close"}, {"volume": "Stock Volume"}]
    
    rename_columns(columns = stock_changes, df = Stocks)
    
    
    # Weather
    weather_changes = [{"tavg": "Average Temp"}, {"prcp": "Precipitation"}, {"snow": "Snow"}]
    
    rename_columns(columns = weather_changes, df = Weather)
    
    
    # Wind
    wind_changes = [{"wpgt": "Peak Wind Gust"}]
    
    rename_columns(columns = wind_changes, df = Wind)
    
    
    return (BalanceSheet, FinancialRatios, IncomeStatement,Stocks,
            Weather, Wind)


Save clean dataframe to compare with the original.

## 4. Outliers

Now that we have performed the appropiate data reductionality and normalisation, lets look for any outliers.

We already fixed outliers when the `merge()` functions were performed, so theoretically we only have to look for outliers.

### 1. To get an understandment of the data characteristics, in order to see if there are any evident outliers.

In [43]:
def display_descriptive_tables():
    """
    Display all descriptive tables side-by-side.
    """
    describe_list = [BalanceSheet.describe(), FinancialRatios.describe(),
                     IncomeStatement.describe(),Stocks.describe(), Weather.describe(),
                     Wind.describe(), News.describe(), Twitter.describe()]
    
    for i in describe_list:
        show = display(i)
    
    return show


**The `describe()` function only works as a preliminary overview views of the data. By quickly observing the results, the following values are suspicious to be outliers:**

Financial Ratios:
- `Price Book Value Ratio` and `Price to Book Ratio` max value.

Stocks:
- `Stock Volume` max value.

Weather:
- `Precipitation` and `Snow` max values.
 
Wind:
- `Peak Wind Gust` max value.

Tweets Volume:
- `Tweets Volume` min value.

However, these are obviously simple observations, so lets have a closer look.

### 2. Time series plots to look for outliers.

Function to save a figure/plot as.

In [44]:
def save_plot(name, plot):
    """
    Function which saves a given image to the corresponding folder.
    """
    plt.tight_layout()# Display appropiately.
    file = os.path.join("./Saved Figures", (str(name) + str(plot) + ".png"))
    plt.savefig(file, bbox_inches = "tight", dpi = 200, pad_inches = 0.2, edgecolor = "white", facecolor = "white")
    plt.close()
    
    return


Individual variable time series plot.

In [45]:
def individual_time_series(df, variable):
    """
    Individual time series plot.
    """
    plt.subplots(figsize=(4, 3))
    
    df[variable].plot()
    plt.title(str(variable) + " Time Series")
    plt.ylabel(variable)
    
    # Save plot.
    save_plot(name = variable, plot = " Time Series")
    
    return


Time series subpolot for all variables in a dataframe.

In [46]:
def all_time_series(df, title):
    """
    Function to plot time series plots for all parameters in a dataframe.
    """
    plt.figure(figsize=(8, 10))
    
    # Define columns in the dataframe in a numeric fashion.
    for i, col in enumerate(df.columns):
        ax = plt.subplot(5, 2, i+1) # Create subplots.
        df[col].plot(label=col)
        ax.set_title(col, fontsize = 6.5) # Define the title of each individual subplot.
        ax.get_yaxis().get_offset_text().set_position((-0.3, 0)) # Move axis labelling to allow space for title.
        
        plt.suptitle(str(title))
        
        if "Adjusted Close" in df.columns:
            ax.xaxis.set_major_locator(MultipleLocator(375))
    
    
    # Save plot.
    save_plot(name = title, plot = "")
    
    return


Now lets save the time series plots, and evaluate them to see if we identify any outliers.

In [47]:
def all_time_series_saved(BalanceSheet, FinancialRatios, IncomeStatement,
                          Stocks, News, Weather, Wind, Twitter):
    """
    Save all time series plots.
    """
    # Balance sheet.
    all_time_series(df = BalanceSheet,
                    title = "Balance Sheet Time Series")
    
    # Financial ratios.
    all_time_series(df = FinancialRatios,
                    title = "Financial Ratios Time Series")
    
    # Income statement.
    all_time_series(df = IncomeStatement,
                    title = "Income Statement Time Series")
    
    # Stocks.
    all_time_series(df = Stocks, title = "Stocks Time Series")
    
    # News scores.
    individual_time_series(df = News, variable = "Score")
    
    # Weather
    all_time_series(df = Weather, title = "Weather Time Series")
    
    # Wind.
    individual_time_series(df = Wind, variable = "Peak Wind Gust")
    
    #Twitter.
    all_time_series(df = Twitter, title = "Twitter Time Series")
    
    # Twitter volume (individual plot used in report).
    individual_time_series(df = Twitter, variable = "Twitter Volume")
    
    return


The `BalanceSheet` dataset looks totally clean from outliers.

The `FinancialRatios` dataset looks clean from outliers.

All the data in this `IncomeStatement` looks clean.

All the data in this `Stocks` looks clean.

The minimum value in early 2021 in `News` looks like an outlier.

The `Weather` dataframe looks clean.

The `Wind` dataframe looks clean.

The `Twitter Volume` maximum value in late 2018/early 2019 looks suspicious.

**Takeaways:**
- The `FinancialRatios["Price Book Value Ratio"]` and `FinancialRatios["Price to Book Ratio"]` max values have been proven to definitely not be outliers.
- In the `Stock` dataframe there are definitely some spikes in the `Volume` time series plot, however none are significant enough to classify as an outlier.
- The `MediaSentiment["Score"]` minimum value is a possible outlier.
- In the `Weather` dataframe, the maximum `Snow` value is a suspect of being an outlier.
- Similarly, the `Wind["Peak Wind Gust"]` maximum value is visble, however it is not large enough to be considered as an outlier.
- The `TwitterVolume["Tweets Volume"]` max value is still considered as a possible outlier.

##### Thus, from the first two outlier detecting functions, we have filtered the possible outliers to:

- `MediaSentiment["Score"]` minimum value.
- `Weather["Snow"]` maximum value.
- `TwitterVolume["Tweets Volume"]` maximum value.

In order to be certain that these values are indeed outliers, we will analyse further.

### 3. Cheking for weather news for  snow values.

In [48]:
def check_snow_for_outliers(Weather):
    """
    Show 5 largest datapoints.
    """
    snows = Weather["Snow"].nlargest(5)
    
    return snows


According to our data there was a large snow storm in late January 2021. News reports confirm this large snowfall, thus we can discard the `Snow` values from being outliers.

Therefore, our only outlier suspects are `MediaSentiment["Score"]` minimum value and `TwitterVolume["Tweets Volume"]`.

### 4. Box Plots.

In [49]:
def box_plot(News, Twitter):
    """
    Box plot to display the overall distribution of the dataset, and observe if there are any outliers.
    """
    import seaborn as sns
    # Plot the three box plots.
    fig, (ax1, ax2) = plt.subplots(1, 2, sharey = True, figsize=(8, 3))
    
    # Media Sentiment.
    attribute = np.array(News["Score"])# Conver to numpy array.
    sns.boxplot(x = attribute, ax = ax1)# Box plot attribute.
    ax1.set_xlabel("Date")
    ax1.set_title("News Score", fontsize = 10)
    
    # Tweets Volume.
    attribute = np.array(Twitter["Twitter Volume"])
    sns.boxplot(x = attribute, ax = ax2)
    ax2.set_xlabel("Date")
    ax2.set_title("Tweets Volume", fontsize = 10)
    
    plt.suptitle("Box Plots")
    
    # Alternative method.
    """
    parameters = list([News["Score"], TwitterVolume["Tweets Volume"]])
    plt.figure(figsize=(10, 6))
    
    for i, name in enumerate(parameters):
        ax = plt.subplot(1, 2, i + 1)
        attribute = np.array(name)
        sns.boxplot(x = attribute);
        ax.axes.set_title(i)
    """
    
    # Save figure.
    save_plot(name = "Full", plot = " Box Plot")
    
    return


- **`News["Score"]`:** There are various values below the 25% percentile, so it does not look like an outleir. However we cannot be 100% sure, therefore we will further analyse this value.
-  **`Twitter["Twitter Volume"]`:** It does look like an outlier, lets further analyse it.

### 5. Z-Score.

In [50]:
def z_scores_media(News):
    """
    Plot the Z-Scores for the Media Sentiment dataframe.
    """    
    # Plot.
    fig, ax = plt.subplots(1, 1, sharey = True, figsize=(5, 3))
    
    # Media Sentiment.
    data_media = np.array(News["Score"])# Conver to numpy array.
    year_media = np.array(News.index)# Date data.
    z_media = np.abs((scipy.stats).zscore(data_media)) # Get Z-Score values of data.
    ax.plot(year_media, z_media)
    ax.xaxis.set_major_locator(MultipleLocator(465)) # Spacing of axis tickers.
    ax.set_xlabel("Date")
    ax.set_ylabel("Z-Score")
    ax.set_ylim([-0.2, 6])
    ax.set_title("Media Sentiment Score Z-Scores", fontsize = 10)
    ax.grid()
    
    
    # Save figure.
    save_plot(name = "News", plot = " Z-Scores")
    
    return


In [51]:
def z_scores_twitter(Twitter):
    """
    Plot the Z-Scores for the Twitter Volume dataframe.
    """
    # Plot.
    fig, ax = plt.subplots(1, 1, sharey = True, figsize=(5, 3))
    
    # Tweets Volume.
    data_twitter = np.array(Twitter["Twitter Volume"])
    year_twitter = np.array(Twitter.index)# Date data.
    z_twitter = np.abs((scipy.stats).zscore(data_twitter)) # Get Z-Score values of data.
    ax.plot(year_twitter, z_twitter)
    ax.xaxis.set_major_locator(MultipleLocator(287)) # Spacing of axis tickers.
    ax.set_xlabel("Date")
    ax.set_ylabel("Z-Score")
    ax.set_ylim([-0.2, 13.5])
    ax.set_title("Twitter Volume Z-Scores", fontsize = 10)
    ax.grid()
    
    # Save figure.
    save_plot(name = "Twitter Volume", plot = " Z-Scores")
    
    return


In [52]:
def check_small_large_outlier():
    """
    Check smallest and largest to view if it an outlier.
    """
    news_s = News["Score"].nsmallest(1)
    
    twitter_l = Twitter["Twitter Volume"].nlargest(5)
    
    return news_s, twitter_l


- **`News`:** The Z-Scores are really high; perhaps 25% of the data points are above the normal Z-Score threshold (3) which usually defines an outlier. However, this obviously does not mean that 25% of the data points are outliers. It simply means that the data does not follow a normal distributio, in other words, that the data is extremely volatile. As we cannot make a decision on wether we classify the maximum as an outleir wiht the Z-Score threshold, we are going to have to make a decision by observing the data. As we can see above, for context, this possible outlier occured on April 23rd 2021. After researching, there aren't any negative news at all on Apple during that time. There was an Apple event in April 2021, however it received great feedback specially with the announcement of the AirTags. Considering all of this knowledge on that specific date and the both the box and Z-Score plots, it is safe to say that **this value can be considered as an outlier**, setting the Z-Score threshold for this variable at 5.

- **`Twitter Volume`:** Again, the Z-Scores are also extremely high for the Twitter Volume data. In this case they are even higher than for the Media Sentiment, peaking at a enormous 12.5. Again, in order to get some context on the situation, I reasearched on Apple at the time of the huge spike(January 3rd 2019, as showed above). In this time, Apple suffered their biggest drop in stocks in 6 years due to their revenue guidance cut. Thus, to this clear event, we cannot label the mentioned value as an outlier.

**Thus, the only outlier is the minimum value of the `News["Score"]`.**

### 6. Fixing the outliers.

We could either:
- Drop the outlier.
- Cap the outlier at a certain value (in this case, we would cap it at the minimum value, excluding the outlier).
- Replace it with a new value, either through regression or the mean value of the variable

I am going to cap the outlier at the next minimum value because:
- Dropping the value is not an option becasue this variable does not have many data points, so we cannot afford to reduce the number of data points.
- I think that capping the value is more realistic than setting at the variable´s mean because these financial variables seem to fluctuate a lot, so it may be more realistic to cap it at the next minimum value.

**Thus, the only outlier is the minimum value of the `News["Score"]`.**

In [53]:
def cap_the_outlier(News):
    """
    Cap the minimum value of Media Sentiment "Score"., 
    """
    # Again defining the paramters within this def function.
    outlier_dataset = np.array(News["Score"])
    outlier_year = np.array(News.index)
    
    # Again defining the outlier within this separate def function.
    z = np.abs((scipy.stats).zscore(outlier_dataset))
    threshold = 3.85
    outlier_loc = np.where(z > threshold) # Defining the outliers.
    outlier_by_Z_Score = outlier_dataset[outlier_loc]
    
    
    # This basically changes the outlier value to the next minimum value.
    capped_outlier_dataset = np.copy(outlier_dataset)
    capped_outlier_dataset[outlier_loc] = np.min(capped_outlier_dataset)
    
    
    # Plot and compare, before and after the outlier is capped.
    fig, (ax1, ax2) = plt.subplots(1, 2, sharey = True, figsize=(8, 3))
    ax1.set_title("Before Cap")
    ax1.scatter(outlier_year, outlier_dataset)
    ax1.scatter(outlier_year[outlier_loc], outlier_dataset[outlier_loc], c = 'r')
    ax1.set_xlabel("Date")
    ax1.set_ylabel("News Score")
    ax1.xaxis.set_major_locator(MultipleLocator(465)) # Spacing of axis tickers.
    
    ax2.set_title("After Cap")
    ax2.scatter(outlier_year, capped_outlier_dataset)
    ax2.scatter(outlier_year[outlier_loc], capped_outlier_dataset[outlier_loc], c = 'r')
    ax2.set_xlabel("Date")
    ax2.xaxis.set_major_locator(MultipleLocator(465)) # Spacing of axis tickers.
    
    # Save figure.
    save_plot(name = "Outlier", plot = " Capping")
    
    """
    Replace the dataframe column with the new capped outlier.
    """
    
    np_series = pd.Series(capped_outlier_dataset)
    new_df = pd.DataFrame(np_series)
    new_df.index = News.index
    
    News["Score"] = new_df[0]
    
    return


To double check, lets plot the `Media Sentiment Score` time series to view how the data has changed.

In [54]:
def score_time_series_fixed(News):
    """
    Plot news score time series after the outlier fix.
    """
    News["Score"].plot()
    plt.title("News Score Time Series Plot After Outlier Fix")
    plt.ylabel("Score")
    
    save_plot(name = "Fixed Outlier News Score", plot = " Plot")
    
    return


## 5. Double checking for missing values.

We already performed missing data cleaning, but lets make sure.

In [55]:
def check_missing_values():
    """
    Check for any missing values.
    """
    dfs = [BalanceSheet, FinancialRatios, IncomeStatement, Stocks, News,
           Weather, Wind, Twitter]
    
    missing_list = []
    
    for df in dfs:
        N_missing_val = (df.loc[:]).isnull().sum()# Add missing values for each variable.
        missing_list.extend(N_missing_val)# Extend to list all variables.
    
    list_all = list(missing_list)# Create a list of missing values.
    n_missing_values = sum(list_all)# Get total number of missing values by adding the list.
    result = print("Total number of missing values is " + str(n_missing_values))
    
    return result


### Totally clean data.

In [56]:
def display_processed_data(BalanceSheet, FinancialRatios, IncomeStatement,
                           Stocks, News, Weather, Wind, Twitter):
    """
    Display all to check.
    """
    df_list = [BalanceSheet, FinancialRatios, IncomeStatement,
               Stocks, News, Weather, Wind, Twitter]
    
    for i in df_list:
        show = display(i)
    
    return show


## Function to preprocess all data.

In [57]:
def process_all():
    """
    Preprocess all data.
    """
    # Define a global function to carry the retrieved data into the processing section.
    global dfBalanceSheet, dfCashFlow, dfFinancialRatios, dfIncomeStatement
    global dfStocks, dfNews, dfWeather, dfTwitter
    
    """
    Retrieve all data from the Oracle SQL database.
    """
    (dfBalanceSheet, dfCashFlow, dfFinancialRatios,
     dfIncomeStatement, dfStocks, dfNews, dfWeather, dfTwitter) = retrieve_all()
    
    """
    Intial data preprocessing.
    """
    # Setting up local workspace to store figures.
    create_new_dir()
    
    # Save original df format for reference.
    save_df_normal(df = dfStocks, title = "Original Stocks")
    
    # Initial data column cleaning.
    intial_column_cleaning()
    
    # Same as before, to prevent error.
    global BalanceSheet, CashFlow, FinancialRatios
    global IncomeStatement, Stocks, News
    global Weather, Wind, Twitter
    
    # Sort datetime.
    (BalanceSheet, CashFlow, FinancialRatios, IncomeStatement,
     Stocks, News, Weather, Twitter) = sort_dfs_datetime()
    
    """
    Data reductionality.
    """    
    # Groupy daily sentimental scores.
    News, Twitter = group_sentimental_scores(News, Twitter)
    
    # Prepare weather data by spliting wind data.
    Wind, Weather = prepare_weather_data(Weather)
    
    # Clean data.
    clean_missing_values()
    
    # Merge auxiliary to stock closing price for correlation analysis.
    (balance_stock, cash_stock, ratios_stock,
     income_stock, news_stock, weather_stock,
     wind_stock, twitter_stock) = apply_merge(BalanceSheet, CashFlow, FinancialRatios,
                                              IncomeStatement, News, Weather, Wind, Twitter)
    
    # Apply correlation analysis data reduction.
    (BalanceSheet, FinancialRatios,
     IncomeStatement) = apply_correlation_filtering(balance_stock, ratios_stock, income_stock,
                                                    BalanceSheet, FinancialRatios, IncomeStatement)
    
    # Add commercial paper as it has interesting results.
    BalanceSheet = balance_add_commercial(BalanceSheet, balance_stock)
    
    # Prepare weather data.
    Weather = final_weather(Weather)
    
    """
    Data normalization.
    """
    # Fix data.
    data_normalisation(BalanceSheet, FinancialRatios, IncomeStatement,
                       Stocks, Weather, Wind)
    
    # Save clean dataframe.
    save_df_normal(df = Stocks, title = "Clean Stocks")
    
    """
    Outlier detection.
    """
    # Save financial ratios descriptive table.
    save_df_normal(df = FinancialRatios.iloc[:, -4:].describe(),
                   title = "Financial Ratios Properties")
    
    # All all time series plots.
    all_time_series_saved(BalanceSheet, FinancialRatios, IncomeStatement,
                          Stocks, News, Weather, Wind, Twitter)
    
    # Save box plot.
    box_plot(News, Twitter)
    
    # Save z-score plots.
    z_scores_media(News)
    z_scores_twitter(Twitter)
    
    # Cap the outlier.
    cap_the_outlier(News)
    
    # News score time series after outlier fix.
    score_time_series_fixed(News)
    
    return (BalanceSheet, FinancialRatios, IncomeStatement,
            Stocks, News, Weather, Wind, Twitter)


<div class="alert alert-heading alert-info">
    
## Data Description

As we have now acquired and preprocessed all of the data, this mini section will build a dataframe table with a detailed description of each dataset. 

</div>

In [58]:
def data_description():
    """
    Data description table.
    """
    data = {"Dataset": ["Financial Data", "Apple Stocks", "News Reponse",
                    "Weather", "Twitter Response"],
            
            "Variables": ["182", "6", "2", "10", "2"],
            
            "Start Date": ["2017-04-01", "2017-04-03", "2017-04-02",
                     "2017-04-01", "2017-04-01"],
            
            "End Date": ["2021-12-25", "2021-04-29", "2017-04-29",
                   "2021-04-30", "2019-09-02"],
            
            "Data Points": ["20", "1027", "2305", "1491", "1341"],
            
            "Data Frequency": ["Quarterly", "Daily", "Not Fixed", "Daily", "Daily"]}
    
    DataDescription = pd.DataFrame(data, columns = ["Dataset", "Variables", "Start Date",
                                                "End Date", "Data Points",
                                                "Data Frequency"])
    
    DataDescription.set_index("Dataset", inplace = True)
    
    # Save table.
    save_df_normal(df = DataDescription, title = "Initial Data Description")
    
    return


<div class="alert alert-heading alert-info">
    
## Task 4: Data Exploration

After ensuring that the data is well preprocessed, it is time to start exploring the data to carry out
hypotheses and intuition about possible patterns that might be inferred. Depending on the data,
different EDA (exploratory data analysis) techniques can be applied, and a large amount of
information can be extracted.
For example, you could do the following analysis:

    
- Time series data is normally a combination of several components:
  - Trend represents the overall tendency of the data to increase or decrease over time.
  - Seasonality is related to the presence of recurrent patterns that appear after regular
intervals (like seasons).
  - Random noise is often hard to explain and represents all those changes in the data
that seem unexpected. Sometimes sudden changes are related to fixed or predictable
events (i.e., public holidays).
- Features correlation provides additional insight into the data structure. Scatter plots and
boxplots are useful tools to spot relevant information.
- Explain unusual behaviour.
- Explore the correlation between stock price data and other external data that you can
collect (as listed in Sec 2.1)
- Use hypothesis testing to better understand the composition of your dataset and its
representativeness.

    
At the end of this step, provide key insights on the data. This data exploration procedure should
inform the subsequent data analysis/inference procedure, allowing one to establish a predictive
relationship between variables.

</div>

## 1. Intial Stock Time Series Analysis

In this first section we are going to explore the Stock data.

### 1. High/Low price plot.

In [59]:
def multi_entire_series(df, v1, v2, label):
    """
    Plot entire time series for two variables.
    """
    fig, ax = plt.subplots(figsize=(5, 3))# Size of figure.
    
    # Define the plots.
    df[v1].plot(label = v1, color = "green")
    df[v2].plot(label = v2, color = "orange")
    plt.title(str(v1) + " " + str(v2) + " " + str(label))
    plt.ylabel(label)
    plt.legend()# Legend to indicate what represents each line.
    
    save_plot(name = str(v1) + " " + str(v2), plot = " Full Time Series")
    
    return


Lets focus on a specific area in order to view the data more up close.

Lets also find the **Minskowski distance** to see their similarity.

In [60]:
def multi_part_series_dist(df, v1, v2, start, end, ylabel, title):
    """
    Find Euclidian distance.
    """
    # p=3 to find Minkowski; p = 1 is Manhattan and p = 2 is Euclidian.
    mink_dist = minkowski(df[v1], df[v2], p = 3)
    dist = round(mink_dist, 3)# Display to 3 decimal places.
    
    
    #Could have also used the Euclidian distance.
    #def euclidian_distance(x,y):
    #    return sqrt (sum(pow (a-b,2) for a, b in zip (x,y)))
    
    """
    Plot part time series for two variables.
    """
    fig, ax = plt.subplots(figsize=(5, 3))# Size of figure.
    
    # Use "iloc" to select specific section of the time series.
    df[v1].loc[start:end].plot(label = v1, color = "green")
    df[v2].loc[start:end].plot(label = v2, color = "orange")
    plt.text(0.4, 0.9, r"Minkowski distance: " + str(dist), fontsize = "small", transform=ax.transAxes)
    plt.title(str(v1) + " " + str(v2) + " " + str(ylabel))
    plt.ylabel(ylabel)
    plt.legend(prop={'size': 6})
    
    save_plot(name = str(v1) + " " + str(v2), plot = title)
    
    return


As the graph shows, the correlation between these two vareibales is almost perfect. The minkowski distance is 26.598.

In [61]:
def high_low(Stocks):
    """
    Investigate high and low price correlation.
    """
    # Entire time series.
    multi_entire_series(df = Stocks, v1 = "High", v2 = "Low",
                        label = " Full Stock Prices")
    
    # Part time series 2018.
    multi_part_series_dist(df = Stocks, v1 = "High", v2 = "Low",
                           start = "2018-02-20", end = "2018-05-10",
                           ylabel = " Prices", title = " Part Plot")
    
    # Zoomed in time series for big variance.
    multi_part_series_dist(df = Stocks, v1 = "High", v2 = "Low", start = "2018-03-23",
                           end = "2018-04-05", ylabel = " Prices", title = " Big Variance Plot")
    
    return


These two varaibles constantly present very similar values (they usually vary by about 25 cents). Nonetheles, at some point there is a bigger difference between the two. Like for example in this area of the time series.

The biggest difference appears to happen in `2018-03-27`.

In [62]:
def big_diff_high_low(Stocks):
    """
    Check when the biggest variance occurs.
    """
    big = (Stocks["High"].iloc[245:254] - Stocks["Low"].iloc[245:254]).idxmax()
    
    return big


So our prediction was correct.

The reason for this bigger difference is becasue in this day, Apple had an [event](https://www.apple.com/uk/newsroom/2018/03/apple-introduces-new-9-7-inch-ipad-with-apple-pencil-support/) where they announced various new products.

In [63]:
def check_stock_march_2018(Stocks):
    """
    Check all stock data the day of the event.
    """
    check = Stocks.loc["2018-03-27"]
    
    return check


The difference between `High` and `Low` here is of 2USD instead of the usual 25 cents.

### 2. Open/Close price plots.

In [64]:
def open_close(Stocks):
    """
    Open and close prices.
    """
    multi_part_series_dist(df = Stocks, v1 = "Open", v2 = "Close", start = "2018-02-20",
                           end = "2018-05-10", ylabel = " Prices", title = " Part Plot")
    
    return


In this case the minkowski distance is 16.147.

### 3. Close/Adjusted Close price plots.

In [65]:
def close_adjusted(Stocks):
    """
    Close and adjusted close prices.
    """
    multi_part_series_dist(df = Stocks, v1 = "Close", v2 = "Adjusted Close", start = "2018-02-20",
                           end = "2018-05-10", ylabel = " Prices", title = " Part Plot")
    
    return


In this case the minkowski distance is 15.148.

In [66]:
def mean_diff_close_adjusted(Stocks):
    """
    Find mean difference between the two.
    """
    mean = (Stocks["Close"] - Stocks["Adjusted Close"]).mean()
    
    return mean


Their shape is IDENTICAL, the only difference is that `Close` is slightly higher (1.37USD). This is because `Adjusted Close` takes into account factors like dividends, stock splits, etc.

### 4. Total money traded.

This is achieved by doing: $${T}_{traded} = {{P}_{open} · {V}}$$
This acts as a market cap, as by viewing only the volume, it may appear that a company is performing phenomally in the stock market, but then the share prices might be very low, so the return profit on the stock may not be that good. By using the total money traded, we are able to get much more information on how profitable Apple really is.

Now lets visualize the `Volume` and `Total Traded` plots. We will do this in two separate plots as they have completely different scales.

In [67]:
def totaltraded_volume(Stocks):
    """
    Compare stock volume and total money traded over the entire time series.
    """
    individual_time_series(df = Stocks, variable = "Stock Volume")
    individual_time_series(df = Stocks, variable = "Total Traded")
    
    return


Now we will plot for the whole of 2020 as there seems to be some interesting data to compare for the two dataframes.

In [68]:
def individual_part_series(df, v, start, end, title):
    """
    Plot part time series.
    """
    fig, ax = plt.subplots(figsize=(5, 3))# Size of figure.
    
    # Use "iloc" to select specific section of the time series.
    df[v].loc[start:end].plot(label = v, color = "green")
    plt.title(str(v) + str(title))
    ax.get_yaxis().get_offset_text().set_position((-0.15, 0)) # Move axis labelling to allow space for title.
    plt.ylabel(v)
    
    
    save_plot(name = v, plot = title)
    
    return


In [69]:
def stock_analysis_2020(Stocks):
    """
    Analyse 2020 stock data.
    """
    # Stock volume.
    individual_part_series(df = Stocks, v = "Stock Volume", start = "2020-01-01", end = "2020-12-31",
                           title = " Part Time Series")
    
    # Total traded.
    individual_part_series(df = Stocks, v = "Total Traded", start = "2020-01-01", end = "2020-12-31",
                           title = " Part Time Series")
    
    # Adjusted Close.
    individual_part_series(df = Stocks, v = "Adjusted Close", start = "2020-01-01", end = "2020-12-31",
                           title = " Part Time Series")
    
    return


### 5. Returns.

#### Get Percentage Daily Returns.

$${R}_{daily} = \displaystyle \Bigg[\frac{{P}_{t}}{{P}_{t-1}} - 1\Bigg] \times 100%$$

#### Get Cumulative Returns.

$${C}_{return} =  \displaystyle \Bigg[\left(\frac{{P}_{t}}{{P}_{t-1}} - 1 \right)
+1\Bigg] {i}_{t-1}$$


I will acquire regular `Daily Returns` and also `Daily Returns (%)`; I will keep `Daily Returns (%)` to analyse, and then I will use `Daily Returns` in order to get the cumulative returns.

**1. Plot Returns.**

In [70]:
def daily_cumulative_returns(Stocks):
    """
    Daily returns and cumulative returns plots.
    """
    # Daily.
    individual_time_series(df = Stocks, variable = "Daily Returns (%)")
    
    # Cumulative.
    individual_time_series(df = Stocks, variable = "Cumulative Return")
    
    return


## 2. In Depth Stock Data Analysis

### 1. Probability Distribution of Percent Daily Returns.

In [71]:
def returns_distribution(Stocks):
    """
    Daily returns probability distribution.
    """
    plt.subplots(figsize=(5, 3))
    
    # Histogram plot.
    Stocks["Daily Returns (%)"].hist(bins = 100)
    plt.xlabel("Daily Returns (%)")
    plt.ylabel("Frequency")
    plt.title("Daily Percentage Return Distribution")
    
    save_plot(name = "Daily Percentage Change", plot = "Probability Distribution")
    
    return


### 2. Rolling Averages.

#### Orignal & Moving Average, together and individually.

In [72]:
def rolling_original_full(Stocks):
    """
    Comparison for entire time series period. 
    """
    # Define the three plots.
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey = True, figsize=(10, 3))
    
    fig.suptitle("Difference Between Original Price Plot & Rolling Average")
    
    # Multi plot of both.
    Stocks["Adjusted Close"].plot(label = "Original", ax = ax1)
    Stocks["Moving Average 30"] = Stocks["Adjusted Close"].rolling(30).mean()# New Column, moving
    Stocks["Moving Average 30"].plot(label = "Moving Average", ax = ax1, color = "green")
    ax1.set_ylabel("Adjusted Close Price")
    ax1.set_title("Both Original and Rolling Average")
    ax1.legend()
    
    # Just the original Adjusted Close price.
    Stocks["Adjusted Close"].plot(label = "Adjusted Close", ax = ax2)
    ax2.set_ylabel("Adjusted Close Price")
    ax2.set_title("Original")
    
    # Just the moving average.
    Stocks["Moving Average 30"].plot(label = "Moving Average", ax = ax3, color = "green")
    ax3.set_ylabel("Adjusted Close Price")
    ax3.set_title("Rolling Average")
    
    save_plot(name = "All Three", plot = " Rolling Average")
    
    return


In [73]:
def rolling_original_2020(Stocks):
    """
    Comparison for 2020. 
    """
    # Moving Average and Original for 2020.
    plt.subplots(figsize=(5, 3))
    
    (Stocks["Adjusted Close"].loc["2020-01-01":"2020-12-31"]).plot(label = "Original")
    (Stocks["Moving Average 30"].loc["2020-01-01":"2020-12-31"]).plot(label = "Moving Average")
    plt.ylabel("Adjusted Close Price")
    plt.title("Original and Rolling Average Part Plots")
    plt.legend()
    
    save_plot(name = "Original & Moving Average", plot = " Part Plot")
    
    return


#### Analyse with holidays.

**Lets see the effect of Christmas.**

In [74]:
def christmas_totaltraded(Stocks):
    """
    Analysis of total traded values during the Christmas holidays.
    """
    # Select solely Chirstmas time periods.
    s_17 = Stocks.loc["2017-12-10": "2018-01-07"]
    s_18 = Stocks.loc["2018-12-10": "2019-01-07"]
    s_19 = Stocks.loc["2019-12-10": "2020-01-07"]
    s_20 = Stocks.loc["2020-12-10": "2021-01-07"]
    
    # Define years list.
    s_years = [s_17, s_18, s_19, s_20]
    
    # For loop to change datetime format to exclude the year.
    for df in s_years:
        df.reset_index(drop = False, inplace = True)
        df["Date"] = df["Date"].dt.strftime("%m-%d")
        df.set_index("Date", inplace = True)
    
    # Plot for each year on a single day-moth x-axis.
    fig, ax = plt.subplots(figsize = (5, 3))
    
    s_17["Total Traded"].plot(label = "2017", color = "green", alpha = 0.7)
    s_18["Total Traded"].plot(label = "2018", color = "blue", alpha = 0.7)
    s_19["Total Traded"].plot(label = "2019", color = "red", alpha = 0.7)
    s_20["Total Traded"].plot(label = "2020", color = "black", alpha = 0.7)
    
    plt.title("Christmas Time Investigation")
    ax.get_yaxis().get_offset_text().set_position((-0.15, 0)) # Move axis labelling to allow space for title.
    plt.ylabel("Total Traded")
    
    plt.legend(prop={'size': 7.5})
    
    save_plot(name = "Christmas", plot = " Total Traded All")
    
    return


**2018 looks like a really interesting year, so lets plot from mid 2018 to early 2019 for the other important parameters.**

In [75]:
def late_2018_stock(Stocks):
    """
    Stock variable plots for particularly interesting time period.
    """
    # Total Traded.
    individual_part_series(df = Stocks, v = "Total Traded",
                           start = "2018-08-01", end = "2019-03-01",
                           title = " Late 2018 Part Time Series")
    
    # Daily Returns.
    individual_part_series(df = Stocks, v = "Daily Returns (%)",
                           start = "2018-08-01", end = "2019-03-01",
                           title = " Late 2018 Part Time Series")
    
    # Cumulative Return.
    individual_part_series(df = Stocks, v = "Cumulative Return",
                           start = "2018-08-01", end = "2019-03-01",
                           title = " Late 2018 Part Time Series")
    
    return


### 3. Year and Month Dependency.

With daily returns.

In [76]:
def monthly_dependency(Stocks):
    """
    Investigate daily returns monthly dependency.
    """
    import seaborn as sns
    
    # Extract datetime sections.
    month_year = Stocks[["Daily Returns (%)"]].copy()# Copy only Daily Returns.
    month_year.loc[:, "year"] = month_year.index.year# Find years.
    month_year.loc[:, "month"] = month_year.index.month# Find months.
    month_year = month_year.groupby(["year", "month"]).mean().unstack()# Groupby years and months.
    month_year.columns = month_year.columns.droplevel(0)
    
    
    # Begin plot.
    f, ax = plt.subplots(figsize = (12, 6))
    
    sns.heatmap(month_year, ax = ax, vmin = -1, vmax = 1, cmap = "mako");
    
    # Settings of barchart legend colorway.
    cbax = f.axes[1]
    [l.set_fontsize(13) for l in cbax.yaxis.get_ticklabels()]# Set y ticks.
    cbax.set_ylabel("Stock Daily Returns (%)", fontsize = 13)
    
    ax.set_title("Stock Daily Returns per Year and Month", fontsize = 16)
    
    # Define fontsize of axis ticks.
    [l.set_fontsize(13) for l in ax.xaxis.get_ticklabels()]
    [l.set_fontsize(13) for l in ax.yaxis.get_ticklabels()]
    
    ax.set_xlabel("Month", fontsize = 15)
    ax.set_ylabel("Year", fontsize = 15)
    
    save_plot(name = "Daily Returns Heatmap", plot = " Monthly Dependency")
    
    return


The blank squares at the beginnig and end of the heatmap represent that there is no data.

### 4. Day of the Week Dependency.

In [77]:
def daily_dependency(Stocks):
    """
    Investigate daily returns daily dependency.
    """
    import seaborn as sns
    
    # Dependency on day of the week and month.
    month_day = Stocks[["Daily Returns (%)"]].copy()
    month_day.loc[:, "Day of the Week"] = month_day.index.dayofweek# Get by days.
    month_day.loc[:, "Month"] = month_day.index.month# Get by months.
    month_day = month_day.groupby(["Day of the Week", "Month"]).mean().unstack()# Groupby days.
    month_day.columns = month_day.columns.droplevel(0)
    
    # Begin plot.
    f, ax = plt.subplots(figsize=(12,6))
    
    sns.heatmap(month_day, ax = ax, vmin = -1, vmax = 1, cmap = "mako")
    
    # Define dependency spectrum settings.
    cbax = f.axes[1]
    [l.set_fontsize(13) for l in cbax.yaxis.get_ticklabels()]
    cbax.set_ylabel("Stock Daily Returns (%)", fontsize=13)
    
    ax.set_title("Stock Daily Returns per Day of the Week and Month", fontsize=16)
    
    # Axis ticks settings.
    [l.set_fontsize(13) for l in ax.xaxis.get_ticklabels()]
    [l.set_fontsize(13) for l in ax.yaxis.get_ticklabels()]
    
    ax.set_xlabel("Month", fontsize=15)
    ax.set_ylabel("Day of the Week", fontsize=15)
    ax.set_yticklabels(["Mon", "Tue", "Wed", "Thu", "Fri"]);
    
    save_plot(name = "Daily Returns Heatmap", plot = " Daily Dependency")
    
    return


## 3. Stock Dependency on Auxiliary Data

#### 1. Update the Stocks dataframe , in order to only keep the variables we still want to analyse.

In [78]:
def stocks_update(Stocks):
    """
    Stocks variables to drop as we do not need them anymore.
    """
    columns = ["Cumulative Return", "Moving Average 30"]
    
    drop_columns(columns, df = Stocks)
    
    return Stocks


#### 2. Concat auxiliary data to stock data for analysis.

In [79]:
def merge_to_evaluate(BalanceSheet, FinancialRatios, IncomeStatement,
                      Stocks, News, Weather, Wind, Twitter):
    """
    Merge each auxliary to the corresponding stock data.
    """
    # Balance Sheet
    evaluate_balance = pd.merge(Stocks[["Total Traded", "Daily Returns (%)"]], BalanceSheet,
                                left_index = True, right_index = True, how = "outer")
    evaluate_balance = evaluate_balance.dropna()# Drop all rows with missing values.
    
    # Financial Ratios
    evaluate_ratios = pd.merge(Stocks[["Total Traded", "Daily Returns (%)"]], FinancialRatios,
                               left_index = True, right_index = True, how = "outer")
    evaluate_ratios = evaluate_ratios.dropna()# Drop all rows with missing values.
    
    # Income Statement
    evaluate_income = pd.merge(Stocks[["Total Traded", "Daily Returns (%)"]], IncomeStatement,
                               left_index = True, right_index = True, how = "outer")
    evaluate_income = evaluate_income.dropna()# Drop all rows with missing values.
    
    # News
    evaluate_media = pd.merge(Stocks[["Total Traded", "Daily Returns (%)"]], News,
                              left_index = True, right_index = True, how = "outer")
    evaluate_media["Score"] = evaluate_media["Score"].shift(1)# Shift down to match the daily returns.
    evaluate_media["Total Traded"] = evaluate_media["Total Traded"].shift(1)
    evaluate_media = evaluate_media.dropna()# Drop all rows with missing values.
    
    # Weather
    evaluate_weather = pd.merge(Stocks[["Total Traded", "Daily Returns (%)"]], Weather,
                                left_index = True, right_index = True, how = "outer")
    evaluate_weather["Average Temp"] = evaluate_weather["Average Temp"].shift(1)# Shift down to match the daily returns.
    evaluate_weather["Precipitation"] = evaluate_weather["Precipitation"].shift(1)
    evaluate_weather["Snow"] = evaluate_weather["Snow"].shift(1)
    evaluate_weather["Total Traded"] = evaluate_weather["Total Traded"].shift(1)
    evaluate_weather = evaluate_weather.dropna()# Drop all rows with missing values.
    
    # Wind
    evaluate_wind = pd.merge(Stocks[["Total Traded", "Daily Returns (%)"]], Wind,
                             left_index = True, right_index = True, how = "outer")
    evaluate_wind["Peak Wind Gust"] = evaluate_wind["Peak Wind Gust"].shift(1)# Shift down to match the daily returns.
    evaluate_wind["Total Traded"] = evaluate_wind["Total Traded"].shift(1)
    evaluate_wind = evaluate_wind.dropna()# Drop all rows with missing values.
    
    # Twitter
    evaluate_twitter = pd.merge(Stocks[["Total Traded", "Daily Returns (%)"]], Twitter,
                                left_index = True, right_index = True, how = "outer")
    evaluate_twitter["Twitter Volume"] = evaluate_twitter["Twitter Volume"].shift(1)# Shift down to match the daily returns.
    evaluate_twitter["Twitter Score"] = evaluate_twitter["Twitter Score"].shift(1)
    evaluate_twitter["Total Traded"] = evaluate_twitter["Total Traded"].shift(1)
    evaluate_twitter = evaluate_twitter.dropna()# Drop all rows with missing values.
    
    return (evaluate_balance, evaluate_ratios, evaluate_income, evaluate_media,
            evaluate_weather, evaluate_wind, evaluate_twitter)


#### 3. Correlation matrix to view most closely correlated variables.

In [80]:
def correlation_matrixes(evaluate_balance, evaluate_ratios, evaluate_income,
                         evaluate_media, evaluate_weather, evaluate_wind,
                         evaluate_twitter):
    """
    Get correlation matrixes from emrges just formed.
    """
    # Balance Sheet correlations.
    corr1 = evaluate_balance.corr()# Get correlation matrix.
    # Exclude first two rows and last columns to avoid correlations between the two stock variables.
    bal_cor = (corr1.iloc[2:, : 2]).style.background_gradient(cmap='coolwarm')
    # Save.
    save_df_normal(df = bal_cor, title = "Balance Correlation Matrix")
    
    # Financial Ratios correlations.
    corr2 = evaluate_ratios.corr()
    # Exclude first two rows and last columns to avoid correlations between the two stock variables.
    rat_cor = (corr2.iloc[2:, : 2]).style.background_gradient(cmap='coolwarm')
    # Save.
    save_df_normal(df = rat_cor, title = "Ratios Correlation Matrix")
    
    # Income Statement correlations.
    corr3 = evaluate_income.corr()
    # Exclude first two rows and last columns to avoid correlations between the two stock variables.
    inc_cor = (corr3.iloc[2:, : 2]).style.background_gradient(cmap='coolwarm')
    # Save.
    save_df_normal(df = inc_cor, title = "Income Correlation Matrix")
    
    # Media Sentiment correlations.
    corr4 = evaluate_media.corr()
    # Exclude first two rows and last columns to avoid correlations between the two stock variables.
    med_cor = (corr4.iloc[2:, : 2]).style.background_gradient(cmap='coolwarm')
    # Save.
    save_df_normal(df = med_cor, title = "Sentiment Correlation Matrix")
    
    # Weather correlations.
    corr5 = evaluate_weather.corr()
    # Exclude first two rows and last columns to avoid correlations between the two stock variables.
    wea_cor = (corr5.iloc[2:, : 2]).style.background_gradient(cmap='coolwarm')
    # Save.
    save_df_normal(df = wea_cor, title = "Weather Correlation Matrix")
    
    # Wind correlations.
    corr6 = evaluate_wind.corr()
    # Exclude first two rows and last columns to avoid correlations between the two stock variables.
    wind_cor = (corr6.iloc[2:, : 2]).style.background_gradient(cmap='coolwarm')
    # Save.
    save_df_normal(df = wind_cor, title = "Wind Correlation Matrix")
    
    # Twitter correlations.
    corr7 = evaluate_twitter.corr()
    # Exclude first two rows and last columns to avoid correlations between the two stock variables.
    twi_cor = (corr7.iloc[2:, : 2]).style.background_gradient(cmap='coolwarm')
    # Save.
    save_df_normal(df = twi_cor, title = "Twitter Correlation Matrix")
    
    return



By seeing the results of the correlation matrixes, these are the most closely correlated variables:
- **Balance Sheet:** `Retained Earnings Accumulated Deficit` and `Total Traded` (-0.878).
- **Balance Sheet:** `Common Stocks Including Additional Paid in Capital` and `Total Traded` (0.862).
- **Balance Sheet:** `Stock Holders Equity` and `Total Traded` (0.878).
- **Financial Ratios:** `Debt Equity Ratio` and `Total Traded` (0.931).
- **Financial Ratios:** `Company Equity Multiplier` and `Total Traded` (0.931).
- **Financial Ratios:** `Price Book Value Ratio` and `Total Traded` (0.902).
- **Financial Ratios:** `Dividend Yield` and `Total Traded` (0.886).
- **Income Statement:** `Operating Expenses` and `Total Traded` (0.880).
- **Income Statement:** `Interest Income` and `Total Traded` (-0.861).

#### 4. Scatter plots to represent the correlations.

In [81]:
def scatter_plot(df, v1, v2, title):
    """
    Scatter plot to represent the relationship between two variables.
    """
    
    fig, ax = plt.subplots(figsize=(4, 3))
    
    # Scatter plot function. 
    ax.scatter(df[v1], df[v2])
    
    ax.set_title(str(v1) + " vs " + str(v2) + " Scatter Plot", fontsize = 8) # Title.
    ax.set_xlabel(str(v1))
    ax.set_ylabel(str(v2))
    
    save_plot(name = str(v2) + " vs " + str(v1) + str(title), plot = " Scatter Plot")
    
    return


In [82]:
def scatter_stocks_vs(evaluate_balance, evaluate_ratios, evaluate_media,
                      evaluate_weather, evaluate_wind, evaluate_twitter):
    """
    Various scatter plots to view teh influence of different parameters on stock data.
    """
    # Commercial paper.
    scatter_plot(df = evaluate_balance, v1 = "Total Traded",
                 v2 = "Commercial Paper", title = " Original")
    
    # Company equity multiplier.
    scatter_plot(df = evaluate_ratios, v1 = "Total Traded",
                 v2 = "Company Equity Multiplier", title = " Original")
    
    # Sentimental score
    
    # Against total traded.
    scatter_plot(df = evaluate_media, v1 = "Total Traded",
                 v2 = "Score", title = " Original")
    # Against daily returns.
    scatter_plot(df = evaluate_media, v1 = "Daily Returns (%)",
                 v2 = "Score", title = " Original")
    
    # Average temp.
    scatter_plot(df = evaluate_weather, v1 = "Daily Returns (%)",
                 v2 = "Average Temp", title = " Original")
    
    # Wind
    
    # Against total traded.
    scatter_plot(df = evaluate_wind, v1 = "Total Traded",
                 v2 = "Peak Wind Gust", title = " Original")
    # Against daily returns
    scatter_plot(df = evaluate_wind, v1 = "Daily Returns (%)",
                 v2 = "Peak Wind Gust", title = " Original")
    
    # Twitter
    
    # Volume.
    scatter_plot(df = evaluate_twitter, v1 = "Total Traded",
                 v2 = "Twitter Volume", title = " Original")
    # Score.
    scatter_plot(df = evaluate_twitter, v1 = "Total Traded",
                 v2 = "Twitter Score", title = " Original")
    
    return


#### 5. Twitter Volume and Total traded.

Lets plot where there are spikes but not during COVID because we have already analysed this, so lets evaluate late 2018. There is already a plot for this from a previous analysis.

In [83]:
def analyse_twiter_2018_2019(evaluate_twitter, Stocks, evaluate_media):
    """
    Evaluate twitter data in late 2018, early 2019.
    """
    # Twitter volume.
    individual_part_series(df = evaluate_twitter, v = "Twitter Volume", start = "2018-09-01",
                           end = "2019-01-31", title = " Late 2018, Early 2019  Part Plot")
    
    # Twitter score.
    individual_part_series(df = evaluate_twitter, v = "Twitter Score", start = "2018-09-01",
                           end = "2019-01-31", title = " Late 2018, Early 2019  Part Plot")
    
    # Total money traded.
    individual_part_series(df = Stocks, v = "Total Traded", start = "2018-09-01",
                           end = "2019-01-31", title = " Late 2018, Early 2019 Part Plot")
    
    # Adjusted closing price.
    individual_part_series(df = Stocks, v = "Adjusted Close", start = "2018-09-01",
                           end = "2019-01-31", title = " Late 2018, Early 2019 Part Plot")
    
    # Daily returns.
    individual_part_series(df = Stocks, v = "Daily Returns (%)", start = "2018-09-01",
                           end = "2019-01-31", title = " Late 2018, Early 2019 Part Plot")
    
    # Correponding new score data to view if they concurd.
    individual_part_series(df = evaluate_media, v = "Score", start = "2018-09-01",
                           end = "2019-01-31", title = " Late 2018, Early 2019  Part Plot")
    
    return


In [84]:
def find_largest_twitter_volume(evaluate_twitter, News):
    """
    Max value of tweets volume and equivalent news score result.
    """
    twi_large = evaluate_twitter["Twitter Volume"].nlargest(1)
    
    news_equi = News["Score"].loc["2019-01-01":"2019-01-10"]
    
    return twi_large, news_equi


Essentially media sentiment results are useless in this investigation as they do not display the actual response of the market.

### 4. Deep Analysing Financial & Sentimental Data

#### 1. Financial Data

Investigating commercial paper.

In [85]:
def analyse_commercial_paper(evaluate_balance, Stocks):
    """
    Analyse commecial paper in 2020.
    """
    # Commercial Paper.
    individual_part_series(df = evaluate_balance, v = "Commercial Paper", start = "2019-11-01",
                           end = "2020-11-30", title = " 2020 Part Plot")
    
    # Corresponding Total Traded.
    individual_part_series(df = Stocks, v = "Total Traded", start = "2020-01-01",
                           end = "2020-10-01", title = " 2020 Part Plot")
    
    return


#### 2. Media Sentiment Data

In [86]:
def analyse_news_sentiment(evaluate_media, Stocks):
    """
    Analyse media sentiment scores in 2018.
    """
    # News score.
    individual_part_series(df = evaluate_media, v = "Score", start = "2018-10-01",
                           end = "2018-12-01", title = " 2018 Part Plot")
    
    # Daily returns.
    individual_part_series(df = Stocks, v = "Daily Returns (%)", start = "2018-10-01",
                       end = "2018-12-01", title = " 2018 Part Plot")
    
    # Total money traded.
    individual_part_series(df = Stocks, v = "Total Traded", start = "2018-10-01",
                       end = "2018-12-01", title = " 2018 Part Plot")
    
    # Adjusted closing price.
    individual_part_series(df = Stocks, v = "Adjusted Close", start = "2018-10-01",
                           end = "2018-12-01", title = " 2018 Part Plot")
    
    return


#### Apply hypothesis testing.

As we can see at the drop were the stocks spiked, the sentimental response is extremely neutral. Therefore lets apply the **hypothesis that a sentimental score between -0.35 and 0.35 doesn´t have a predictive value on tomorrow´s stock results.** Thus we will drop all columns where this hypothesis testing is not met. 

In [87]:
def news_hypothesis_testing(evaluate_media):
    """
    Apply hypothesis testing that a sentimental score between -0.35 and 0.35
    does not have a predictive value on tomorrow's stock data.
    """
    # Apply hypothesis.
    evaluate_media2 = evaluate_media[(evaluate_media["Score"] > 0.35) |
                                     (evaluate_media["Score"] < -0.35)]
    
    # Scatter plots of new data.
    scatter_plot(df = evaluate_media2, v1 = "Total Traded",
                 v2 = "Score", title = " Hypothesis")
    scatter_plot(df = evaluate_media2, v1 = "Daily Returns (%)",
                 v2 = "Score", title = " Hypothesis")
    
    # Media sentiment correlations after hypothesis is applied.
    corr = evaluate_media2.corr()
    # Exclude first two rows and last columns to avoid correlations between the two stock variables.
    med2_cor = (corr.iloc[2:, : 2]).style.background_gradient(cmap='coolwarm')
    # Save.
    save_df_normal(df = med2_cor, title = "Hypothesis Sentiment Correlation Matrix")
    
    return


The hypothesis has definitely helped to understand the correlation as the correlation coefficients have gone up. However, they are still not high enough (0.19) to be reliable when it comes to evaluate stock prices.

**Also lets evaluate sentimental score with categorical analysis:**

In [88]:
def news_categorical_data(evaluate_media):
    """
    Perform categorical analysis on news sentiment data using the stated hypothesis.
    """
    # Get number of days for each data.
    
    # Setimental scores.
    postive_days = len(evaluate_media[(evaluate_media["Score"] > 0.3)])
    negative_days = len(evaluate_media[(evaluate_media["Score"] < -0.3)])
    
    # Daily returns.
    increase_days = len(evaluate_media[(evaluate_media["Daily Returns (%)"] > 0)])
    decrease_days = len(evaluate_media[(evaluate_media["Daily Returns (%)"] < 0)])
    
    # Returns and positive news scores.
    inc_pos_days = len(evaluate_media[(evaluate_media["Daily Returns (%)"] > 0) & 
                                      (evaluate_media["Score"] > 0.3)])
    dec_pos_days = len(evaluate_media[(evaluate_media["Daily Returns (%)"] < 0) & 
                                      (evaluate_media["Score"] > 0.3)])
    
    # Returns and negative news scores.
    inc_neg_days = len(evaluate_media[(evaluate_media["Daily Returns (%)"] > 0) & 
                                      (evaluate_media["Score"] < -0.3)])
    dec_neg_days = len(evaluate_media[(evaluate_media["Daily Returns (%)"] < 0) & 
                                      (evaluate_media["Score"] < -0.3)])
    
    
    # Define appropiate data points.
    Positive = [inc_pos_days, dec_pos_days, postive_days]
    Negative = [inc_neg_days, dec_neg_days, negative_days]
    Total = [increase_days, decrease_days, (increase_days + decrease_days)]
    
    # Define dataframe columns.
    columns = ["Increse Days", "Decrease Days", "Total"]
    index = ["Increse Days", "Decrease Days", "Total"]
    
    # Define data points to correponding columns.
    d = {"Positive Days": Positive, "Negative Days": Negative, "Total": Total}
    
    # Create dataframe table.
    Categorical_Score = pd.DataFrame(data = d, index = index, columns = None)
    
    # Save dataframe.
    save_df_normal(df = Categorical_Score, title = "Sentimental Score Categorical Data Table")
    
    return


### 5. Deep Analysing Weather & Wind Data with Categorical Analysis

We will determine wether it is "snowing", "heavy raining", "clear", "hot" or "cold".

#### 1. Evaluate Snow

In [89]:
def get_snow_dates(evaluate_weather):
    """
    Get dates of snowstorms over the four year data spam period.
    """
    evaluate_snow = evaluate_weather[evaluate_weather["Snow"] != 0]
    snowstorms = list(evaluate_snow.index.values)
    
    return snowstorms


As we can see, there have been two big snow storms: Late February 2018 and late January 2021.

In [90]:
def snow_2018(evaluate_weather, evaluate_media, evaluate_twitter):
    """
    February 2018 snowstorm analysis.
    """
    # Total traded.
    individual_part_series(df = evaluate_weather, v = "Total Traded", start = "2018-02-25",
                           end = "2018-03-06", title = " Snow 2018")
    
    # Daily returns. 
    individual_part_series(df = evaluate_weather, v = "Daily Returns (%)", start = "2018-02-25",
                           end = "2018-03-06", title = " Snow 2018")
    
    # Sentimental score.
    individual_part_series(df = evaluate_media, v = "Score", start = "2018-02-25",
                           end = "2018-03-06", title = " Snow 2018")
    
    # Tweets volume.
    individual_part_series(df = evaluate_twitter, v = "Twitter Volume", start = "2018-02-25",
                           end = "2018-03-06", title = " Snow 2018")
    
    return


In [91]:
def snow_2021(evaluate_weather, evaluate_media):
    """
    January 2021 snowstorm analysis.
    """
    # Total traded.
    individual_part_series(df = evaluate_weather, v = "Total Traded", start = "2021-01-22",
                           end = "2021-01-31", title = " Snow 2021")
    
    # Daily returns. 
    individual_part_series(df = evaluate_weather, v = "Daily Returns (%)", start = "2021-01-22",
                           end = "2021-01-31", title = " Snow 2021")
    
    # Sentimental score.
    individual_part_series(df = evaluate_media, v = "Score", start = "2021-01-22",
                           end = "2021-01-31", title = " Snow 2021")
    
    # Tweets volume data is not available for this period.
    
    return


#### 2. Rain Evaluation

Again, run hypothesis that little drizzle/precipitation will not have an effect on tomorrow's stock price.

In [92]:
def precipitation_analysis(evaluate_weather):
    """
    Precipitation analysis before and after applying hypothesis testing.
    """
    # Original
    scatter_plot(df = evaluate_weather, v1 = "Total Traded",
                 v2 = "Precipitation", title = " Original")
    
    # New hypothesis of only considering significant rain.
    evaluate_rain = evaluate_weather[(evaluate_weather["Precipitation"] > 5)]
    scatter_plot(df = evaluate_rain, v1 = "Total Traded",
                 v2 = "Precipitation", title = " Hypothesis")
    
    # Check effect of hypotheiss testing on the correlation coefiicents.
    ctt = evaluate_rain["Total Traded"].corr(evaluate_rain["Precipitation"])
    cdr = evaluate_rain["Daily Returns (%)"].corr(evaluate_rain["Precipitation"])
    results = print("New and improved precipitation correlation with: \nTotal Traded: " +
                    str(ctt) + "\nDaily Returns (%): " + str(cdr))
    
    return


Same as with the snow hypothesis, the correlation coefficient improves however it is still too low.

#### 3. Temperature Evaluation

We will again run a hypothesis that only "extreme" temperatures will affect the stock price.

In [93]:
def temp_hypothesis(evaluate_weather):
    """
    Apply hypothesis testing that only temp below 2 and above 20 have an effect on stocks.
    """
    # Apply hypothesis.
    evaluate_temp = evaluate_weather.loc[(evaluate_weather["Average Temp"] > 20) |
                                         (evaluate_weather["Average Temp"] < 2)]
    
    # Check effect on the correlation coefficients.
    ctt = evaluate_temp["Total Traded"].corr(evaluate_temp["Average Temp"])
    cdr = evaluate_temp["Daily Returns (%)"].corr(evaluate_temp["Average Temp"])
    
    effect = print("New and improved temperature correlation with: \nTotal Traded: " +
                   str(ctt) + "\nDaily Returns (%): " + str(cdr))
    
    return


#### 4. Data Categorisation

In [94]:
def weather_categorical_data(evaluate_weather):
    """
    Perform categorical analysis on weather data using the stated hypothesis.
    """
    # Precipitation
    rainy_days = len(evaluate_weather[(evaluate_weather["Precipitation"] >= 7)])
    
    # Temperature
    hot_days = len(evaluate_weather[(evaluate_weather["Average Temp"] >= 20)])
    cold_days = len(evaluate_weather[(evaluate_weather["Average Temp"]<= 2)])
    
    # Snow
    snow_days = len(evaluate_weather[(evaluate_weather["Snow"] > 0)])
    
    # Daily Returns 
    increase_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] > 0.3)])
    decrease_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] < -0.3)])
    
    # Returns and Rainy.
    inc_rain_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] > 0.3) & 
                                         (evaluate_weather["Precipitation"] >= 7)])
    dec_rain_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] < -0.3) & 
                                         (evaluate_weather["Precipitation"] >= 7)])
    
    # Returns and Snow.
    inc_snow_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] > 0.3) & 
                                         (evaluate_weather["Snow"] > 0)])
    dec_snow_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] < -0.3) &
                                         (evaluate_weather["Snow"] > 0)])
    
    # Returns and Hot.
    inc_hot_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] > 0.3) & 
                                        (evaluate_weather["Average Temp"] >= 20)])
    dec_hot_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] < -0.3) & 
                                        (evaluate_weather["Average Temp"] >= 20)])
    
    # Returns and Cold.
    inc_cold_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] > 0.3) & 
                                         (evaluate_weather["Average Temp"] <= 2)])
    dec_cold_days = len(evaluate_weather[(evaluate_weather["Daily Returns (%)"] < -0.3) & 
                                         (evaluate_weather["Average Temp"] <= 2)])
    
    
    # Creating dataframe table.
    Hot = [inc_hot_days, dec_hot_days, hot_days]
    Cold = [inc_cold_days, dec_cold_days, cold_days]
    Rainy = [inc_rain_days, dec_rain_days, rainy_days]
    Snowy = [inc_snow_days, dec_snow_days, snow_days]
    Total = [increase_days, decrease_days, (increase_days + decrease_days)]
    
    columns = ["Increse Days", "Decrease Days", "Total"]
    index = ["Increse Days", "Decrease Days", "Total"]
    d = {"Hot Days": Hot, "Cold Days": Cold, "Rainy Days": Rainy, "Snowy Days": Snowy,
     "Total": Total}
    
    Categorical_Weather = pd.DataFrame(data = d, index = index, columns = None)
    
    # Save dataframe.
    save_df_normal(df = Categorical_Weather, title = "Weather Categorical Data Table")
    
    return


#### 5. Wind Evaluation

In [95]:
def wind_categorical_data(evaluate_wind):
    """
    Perform categorical analysis on wind data using the stated hypothesis.
    """
    # Wind
    windy_days = len(evaluate_wind[(evaluate_wind["Peak Wind Gust"] > 55)])
    
    # Daily Returns 
    increase_days = len(evaluate_wind[(evaluate_wind["Daily Returns (%)"] > 0.3)])
    decrease_days = len(evaluate_wind[(evaluate_wind["Daily Returns (%)"] < -0.3)])
    
    # Returns and Wind.
    inc_windy_days = len(evaluate_wind[(evaluate_wind["Daily Returns (%)"] > 0.3) & 
                                       (evaluate_wind["Peak Wind Gust"] > 55)])
    dec_windy_days = len(evaluate_wind[(evaluate_wind["Daily Returns (%)"] < -0.3) & 
                                       (evaluate_wind["Peak Wind Gust"] > 55)])
    
    
    # Creating dataframe table.
    Windy = [inc_windy_days, dec_windy_days, windy_days]
    Total = [increase_days, decrease_days, (increase_days + decrease_days)]
    
    columns = ["Increse Days", "Decrease Days", "Total"]
    index = ["Increse Days", "Decrease Days", "Total"]
    d = {"Windy Days": Windy, "Total": Total}
    
    Categorical_Wind = pd.DataFrame(data = d, index = index, columns = None)
    
    # Save dataframe.
    save_df_normal(df = Categorical_Wind, title = "Wind Categorical Data Table")
    
    def chi_test(Categorical_Wind):
        """
        Check dependency using the chi contingency.
        """
        stat, p, dof, expected = chi2_contingency(Categorical_Wind)
        
        # Null hypothesis is that the variables are independent.
        prob = 0.90
        critical = chi2.ppf(prob, dof)
        if abs(stat) >= critical:
            print('Dependent (Reject Null Hypothesis)')
        else:
            print('Independent (Fail to Reject Null Hypothesis)')
        
        return
    
    return


## Function to explore all data.

In [96]:
def explore_all(BalanceSheet, FinancialRatios, IncomeStatement,
                Stocks, News, Weather, Wind, Twitter):
    """
    Exploration of all datasets.
    """
    
    """
    Stock data analysis.
    """
    # High and low prices.
    high_low(Stocks)
    
    # Open and close prices.
    open_close(Stocks)
    
    # Close and adjusted close prices.
    close_adjusted(Stocks)
    
    # Total money traded.
    
    # Define it.
    Stocks["Total Traded"] = Stocks["Open"] * Stocks["Stock Volume"]
    # Plot time series to compare with stock volume.
    totaltraded_volume(Stocks)
    
    # Analyse 2020 stock volume, total traded and adjusted closing price metrics.
    stock_analysis_2020(Stocks)
    
    # Daily and cumulative returns.
    
    # Define daily returns.
    Stocks["Daily Returns (%)"] = ((Stocks["Adjusted Close"]/
                                    Stocks["Adjusted Close"].shift(1)) - 1) * 100
    # Define cumulative returns.
    returns = ((Stocks["Adjusted Close"]/
                Stocks["Adjusted Close"].shift(1)) - 1)
    Stocks["Cumulative Return"] = (1 + returns).cumprod()
    # Plot them.
    daily_cumulative_returns(Stocks)
    
    # Daily returns probability distribution.
    returns_distribution(Stocks)
    
    # Rolling averages.
    
    # Full time period comparison to original prices.
    rolling_original_full(Stocks)
    # 2020 comparison plot.
    rolling_original_2020(Stocks)
    
    # Christmas effect.
    christmas_totaltraded(Stocks)
    
    # Interesting 2018 stock data subplots.
    late_2018_stock(Stocks)
    
    # Dependencies.
    
    # Monthly.
    monthly_dependency(Stocks)
    # Daily.
    daily_dependency(Stocks)
    
    
    """
    Stock dependency on auxliary data.
    """
    ### PREPARING DATA FOR ANALYSIS.
    
    # Update stocks df by cleaning it a bit.
    stocks_update(Stocks)
    
    # Merge each auxliary df to stock in order to evaluate.
    (evaluate_balance, evaluate_ratios,
     evaluate_income, evaluate_media,
     evaluate_weather, evaluate_wind,
     evaluate_twitter) = merge_to_evaluate(BalanceSheet, FinancialRatios,
                                           IncomeStatement,Stocks, News,
                                           Weather, Wind, Twitter)
    # Save correlation matrixes.
    correlation_matrixes(evaluate_balance, evaluate_ratios, evaluate_income,
                         evaluate_media, evaluate_weather, evaluate_wind,
                         evaluate_twitter)
    
    ### ANALYSIS.
    
    # Various scatter plots between stock variables and auxiliary.
    scatter_stocks_vs(evaluate_balance, evaluate_ratios, evaluate_media,
                      evaluate_weather, evaluate_wind, evaluate_twitter)
    
    # Evaluate twitter.
    analyse_twiter_2018_2019(evaluate_twitter, Stocks, evaluate_media)
    
    # Evaluate commecial paper.
    analyse_commercial_paper(evaluate_balance, Stocks)
    
    
    # Evaluate news sentiment scores.
    
    # Intial analysis.
    analyse_news_sentiment(evaluate_media, Stocks)
    # Apply hypothesis testing.
    news_hypothesis_testing(evaluate_media)
    # Apply categorical analysis.
    news_categorical_data(evaluate_media)
    
    
    # Evaluate weather data.
    
    # Evaluate snow.
    
    # Get dates of snowstorms over the years.
    snowstorms = get_snow_dates(evaluate_weather)
    # Evaluate February 2018 storm.
    snow_2018(evaluate_weather, evaluate_media, evaluate_twitter)
    # Evaluate January 2021 storm.
    snow_2021(evaluate_weather, evaluate_media)
    
    # Evalate precipitation.
    
    # Intial and hypothesis comparison.
    precipitation_analysis(evaluate_weather)
    # Weather categoriastion.
    weather_categorical_data(evaluate_weather)
    
    
    # Wind categoriastion.
    wind_categorical_data(evaluate_wind)
    
    return Stocks


Stocks is defined at the end because throughout this section only the stocks dataframe has been altered by adding the `total money traded` and `daily returns` variables.

<div class="alert alert-heading alert-info">

## Task 5: Inference

Train a model to predict the closing stock price on each day for the data you have already
collected, stored, preprocessed and explored from previous steps. The data must be spanning
from April 2017 to April 2021.
You should develop two separate models:


1. A model for predicting the closing stock price on each day for a 1-month time window (until
    end of May 2021), using only time series of stock prices.
2. A model for predicting the closing stock price on each day for a 1-month time window (until
    end of May 2021), using the time series of stock prices and the auxiliary data you collected.
Which model is performing better? How do you measure performance and why? How could you
further improve the performance? Are the models capable of predicting the closing stock prices
far into the future?

[IMPORTANT NOTE] For these tasks, you are not expected to compare model architectures, but
examine and analyse the differences when training the same model with multiple data attributes
and information from sources. Therefore, you should decide a single model suitable for time series
data to solve the tasks described above. Please see the lecture slides for tips on model selection
and feel free to experiment before selecting one.

The following would help you evaluate your approach and highlight potential weaknesses in your
process:

1. Evaluate the performance of your model using different metrics, e.g. mean squared error,
    mean absolute error or R-squared.
2. Use ARIMA and Facebook Prophet to explore the uncertainty on your model’s predicted
    values by employing confidence bands.
3. Result visualization: create joint plots showing marginal distributions to understand the
    correlation between actual and predicted values.
4. Finding the mean, median and skewness of the residual distribution might provide
    additional insight into the predictive capability of the model.
</div>

## 1. Simple Prediction Solely on Stock Prices

In [97]:
def is_pandemic_affected(ds):
    """
    Add regressor to indicate if the data was affected by the pandemic.
    """
    date = pd.to_datetime(ds)
    
    return (date.year == 2020)


In [98]:
def prepare_data(Stocks, data, target_feature, name):
    """
    Prepare data to inserted into the Prophet algorithm.
    """
    # Copy the data.
    new_data = data.copy()
    
    # To use when declaring the pandemic effect.
    global ds
    
    # Prepare the data.
    new_data.reset_index(inplace = True)
    new_data = new_data.rename({"Date": "ds", "{}".format(target_feature): name}, axis = 1)
    new_data = new_data.dropna()
    
    # Include pandemic effects.
    def is_pandemic_affected(ds):
        """
        Add regressor to indicate if the data was affected by the pandemic.
        """
        date = pd.to_datetime(ds)
        
        return date.year == 2020
    
    new_data["pandemic_affected"] = new_data["ds"].apply(is_pandemic_affected)
    new_data["not_pandemic_affected"] = ~new_data["ds"].apply(is_pandemic_affected)
    
    return new_data


Prepare holidays data.

In [99]:
# Create dataframe.
holidays_new = pd.DataFrame([], columns = ["ds", "holiday"])
ldates = []
lnames = []
# For loop to get holidays for all the years defined.
for date, name in sorted(holidays.England(years=np.arange(2017, 2021 + 1)).items()):
    ldates.append(date)
    lnames.append(name)

# Get data points and define them within the dataframe.
ldates = np.array(ldates)
lnames = np.array(lnames)
holidays_new.loc[:, "ds"] = ldates
holidays_new.loc[:, "holiday"] = lnames
holidays_new.holiday.unique()

# Get only date and holiday columns.
holidays_new.loc[:, "holiday"] = holidays_new.loc[:, "holiday"].apply(lambda x : x.replace(" (Observed)", ""))


#### First I will test the predictive model by testing in on the last month of data in order to evaluate the accuracy of the predictive model. Then I will predict for the month of May for reference when comparing with the other predictive models I will design later on.

In [100]:
# Training data from April 2017 to February 2021, and test data the remainder two months.
def train_test_split(data):
    # Select "ds" date column as the index.
    # Define train and test periods.
    train = data.set_index("ds").loc[: "2021-03-31", :].reset_index()
    test = data.set_index("ds").loc["2021-04-01":, :].reset_index()
    train.dropna()
    test.dropna()
    
    return train, test


simple_train, simple_test = train_test_split(data = stocks_new)

### Testing Model & Corresponding Inference

In [101]:
def standard_m_prophet():
    
    # Seasonalities determine the period of time for which the data will be displayed.
    m = Prophet(holidays = holidays_new,
                seasonality_mode = "multiplicative",
                yearly_seasonality = True,# Display in yearly cycles.
                weekly_seasonality = True,# Display weekly data.
                daily_seasonality = False)# False because the data is daily, so it is impossible to build a daily function.
    
    return m 


In [102]:
def train_model(data_train, model):
    """
    Build function to test the algorithm.
    """
    m = standard_m_prophet()
    
    # Add pandemic regressors.
    m.add_seasonality(name = "pandemic_affected", period = 365, fourier_order = 3,
                      mode = "multiplicative", condition_name = "pandemic_affected")
    m.add_seasonality(name = "not_pandemic_affected", period = 365, fourier_order = 3,
                      mode = "multiplicative", condition_name = "not_pandemic_affected")
    
    # Fit all of the seasonalities above into our Prophet model.
    prophet_train_test = m.fit(data_train)
    
    global ds
    
    # Extend the training data into our test data period (two months).
    future_test = m.make_future_dataframe(periods = 62, freq = "1D")
    
    # Add pandemic regressors into the model df.
    future_test["pandemic_affected"] = future_test["ds"].apply(is_pandemic_affected)
    future_test["not_pandemic_affected"] = ~future_test["ds"].apply(is_pandemic_affected)
    
    # Only making the future dataframe on weekdays as the stock market is closed on weekends.
    future_test = future_test[future_test["ds"].dt.dayofweek < 5]
    
    # Produces a detailed dataframe comprising of many modelled time series components.
    forecast_test = m.predict(future_test)
    
    comp = m.plot_components(forecast_test, figsize = (7, 10))
    
    # Save plots.
    save_plot(name = "Prophet " + str(model), plot = " Components")
    
    return forecast_test


In [103]:
def predict_model(data, period):
    """
    Testing model to infer data for the month of May.
    """
    m = standard_m_prophet()
    
    # Add pandemic regressors.
    m.add_seasonality(name = "pandemic_affected", period = 365, fourier_order = 3,
                      mode = "multiplicative", condition_name = "pandemic_affected")
    m.add_seasonality(name = "not_pandemic_affected", period = 365, fourier_order = 3,
                      mode = "multiplicative", condition_name = "not_pandemic_affected")
    
    # This time apply the fitting and prediction process to the entire Stock data period.
    prophet_train_predict = m.fit(data)
    
    # Extend the data to the month of May in order to predict.
    future_predict = m.make_future_dataframe(periods = period, freq = "1D")
    
    # Add pandemic regressors into the model df.
    future_predict["pandemic_affected"] = future_predict["ds"].apply(is_pandemic_affected)
    future_predict["not_pandemic_affected"] = ~future_predict["ds"].apply(is_pandemic_affected)
    
    # Only weekdays.
    future_predict = future_predict[future_predict["ds"].dt.dayofweek < 5]
    
    # Produces a detailed dataframe comprising of many modelled time series components.
    forecast_predict = m.predict(future_predict)
    
    return forecast_predict

#### 1. Plots for Testing the Model

In [104]:
def predictions_test(forecast, data_train, data_test, model):
    
    def make_predictions_df(): 
        """
        Function to convert the output Prophet dataframe to a datetime index and append the actual target values at the end
        """
        
        forecast.index = pd.to_datetime(forecast.ds)
        data_train.index = pd.to_datetime(data_train.ds)
        data_test.index = pd.to_datetime(data_test.ds)
        data = pd.concat([data_train, data_test], axis=0)
        forecast.loc[:, "y"] = data.loc[:, "y"] # y is the actual output of the datasets. 
        
        return forecast
    
    result = make_predictions_df()
    
    """
    Clip predictions so that all the values are positive as expected.
    """
    result.loc[:, "yhat"] = result.yhat.clip(lower=0)
    result.loc[:, "yhat_lower"] = result.yhat_lower.clip(lower=0)
    result.loc[:, "yhat_upper"] = result.yhat_upper.clip(lower=0)
    
    def plot_test_predictions():
        """
        Function to plot the predictions 
        """
        
        fig, ax = plt.subplots(figsize = (10, 6))
        
        train = result.loc["2019-10-01": "2021-03-31", :]
        ax.plot(train.index, train.y, "bo", markersize = 3, label = "Real Training Data")
        ax.plot(train.index, train.yhat, color = "steelblue", lw = 0.5, label = "Predicted Training Data")
        ax.fill_between(train.index, train.yhat_lower, train.yhat_upper, color = "steelblue", alpha = 0.3)
        
        test = result.loc["2021-04-01":, :]
        ax.plot(test.index, test.y, "ro", markersize = 3, label = "Real Testing Data")
        ax.plot(test.index, test.yhat, color = "coral", lw = 0.5, label = "Predicted Testing Data")
        ax.fill_between(test.index, test.yhat_lower, test.yhat_upper, color = "coral", alpha = 0.3)
        ax.axvline(result.loc["2021-04-01", "ds"], color = "k", ls = "--", alpha = 0.7)
        
        plt.grid(ls = ":", lw = 0.5)
        plt.xlabel("Date", fontsize = 14)
        plt.ylabel("Adjusted Close Price", fontsize = 14)
        plt.title("Validating Inference Model With " + str(model) + " Data", fontsize = 18)
        plt.legend(loc = "upper left", prop={'size': 17})
        
        # Save the plot
        save_plot(name = "Prophet " + str(model) + " Validation", plot = " Time Series")
        
        return
    
    plot_test_predictions()
    
    return result


#### 2. Plots for Inference for the Month of May

In [105]:
def predictions_inference(forecast, data, model):
    
    def make_predictions_df(): 
        """
        Function to convert the output Prophet dataframe to a datetime index and append the actual target values at the end
        """
        
        forecast.index = pd.to_datetime(forecast.ds)
        data.index = pd.to_datetime(data.ds)
        forecast.loc[:, "y"] = data.loc[:, "y"] # y is the actual output of the datasets. 
        
        return forecast
    
    result = make_predictions_df()
    
    """
    Clip predictions so that all the values are positive as expected.
    """
    
    result.loc[:, "yhat"] = result.yhat.clip(lower=0)
    result.loc[:, "yhat_lower"] = result.yhat_lower.clip(lower=0)
    result.loc[:, "yhat_upper"] = result.yhat_upper.clip(lower=0)
    
    def plot_inference_predictions():
        """
        Function to plot the predictions. 
        """
        
        fig, ax = plt.subplots(figsize = (10, 6))
        
        train = result.loc["2019-10-01": "2021-04-30", :]
        ax.plot(train.index, train.y, "bo", markersize = 3, label = "Real Training Data")
        ax.plot(train.index, train.yhat, color = "steelblue", lw = 0.5, label = "Predicted Training Data")
        ax.fill_between(train.index, train.yhat_lower, train.yhat_upper, color = "steelblue", alpha = 0.3)
        
        # From the 3rd of May because the 1st and are in the weekend.
        infer = result.loc["2021-05-03":, :]
        ax.plot(infer.index, infer.yhat, color = "coral", lw = 0.5, label = "Inferred Data")
        ax.fill_between(infer.index, infer.yhat_lower, infer.yhat_upper, color = "coral", alpha = 0.3)
        ax.axvline(result.loc["2021-05-03", "ds"], color = "k", ls = "--", alpha = 0.7)
        
        plt.grid(ls = ":", lw = 0.5)
        plt.xlabel("Date", fontsize = 14)
        plt.ylabel("Adjusted Close Price", fontsize = 14)
        plt.title("Testing Inference Model With " + str(model) + " Data", fontsize = 18)
        plt.legend(loc = "upper left", prop={'size': 17})
        # Save the plot
        save_plot(name = "Prophet " + str(model) + " Test", plot = " Time Series")
        
        return
    
    plot_inference_predictions()
    
    return


#### 3. Analyse the Performance of the Model

Perform a joint plot to evaluate the accuracy of the inference model.

In [106]:
def joint_plots(model_results, model):
    """
    Joint plot function to quantify the performance of each model.
    """
    
    # Define the training and testing data periods.
    data0 = model_results.loc[:"2021-03-31", :]
    data1 = model_results.loc["2021-04-01":, :]
    
    # First joint plot for the training data.
    g0 = sns.jointplot(x = "yhat", y = "y", data = data0,
              kind = "reg", color = "b")
    
    # Define the axis within the plot to define the title.
    ax0 = g0.fig.axes[1]
    ax0.set_title("Train Data", fontsize = 17)
    
    # Define the axis within the plot to define axis and plot settings.
    ax1 = g0.fig.axes[0]
    ax1.text(40, 120, "R = {:+4.2f}".format(data0.loc[:,["y", "yhat"]].corr().iloc[0, 1]),
             fontsize = 24)# Find correlation between the real and predicted values.
    ax1.set_xlabel("Predictions", fontsize = 14)
    ax1.set_ylabel("Observations", fontsize = 14)
    ax1.set_xlim([20, 160])
    ax1.set_ylim([20, 160])
    ax1.grid(ls = ":")
    
    # Second joint plot for the testin data (March & April).
    g1 = sns.jointplot(x = "yhat", y = "y", data = data1,
              kind = "reg", color = "b")
    
    # Define the axis within the plot to define the title.
    ax0 = g1.fig.axes[1]
    ax0.set_title("Validation Data", fontsize = 17)
    
    # Define the axis within the plot to define axis and plot settings.
    ax1 = g1.fig.axes[0]
    ax1.text(40, 120, "R = {:+4.2f}".format(data1.loc[:,["y", "yhat"]].corr().iloc[0, 1]), fontsize = 24)
    ax1.set_xlabel("Predictions", fontsize = 14)
    ax1.set_ylabel("Observations", fontsize = 14)
    ax1.set_xlim([20, 160])
    ax1.set_ylim([20, 160])
    ax1.grid(ls = ":")
    
    """
    Seaborn jointplot doesn´t have a subplot function.
    So I will save the plots on the temporary local memory, load them and subplot.
    """
    
    # Save plots in memory temporary.
    g0.savefig("g0.png")
    plt.close(g0.fig)# Close so that plot is not shown.
    
    g1.savefig("g1.png")
    plt.close(g1.fig)
    
    # Create subplot from temporary memory plots.
    f, axarr = plt.subplots(1, 2, figsize=(10, 5))
    
    axarr[0].imshow(mpimg.imread("g0.png"))
    axarr[1].imshow(mpimg.imread("g1.png"))
    
    # Turn off x and y axis.
    [ax.set_axis_off() for ax in axarr.ravel()]
    
    # Save plot in local drive.
    save_plot(name = "Joint Plot ", plot = str(model))
    
    return


### Function for entire basic inference model.

In [107]:
def basic_model(Stocks):
    """
    Basic model with only stock data and pandemic regressor.
    """
    
    """
    Prepare data for basic inference model with only stock data.
    """
    # Only the stocks adjusted closing price.
    StocksProphet = Stocks[["Adjusted Close"]]
    
    # Get stocks data in wanted format. 
    stocks_new = prepare_data(Stocks, data = StocksProphet,
                              target_feature = "Adjusted Close", name = "y")
    
    # Save dataframe format.
    save_df_normal(df = stocks_new.head(5), title = "Stocks Prophet Format")
    
    """
    Split.
    """
    # Define train and test datasets.
    simple_train, simple_test = train_test_split(data = stocks_new)
    
    """
    Train the model.
    """
    # Validating model.
    forecast_test = train_model(simple_train, model = "Basic Model")
    simple_model_results = predictions_test(forecast_test, simple_train,
                                            simple_test, model = "Basic")
    
    """
    Plot results of inference.
    """
    # Test inference model
    forecast_predict = predict_model(data = stocks_new, period = 31)
    predictions_inference(forecast_predict, data = stocks_new,
                          model = "Basic")
    
    """
    Joint plot to see effectiveness.
    """
    # Joint plot to see the effectiveness of the model.
    joint_plots(model_results = simple_model_results,
                model = "Only Stock Data")
    
    return


## 2. Prediction model with the auxiliary data.

There is a big issue concerning most of our auxiliary datsets, they have many missing values relative to the period we are investigating in the stock dataset. For example, the financial datasets only have a data point every quarter, thus we have to impute this missing data points in order to perform the inference with the Facebook Prophet module.

### Predict financial data.

In [108]:
def interprolate_financial(BalanceSheet, FinancialRatios, IncomeStatement):
    """
    Interprolate finacila data in order to get all data points.
    """
    
    """
    Balance Sheet.
    """
    # Upsample in order to have data for every minute.
    balance_res = BalanceSheet.resample("1D")
    # Interpolate through time
    balance_res = balance_res.interpolate(method = "time")
    
    """
    Financial Ratios.
    """
    ratios_res = FinancialRatios.resample("1D")
    ratios_res = ratios_res.interpolate(method = "time")
    
    """
    Income Statement.
    """
    income_res = IncomeStatement.resample("1D")
    income_res = income_res.interpolate(method = "time")
    
    return balance_res, ratios_res, income_res


There is only financial data until March 29th. So first we forecast data until May in order to from there forecast stock data.

In [109]:
def make_predictions_df(forecast, data): 
    """
    Function to convert the output Prophet dataframe to a datetime index and append the actual target values at the end.
    """
    forecast.index = pd.to_datetime(forecast.ds)
    data.index = pd.to_datetime(data.ds)
    forecast.loc[:, "y"] = data.loc[:, "y"] # y is the actual output of the datasets. 
    
    return forecast


In [110]:
def predict_financial_data(balance_res, ratios_res, income_res):
    """
    Model to predict financial data for May.
    """
    # Define lists and arrays to be used throughout the function.
    dfs = [balance_res, ratios_res, income_res]
    prepare = {}
    fit = {}
    future = {}
    predict = {}
    results = {}
    new = {}
    
    for df in dfs:
        cols = list(df.columns)
        # Perform for all columns in all financial dfs.
        for col in cols:
            # Prepare data by defining a df for each variable.
            prepare[col] = prepare_data(Stocks, data = df, target_feature = col, name = "y")
            prepare[col] = prepare[col][["ds", "y"]]
            
            # Apply pandemic analysis.
            def is_pandemic_affected(ds):
                """
                Add regressor to indicate if the data was affected by the pandemic.
                """
                date = pd.to_datetime(ds)
                
                return date.year == 2020
            
            # Apply pandemic analysis.
            prepare[col]["pandemic_affected"] = prepare[col]["ds"].apply(is_pandemic_affected)
            prepare[col]["not_pandemic_affected"] = ~prepare[col]["ds"].apply(is_pandemic_affected)
            # Only for week days.
            prepare[col] = prepare[col][prepare[col]["ds"].dt.dayofweek < 5]
            prepare[col].dropna()
            
            # Run prophet model.
            m = standard_m_prophet()
            m.add_seasonality(name = "pandemic_affected", period = 365, fourier_order = 3,
                              mode = "multiplicative", condition_name = "pandemic_affected")
            m.add_seasonality(name = "not_pandemic_affected", period = 365, fourier_order = 3,
                              mode = "multiplicative", condition_name = "not_pandemic_affected")
            
            # Fit the model.
            fit[col] = m.fit(prepare[col])
            
            # Extend the data to the month of May in order to predict.
            future[col] = m.make_future_dataframe(periods = 62, freq = "1D")
            
            # Add pandemic regressors into the model df.
            future[col]["pandemic_affected"] = future[col]["ds"].apply(is_pandemic_affected)
            future[col]["not_pandemic_affected"] = ~future[col]["ds"].apply(is_pandemic_affected)
            
            # Only weekdays again to make sure.
            future[col] = future[col][future[col]["ds"].dt.dayofweek < 5]
            
            # Predict may values.
            predict[col] = m.predict(future[col])
            
            # Results dfs.
            results[col] = make_predictions_df(predict[col], data = prepare[col])
            # Define new varaible values as "yhat", the predicted values.
            results[col].loc[:, "yhat"] = results[col].yhat.clip(lower = 0)
            # Only extract inferred values.
            results[col] = results[col].loc["2021-03-30":, :]
            results[col] = (results[col])[["yhat"]]
            # Change column name to see each variable.
            results[col].rename(columns = {"yhat": col}, inplace = True)
            
    # Extract all newly generated variable datasets from each df.
    return results


In [111]:
def group_financial_data(balance_res, ratios_res, income_res, results):
    """
    Group inferred and old real data, and group into a single df.
    """
    
    """
    Concat balance.
    """
    balance_full = pd.concat([results["Commercial Paper"], results["Retained Earnings Accumulated Deficit"],
                              results["Stock Holders Equity"]], join = "outer", axis = 1)
    balance_full = pd.concat([balance_res, balance_full])
    
    """
    Concat ratios.
    """
    ratios_full = pd.concat([results["Debt Ratio"], results["Debt Equity Ratio"],
                             results["Long Term Debt to Capitalization"], results["Total Debt to Capitalization"],
                             results["Company Equity Multiplier"], results["Price Book Value Ratio"],
                             results["Price to Book Ratio"], results["Dividend Yield"], results["Price Fair Value"]],
                            join = "outer", axis = 1)
    ratios_full = pd.concat([ratios_res, ratios_full])
    
    """
    Concat income.
    """
    income_full = pd.concat([results["Interest Income"], results["Weighted Average Shares Outstanding"],
                             results["Weighted Average Diluted Shares Outstanding"]],
                            join = "outer", axis = 1)
    income_full = pd.concat([income_res, income_full])
    
    """
    Merge all into full financial df.
    """
    financial_full = pd.concat([balance_full, ratios_full, income_full], join = "outer", axis = 1)
    financial_full = financial_full.dropna()
    
    return financial_full


### Full inference model.

Prepare data.

In [112]:
def prepare_full_data(Stocks, financial_full):
    """
    Prepare data to merge.
    """
    # Set this as global to use it later to define the pandemic.
    global ds
    
    # Prepare stock data.
    stocks_full_new = Stocks[["Adjusted Close"]]
    stocks_full_new.reset_index(inplace = True)
    stocks_full_new = stocks_full_new.rename({"Date": "ds", "{}".format("Adjusted Close"): "y"}, axis = 1)
    stocks_full_new = stocks_full_new.dropna()
    stocks_full_new.set_index("ds", inplace = True)
    
    # Prepare financial data.
    finance = financial_full.loc[:, list(financial_full.columns)]
    finance.reset_index(inplace = True)
    finance.rename(columns = {"index": "ds"}, inplace = True)
    finance = finance[finance["ds"].dt.dayofweek < 5]
    finance.set_index("ds", inplace = True)
    
    """
    Merge.
    """
    prophet_full = pd.merge(finance, stocks_full_new, left_index = True, right_index = True, how = "outer")
    prophet_full.reset_index(drop = False, inplace = True)
    
    # Include pandemic.
    def is_pandemic_affected(ds):
        """
        Add regressor to indicate if the data was affected by the pandemic.
        """
        date = pd.to_datetime(ds)
        
        return date.year == 2020
    
    prophet_full["pandemic_affected"] = prophet_full["ds"].apply(is_pandemic_affected)
    prophet_full["not_pandemic_affected"] = ~prophet_full["ds"].apply(is_pandemic_affected)
    
    return prophet_full, finance


#### Validation.

In [113]:
def train_full(finance, data_train, model):
    """
    Build function to test the algorithm.
    """
    m = standard_m_prophet()
    
    # Add pandemic regressors.
    m.add_seasonality(name = "pandemic_affected", period = 365, fourier_order = 3,
                      mode = "multiplicative", condition_name = "pandemic_affected")
    m.add_seasonality(name = "not_pandemic_affected", period = 365, fourier_order = 3,
                      mode = "multiplicative", condition_name = "not_pandemic_affected")
    # Add financial regressors.
    for regressor in list(finance.columns):
        m.add_regressor(regressor, mode = "multiplicative")
    
    # Fit all of the seasonalities above into our Prophet model.
    prophet_train_test = m.fit(data_train)
    
    global ds
    
    # Extend the training data into our test data period (two months).
    future_test = m.make_future_dataframe(periods = 61, freq = "1D")
    
    # Add pandemic regressors into the model df.
    future_test["pandemic_affected"] = future_test["ds"].apply(is_pandemic_affected)
    future_test["not_pandemic_affected"] = ~future_test["ds"].apply(is_pandemic_affected)
    
    # Add financial data.
    future_test.set_index("ds", inplace = True)
    future_full = pd.merge(finance, future_test, left_index = True,
                           right_index = True, how = "outer")
    future_full.reset_index(drop = False, inplace = True)
    
    # Only making the future dataframe on weekdays as the stock market is closed on weekends.
    future_full = future_full[future_full["ds"].dt.dayofweek < 5]
    future_full = future_full.dropna()
    
    # Produces a detailed dataframe comprising of many modelled time series components.
    forecast_full = m.predict(future_full)
    
    return forecast_full


### Function for entire full inference model.

In [114]:
def full_model(Stocks, BalanceSheet, FinancialRatios, IncomeStatement):
    """
    Full model with auxiliary data.
    """
    
    """
    Predict auxiliary data.
    """
    # Interpolate financial data.
    balance_res, ratios_res, income_res = interprolate_financial(BalanceSheet, FinancialRatios,
                                                                 IncomeStatement)
    
    # Predict data for April and May of each individual financial variable.
    results = predict_financial_data(balance_res, ratios_res, income_res)
    
    # Generate a single financial df with data all the way until May 2021.
    financial_full = group_financial_data(balance_res, ratios_res,
                                          income_res, results)
    
    """
    Stock inference model.
    """
    # Prepare data for prophet model.
    prophet_full, finance = prepare_full_data(Stocks, financial_full)
    
    # Split.
    full_train, full_test = train_test_split(data = prophet_full)
    
    # Validation.
    
    # Train.
    full_predict = train_full(finance, full_train, model = "Full")
    # Predict.
    full_model_results = predictions_test(full_predict, full_train,
                                      full_test, model = "Full")
    
    # Joint plot.
    joint_plots(full_model_results, model = "Full")
    
    return


<div class="alert alert-heading alert-danger">

## Autorun

</div>

#### So, essentially, running all functions at the same time to produce the work autmatically.

In [115]:
def main():
    """
    Pandas settings.
    """
    pandas_settings()
    
    """
    Data acquisition.
    """
    acquire_all()
    
    """
    Data preprocessing.
    """
    # Data is retrieved from the Oracle SQL databse in this function as well.
    (BalanceSheet, FinancialRatios, IncomeStatement,
     Stocks, News, Weather, Wind, Twitter) = process_all()
    
    """
    Data description.
    """
    data_description()
    
    """
    Data exploration.
    """
    Stocks = explore_all(BalanceSheet, FinancialRatios, IncomeStatement,
                         Stocks, News, Weather, Wind, Twitter)
    
    """
    Data Inference.
    """
    # Only stock data.
    basic_model(Stocks)
    
    # Full data.
    full_model(Stocks, BalanceSheet, FinancialRatios, IncomeStatement)
    
    return


In [116]:
main()

twitter-sentiments-aapl-stock.zip: Skipping, found more recently modified local copy (use --force to force download)


[0401/012053.328851:INFO:headless_shell.cc(659)] Written to file /var/folders/__/9h6t1gb167q96r0r54xl3ymr0000gn/T/tmp6l31vk3x/temp.png.
[0401/012054.643276:INFO:headless_shell.cc(659)] Written to file /var/folders/__/9h6t1gb167q96r0r54xl3ymr0000gn/T/tmpggd8oyf5/temp.png.
[0401/012055.570467:INFO:headless_shell.cc(659)] Written to file /var/folders/__/9h6t1gb167q96r0r54xl3ymr0000gn/T/tmptge3yp8g/temp.png.
[0401/012102.898783:INFO:headless_shell.cc(659)] Written to file /var/folders/__/9h6t1gb167q96r0r54xl3ymr0000gn/T/tmp0tiput9q/temp.png.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/panda

New and improved precipitation correlation with: 
Total Traded: -0.09186158551421496
Daily Returns (%): 0.08056232748093907


[0401/012126.752437:INFO:headless_shell.cc(659)] Written to file /var/folders/__/9h6t1gb167q96r0r54xl3ymr0000gn/T/tmpfzm6lbzc/temp.png.
[0401/012127.603820:INFO:headless_shell.cc(659)] Written to file /var/folders/__/9h6t1gb167q96r0r54xl3ymr0000gn/T/tmprmbc8vzc/temp.png.
[0401/012128.600392:INFO:headless_shell.cc(659)] Written to file /var/folders/__/9h6t1gb167q96r0r54xl3ymr0000gn/T/tmpnfc3fhji/temp.png.


Initial log joint probability = -13.0441
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       3275.63    0.00165117       771.453           1           1      120   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       3394.16     0.0369734       3037.97           1           1      230   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       3456.53    0.00416766       2277.03      0.3235           1      344   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       3486.48    0.00136794       1941.63     0.06075           1      456   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       3536.08    0.00534196       638.178      0.6316      0.6316      578   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -17.2023
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       3330.57     0.0235048       558.795           1           1      118   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       3477.96    0.00420516       4669.15     0.04697           1      230   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       3534.32     0.0044198       832.106           1           1      349   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       3568.67    0.00227123       557.694      0.9114      0.9114      469   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       3592.97    0.00385711       1383.05           1           1      589   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -5.23439
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       4120.08    0.00317396       2507.31           1           1      114   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       4520.58     0.0121678       11677.9           1           1      222   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       4712.28     0.0191395       8539.57           1           1      333   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       4790.07    0.00200081       4282.21      0.4206      0.4206      456   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       4877.18   0.000757729       2014.77      0.8077      0.8077      568   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -3.45718
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       4475.39      0.013256       3885.33           1           1      125   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199        4764.6     0.0127541       4191.67           1           1      233   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       4934.65     0.0156103       8737.38           1           1      345   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       5053.86     0.0151861       8399.75           1           1      455   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       5154.02   0.000780856       4230.06           1           1      565   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -28.9388
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       2991.59     0.0138958       2470.94           1           1      116   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       3856.36    0.00147923       4439.27           1           1      237   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       4184.48    0.00889923       5746.17           1           1      355   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       4310.69     0.0115583       15128.9           1           1      471   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       4409.74    0.00229117       2599.56           1           1      586   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -2.19455
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       5433.36    0.00408235       11418.3           1           1      131   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       5736.25    0.00178109          8698           1           1      242   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       5954.38   0.000486379       13167.5      0.1252           1      350   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       6226.11    0.00293487       19502.1           1           1      461   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499        6268.1    0.00567191       9015.19           1           1      575   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -4.13235
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       4367.67    0.00291826       3131.65      0.4662      0.4662      121   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       4761.87     0.0052439       7411.49       0.435       0.435      230   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       5186.12    0.00674362       35597.1      0.5794      0.5794      341   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       5352.12    0.00210633       22157.6      0.7205      0.7205      459   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       5547.85    0.00324828       6614.12           1           1      569   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -2.64784
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       5492.15   0.000283289       3078.96           1           1      123   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       5681.22     0.0105186       35434.7      0.4819      0.4819      235   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       5913.05   0.000543178       3831.78           1           1      349   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       6089.38   0.000129275       4115.71           1           1      456   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       6155.29    0.00196371       22257.1           1           1      563   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -2.21935
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       5498.35    0.00346123       4294.16           1           1      121   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       5736.16    0.00115196       7381.86           1           1      229   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       5819.38    0.00067702       7838.72      0.6753      0.6753      336   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       6070.68     0.0121337       20563.8      0.1827           1      448   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       6246.93    0.00070697        6301.8           1           1      560   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -3.41205
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       4420.08    0.00562518       8408.18           1           1      121   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       4746.83      0.015372       9546.39           1           1      231   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       5031.85    0.00208069       3254.13           1           1      345   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       5174.16    0.00287839       2200.02           1           1      455   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       5360.62    0.00126919       4304.31           1           1      569   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -19.2461
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       3802.44    0.00347648       3282.28      0.5829      0.5829      122   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       4229.49     0.0194709        4364.6      0.7931      0.7931      234   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       4606.67    0.00597804       4098.94           1           1      343   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       5010.73    0.00777845       20247.8      0.1476           1      464   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       5122.24   0.000305635       4819.37      0.3312      0.3312      580   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -19.2461
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       3802.44    0.00347648       3282.28      0.5829      0.5829      122   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       4229.49     0.0194709        4364.6      0.7931      0.7931      234   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       4606.67    0.00597804       4098.94           1           1      343   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       5010.73    0.00777845       20247.8      0.1476           1      464   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       5122.24   0.000305635       4819.37      0.3312      0.3312      580   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -13.8659
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       3032.52     0.0511343       1829.49      0.5131           1      121   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       3443.02    0.00737429       6427.16        0.27        0.27      240   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       3590.32     0.0100286       3088.86      0.6201      0.6201      356   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       3730.02      0.010764       1865.55           1           1      472   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       3775.04    0.00462935       1481.52           1           1      587   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -19.2461
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       3802.44    0.00347648       3282.28      0.5829      0.5829      122   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       4229.49     0.0194709        4364.6      0.7931      0.7931      234   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       4606.67    0.00597804       4098.94           1           1      343   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       5010.73    0.00777845       20247.8      0.1476           1      464   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       5122.24   0.000305635       4819.37      0.3312      0.3312      580   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -12.4645
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       4556.87     0.0993996       35210.6           1           1      120   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       5101.93     0.0150281       14267.1           1           1      228   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       5448.71    0.00491085       4916.33           1           1      337   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       5649.23     0.0021154       7059.44           1           1      446   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       6188.64      0.039699       60362.3           1           1      552   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -2.08493
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       6111.61   0.000728907       7783.68           1           1      127   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       6238.33    0.00427511       12530.6           1           1      237   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       6848.14   0.000801741        8883.1           1           1      346   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       6912.87   5.30628e-05       8713.25      0.5181      0.5181      456   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       7091.87   0.000301015       50552.5           1           1      566   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -2.09002
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       6643.38   0.000194927       14905.7      0.9977      0.9977      127   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       6932.61    0.00041064       19782.7       1.418      0.1418      240   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       7053.21   9.27545e-05       34101.2      0.4608      0.4608      346   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       7116.22   0.000104622         42031      0.5848      0.5848      454   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       7147.77    0.00037236       27294.2      0.3063           1      562   
    Iter      log prob        ||dx||      ||grad||       alpha  



Initial log joint probability = -13.0441
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      99       3647.73    0.00546813       5454.29           1           1      125   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     199       3714.35    0.00272869       604.606           1           1      237   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     299       3725.23   0.000752032       977.766           1           1      352   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     399       3732.91   0.000288651       972.587      0.3639           1      473   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
     499       3735.86   0.000599807       696.152           1           1      591   
    Iter      log prob        ||dx||      ||grad||       alpha  



<div class="alert alert-heading alert-info">

## Uploading code into GitHub
    
Pip install `requirements.txt` at the beigning of this code.
    
**Q:** Can I remove or edit the build.yml, the workflow or the .github folder?
**A:** No, you cannot. It is our responsibility to maintain this automations. If you think you found a bug, do not try to fix it yourself, and please communicate it to us - we will try to fix it, if possible.

</div>