# stock_data.py

## Import for StockData.py


Firstly, we import all the packages that would be used in StockData.py. We used the `import` statement and created an alias for the packages using the `as` statement.
We import `numpy` because we would be using some of the in-built functions such as np.nan.
`pandas` package would enable us to read and overwrite our CSV datafiles.
`matplotlib.pyplot` package would be used for plotting the stock data into graphs.


```
1 import numpy as np
2 import pandas as pd
3 import matplotlib.pyplot as plt
```

> *Learning points: The programmer can create alias for their imported packages so that it would be easier for them to recognize and use the functions in the packages.*

## Class for StockData

We create a class name `StockData` which will contain the attributes and functions.

```class StockData():```

> *Learning point: Classes are often created because it allows us to bundle data and functionalities together.*

## Constructor

We create a constructor using `__init__` which requires one string parameter: `filepath`. 
The attribute `filepath` stores the parameter `filepath`. 
The attribute `data` stores the pandas dataframe which is extracted from a CSV file found in the `filepath`. 
When constructing a `StockData` object, it will call and run the `check_data()` function.


17 def __init__(self, filepath):
31     self.filepath = filepath
32     self.data = pd.read_csv(filepath).set_index('Date')
33     self.check_data()

> *Learning point: If you do not create a Constructor, Python will automatically create a default constructor that does not do anything.*

## Check data

We first start by checking for any missing data and then filling in any missing values by interpolation in the csv data. We use the `interpolate()` function to fill in the estimated values. The interpolate function uses a linear interpolation which takes the average of the value before and after the data point to come out with an estimation. We started with this step so that our dataset would be cleaned and have no missing values.

We define a function name `check_data()`. This functions checks and handles missing data by filling in missing values by interpolation. The parameter (overwrite = True) takes a boolean value and overwrites the original source stock data .csv file.


In [None]:
def check_data(self, overwrite=True):
    self.data = self.data.interpolate()

> *Learning point: When creating a function, we would need to make sure there is proper indentation after the colon. All the code that is in the function would need to have the same indentation.*

The next part is to overwrite the original stock data.csv file. We would use a pandas inbuilt function`to_csv()` with the parameters (self.filepath) as the filepath and (index= overwrite) to overwrite the csv file.

In [None]:
self.data.to_csv(self.filepath, index=overwrite)

We then use `return` to send the StockData to any code that calls this function.


In [None]:
return self

> *Learning point: `return` statement is often used at the end of the function to returns the results (values) of the expression to the caller. Statements after the return statement are not executed. If the return statement is without any expression, the value returned would be `none`.*

## Get data

The get_data function to return a subset of the stock data from start_date to end_date inclusive.  The parameter `start_date` and `end_date` has a type `str` that is the start date and end_data of stock data range, must be of format YYYY-MM-DD. 


In [None]:
def get_data(self, start_date, end_date):

The variable self.selected_data would store a dataframe indexed from the specified start to end date inclusive.  

In [None]:
self.selected_data = self.data[str(start_date):str(end_date)]

We then use `return` to send the `selected_data` that consist of start and end dates to any code that calls this function.

In [None]:
return self.selected_data

## Get period

The get_period function is used to obtain the earliest and latest date in the `data` dataframe. Since `data` have index based on the date, we can obtain a list of date with `list(self.data.index)`. With the list, we can obtain the first and last index in the list and return them in a tuple.

In [None]:
  def get_period(self):
        index = list(self.data.index)
        (first, last) = (index[0], index[-1])
        return (first, last)

> *Learning point: If you want to return more than one variable, you can return them in  heterogeneous containers like tuple or list.*

## calculate_SMA

In the calculate_SMA function, we take in 1 parameter: n which is the number of days used to calculate the simple moving average (SMA). 

With n, we will create a column label named `SMA + n`.


In [None]:
col_head = 'SMA' + str(n) #col_hard will be SMA15 if n is 15

Due to the dataframe of the self.data having an index using the date, we use reset_index() to undo the index and reinclude date into one of the columns.

In [None]:
df = self.data.reset_index()

> *Learning point: To speed the working progress, we should use in-built functions provided by packages if it fulfil the requirements.*

Then we check if the column name `col_head` is found in `df` by using the following code:


In [None]:
if col_head not in df.columns:

> *Learning point: `Not` is a logical operator commonly used with conditional statements such as `if else` or `while`.*

If it is found in `df`, we will `return self` and leave the dataframe untouched as the SMA of `n` number of days has already been calculated. Otherwise, we will begin the calculation.

We begin by retrieving the list of date found in the self.data(portion of the full data) and creating `returnList` which will store the calculated SMA later on by using the following code:


In [None]:
dateList = self.data.index.values.tolist() 
returnList = []

With this list of data, we will do a for loop with each of the date in the list and find the index of each specific date in the full dataset. We will then use these dateIndex to see if there are enough datasets to calculate the SMA. For example, we need 15 data set prior to the current day in order to calculate the SMA of 15 days. If there is not enough data prior to the current date, we will append NaN into the `returnList` to show that we do not have SMA for that current date. 

In [None]:
for date in dateList:                
dateIndex = df[df["Date"]==date].index.values[0]
if dateIndex < n:
    returnList.append(np.nan)

If there is enough data, we will do a for loop with `n` number of iterations to calculate the sum of adjusted close values for n number of days which is the SMA value. At the end of the loop, we will append the SMA into `returnList`.

After calculating all the SMA for every date in self.data, we insert the `returnList` containing all the SMA value with a column name stored in `col_head`. At the end of the function, we save the dataframe with SMA into a CSV file.


In [None]:
self.data[col_head] = returnList
self.data.to_csv(self.filepath, index=True)

## calculate_crossover

We first start by creating and defining the shell of the calculate_crossover function:

In [None]:
def calculate_crossover(self, SMAa,SMAb):
    
    return self

This function takes in the two SMA values previously calculated in the calculate_sma function as inputs to calculate the crossover locations. 

Next we will start to write the code inside the function. We first define the columns we plan to add to the .csv file and extract the all data in the .csv file:


In [None]:
col_head3 = "Buy"
col_head4 = "sell"
df = self.data

We convert the data into a list which we will use as a reference to ensure our subsequent calculations have the correct number of elements 

In [None]:
SMAlist = self.data.index.values.tolist()

We then use an if, elif, and else statement to assign the lower SMA to SMA1 from the and the higher SMA to SMA2. This is useful later in the calculations to ensure that buy and sell signals are correctly identified. 

In [None]:
if SMAa < SMAb: 
    SMA1 = df[SMAa].tolist()
    SMA2 = df[SMAb].tolist()

elif SMAa > SMAb:
    SMA1 = df[SMAb].tolist()
    SMA2 = df[SMAa].tolist()
    
else:
    raise valueError(f"Given {SMAa} & {SMAb} are the same. Must be different SMA")


> *learning point: if, elif, and else statements*

> *elif is used here because there are multiple distinct different possibilities with how SMAa and SMAb are related. It is common to list the expected possibilities first in the if and elif statements, and else would normally be reserved for unexpected outcomes or errors*

`df.[SMAa].tolist()` extracts the column `SMAa` from the dataframe `df` and converts it to a list. Likewise for `df.[SMAb].tolist()`. If the two SMA values are equal, the code will raise a value error and the error message.

We create empty lists for the relative position of the two SMAs (`stockPosition`), the combined list of crossover signals (`stockSignal`), and finally separate lists for the buy and sell signals (`buySignal`, `sellSignal`). These lists will be referenced and used in the next few lines of code.

In [None]:
stockPosition = []
stockSignal = []
buySignal = []  
sellSignal = [] 

To create a list of relative SMA positions, we use a for loop:

In [None]:
for i in range(len(SMAlist)): 
    if SMA1[i] > SMA2[i]: stockPosition.append(1)  
    elif SMA1[i] < SMA2[i]: stockPosition.append(0)  
    elif SMA1[i] == SMA2[i]: stockPosition.append(stockPosition[i-1])
    else: stockPosition.append(np.nan)

By setting the range of the for loop to be the length of `SMAlist`, we ensure that the loop iterates over every single element in the dataframe.

Any day that `SMA1` (the smaller one) is higher than `SMA2` will add a `1` to the stockPosition list. 
Days where `SMA2` is higher than SMA1 will add a `0` to the `stockPosition` list. The end result will be a list of 1s and 0s showing which SMA is higher on any given day.

In the unlikely case that the two SMA vaues are equal in a day, the number added will be a repeat of the previous day, as no crossover has occured yet. 

On days where either SMA is missing data, such as in the first few days when there is not enough data to compute the SMA, we will add `np.nan` to the list as a filler.

After getting the full `stockPosition` list, we need to identify the days where crossover occurs. For this, another for loop is used:

In [None]:
for j in range(len(stockPosition)):
    if j == 0: stockSignal.append(np.nan)
    else: stockSignal.append(stockPosition[j] - stockPosition[j-1])

Again we set the range for the loop to be the length of `stockPosition` to ensure the code iterates over every element. 

The `stockSignal` list 'lags' behind the stockPosition list by one day, hence we add a `np.nan` as the very first value in the list to align the `stockSignal` list with the `stockPosition` list and ensure that both lists have the same number of elements. 

Following that we take the difference between the stockPosition that day and the `stockPosition` the previous day to identify the locations of crossovers. Crossovers show up in the list as `1` for a buy signal, and a `-1` for sell signals. `0` indicates that there has been no crossover that day.

> *learning point: indexing*

> *remember that in python, sequences start with 0, not 1! Hence,* `j == 0` *just refers to the first element in the range*

> *learning point: np.nan*

> *remember that any arithmetic operation on `NaN` will result in `NaN`. This allows us to append the list with null values without generating a value error*

The next step would be to filter out the buy and sell signals, which will be processed separately by the application:

In [None]:
for k in range(len(stockSignal)):
    if stockSignal[k] == 1:
        value = self.data[SMAa].tolist()[k]
        buySignal.append(value)
    else: buySignal.append(np.nan)
        
for k in range(len(stockSignal)): 
    if stockSignal[k] == -1:
        value = self.data[SMAa].tolist()[k]
        sellSignal.append(value)
    else: sellSignal.append(np.nan)

Using yet another set of for loops, we identify the crossover locations in the `stockSignal` list. At the crossover locations, we append the average SMA values of that particular day to the appropriate buy or sell list. This value will then be used as the y-axis value that the application uses to plot the crossover signals on the graph. 

The else condition appends `np.nan` to the list on days that do not contain the respective crossover signals, and ensures that the signals are correctly aligned to the dates where the crossover occurred. 

Finally, with the locations of buy and sell crossover signals, the function will append the buy and sell signals to the .csv file as new columns while also printing the results in the application:

In [None]:
self.data[col_head3] = buySignal
self.data[col_head4] = sellSignal

print(self.data)
self.data.to_csv(self.filepath, index=True)

> *learning point: testing*

> *the reason why the function prints the results is so we can independently test whether the function works even before the rest of the app is completed. Splitting work up in such a complex application is crucial so you can identify exactly which part of the app is causing errors!* 