# Moving Big Data Predict 
© Explore Data Science Academy

## Data Processing Guide:  Historical Data Processing

### Overview 

This notebook is provided to help you complete the data processing component of you data pipeline, formed within the Moving Big Data predict. 

Several completed helper functions are initially provided, with the expectation that you will implement the main `data_processing` function. 

At the conclusion of testing, you are expected to convert this notebook into a `.py` script file that can be run whenever your pipeline is invoked.  

## Imports

We only make use of the `pandas` library in order to perform the data processing. You should ensure that the AMI and resulting EC2 instance used to run you data processing has this dependency met.  

In [13]:
import pandas as pd

### Variables

We initially define three path variables that are used within the data processing: 
1. Source data path: The path to the comany `.csv` files used within the data processing. 
2. Saved data path: Path to which the resulting aggregated `historical_stock_data.csv`should be saved. 
3. Path of index file: Path to the index file that defines companies whose data needs to be aggregated. Within the predict, we make use of the `top_companies.txt` file as this index. 

We declare these variables in two sets. The first, local, set of variables are defined to enable local testing of the functions formed. These paths can be adapted to suite testing needs. 

The second set of variables represents the direcory structure within the EC2 instance where data processing will occur during the pipeline's opperation. These paths should not be altered if at all possible, as doing so will require changes to other components to the predict. 

In [30]:
# Local file directories
# These paths can be updated to meet your testing environment needs.
source_path = "C:/Users/HP/Downloads/Moving Big Data/Stocks/Stocks/"
save_path = "C:/Users/HP/Downloads/Moving Big Data/moving-big-data-predict-main/moving-big-data-predict-main/code/Output/"
index_file_path = 'C:/Users/HP/Downloads/Moving Big Data/moving-big-data-predict-main/moving-big-data-predict-main/data/top_companies.txt'

# # EC2 file directories
# Do not change these paths
# source_path = "/home/ec2-user/s3-drive/Stocks/"
# save_path = "/home/ec2-user/s3-drive/Output/"
# index_file_path = "/home/ec2-user/s3-drive/CompanyNames/top_companies.txt"

## Helper Functions

Create a function that returns a list of all the companies (represented by their `csv` files) selected for inclusion within the data processing pipeline. 

In [15]:
def extract_companies_from_index(index_file_path):
    """Generate a list of company files that need to be processed. 

    Args:
        index_file_path (str): path to index file

    Returns:
        list: Names of company names. 
    """
    company_file = open(index_file_path, "r")
    contents = company_file.read()
    contents = contents.replace("'","")
    contents_list = contents.split(",")
    cleaned_contents_list = [item.strip() for item in contents_list]
    company_file.close()
    return cleaned_contents_list

Create a function that attaches the source directory to each company csv file selected for processing.

In [16]:
def get_path_to_company_data(list_of_companies, source_data_path):
    """Creates a list of the paths to the company data
       that will be processed

    Args:
        list_of_companies (list): Extracted `.csv` file names of companies whose data needs to be processed.
        source_data_path (str): Path to where the company `.csv` files are stored. 

    Returns:
        [type]: [description]
    """
    path_to_company_data = []
    for file_name in list_of_companies:
        path_to_company_data.append(source_data_path + file_name)
    return path_to_company_data

Create a function that saves a pandas dataframe in csv format

In [17]:
def save_table(dataframe, output_path, file_name, header):
    """Saves an input pandas dataframe as a CSV file according to input parameters.

    Args:
        dataframe (pandas.dataframe): Input dataframe.
        output_path (str): Path to which the resulting `.csv` file should be saved. 
        file_name (str): The name of the output `.csv` file. 
        header (boolean): Whether to include column headings in the output file.
    """
    print(f"Path = {output_path}, file = {file_name}")
    dataframe.to_csv(output_path + file_name + ".csv", index=False, header=header)

## Data Processing Function

The students are now expected to form a function that takes as input an array of company names (formed from the `top_companies.txt` file, representing multiple `.csv` files), and output a single `.csv` file containing the combined collection of data from these files, represented as rows, along with two additional summary columns providing extra context for each company entry.    

Instructions

- The target output file should be named `historical_stock_data.csv`.
- The output `csv` file must have the following schema:

    |Column Name|DataType|
    |---|---|
    |stock_date|datetime|
    |open_value|float64|
    |high_value|float64|
    |low_value|float64|
    |close_value|float64|
    |volume_traded|int64|
    |daily_percent_change|float64|
    |value_change|float64|
    |company_name|string|

    <br>

- The output file should not contain any column headers.
- Some input `csv` files may be corrupted. As such, be careful of exceptions when forming the processing function.
- The `daily_percent_change` and `value_change` columns must be calculated based off the open_value and the close_value.
  * `daily_percent_change` = ((`Close` - `Open`)/`Open`) * 100
  * `value_change` = `Close` - `Open`
 

In [18]:
def data_processing(file_paths, output_path):
    """Process and collate company csv file data for use within the data processing component of the formed data pipeline.  

    Args:
        file_paths (list[str]): A list of paths to the company csv files that need to be processed. 
        output_path (str): The path to save the resulting csv file to.  
    """
    file_combined=pd.DataFrame()
    for file_path in file_paths:
        file_path_split = file_path.split('/')
        file_split_index = len(file_path_split) - 1;
        file_csv = file_path_split[file_split_index]
        company_name = file_csv.replace('.csv', '', 1)
        try:
            load_file  = pd.read_csv(file_path);
            file = pd.DataFrame(load_file)
            del file['OpenInt']
            file['daily_percent_change'] = ((file['Close'] - file['Open'])/file['Open'])*100
            file['value_change'] = file['Close'] - file['Open']
            file['company_name'] = company_name
            data = [file_combined, file]
            file_combined = pd.concat(data, ignore_index=True, sort=False)
        except:
            print("Could Not Parse. Possible Empty File " + file_path)
    save_table(file_combined, output_path, 'historical_stock_data', header=False)
    


In [12]:
print(data_processing(path_to_company_data, save_path))

NameError: name 'path_to_company_data' is not defined

In [31]:
list_of_companies = extract_companies_from_index(index_file_path)

In [32]:
file_pathway = get_path_to_company_data(list_of_companies, source_path)

In [33]:
dataframe = pd.DataFrame(file_pathway)

In [34]:
save_table(dataframe, save_path, "file_name", False)

Path = C:/Users/HP/Downloads/Moving Big Data/moving-big-data-predict-main/moving-big-data-predict-main/code/Output/, file = file_name


In [35]:
data_processing(file_pathway, save_path)

Could Not Parse. Possible Empty File C:/Users/HP/Downloads/Moving Big Data/Stocks/Stocks/znwaa.us.csv
Could Not Parse. Possible Empty File C:/Users/HP/Downloads/Moving Big Data/Stocks/Stocks/sail.us.csv
Could Not Parse. Possible Empty File C:/Users/HP/Downloads/Moving Big Data/Stocks/Stocks/jt.us.csv
Could Not Parse. Possible Empty File C:/Users/HP/Downloads/Moving Big Data/Stocks/Stocks/stnl.us.csv
Could Not Parse. Possible Empty File C:/Users/HP/Downloads/Moving Big Data/Stocks/Stocks/bxg.us.csv
Path = C:/Users/HP/Downloads/Moving Big Data/moving-big-data-predict-main/moving-big-data-predict-main/code/Output/, file = historical_stock_data


## Main Program Flow

With the data processing function completed, you need to assemble the provided functions together to form a script (standalone `.py` file) that can be run via a bash command called during your data pipeline's execution.  

The following scaffolding is provided to help in this regard: 

In [9]:
if __name__ == "__main__":

    # Get all file names in source data directory of companies whose data needs to be processed, 
    # This information is specified within the `top_companies.txt` file. 
    file_names = extract_companies_from_index(index_file_path)

    # Update the company file names to include path information. 
    path_to_company_data = get_path_to_company_data(file_names, source_path)

    # Process company data and create full data output
    data_processing(path_to_company_data, save_path)

Could Not Parse. Possible Empty File ../data/Stocks/tmusp.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/mcv.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/fllv.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/abr_c.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/gs_i-cl.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/bfin.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/mbot.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/tlk.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/sgb.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/mpaa.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/wex.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/jrvr.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/ayx.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/arkw.us.csv
Could Not Parse. Possible Empty File ../data/Stocks/pso.us.csv
Could Not Parse. Possible Empty File ../d

FileNotFoundError: [Errno 2] No such file or directory: '../data/Output/historical_stock_data.csv'