# Working with parquet files

## Objective

+ In this assignment, we will use the data downloaded with the module `data_manager` to create features.

(11 pts total)

## Prerequisites

+ This notebook assumes that price data is available to you in the environment variable `PRICE_DATA`. If you have not done so, then execute the notebook `production_2_data_engineering.ipynb` to create this data set.


+ Load the environment variables using dotenv. (1 pt)

In [7]:
# Write your code below.

%load_ext dotenv
%dotenv ../src/.env


The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


In [30]:
import dask
dask.config.set({'dataframe.query-planning': True})
import dask.dataframe as dd

+ Load the environment variable `PRICE_DATA`.
+ Use [glob](https://docs.python.org/3/library/glob.html) to find the path of all parquet files in the directory `PRICE_DATA`.

(1pt)

In [31]:
    
import os
from glob import glob

# Write your code below.
PRICE_DATA = os.getenv("PRICE_DATA")

parquet_files = glob(os.path.join(PRICE_DATA, "*/*.parquet"))

# Check if parquet_files is empty
if not parquet_files:
    print("No Parquet files found in the specified directory.")
else:
    # Read Parquet files into Dask DataFrame
    dd_px = dd.read_parquet(parquet_files).set_index("ticker")
    print(dd_px.head())


                            Date       Open       High        Low      Close  \
ticker                                                                         
A      2013-12-02 00:00:00-05:00  35.052676  35.183785  34.816674  34.882229   
A      2013-12-03 00:00:00-05:00  34.692137  34.856029  34.456136  34.698692   
A      2013-12-04 00:00:00-05:00  34.639663  35.288666  34.560998  35.124775   
A      2013-12-05 00:00:00-05:00  34.974005  35.269005  34.823227  35.072338   
A      2013-12-06 00:00:00-05:00  35.242792  36.009791  35.242792  35.944237   

         Volume  Dividends  Stock Splits  year  
ticker                                          
A       2039962        0.0           0.0  2013  
A       3462706        0.0           0.0  2013  
A       3377288        0.0           0.0  2013  
A       2530939        0.0           0.0  2013  
A       4268513        0.0           0.0  2013  


For each ticker and using Dask, do the following:

+ Add lags for variables Close and Adj_Close.
+ Add returns based on Adjusted Close:
    
    - `returns`: (Adj Close / Adj Close_lag) - 1

+ Add the following range: 

    - `hi_lo_range`: this is the day's High minus Low.

+ Assign the result to `dd_feat`.

(4 pt)

In [33]:
# Write your code below.
# import numpy as np

dd_rets = (dd_px.groupby('ticker', group_keys=False).apply(
    lambda x: x.assign(Close_lag_1 = x['Close'].shift(1))
).assign(
    returns = lambda x: x['Close']/x['Close_lag_1'] - 1
).assign(
    positive_return = lambda x: (x['returns'] > 0)*1
))
# def process_data(file):
#     # Read parquet file into Dask DataFrame
#     df = dd.read_parquet(file)
    
#     # Add lags for Close and Adj_Close
#     df['Close_lag'] = df['Close'].shift(1)
#     df['Adj_Close_lag'] = df['Adj_Close'].shift(1)
    
#     # Add returns
#     df['returns'] = (df['Adj_Close'] / df['Adj_Close_lag']) - 1
    
#     # Add hi_lo_range
#     df['hi_lo_range'] = df['High'] - df['Low']
    
#     return df



  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  dd_rets = (dd_px.groupby('ticker', group_keys=False).apply(


+ Convert the Dask data frame to a pandas data frame. 
+ Add a rolling average return calculation with a window of 10 days.
+ *Tip*: Consider using `.rolling(10).mean()`.

(3 pt)

In [34]:
# Write your code below.
import pandas as pd

# Convert the Dask DataFrame to a Pandas DataFrame
dd_rets_pd = dd_rets.compute()

# Add a rolling average return calculation with a window of 10 days
dd_rets_pd['rolling_avg_return'] = dd_rets_pd['returns'].rolling(window=10).mean()

# Print the resulting DataFrame
print(dd_rets_pd.head())



                            Date       Open       High        Low      Close  \
ticker                                                                         
A      2013-12-02 00:00:00-05:00  35.052676  35.183785  34.816674  34.882229   
A      2013-12-03 00:00:00-05:00  34.692137  34.856029  34.456136  34.698692   
A      2013-12-04 00:00:00-05:00  34.639663  35.288666  34.560998  35.124775   
A      2013-12-05 00:00:00-05:00  34.974005  35.269005  34.823227  35.072338   
A      2013-12-06 00:00:00-05:00  35.242792  36.009791  35.242792  35.944237   

         Volume  Dividends  Stock Splits  year  Close_lag_1   returns  \
ticker                                                                  
A       2039962        0.0           0.0  2013          NaN       NaN   
A       3462706        0.0           0.0  2013    34.882229 -0.005262   
A       3377288        0.0           0.0  2013    34.698692  0.012280   
A       2530939        0.0           0.0  2013    35.124775 -0.001493   
A

Please comment:

+ Was it necessary to convert to pandas to calculate the moving average return?
+ Would it have been better to do it in Dask? Why?

(1 pt)


No, it wasn't strictly necessary to convert the Dask DataFrame to a Pandas DataFrame to calculate the rolling average return. It's often better to perform computations directly in Dask when dealing with large datasets because Dask is designed to handle parallel computation and distributed processing efficiently, especially when working with data that doesn't fit into memory.

Calculating the rolling average return in Dask would likely have been more efficient, especially if the dataset is large, as Dask can handle the computation in a distributed manner, potentially utilizing multiple cores or even distributed computing clusters.

