# Working with parquet files

## Objective

+ In this assignment, we will use the data downloaded with the module `data_manager` to create features.

(11 pts total)

## Prerequisites

+ This notebook assumes that price data is available to you in the environment variable `PRICE_DATA`. If you have not done so, then execute the notebook `01_materials/labs/2_data_engineering.ipynb` to create this data set.


+ Load the environment variables using dotenv. (1 pt)

In [2]:
%load_ext dotenv
%dotenv 

In [3]:
import dask.dataframe as dd



+ Load the environment variable `PRICE_DATA`.
+ Use [glob](https://docs.python.org/3/library/glob.html) to find the path of all parquet files in the directory `PRICE_DATA`.

(1pt)

In [4]:
import os
from glob import glob

# Load the PRICE_DATA environment variable
price_data_dr = os.getenv('PRICE_DATA')


In [5]:
# Use glob to find all parquet files in the directory PRICE_DATA
parquet_files = glob(os.path.join(price_data_dr, "**/*.parquet"), recursive = True)
parquet_files

['../../05_src/data/prices\\A\\A_2000\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2001\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2002\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2003\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2004\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2005\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2006\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2007\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2008\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2009\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2010\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2011\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2012\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2013\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2014\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2015\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2016\\part.0.parquet',
 '../../05_src/data/prices\\A\\A_2017\\part.0.pa

For each ticker and using Dask, do the following:

+ Add lags for variables Close and Adj_Close.
+ Add returns based on Close:
    
    - `returns`: (Close / Close_lag_1) - 1

+ Add the following range: 

    - `hi_lo_range`: this is the day's High minus Low.

+ Assign the result to `dd_feat`.

(4 pt)

In [6]:
import dask.dataframe as dd
import pandas as pd

In [None]:
# Load your Parquet files into a Dask DataFrame
df = dd.read_parquet(parquet_files)

# Repartition to ensure each partition is large enough. The code was taking too long to run on my computer
df = df.repartition(npartitions=10)  

# Now proceed with your calculations
df = df.sort_values(by=['Date'])

# Add lags for 'Close' and 'Adj Close' (lag 1)
df['Close_lag_1'] = df['Close'].shift(1)
df['Adj_Close_lag_1'] = df['Adj Close'].shift(1)

# Calculate returns for 'Close' (returns = (Close / Close_lag_1) - 1)
df['returns'] = (df['Close'] / df['Close_lag_1']) - 1

# Calculate hi_lo_range (High - Low)
df['hi_lo_range'] = df['High'] - df['Low']

# Assign the result to dd_feat
dd_feat = df

# Show a sample of the result (use .compute() if you want to load the data into memory)
print(dd_feat.head())

Price        Date  Adj Close      Close       High        Low       Open  \
Ticker                                                                     
A      2000-01-03  43.382843  51.502148  56.464592  48.193848  56.330471   
IT     2000-01-03  16.625000  16.625000  16.625000  15.062500  15.500000   
ICE    2000-01-03        NaN        NaN        NaN        NaN        NaN   
IEX    2000-01-03   9.214503  13.416667  13.555556  13.333333  13.555556   
J      2000-01-03   6.203315   6.670322   6.800856   6.670322   6.800856   

Price      Volume  Year  Close_lag_1  Adj_Close_lag_1  returns  hi_lo_range  
Ticker                                                                       
A       4674353.0  2000          NaN              NaN      NaN     8.270744  
IT      1006700.0  2000          NaN              NaN      NaN     1.562500  
ICE           NaN  2000          NaN              NaN      NaN          NaN  
IEX      148725.0  2000          NaN              NaN      NaN     0.222223  

+ Convert the Dask data frame to a pandas data frame. 
+ Add a new feature containing the moving average of `returns` using a window of 10 days. There are several ways to solve this task, a simple one uses `.rolling(10).mean()`.

(3 pt)

In [9]:
# Convert Dask data frame to Pandas data frame
dd_feat_pandas = dd_feat.compute()

# Calculate the 10-day moving average of the 'returns' column
dd_feat_pandas['returns_ma_10'] = dd_feat_pandas['returns'].rolling(window=10).mean()

print(dd_feat_pandas[['Date', 'returns', 'returns_ma_10']].head())


Price        Date    returns  returns_ma_10
Ticker                                     
TYL    2000-01-03        NaN            NaN
BRO    2000-01-03  -0.597074            NaN
TER    2000-01-03  25.112206            NaN
BRK.B  2000-01-03        NaN            NaN
TFC    2000-01-03        NaN            NaN


Please comment:

+ Was it necessary to convert to pandas to calculate the moving average return?
+ Would it have been better to do it in Dask? Why?

(1 pt)

No, it wasn't necessary to convert to Pandas to calculate the moving average return. In fact, when working with large datasets like the one above, it is preferable to use Dask. Dask allows for lazy execution, meaning operations are only performed when needed, enabling you to calculate tasks like a moving average without loading the entire dataset into memory. In contrast, Pandas loads the entire DataFrame into memory, which can lead to memory issues when working with large datasets. Dask processes the data in smaller chunks, improving memory efficiency. Therefore, when dealing with large datasets, Dask offers better performance and memory management compared to Pandas.

## Criteria

The [rubric](./assignment_1_rubric_clean.xlsx) contains the criteria for grading.

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-1`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.