# Working with parquet files

## Objective

+ In this assignment, we will use the data downloaded with the module `data_manager` to create features.

(11 pts total)

## Prerequisites

+ This notebook assumes that price data is available to you in the environment variable `PRICE_DATA`. If you have not done so, then execute the notebook `01_materials/labs/2_data_engineering.ipynb` to create this data set.


+ Load the environment variables using dotenv. (1 pt)

In [36]:
! pip install yfinance pandas dask pyarrow python-dotenv



In [17]:
SRC_DIR="/Users/williamhopkins/production/05_src"
PRICE_DATA="/Users/williamhopkins/production/05_src/data/prices"
FEATURES_DATA="/Users/williamhopkins/production/05_src/data/features"
TICKERS="/Users/williamhopkins/production/05_src/data/sp500_2024.csv"



In [18]:
from dotenv import load_dotenv
import os


load_dotenv()


price_data = os.getenv("PRICE_DATA")

if price_data is None:
    raise ValueError("ERROR: The environment variable 'PRICE_DATA' is not set. Please check your .env file.")

print("PRICE_DATA is set to:", price_data)

PRICE_DATA is set to: ../../05_src/data/prices/


In [19]:
import dask.dataframe as dd

+ Load the environment variable `PRICE_DATA`.
+ Use [glob](https://docs.python.org/3/library/glob.html) to find the path of all parquet files in the directory `PRICE_DATA`.

(1pt)

In [24]:
import os
import sys
sys.path.append(os.getenv('SRC_DIR'))

from data_manager import DataManager
dm = DataManager()
dm.download_all()  # This downloads the stock data
dm.featurize()     # This processes and saves as parquet

2025-02-14 18:59:03,650, data_manager.py, 42, INFO, Getting price data for all tickers.
2025-02-14 18:59:03,651, data_manager.py, 51, INFO, Getting tickers from ../../05_src/data/tickers/sp500_wiki.csv
2025-02-14 18:59:03,665, data_manager.py, 57, INFO, Processing all tickers
2025-02-14 18:59:03,666, data_manager.py, 70, INFO, Processing ticker ['MMM', 'AOS', 'ABT', 'ABBV', 'ACN', 'ADBE', 'AMD', 'AES', 'AFL', 'A', 'APD', 'ABNB', 'AKAM', 'ALB', 'ARE', 'ALGN', 'ALLE', 'LNT', 'ALL', 'GOOGL', 'GOOG', 'MO', 'AMZN', 'AMCR', 'AEE', 'AAL', 'AEP', 'AXP', 'AIG', 'AMT', 'AWK', 'AMP', 'AME', 'AMGN', 'APH', 'ADI', 'ANSS', 'AON', 'APA', 'AAPL', 'AMAT', 'APTV', 'ACGL', 'ADM', 'ANET', 'AJG', 'AIZ', 'T', 'ATO', 'ADSK', 'ADP', 'AZO', 'AVB', 'AVY', 'AXON', 'BKR', 'BALL', 'BAC', 'BK', 'BBWI', 'BAX', 'BDX', 'BRK.B', 'BBY', 'BIO', 'TECH', 'BIIB', 'BLK', 'BX', 'BA', 'BKNG', 'BWA', 'BXP', 'BSX', 'BMY', 'AVGO', 'BR', 'BRO', 'BF.B', 'BLDR', 'BG', 'CDNS', 'CZR', 'CPT', 'CPB', 'COF', 'CAH', 'KMX', 'CCL', 'CARR', 

In [29]:
import os
import sys
print("Current working directory:", os.getcwd())
print("Contents of 05_src:", os.listdir("../05_src"))

Current working directory: /Users/williamhopkins/production/02_activities/assignments


FileNotFoundError: [Errno 2] No such file or directory: '../05_src'

In [30]:
import os
from glob import glob
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Get the PRICE_DATA directory from environment variables
price_data_dir = os.getenv("PRICE_DATA")
if price_data_dir is None:
    raise ValueError("ERROR: The environment variable 'PRICE_DATA' is not set. Please check your .env file.")

# Print the directory to confirm it is loaded
print("PRICE_DATA directory:", price_data_dir)

# Use glob to find all parquet files in the directory
# The ** means search recursively through subdirectories
# *.parquet matches any file ending in .parquet
parquet_files = glob(os.path.join(price_data_dir, "**/*.parquet"), recursive=True)

# Print the list of parquet files found
print("Parquet files found:", parquet_files)

PRICE_DATA directory: ../../05_src/data/prices/
Parquet files found: ['../../05_src/data/prices/CTAS/CTAS_2002/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2005/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2004/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2003/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2010/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2017/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2021/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2019/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2018/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2020/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2016/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2011/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2008/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2006/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2001/part.0.parquet', '../../05_src/data/prices/CTAS/CTAS_2000/part.0.parquet', '.

For each ticker and using Dask, do the following:

+ Add lags for variables Close and Adj_Close.
+ Add returns based on Close:
    
    - `returns`: (Close / Close_lag_1) - 1

+ Add the following range: 

    - `hi_lo_range`: this is the day's High minus Low.

+ Assign the result to `dd_feat`.

(4 pt)

In [32]:
import dask.dataframe as dd

# Get parquet files path and read data
price_data_dir = os.getenv("PRICE_DATA")
parquet_files = glob(os.path.join(price_data_dir, "**/*.parquet"), recursive=True)
dd_px = dd.read_parquet(parquet_files)

# Define the transformations
def add_features(group):
    # Add lags
    group = group.assign(
        Close_lag_1=group['Close'].shift(1),
        Adj_Close_lag_1=group['Adj Close'].shift(1)
    )
    # Add returns
    group = group.assign(
        returns=(group['Close'] / group['Close_lag_1'] - 1)
    )
    # Add high-low range
    group = group.assign(
        hi_lo_range=group['High'] - group['Low']
    )
    return group

# Apply transformations
dd_feat = dd_px.groupby('Ticker', group_keys=False).apply(add_features)

  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  dd_feat = dd_px.groupby('Ticker', group_keys=False).apply(add_features)


+ Convert the Dask data frame to a pandas data frame. 
+ Add a new feature containing the moving average of `returns` using a window of 10 days. There are several ways to solve this task, a simple one uses `.rolling(10).mean()`.

(3 pt)

In [35]:
# Convert Dask DataFrame to pandas DataFrame
print("\nConverting Dask DataFrame to pandas DataFrame...")
df_feat = dd_feat.compute()
print("Conversion complete.")

# Print info about the DataFrame before adding moving average
print("\nInitial DataFrame information:")
print("Shape:", df_feat.shape)
print("Columns:", df_feat.columns.tolist())
print("Index name:", df_feat.index.name)

# Add 10-day moving average of returns
print("\nCalculating 10-day moving average of returns...")
df_feat = df_feat.assign(
    returns_ma_10=df_feat.groupby(level=0)['returns'].transform(lambda x: x.rolling(10).mean())
)
print("Calculation complete.")

# Verify the new column was added and check some values
print("\nVerification of calculations:")
print("Updated columns:", df_feat.columns.tolist())

# Show sample of data for one ticker to verify moving average calculation
sample_ticker = df_feat.index[0]
print(f"\nSample data for ticker {sample_ticker}:")
sample_data = df_feat.loc[sample_ticker][['Date', 'returns', 'returns_ma_10']].head(15)
print(sample_data)

# Check for any NaN values in the moving average (first 9 days should be NaN)
print("\nNumber of NaN values in returns_ma_10:", df_feat['returns_ma_10'].isna().sum())



Converting Dask DataFrame to pandas DataFrame...
Conversion complete.

Initial DataFrame information:
Shape: (3177954, 12)
Columns: ['Date', 'Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume', 'Year', 'Close_lag_1', 'Adj_Close_lag_1', 'returns', 'hi_lo_range']
Index name: Ticker

Calculating 10-day moving average of returns...
Calculation complete.

Verification of calculations:
Updated columns: ['Date', 'Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume', 'Year', 'Close_lag_1', 'Adj_Close_lag_1', 'returns', 'hi_lo_range', 'returns_ma_10']

Sample data for ticker DOV:
Price        Date   returns  returns_ma_10
Ticker                                    
DOV    2005-01-03       NaN            NaN
DOV    2005-01-04 -0.004614            NaN
DOV    2005-01-05 -0.019517            NaN
DOV    2005-01-06 -0.009455            NaN
DOV    2005-01-07 -0.014569            NaN
DOV    2005-01-10 -0.005863            NaN
DOV    2005-01-11 -0.007692            NaN
DOV    2005-01-12  0.021705     

Please comment:

+ Was it necessary to convert to pandas to calculate the moving average return?
+ Would it have been better to do it in Dask? Why?

ANSWER: It wasen't necessary. Dask DataFrames supports rolling operations. It would not have been better to do it in Dask because for a 
dataset of this size, pandas is more straightforward. 
(1 pt)

## Criteria

The [rubric](./assignment_1_rubric_clean.xlsx) contains the criteria for grading.

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-1`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_1.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/production/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [X] Created a branch with the correct naming convention.
- [X] Ensured that the repository is public.
- [X] Reviewed the PR description guidelines and adhered to them.
- [X] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.