# Part 2: Data Processing

This notebook handles:
- Loading raw data from previous notebook
- Handling missing values
- Calculating daily returns
- Saving processed data


## 2.1 Setup


In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)


✓ Libraries imported successfully!


## 2.2 Load Data from Previous Notebook


In [None]:
%store -r prices_df
%store -r tech_stocks
%store -r finance_stocks
%store -r all_tickers

if 'prices_df' not in locals():
    data_path = Path.cwd().parent / 'data' / 'raw' / 'stock_prices.csv'
    prices_df = pd.read_csv(data_path, index_col=0, parse_dates=True)
else:
    print("✓ Data loaded from previous notebook")

✓ Data loaded from previous notebook
✓ Shape: (1483, 6)
✓ Columns: ['AAPL', 'MSFT', 'GOOGL', 'JPM', 'BAC', 'GS']


## 2.3 Check for Missing Values


In [None]:
print("Missing values per stock:")
missing_counts = prices_df.isnull().sum()
print(missing_counts)
print(f"\nTotal missing values: {missing_counts.sum()}")

Missing values per stock:
AAPL     0
MSFT     0
GOOGL    0
JPM      0
BAC      0
GS       0
dtype: int64

Total missing values: 0


## 2.4 Handle Missing Values


In [None]:
prices_clean = prices_df.fillna(method='ffill')
prices_clean = prices_clean.fillna(method='bfill')

✓ Missing values after cleaning: 0
✓ Clean data shape: (1483, 6)


## 2.5 Calculate Daily Returns


In [None]:
returns_df = prices_clean.pct_change()

returns_df = returns_df.dropna()

display(returns_df.head())


✓ Returns calculated successfully!
✓ Returns shape: (1482, 6)
✓ Date range: 2020-01-03 to 2025-11-24

First 5 rows of returns:


Unnamed: 0_level_0,AAPL,MSFT,GOOGL,JPM,BAC,GS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-03,-0.0097,-0.0125,-0.0052,-0.0132,-0.0208,-0.0117
2020-01-06,0.008,0.0026,0.0267,-0.0008,-0.0014,0.0102
2020-01-07,-0.0047,-0.0091,-0.0019,-0.017,-0.0066,0.0066
2020-01-08,0.0161,0.0159,0.0071,0.0078,0.0101,0.0096
2020-01-09,0.0212,0.0125,0.0105,0.0037,0.0017,0.0204


## 2.6 Save Processed Data


In [None]:
processed_dir = Path.cwd().parent / 'data' / 'processed'
processed_dir.mkdir(parents=True, exist_ok=True)

returns_path = processed_dir / 'daily_returns.csv'
returns_df.to_csv(returns_path)

clean_prices_path = processed_dir / 'clean_prices.csv'
prices_clean.to_csv(clean_prices_path)

%store prices_clean
%store returns_df


✓ Returns saved to /Users/mkgp3/WebstormProjects/Market Pulse Python Project/data/processed/daily_returns.csv
✓ Clean prices saved to /Users/mkgp3/WebstormProjects/Market Pulse Python Project/data/processed/clean_prices.csv
Stored 'prices_clean' (DataFrame)
Stored 'returns_df' (DataFrame)


## 2.7 Quick Summary Statistics


In [7]:
print("Summary statistics of daily returns:")
display(returns_df.describe())

print("\nReturn statistics by sector:")
print(f"\nTechnology stocks average return: {returns_df[tech_stocks].mean().mean():.6f}")
print(f"Financial stocks average return: {returns_df[finance_stocks].mean().mean():.6f}")


Summary statistics of daily returns:


Unnamed: 0,AAPL,MSFT,GOOGL,JPM,BAC,GS
count,1482.0,1482.0,1482.0,1482.0,1482.0,1482.0
mean,0.0011,0.0009,0.0013,0.0008,0.0006,0.0011
std,0.0202,0.0187,0.0206,0.0198,0.0218,0.0208
min,-0.1286,-0.1474,-0.1163,-0.1496,-0.154,-0.1271
25%,-0.0083,-0.0079,-0.0094,-0.008,-0.0098,-0.0093
50%,0.0011,0.001,0.0018,0.001,0.0004,0.0007
75%,0.0116,0.0105,0.0116,0.0099,0.0108,0.0115
max,0.1533,0.1422,0.1022,0.1801,0.178,0.1758



Return statistics by sector:

Technology stocks average return: 0.001100
Financial stocks average return: 0.000843
