# Data Collection and Cleaning

First, set up your Python environment to ensure that modules from the parent directory can be imported. Then, import the necessary functions for data collection and cleaning.

In [None]:
# Data Collection and Cleaning
import os
import sys

# Add parent directory to Python path
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from src.data_processing import fetch_stock_data, preprocess_data
import pandas as pd

- `fetch_stock_data`: Function to collect raw stock data.
- `preprocess_data`: Function to clean and preprocess the collected data.
- `pandas`: Library for data manipulation and analysis.

# Fetch Sample Data

Select a list of stock tickers from different sectors to create a diverse dataset for analysis. Then, use the `fetch_stock_data` function to download daily historical price data for each ticker over the specified date range.


In [None]:
# Fetch sample data
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 
           'JPM', 'GS', 'WMT', 'NVDA', 'NFLX']
stock_data = fetch_stock_data(tickers, '2023-01-01', '2025-06-07') #Enter date between which you want to fetch data from.

- The `tickers` list includes technology, automotive, finance, retail, and entertainment companies.
- The data collected covers the period from January 1, 2023 to June 08, 2025, allowing for analysis of trends across major recent market events.

# Data Preparation Insights

Before analysis, it is important to understand the structure of the dataset. By checking the columns in `stock_data`, we ensure that all necessary fields (such as 'Date', 'Ticker', 'Close', etc.) are present for further processing.

Next, we calculate the daily return for each ticker, which measures the day-to-day percentage change in closing price. This metric is essential for evaluating stock performance and volatility across different companies.

Finally, we preprocess the data to handle any inconsistencies or missing values. This step ensures the dataset is clean and reliable, providing a solid foundation for subsequent exploratory analysis and visualization.

In [None]:
# Check columns before preprocessing
print("Columns in stock_data:", stock_data.columns)

# Calculate Daily_Return for each ticker
stock_data['Daily_Return'] = stock_data.groupby('Ticker')['Close'].pct_change()

# Preprocess data
cleaned_data = preprocess_data(stock_data)

# Saving Cleaned Data
After preprocessing, save the cleaned stock data to a CSV file for future use and reproducibility:

In [None]:
# Save cleaned data
stock_data.to_csv('data/stock_data.csv', index=False)

This exports the DataFrame to `data/stock_data.csv`, making it easy to reload the processed data for further analysis or sharing.