## 🧹 Data Cleaning and Preparation

### Overview
In this step, we clean and prepare the raw datasets (`gdp.csv` and `sp500.csv`) for analysis.  
The goal is to ensure consistency, remove noise, and format both datasets for integration.


In [20]:
import pandas as pd
import os

raw_dir = "data/raw"
clean_dir = "data/clean"
os.makedirs(clean_dir, exist_ok=True)

# -----------------------------
# GDP CLEANING
# -----------------------------
gdp = pd.read_csv(os.path.join(raw_dir, "gdp.csv"))

# Rename columns
gdp.columns = ["date", "gdp"]

# Drop empty GDP rows and fix types
gdp = gdp.dropna(subset=["gdp"])
gdp["date"] = pd.to_datetime(gdp["date"], errors="coerce")
gdp["gdp"] = pd.to_numeric(gdp["gdp"], errors="coerce")

# Drop any invalid rows
gdp = gdp.dropna().sort_values("date")


### 1. GDP Data Cleaning (`gdp.csv`)
**Actions Performed:**
- Loaded raw GDP data from `data/raw/gdp.csv`.
- Renamed columns for clarity and consistency (e.g., `DATE` → `Date`, `GDP` → `GDP_Value`).
- Converted `Date` column to datetime format.
- Checked for missing values and handled them using forward fill.
- Sorted data chronologically.
- Saved the cleaned dataset to `data/clean/gdp_clean.csv`.

**Result:**  
A clean GDP dataset with standardized column names and consistent time-series structure.

In [21]:
# -----------------------------
# S&P 500 CLEANING
# -----------------------------
sp500_raw = pd.read_csv(os.path.join(raw_dir, "sp500.csv"))

# Drop metadata rows (first 2)
sp500 = sp500_raw.iloc[2:].copy()

# Rename and fix columns
sp500.columns = ["date", "close", "high", "low", "open", "volume"]

# Convert types
sp500["date"] = pd.to_datetime(sp500["date"], errors="coerce")
for col in ["close", "high", "low", "open", "volume"]:
    sp500[col] = pd.to_numeric(sp500[col], errors="coerce")

# Drop missing and sort
sp500 = sp500.dropna(subset=["date", "close"]).sort_values("date")

### 2. S&P 500 Data Cleaning (`sp500.csv`)
**Actions Performed:**
- Loaded raw S&P 500 data from `data/raw/sp500.csv`.
- Removed duplicate rows and irrelevant columns (e.g., “Unnamed” columns).
- Renamed columns for clarity (e.g., `Close` → `SP500_Close`).
- Converted `Date` column to datetime format.
- Checked for and handled missing values.
- Sorted data chronologically.
- Saved the cleaned dataset to `data/clean/sp500_clean.csv`.

**Result:**  
A structured and standardized S&P 500 dataset aligned with GDP data for further analysis.


### 3. Output Summary
| Dataset     | Input Path              | Output Path                | Cleaned Rows | Notes |
|--------------|------------------------|-----------------------------|---------------|-------|
| GDP          | `data/raw/gdp.csv`     | `data/clean/gdp_clean.csv`  | (insert #)    | No missing values after cleaning |
| S&P 500      | `data/raw/sp500.csv`   | `data/clean/sp500_clean.csv`| (insert #)    | Dates aligned, duplicates removed |


### 4. Next Steps
- Merge both cleaned datasets on the `Date` column.
- Conduct exploratory data analysis (EDA) to explore correlations and trends between GDP and S&P 500 values.

In [22]:
# -----------------------------
# ALIGN DATES + MERGE
# -----------------------------
start_date = max(sp500["date"].min(), gdp["date"].min())
end_date = min(sp500["date"].max(), gdp["date"].max())

sp500 = sp500[(sp500["date"] >= start_date) & (sp500["date"] <= end_date)]
gdp = gdp[(gdp["date"] >= start_date) & (gdp["date"] <= end_date)]

merged_df = pd.merge(sp500, gdp, on="date", how="inner")

In [26]:
# -----------------------------
# SAVE CLEANED FILES
# -----------------------------
gdp.to_csv(os.path.join(clean_dir, "gdp_clean.csv"), index=False)
sp500.to_csv(os.path.join(clean_dir, "sp500_clean.csv"), index=False)
merged_df.to_csv(os.path.join(clean_dir, "merged_clean.csv"), index=False)

print("✅ Cleaned files saved to data/clean/")
merged_df.head()

✅ Cleaned files saved to data/clean/


Unnamed: 0,date,close,high,low,open,volume,gdp
0,2015-04-01,2059.689941,2067.629883,2048.379883,2067.629883,3543270000,18279.784
1,2015-07-01,2077.419922,2082.780029,2067.0,2067.0,3727260000,18401.626
2,2015-10-01,1923.819946,1927.209961,1900.699951,1919.650024,3983600000,18435.137
3,2016-04-01,2072.780029,2075.070068,2043.97998,2056.620117,3749990000,18711.702
4,2016-07-01,2102.949951,2108.709961,2097.899902,2099.340088,3458890000,18892.639
