# Data Collection Notebook
## Objectives:

* - Load 3 input data tables from the same folder
* - Perform full outer joins between the tables
* - Conduct a quick data check
* - Save the resulting dataset in the folder

---

### Import Packages for Learning

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Retail-Sales-Prediction/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Retail-Sales-Prediction'

In [4]:
import numpy as np
import pandas as pd
import os

---

### Step 1: Load the 3 input data tables
  


In [6]:
sales_data = pd.read_csv('inputs/sales data-set.csv')
features_data = pd.read_csv('inputs/Features data set.csv')
stores_data = pd.read_csv('inputs/stores data-set.csv')

----

### Step 2: Perform full outer join of Sales and Features on 'Store' and 'Date'
Drop 'IsHoliday' column from the 'features_data' table to avoid redundancy

In [7]:
if 'IsHoliday' in features_data.columns:
    features_data = features_data.drop(columns=['IsHoliday'])
sales_features_merged = pd.merge(sales_data, features_data, on=['Store', 'Date'], how='outer')

### Step 3: Perform full outer join with Stores on 'Store'

In [8]:
full_data = pd.merge(sales_features_merged, stores_data, on='Store', how='outer')

### Step 4: Quick checks for data

In [9]:
print("Quick Data Check:")
print(full_data.info())
print(full_data.head())

Quick Data Check:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423325 entries, 0 to 423324
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Store         423325 non-null  int64  
 1   Dept          421570 non-null  float64
 2   Date          423325 non-null  object 
 3   Weekly_Sales  421570 non-null  float64
 4   IsHoliday     421570 non-null  object 
 5   Temperature   423325 non-null  float64
 6   Fuel_Price    423325 non-null  float64
 7   MarkDown1     152433 non-null  float64
 8   MarkDown2     112532 non-null  float64
 9   MarkDown3     138658 non-null  float64
 10  MarkDown4     136466 non-null  float64
 11  MarkDown5     153187 non-null  float64
 12  CPI           422740 non-null  float64
 13  Unemployment  422740 non-null  float64
 14  Type          423325 non-null  object 
 15  Size          423325 non-null  int64  
dtypes: float64(11), int64(2), object(3)
memory usage: 51.7+ MB
None
   Store  

### Step 5: Save the merged file to the output folder

In [10]:
output_path = "outputs/merged_data.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
full_data.to_csv(output_path, index=False)

print(f"Merged data saved at: {output_path}")

Merged data saved at: outputs/merged_data.csv
