# **Data Collection Heritage-Housing-Project**

## Objectives

* Fetch the Heritage Housing dataset from the corresponding Kaggle repository  
* Store the dataset in a designated local directory for further processing  
* Inspect the data and save it under outputs/datasets/collection
* Provide a structured step-by-step guide to load the dataset for further use  

## Inputs

* Kaggle API credentials provided via a `kaggle.json` file  
* Kaggle repository URL: https://www.kaggle.com/codeinstitute/housing-prices-data

## Outputs

* Raw dataset files downloaded from Kaggle  
* Files saved to the appropriate folder structure

## Additional Comments

* Ensure that the Kaggle API token is correctly set up before running the notebook  
* Use version control to track changes to data download scripts  

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\PabloGalindo\\Coding-Institute\\PMS5\\heritage-housing-ml-pgz\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\PabloGalindo\\Coding-Institute\\PMS5\\heritage-housing-ml-pgz'

# Fetch Data from Kaggle

## Intro

To retrieve the Heritage Housing dataset from Kaggle, follow the structured steps below. This ensures a reproducible and reliable data acquisition process.

Before proceeding, make sure the `kaggle.json` file containing your API credentials is available in the **parent directory** of this project. If the file is not present, it should be downloaded manually first from your Kaggle.

### Step-by-Step Instructions

1. **Install Required Packages**  
   Ensure the `kaggle` Python package is installed in your environment. This is required to authenticate and download data via the Kaggle API.

2. **Load `kaggle.json`**  
   Load the `kaggle.json` file and set it as an environment variable dynamically to authorize the Kaggle API for dataset download. Ensure the file is located in the working directory before proceeding.


3. **Download the Dataset**  
   Use the Kaggle API to fetch the dataset from the following URL:  
   [https://www.kaggle.com/codeinstitute/housing-prices-data](https://www.kaggle.com/codeinstitute/housing-prices-data)  
   The data should be extracted and saved in a predefined location, such as `inputs/datasets/raw`.





Step 1 – Install Required Packages

In [None]:
%pip install kaggle==1.5.12

Step 2 – Load kaggle.json

In [None]:
import os

kaggle_json_path = os.path.abspath("kaggle.json")

if os.path.exists(kaggle_json_path):
    os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
    print("kaggle.json found and environment variable set.")
else:
    print("kaggle.json not found. Please download it from your Kaggle account.")

kaggle.json found and environment variable set.


---

# Section 2

---

NOTE

* You may add as many sections as you want, as long as they support your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you do not need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
