# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset



---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with `os.getcwd()`

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Arthur\\OneDrive\\Documentos\\Code Institute\\PP5\\PP5-heritage-housing-issues-ml\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* `os.path.dirname()` gets the parent directory
* `os.chir()` defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Arthur\\OneDrive\\Documentos\\Code Institute\\PP5\\PP5-heritage-housing-issues-ml'

# Fetch Data from Kaggle

Install Kaggle package to fetch data

In [4]:
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Using cached kaggle-1.5.12-py3-none-any.whl
Collecting python-slugify (from kaggle==1.5.12)
  Using cached python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Using cached text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Using cached python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Installing collected packages: text-unidecode, python-slugify, kaggle
Successfully installed kaggle-1.5.12 python-slugify-8.0.4 text-unidecode-1.3
Note: you may need to restart the kernel to use updated packages.


A **JSON file (authentication token)** is needed to authenticate Kaggle to download data in this session. The Kaggle token was added in the `kaggle.json` file. The following cell is for recognizing it in this session.
I added an `if` statement for it to be able to run on Windows or Linux.

In [5]:
import os
import platform

# Set Kaggle config path
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()

# Only run chmod on Unix-based systems
if platform.system() != 'Windows':
    os.system('chmod 600 kaggle.json')

We are using the following [Kaggle URL](https://www.kaggle.com/codeinstitute/housing-prices-data)

![kaggle dataset page screenshot](..\docs\screenshot-kaggle-dataset.png)

Get the dataset path from the Kaggle url
* When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ .

Define the Kaggle dataset, and destination folder and download it.

In [6]:
# Define Kaggle dataset path and local destination folder
KaggleDatasetPath = "codeinstitute/housing-prices-data"
DestinationFolder = "inputs/datasets/raw"

# Download dataset from Kaggle
!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}


Downloading housing-prices-data.zip to inputs/datasets/raw




  0%|          | 0.00/49.6k [00:00<?, ?B/s]
100%|██████████| 49.6k/49.6k [00:00<00:00, 427kB/s]
100%|██████████| 49.6k/49.6k [00:00<00:00, 427kB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
import zipfile
import glob

# Unzip any zip files in the destination folder
for zip_file in glob.glob(f"{DestinationFolder}/*.zip"):
    with zipfile.ZipFile(zip_file, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)
    os.remove(zip_file)

# Delete the kaggle.json file
if os.path.exists("kaggle.json"):
    os.remove("kaggle.json")


# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
