# Data Download
We begin our project with doing the necessary data setup and downloading the dataset we need The online retail giant [Amazon's Product Reviews](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) are publicly available via an easily downloadable route. Each row in the dataset equates a review written by a user, and also has other data points such as star ratings which we will get to explore later.. 

**Set Up**

This notebook is run on 13.2 ML Runtime.

#### Initial Setup

Setting up the necessary data holding objects such as Catalogs, Databases or Volumes are a great way to start projects on Databricks. These help us organise our assets with ease.

Given this, we will use the next few cells of code to create a Catalog, a Database (Schema) within that catalog which will hold our tables, and also a Volume which will hold our files.

_If Unity Catalog is not yet enabled on your workspace, please follow the instructions for alternatives. It is not required for this project_

In [0]:
%sql

-- Creating a Catalog (Optional, skip if no-UC)
CREATE CATALOG IF NOT EXISTS mas;

-- Select the Catalog as Default for this Notebook
-- If you would like to use your own Catalog, you can replace the name
-- (Optional, skip if no-UC)
USE CATALOG mas;

-- Create a Database
CREATE DATABASE IF NOT EXISTS review_summarisation;

-- Select the Database as Default
USE SCHEMA review_summarisation;

-- Create a Volume (Optional, skip if no-UC)
CREATE VOLUME IF NOT EXISTS data_store;

#### Setting Up Paths

We will now set up our paths, which we will use for downloading and storing the data. This code will give you the option to select a `dbfs` path or any other path you might want to use for storing the raw files.

In [0]:
# Import the OS system to declare a env variable
import os

# Seting up the storage path (please edit this if you would like to store the data somewhere else)
main_storage_path = "/Volumes/mas/review_summarisation/data_store"

# Declaring as an Environment Variable 
os.environ["MAIN_STORAGE_PATH"] = main_storage_path

In [0]:
%sh
# Confirming the variable made it through
echo $MAIN_STORAGE_PATH

#### Downloading the Data
Now, we can download the data from the public registry.. There are many datasets which are available from Aamzon. They are grouped by category such as Books or Cameras. For this use case, we will focus on the books dataset as we might see reviews about the books we have read before.

These datasets are in the form of compressed TSVs (Tab Seperated Values), which is the close cousing of CSVs (Comma Seperated Values). Our first task is going to be to download and unzip the data in the main locaiton we have predefined, and we are going to execute this within a shell script, using the wget utility for download.

In [0]:
%sh
# Move in to our main directory
cd $MAIN_STORAGE_PATH
# Create a new folder
mkdir -p amazon_reviews
# Move in to the folder
cd amazon_reviews
# Download the data
wget https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Books_v1_00.tsv.gz
# Unzip 
gunzip amazon_reviews_us_Books_v1_00.tsv.gz
# Display whats there
du -ah .