# Amazon Reviews Data Downloader

This notebook downloads the Amazon Reviews dataset from Kaggle and stores it locally in the data folder.

**Dataset URL**: https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews/data

## Prerequisites
- Kaggle account and API key
- Kaggle API key file (`kaggle.json`) placed in `~/.kaggle/` directory

## 1. Install Required Libraries

Install the Kaggle API and other necessary libraries for downloading and handling data.

In [1]:
# Install required packages
!pip install kaggle pandas numpy matplotlib seaborn

Collecting pandas
  Using cached pandas-2.3.1-cp313-cp313-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting numpy
  Using cached numpy-2.3.2-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting matplotlib
  Using cached matplotlib-3.10.5-cp313-cp313-macosx_11_0_arm64.whl.metadata (11 kB)
Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2025.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.3-cp313-cp313-macosx_11_0_arm64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Using cached fonttools-4.59.0-cp313-cp313-macosx_10_13_universal2.whl.metadata (107 kB)
Collecting kiwisolver>=1.3.1 (from 

## 2. Import Libraries and Setup

Import required libraries including kaggle, os, zipfile, and pandas for data handling.

In [4]:
import os
import zipfile
import pandas as pd
import numpy as np
from pathlib import Path
import shutil
from kaggle.api.kaggle_api_extended import KaggleApi

print("Libraries imported successfully!")

Libraries imported successfully!


## 3. Configure Kaggle API

Set up Kaggle API credentials and configure authentication to access datasets.

**Note**: Make sure you have your `kaggle.json` file in the `~/.kaggle/` directory. You can download it from your Kaggle account settings.

In [6]:
# Initialize Kaggle API
api = KaggleApi()
api.authenticate()

print("Kaggle API authenticated successfully!")

# Dataset identifier
dataset_name = "kritanjalijain/amazon-reviews"
print(f"Target dataset: {dataset_name}")

Kaggle API authenticated successfully!
Target dataset: kritanjalijain/amazon-reviews


## 4. Create Data Directory

Create a local 'data' directory to store the downloaded dataset, ensuring it exists and is properly structured.

In [7]:
# Define data directory path
data_dir = Path("data")
amazon_data_dir = data_dir / "amazon_reviews"

# Create directories if they don't exist
data_dir.mkdir(exist_ok=True)
amazon_data_dir.mkdir(exist_ok=True)

print(f"Data directory created: {data_dir.absolute()}")
print(f"Amazon reviews directory: {amazon_data_dir.absolute()}")

# List existing contents
print("\nCurrent contents of data directory:")
for item in data_dir.iterdir():
    print(f"  - {item.name}")

Data directory created: /Users/ducqhle/Documents/UIT_workspace/Project_Python_For_AI/data
Amazon reviews directory: /Users/ducqhle/Documents/UIT_workspace/Project_Python_For_AI/data/amazon_reviews

Current contents of data directory:
  - amazon_reviews


## 5. Download Amazon Reviews Dataset

Use the Kaggle API to download the Amazon Reviews dataset from the specified URL to the local data folder.

In [8]:
# Download the dataset
print(f"Downloading dataset: {dataset_name}")
print(f"Destination: {amazon_data_dir.absolute()}")

try:
    api.dataset_download_files(
        dataset_name, 
        path=str(amazon_data_dir), 
        unzip=True
    )
    print("✅ Dataset downloaded and extracted successfully!")
except Exception as e:
    print(f"❌ Error downloading dataset: {e}")
    print("Please check your Kaggle API credentials and internet connection.")

Downloading dataset: kritanjalijain/amazon-reviews
Destination: /Users/ducqhle/Documents/UIT_workspace/Project_Python_For_AI/data/amazon_reviews
Dataset URL: https://www.kaggle.com/datasets/kritanjalijain/amazon-reviews
✅ Dataset downloaded and extracted successfully!


## 6. Extract and Organize Data

Extract downloaded zip files and organize the data files in the appropriate directory structure.

In [9]:
# Check for any remaining zip files and extract them
zip_files = list(amazon_data_dir.glob("*.zip"))

if zip_files:
    print(f"Found {len(zip_files)} zip file(s) to extract:")
    for zip_file in zip_files:
        print(f"  - {zip_file.name}")
        
        # Extract zip file
        with zipfile.ZipFile(zip_file, 'r') as zip_ref:
            zip_ref.extractall(amazon_data_dir)
        
        # Remove the zip file after extraction
        zip_file.unlink()
        print(f"    ✅ Extracted and removed {zip_file.name}")
else:
    print("No additional zip files found to extract.")

print("\n📁 Data organization complete!")

No additional zip files found to extract.

📁 Data organization complete!


## 7. Verify Downloaded Data

Check the downloaded files, display file sizes, and preview the data structure to confirm successful download.

In [10]:
# List all files in the amazon reviews directory
print("📋 Downloaded files:")
print("=" * 50)

total_size = 0
file_count = 0

for file_path in amazon_data_dir.rglob("*"):
    if file_path.is_file():
        file_size = file_path.stat().st_size
        total_size += file_size
        file_count += 1
        
        # Convert bytes to human readable format
        if file_size < 1024:
            size_str = f"{file_size} B"
        elif file_size < 1024**2:
            size_str = f"{file_size/1024:.1f} KB"
        elif file_size < 1024**3:
            size_str = f"{file_size/(1024**2):.1f} MB"
        else:
            size_str = f"{file_size/(1024**3):.1f} GB"
        
        rel_path = file_path.relative_to(amazon_data_dir)
        print(f"  📄 {rel_path} ({size_str})")

# Total size
if total_size < 1024**3:
    total_size_str = f"{total_size/(1024**2):.1f} MB"
else:
    total_size_str = f"{total_size/(1024**3):.1f} GB"

print("\n" + "=" * 50)
print(f"📊 Summary: {file_count} files, Total size: {total_size_str}")

📋 Downloaded files:
  📄 test.csv (167.9 MB)
  📄 amazon_review_polarity_csv.tgz (656.5 MB)
  📄 train.csv (1.5 GB)

📊 Summary: 3 files, Total size: 2.3 GB


## 8. Preview Data Structure

Let's take a quick look at the structure of the downloaded data files.

In [11]:
# Find CSV files to preview
csv_files = list(amazon_data_dir.glob("*.csv"))

if csv_files:
    print(f"Found {len(csv_files)} CSV file(s):")
    
    for csv_file in csv_files[:3]:  # Preview first 3 CSV files
        print(f"\n📊 Preview of {csv_file.name}:")
        print("-" * 40)
        
        try:
            # Read a sample of the data
            df = pd.read_csv(csv_file, nrows=5)
            
            print(f"Shape: {df.shape}")
            print(f"Columns: {list(df.columns)}")
            print("\nFirst few rows:")
            display(df.head())
            
            # Show data types
            print("\nData types:")
            print(df.dtypes)
            
        except Exception as e:
            print(f"Error reading {csv_file.name}: {e}")
else:
    print("No CSV files found. Let's check for other data formats...")
    
    # Check for other common data formats
    other_formats = {
        "JSON": list(amazon_data_dir.glob("*.json")),
        "Parquet": list(amazon_data_dir.glob("*.parquet")),
        "TSV": list(amazon_data_dir.glob("*.tsv")),
        "TXT": list(amazon_data_dir.glob("*.txt"))
    }
    
    for format_name, files in other_formats.items():
        if files:
            print(f"\n{format_name} files found:")
            for file in files[:3]:  # Show first 3 files of each type
                print(f"  - {file.name}")

Found 2 CSV file(s):

📊 Preview of test.csv:
----------------------------------------
Shape: (5, 3)
Columns: ['2', 'Great CD', 'My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I\'m in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life\'s hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"']

First few rows:


Unnamed: 0,2,Great CD,"My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing ""Who was that singing ?"""
0,2,One of the best game music soundtracks - for a...,Despite the fact that I have only played a sma...
1,1,Batteries died within a year ...,I bought this charger in Jul 2003 and it worke...
2,2,"works fine, but Maha Energy is better",Check out Maha Energy's website. Their Powerex...
3,2,Great for the non-audiophile,Reviewed quite a bit of the combo players and ...
4,1,DVD Player crapped out after one year,I also began having the incorrect disc problem...



Data types:
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               int64
Great CD                                                                                                                                                                                                                                                                                                                                                                                                                                                             

Unnamed: 0,2,Stuning even for the non-gamer,This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^
0,2,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
1,2,Amazing!,This soundtrack is my favorite music of all ti...
2,2,Excellent Soundtrack,I truly like this soundtrack and I enjoy video...
3,2,"Remember, Pull Your Jaw Off The Floor After He...","If you've played the game, you know how divine..."
4,2,an absolute masterpiece,I am quite sure any of you actually taking the...



Data types:
2                                                                                                                                                                                                                                                                                                                                                                                                              int64
Stuning even for the non-gamer                                                                                                                                                                                                                                                                                                                                                                                object
This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but ou

## 🎉 Download Complete!

The Amazon Reviews dataset has been successfully downloaded to the `data/amazon_reviews/` directory.

### Next Steps:
1. **Explore the data**: Use the files in the `data/amazon_reviews/` directory for your analysis
2. **Data preprocessing**: Clean and prepare the data for machine learning
3. **Analysis**: Perform sentiment analysis, rating prediction, or other ML tasks

### Important Notes:
- The `data/` folder is ignored by Git (configured in `.gitignore`) due to large file sizes
- Make sure to re-run this notebook if you need to refresh the dataset
- Keep your Kaggle API credentials secure and never commit them to version control