# An Analyst's Guide to Azure Data Lake Storage Gen 2 
Author: [elizabethotoole](https://github.com/elizabethotoole)

This beginner-friendly notebook serves as a step-by-step guide for connecting to Azure Data Lake Storage (ADLS) Gen 2 using python 🐍.

Join me as I walk you through the essentials, so you can focus on analysing data, not battling with storage setups. 🚀


## 1. A Quick Overview of Fundamental Concepts

#### What is ADLS Gen 2?
Azure Data Lake Storage Gen 2 (ADLS Gen 2) is a cloud storage service from Microsoft Azure designed to store large amounts of data.

#### How is it structured?
- **Storage account** - The highest level of storage and is where all your data is stored and accessed. <br>
    > e.g. "https://youraccount.blob.core.windows.net" <br>
    >
    The youraccount part of the URL should be replaced with your actual Azure Storage account name. The account_url will be used to authenticate and connect to your specific Azure Storage account. <br>

- **Containers**- Inside a storage account, data is organised into containers. Think of containers as "buckets" that hold your data. <br>
    > e.g. "your-container-name"
    >
    Replace "your-container-name" with the actual name of the container where your files are stored.

- **Folders and Subfolders**: ADLS Gen 2 allows you to organise data using folders, this differs to traditional sotrage which is typically stored in flat containers.
    > e.g. "your-container-name/raw_data/" <br>
    >
    You can have folders like raw_data/, processed_data/, or analytics/, and within these, additional subfolders like logs/, reports/, etc. <br>

- **Blobs**: These are the files themselves which are stored inside the containers, folders, or subfolders. 
    > e.g.  "raw_data/logs/jan_log.csv"
    >
    Blobs (aka files) can be in various formats like CSV, Parquet, JSON, and even images and videos. <br>


## 2. Downloading data from ALDS Gen 2 ⬇️
Now you know the essentials, let's get into it! 🚀

### Step 1: Import our packages
In this step, we'll import all the required libraries that we'll use to interact with the Azure Blob Storage and process our data.

In [None]:
# Import the necessary libraries
from azure.identity import InteractiveBrowserCredential  # Used for authenticating with Azure using the interactive browser method
from azure.storage.blob import BlobServiceClient  # Used for connecting to the Azure Blob Storage service

import pyarrow as pa 
import pyarrow.csv as pv_csv  # for csv files
import pyarrow.parquet as pq  # for for parquet files
import pandas as pd  # Pandas library used for data manipulation and analysis


### Step 2: Connect to Azure Blob Storage

In [None]:
# Connect to Azure Blob Storage
account_url = "https://youraccount.blob.core.windows.net" # Replace youraccount with the name of your Azure Blob Storage account.
container_name = "your-container-name" # Replace your-container-name with the name of your container in Azure Blob Storage.
default_credential = InteractiveBrowserCredential() # Use the interactive browser method for authenticating with Azure, you'll be prompted to log in via the web browser.

blob_service_client = BlobServiceClient(account_url=account_url, credential=default_credential)
container_client = blob_service_client.get_container_client(container_name)

In [None]:
# Option 1: List all files in the container
file_names = [blob.name for blob in container_client.list_blobs()]
print(file_names)

In [None]:
# Option 2: List files from a certain folder in the container by adding a prefix
prefix = "folder1/"

file_names = [blob.name for blob in container_client.list_blobs(name_starts_with=prefix)]
print(file_names)

In [None]:
# Option 3: List files with a specific file name
specific_file_name = "folder1/my_file.parquet"  # Replace with the name of your file

file_names = [blob.name for blob in container_client.list_blobs(name_starts_with=specific_file_name)]
print(file_names)

### Step 3: Load Data from the Blob Storage
Now we've listed the files in Azure Blob Storage, it's crucial to consider the size of your dataset as this will influence how we load the data into a dataframe for analysis.

#### Option 1: Using Pandas for smaller datasets
Pandas is a popular open-source Python library used for data manipulation and analysis. <br>

If the dataset fits into memory, using pandas is simple and effective, allowing the data to be stored in a dataframe using rows and columns (like an Excel spreadsheet or SQL table)

In [None]:
# Reading CSV Files

# List to hold dataframes
dfs = []

# Download each blob, read as CSV, and append to the dfs list
for file_name in file_names:
    # Download the blob using the blob client
    blob_client = container_client.get_blob_client(file_name)
    blob_data = blob_client.download_blob()  # Retrieves the blob data in its raw binary format
    
    # Convert the blob data into an Arrow Table directly
    buffer = pa.BufferReader(blob_data.readall()) # reads the blob data into buffer
    table = pv_csv.read_csv(buffer)  # Read CSV directly from buffer
    
    # Convert the Arrow Table to a pandas DataFrame
    df = table.to_pandas()
    
    # Append the DataFrame to the list
    dfs.append(df)

# Concatenate all DataFrames into a single DataFrame
imported_data = pd.concat(dfs, ignore_index=True)

# Now `imported_data` holds the combined DataFrame
print(imported_data.head())

In [None]:
# Reading parquet files

# List to hold dataframes
dfs = []

# Download each blob, read as Parquet, and append to the dfs list
for file_name in file_names:
    # Download the blob using the blob client
    blob_client = container_client.get_blob_client(file_name)
    blob_data = blob_client.download_blob()  # Retrieves the blob data in its raw binary format
    
    # Read the Parquet data directly using pyarrow
    buffer = pa.BufferReader(blob_data.readall())  # reads the blob data into buffer
    table = pq.read_table(buffer)  # Read the Parquet data into an Arrow Table
    
    # Convert the Arrow Table to a pandas DataFrame
    df = table.to_pandas()
    
    # Append the DataFrame to the list
    dfs.append(df)

# Concatenate all DataFrames into a single DataFrame
imported_data = pd.concat(dfs, ignore_index=True)

# Now `imported_data` holds the combined DataFrame
print(imported_data.head())


#### Option 2: Using Spark for larger datasets
PySpark is the Python API for Apache Spark, an open-source, distributed computing system that is widely used for big data processing and analytics. Apache Spark allows you to process large amounts of data across a distributed computing cluster in parallel, making it highly scalable so allows us to handle datasets that don't fit into a single machine's memory.

For datasets that exceed your system's memory, loading them with pandas may cause memory errors. In such cases, you can use Pyspark instead. <br>

As an analyst, you're more likely to access very large datasets through dedicated environments such Databricks where your fellow data engineers have done a lot of the heavy lifting with spark configurations. Below I have provided code to help utilise Pyspark and connecting to ADLS Gen 2 storage within Databricks.

Note - While explaining spark configurations is beyond the scope of this beginner's guide, there are some useful resources listed below for more information. <br>
> If you're interested in learning more take a look at this useful article: <br>
> https://subhamkharwal.medium.com/pyspark-connect-azure-adls-gen-2-c4efa5bf016b  <br>
> The code is also available via the author's github: <br>
> https://github.com/subhamkharwal/ease-with-apache-spark/blob/master/30_connect_adls_gen2.ipynb  <br>

**Loading data from ADLS Gen 2 within the Databricks environment or similar** <br>


In [None]:
# For use within the databricks environment
# A function to load data from Azure Data Lake Storage Gen 2 into a Spark DataFrame
# This function supports reading Parquet and CSV files from ADLS Gen 2 but could be expanded to support other file formats.

def load_lake_data(file_path, file_format):
    """
    Description:
    Loads data stored in Azure Data Lake Storage Gen 2 (ADLS Gen 2) into a Spark DataFrame.

    This function supports reading Parquet and CSV files from ADLS Gen 2 and applies the necessary 
    configurations like treating the first row as headers for CSV files, inferring schema, and 
    recursively looking for files.

    Args:
        file_path (str): The full path to the file or directory in ADLS storage. This should be 
                            in the format `abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-file>`.
        file_format (str): The format of the file to read. Valid options are "parquet" or "csv".

    Returns:
        pyspark.sql.DataFrame: A Spark DataFrame containing the data read from the specified file format.
        
    Raises:
        ValueError: If an unsupported file format is provided.

    Example usage:
        file_path = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-file>"
        df = load_lake_data(file_path, "parquet")
        df = load_lake_data(file_path, "csv")
    """

    from databricks.sdk.runtime import spark

    if file_format == "parquet":
        output = (
            spark.read
            .option("header", "true")  # Treat the first row as column names
            .option("recursiveFileLookup", "true")  # Look for files recursively
            .parquet(file_path)  # Read Parquet data from the specified path
        )

    elif file_format == "csv":
        output = (
            spark.read.option("header", "true")
            .option("inferSchema", "true")
            .option("recursiveFileLookup", "true")
            .csv(file_path)
        )

    else:
        raise ValueError(f"Unsupported file format: {file_format}")

    return output     


### Step 4 - Congrats, now you can use your dataframe for analysis!

In [None]:
# Take a quick look at your dataframe:
print(imported_data.head())

# In databricks:
display(imported_data.limit(5))  # Display the first 5 rows interactively in Databricks

## 3. Uploading Data to Azure Blob Storage ⬆️

The below code allows us to upload a file from our local machine into Azure Blob Storage which we can then subsequently incorporate into analytical projects.

In [None]:
# Provide lake details for where you want to upload the file 
"""
The below have been defined earlier on in this notebook but are provided within this comment for easy reference
account_url = "https://youraccount.blob.core.windows.net" 
container_name = "your-container-name" 

default_credential = InteractiveBrowserCredential()
blob_service_client = BlobServiceClient(account_url=account_url, credential=default_credential) 
container_client = blob_service_client.get_container_client(container_name) 
"""

# Provide the local file path and desired destination blob name (include the file and file tpye)
local_file_path = "Documents/file_1.csv"  # Path to the local CSV file you want to upload
blob_name = "uploads/file1.csv"  # Path in the container

# Now upload the file to Azure Blob Storage
try:
    # Open the file in binary mode
    with open(local_file_path, mode="rb") as data:
        # Upload the file to Azure Blob Storage
        container_client.upload_blob(blob_name, data, overwrite=True) #overwrite=True will replace the file if it already exists
        print(f"File {local_file_path} uploaded successfully to {container_name}/{blob_name}")
except Exception as e:
    print(f"Error uploading file: {e}")