# Downloading and using tables from the Internet in a notebook

We can use files present in the host running a notebook normally.
This notebook exemplifies how we can load the data used in project 1, automatically downloading the file if it's still not present in the host.

First we import the relevant modules:

In [1]:
# OS files manipulation
import os
# web interaction
import requests
# files unzipping
import gzip
# csv files/tables manipulation
import pandas as pd

Now we set the web and local paths of the dataset:

In [2]:
# web location of the dataset table
DATASET_URL = ('https://archive.ics.uci.edu/ml/machine-learning-databases/'
    '00492/Metro_Interstate_Traffic_Volume.csv.gz')
# location of the local dataset table
DATASET_LOCAL_PATH = 'data/dataset.csv'

We use the following methods to download, unzip and open the dataset .csv table:

In [3]:
def download(url, dst_path):
    '''Downloads a file from given url to destination path.'''
    resp = requests.get(url)
    with open(dst_path, 'wb') as f:
        f.write(resp.content)
        
        
def gunzip(path, dst_path):
    '''Unzips a gzip file from given path do destination path.'''
    with gzip.open(path, 'rb') as f:
        with open(dst_path, 'wb') as dst_f:
            dst_f.write(f.read())


def download_and_unzip_dataset():
    '''Downloads and unzips dataset to be used in project.'''
    # making sure the directory that will hold our file exists
    os.makedirs(os.path.dirname(DATASET_LOCAL_PATH), exist_ok=True)
    # downloading file from internet
    download(DATASET_URL, f'{DATASET_LOCAL_PATH}.gz')
    # unzipping file to final destination
    gunzip(f'{DATASET_LOCAL_PATH}.gz', DATASET_LOCAL_PATH)


def get_dataset():
    '''Gets the dataset to be used in project as a pandas table.'''
    # if the file still does not exist locally, get it from the internet
    if not os.path.isfile(DATASET_LOCAL_PATH):
        print('downloading dataset from '
            f'{DATASET_URL} to {DATASET_LOCAL_PATH}...', end=' ', flush=True)
        download_and_unzip_dataset()
        print('done.\n')

    # load the csv table
    df = pd.read_csv(DATASET_LOCAL_PATH)
    return df

We can now get our dataset:

In [4]:
df = get_dataset()

downloading dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/00492/Metro_Interstate_Traffic_Volume.csv.gz to data/dataset.csv... done.



And we can finally use our dataset as we wish!

In [5]:
print('first rows of dataset:')
print(df.head())
print()

n_rows, n_cols = df.shape[:2]
print(f'the dataset contains {n_rows} rows and {n_cols} columns.')
print(f'dataset columns are: {", ".join(df.columns)}')

first rows of dataset:
  holiday    temp  rain_1h  snow_1h  clouds_all weather_main  \
0    None  288.28      0.0      0.0          40       Clouds   
1    None  289.36      0.0      0.0          75       Clouds   
2    None  289.58      0.0      0.0          90       Clouds   
3    None  290.13      0.0      0.0          90       Clouds   
4    None  291.14      0.0      0.0          75       Clouds   

  weather_description            date_time  traffic_volume  
0    scattered clouds  2012-10-02 09:00:00            5545  
1       broken clouds  2012-10-02 10:00:00            4516  
2     overcast clouds  2012-10-02 11:00:00            4767  
3     overcast clouds  2012-10-02 12:00:00            5026  
4       broken clouds  2012-10-02 13:00:00            4918  

the dataset contains 48204 rows and 9 columns.
dataset columns are: holiday, temp, rain_1h, snow_1h, clouds_all, weather_main, weather_description, date_time, traffic_volume
