# Analyzing data with Python


In [1]:
import pandas as pd
#import numpy as np
from tqdm import tqdm

## Downloading Dataset

Frequently used datasets are hosted by university research centers, companies, data science platforms, etc. These datasets are mostly publically accessible. Python has a variety of methods to download files avialable on internet.

### UC Irvine Dataset Archive

A large collection of educational datasets are hosted by [UC Irvine Machine Learning repository](https://archive.ics.uci.edu/datasets). They provide a pypi package `ucimlrepo` to pull the datasets hosted by them.

In [2]:
%pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3
You should consider upgrading via the '/data/user/0/ru.iiec.pydroid3/files/aarch64-linux-android/bin/python3.9 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
automobile = fetch_ucirepo(id=10) 
  
# data (as pandas dataframes) 
dfX = automobile.data.features 
dfy = automobile.data.targets 
  
# metadata 
print(automobile.metadata) 
  
# variable information 
print(automobile.variables) 

{'uci_id': 10, 'name': 'Automobile', 'repository_url': 'https://archive.ics.uci.edu/dataset/10/automobile', 'data_url': 'https://archive.ics.uci.edu/static/public/10/data.csv', 'abstract': "From 1985 Ward's Automotive Yearbook", 'area': 'Other', 'tasks': ['Regression'], 'characteristics': ['Multivariate'], 'num_instances': 205, 'num_features': 25, 'feature_types': ['Categorical', 'Integer', 'Real'], 'demographics': [], 'target_col': ['symboling'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1985, 'last_updated': 'Thu Aug 10 2023', 'dataset_doi': '10.24432/C5B01C', 'creators': ['Jeffrey Schlimmer'], 'intro_paper': None, 'additional_info': {'summary': 'This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars.  The second rating corresponds to the degree to which th

In [6]:
dfX.head(5)

Unnamed: 0,price,highway-mpg,city-mpg,peak-rpm,horsepower,compression-ratio,stroke,bore,fuel-system,engine-size,...,length,wheel-base,engine-location,drive-wheels,body-style,num-of-doors,aspiration,fuel-type,make,normalized-losses
0,13495.0,27,21,5000.0,111.0,9.0,2.68,3.47,mpfi,130,...,168.8,88.6,front,rwd,convertible,2.0,std,gas,alfa-romero,
1,16500.0,27,21,5000.0,111.0,9.0,2.68,3.47,mpfi,130,...,168.8,88.6,front,rwd,convertible,2.0,std,gas,alfa-romero,
2,16500.0,26,19,5000.0,154.0,9.0,3.47,2.68,mpfi,152,...,171.2,94.5,front,rwd,hatchback,2.0,std,gas,alfa-romero,
3,13950.0,30,24,5500.0,102.0,10.0,3.4,3.19,mpfi,109,...,176.6,99.8,front,fwd,sedan,4.0,std,gas,audi,164.0
4,17450.0,22,18,5500.0,115.0,8.0,3.4,3.19,mpfi,136,...,176.6,99.4,front,4wd,sedan,4.0,std,gas,audi,164.0


### Fetching Directly with pandas

Alternatively, we can use direct link to the file with set of pandas functions:
- read_csv()
- read_json()
- read_html()
- read_sql()
- read_pickle()
Similarly, a dataframe can be written to these file formats:
- to_csv()
- to_json()
- to_html()
- to_pickle()


In [None]:
url_names = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names'
url_data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
df_names = pd.read_csv(url_names, header=None)
df_cars = pd.read_csv(url_data, header=None)
df_cars.columns = df_names
df_cars.to_csv('autos_85.csv')

### Using urllib

We can use urllib to pull the csv file and save it locally for future use:

In [2]:
url = 'https://archive.ics.uci.edu/static/public/10/data.csv'
filename = 'automobiles.csv'

In [9]:
from urllib.request import urlretrieve
path, headers = urlretrieve(url, filename)

('autos_85.csv', <http.client.HTTPMessage at 0x731198ffd0>)

In [None]:
for field, value in headers:
    print(f'{field}: {value}')

In [3]:
%ls *.csv


[0;0mautos_85.csv[m


### Using Third-party packages

There are quite a few 3rd party packages to reliably fetch remote files synchronous and asunchronously. 

`requests` is a well-known package in data extraction. It provides a uniform interface to fetch data synchronously over HTTP protocol.

Some of the famous asynchronous HTTP request

**Downloading a Large File in a Streaming Fashion**

If your project requires downloading a larger file, then you may run into issues trying to load the entire file into memory. To avoid this we can download large files in a streaming fashion with `urllib.response.iter_content`

In [3]:
import requests
def stream_fetch(filename, path):
    response = requests.get(path, stream=True)
    total_size_in_bytes= int(response.headers.get('content-length', 0))
    block_size = 1024 #1 Kibibyte
    progress_bar = tqdm(total=total_size_in_bytes, unit='iB', unit_scale=True)
    with open(filename, mode="wb") as file:     
        for chunk in response.iter_content(chunk_size=block_size):
            progress_bar.update(len(chunk))
            file.write(chunk)
    progress_bar.close()
    if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
        print("ERROR, something went wrong")

In [4]:
stream_fetch(filename, url)

24.1kiB [00:00, 98.3kiB/s]


In [5]:
%ls -lh

total 176K   
-rw-rw----    1 u0_a201  media_rw  101.1K Oct 19 18:27 [0;0mDA0101EN-Review-Introduction.jupyterlite.ipynb[m
-rw-rw----    1 u0_a201  media_rw   23.5K Oct 23 15:30 [0;0mautomobiles.csv[m
-rw-rw----    1 u0_a201  media_rw   23.5K Oct 19 17:18 [0;0mautos_85.csv[m
-rw-rw----    1 u0_a201  media_rw   23.1K Oct 23 15:30 [0;0mibm-analyze-data.ipynb[m
