# Read most commonly used file formats

## Table of Contents
1. [Read CSV file: VDOT Traffic Volume dataset](#Read-CSV-file)
2. [Read Excel file: Virginia Pesticide Civil Penalty dataset](##Read-Excel-file)
3. [Read tgz file: California Housing dataset  dataset](#Read-tgz-file)

## Read CSV file
- File type: CSV
- Dataset: VDOT Traffic Volume
- Average Daily Traffic (ADT) volumes with Vehicle Classification Data for most recent years, on Interstate, Arterial and Primary Routes. It also includes a list of each Interstate and Primary highway segment with the estimated Annual Average Week Day Traffic (AAWDT) for that segment.

In [1]:
import pandas as pd
# Relative file path
vdot_csv_path = './datasets/VDOT_Traffic_Volume.csv'

# load file as a Pandas DataFrame
# low_memory=False: Pandas tries to determine what dtype to set by analyzing the data in each column
vdotDf = pd.read_csv(vdot_csv_path, low_memory=False)

# display first 3 rows
print(vdotDf.head(3))

   OBJECTID                 DATA_DATE                 ROUTE_COMMON_NAME  \
0         1  2011-08-03T00:00:00.000Z        SC-2901N (Accomack County)   
1         2  2013-05-15T00:00:00.000Z  SC-1383N (Prince William County)   
2         3  2014-08-05T00:00:00.000Z         SC-2352N (Hanover County)   

           START_LABEL            END_LABEL    ADT ADT_QUALITY  \
0            Bus US 13             Dead End  100.0           R   
1           Cul-de-Sac  76-1279 Longview Dr   60.0           M   
2  42-1685 Daffodil Rd  42-2351 Sydnor Lane  100.0           R   

   PERCENT_4_TIRE  PERCENT_BUS  PERCENT_TRUCK_2_AXLE      ...       \
0             NaN          NaN                   NaN      ...        
1             NaN          NaN                   NaN      ...        
2             NaN          NaN                   NaN      ...        

   CLASS_QUALITY_CODE  AAWDT  AAWDT_QUALITY_CODE      FROM_JURISDICTION  \
0                   X    NaN                   X        Accomack County   
1  

In [2]:
# display information about each column
vdotDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121811 entries, 0 to 121810
Data columns (total 23 columns):
OBJECTID                 121811 non-null int64
DATA_DATE                121811 non-null object
ROUTE_COMMON_NAME        121811 non-null object
START_LABEL              121811 non-null object
END_LABEL                121811 non-null object
ADT                      121803 non-null float64
ADT_QUALITY              121811 non-null object
PERCENT_4_TIRE           25457 non-null float64
PERCENT_BUS              25457 non-null float64
PERCENT_TRUCK_2_AXLE     25457 non-null float64
PERCENT_TRUCK_3_AXLE     25457 non-null float64
PERCENT_TRUCK_1_TRAIL    25457 non-null float64
PERCENT_TRUCK_2_TRAIL    25457 non-null float64
CLASS_QUALITY_CODE       121811 non-null object
AAWDT                    27685 non-null float64
AAWDT_QUALITY_CODE       121796 non-null object
FROM_JURISDICTION        121714 non-null object
TO_JURISDICTION          121646 non-null object
ROUTE_NAME               

## Read Excel file
- File type: Excel
- Dataset: Virginia Pesticide Civil Penalty
- Information about previous compliance actions against a pesticide business or an individual pesticide applicator.

In [3]:
import pandas as pd
# Relative file path
pesticides_penalty_xls_path = './datasets/pesticides.xls'

# load file as a Pandas DataFrame
pestiDf = pd.read_excel(pesticides_penalty_xls_path)

# display first 3 rows
print(pestiDf.head(3))

                        BUSINESS NAME TRADING AS NAME            ADDRESS  \
0       ACP LANDSCAPING AND LAWN CARE             NaN    58O4 NAVAJO CIR   
1       ACP LANDSCAPING AND LAWN CARE             NaN    58O4 NAVAJO CIR   
2  ADC LAWN CARE & BOBCAT SERVICE INC             NaN  8357 LEESVILLE RD   

         CITY STATE    ZIP APPLICATOR NAME                   VIOLATION  \
0   LYNCHBURG    VA  24502   PUTNEY, ADAM          NO BUSINESS LICENSE   
1   LYNCHBURG    VA  24502   PUTNEY, ADAM                NOT CERTIFIED   
2  HUDDLESTON    VA  24104  HARMAN, ADAM M  SERVICE CONTAINER LABELING   

  VIOLATION DATE STATUS  PENALTY AMT  
0     2016-07-25   PAID          280  
1     2016-07-25   PAID          280  
2     2016-09-19   PAID          280  


In [4]:
# display row, column size
print(pestiDf.shape)

(123, 11)


## Read tgz file
- File type: tgz
- Dataset: California Housing
- Modified version of the California Housing dataset which was built using the 1990 California census data

In [5]:
import tarfile
import os
import pandas as pd

DATASET_ROOT = './datasets'
housing_tgz_path = os.path.join(DATASET_ROOT, 'housing.tgz')

# returns a TarFile object for the pathname name
housing_tgz = tarfile.open(housing_tgz_path)

# extract contents of tgz file
housing_tgz.extractall(path=DATASET_ROOT)

# close the tar file
housing_tgz.close()

# read the content of tgz file which is a csv file
housing_csv_path = os.path.join(DATASET_ROOT, 'housing.csv')
housingDf = pd.read_csv(housing_csv_path)

# display first 3 rows
print(housingDf.head(3))

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
