## Importing libraries & packages
Importing packages typically appears at the top of the file.
* `import <package_name>` is the most basic command
* The package can be imported with an alias to shorten verbosity. Common packages will often have a conventional alias.
<blockquote>

```python
import pandas
pandas.read_csv(path)

# VS as an alias

import pandas as pd
pd.read_csv(path)
```
</blockquote>


In [8]:
import os  # Import standard library operating system package that deals with system diretory interfaces
from pathlib import Path  # Import filesystem path package, for easier pathing to files and outputs
import pandas as pd  # Import pandas library as an alias of 'pd'
import matplotlib.pyplot as plt  # Import the sub-package pyplot from the matplotlib library as an alias of 'plt'

# Magic command for jupyter notebook to generate figures within the notebook
%matplotlib inline

## Data
Today we will be using Providence, RI air quality data to demonstrate data exploratory data analysis techniques.

The Rhode Island Department of Environmental Management (RIDEM) and Rhode Island Department of Health (RIDOH) collects air quality data at several sites across Rhode Island. We will be examining data from the Community of Rhode Island (CCRI) Liston Campus site.

* The CCRI site is part of the EPA's *State or Local Air Monitoring Stations* (SLAMS) and *National Air Toxics Trends Sites* (NATTS) networks.
* A variety of air pollutants (particulate matter (PM), volatile organic carbon (VOCs),  polycyclic aromatic hydrocarbons (PAHs), carbonyls, black carbon) have been monitored at this site since 2005.
* The data was obtained from the Environmental Protection Agency (EPA) [Air Quality Data website](https://www.epa.gov/outdoor-air-quality-data).
<div>
<img src="images/aq-site-info.png" width="400"/>
</div>

We will use a subset of this data in the demonstrations below and give you a chance to work with a larger dataset during the hands-on lab.


*Links*
[EPA Air Quality Data Interactive Map](https://www.epa.gov/outdoor-air-quality-data/interactive-map-air-quality-monitors) - Data source
[RIDEM 2022 Annual Monitoring Report](https://dem.ri.gov/sites/g/files/xkgbur861/files/2023-01/airnet22.pdf) - More information about the site and other monitoring locations across the state.



## Importing Tabular Data
The pandas package reads tabular data into a data structure called a `DataFrame`. Some examples of read functions are below:

* [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) - Comma-delimited or other delimited files
* [`pd.read_fwf`](https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html#) - Fixed width files
* [`pd.read_excel`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) - Microsoft excel files
* [`pd.read_sql`](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html) - SQL query or database table
* See [pandas I/O documentation](https://pandas.pydata.org/docs/reference/io.html#input-output) for more examples

Today we will be working with the `pd.read_csv()`. While this function defaults to read comma-delimited files, the function can be used on any delimited text file if provided the seperator as a keyword argument. Let's take a look at the online documentation for this function. [`pd.read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

At the top is the function call signature:
>pandas.read_csv(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', ...)
* This demonstrates how to use the function in code with all the available arguments.
* There are two types of arguments: *Positional* and *Keyword*
    1. **Positional arguments** are listed first. They are required and need to be specified in order. In this example there is only one, `filepath_or_buffer`.
    2. **Keyword arguments** are listed after positional arguments and are optional. They have an `=` after the name to denote default values.

    Positional arguments do not need to be specified by name while keyword arguments must be specified by name.
    ```python
    # Both of these calls are acceptable
    pd.read_csv('data/raw/datafile.csv', sep=',')
    pd.read_csv(filepath_or_buffer='data/raw/datafile.csv', sep=',')
    ```

Let's read in datafile. First we will set the path to the data.

In [21]:
# Setting up the path to the data directory
## We get our current working directory
cwd = os.getcwd()
print(f'The current working directory is where this notebook is located: {cwd}')

## We use the Path library to move up one level to the top level of the project
project_top = Path(cwd).parents[0]
print(f'This is the the top level of the project: {project_top}')

## We extend the path to the monthly data directory
path_to_monthly_data = project_top.joinpath('data','raw','monthly')
### Alternative syntax for extending path...
path_to_monthly_data = project_top / 'data' / 'raw'/ 'monthly'
print(f'This is the monthly data directory: {path_to_monthly_data}')

The current working directory is where this notebook is located: /Users/gdang2/repos/ccv-bootcamp-python-2023/notebooks
This is the the top level of the project: /Users/gdang2/repos/ccv-bootcamp-python-2023
This is the monthly data directory: /Users/gdang2/repos/ccv-bootcamp-python-2023/data/raw/monthly


<div class="alert alert-block alert-info">
Note: Using the pathlib and Path object is highly reccomended because it standarizes pathing across operating systems. Path seperators are different between Unix (Mac & Linux) and Windows operating systems. Unix uses `/` whereas Windows uses `\`. Avoiding hardcoding paths makes your code more universal.
</div>

Let's read in one file. This will be data from the first month of 2022.

In [34]:
# Save the DataFrame object to a variable
df_2022_01 = pd.read_csv(path_to_monthly_data / 'daily_44_007_0022_2022_01.csv')
df_2022_01.head(10) # show first 10 records

Unnamed: 0,State Code,County Code,Site Number,Parameter Code,POC,Latitude,Longitude,Datum,Parameter Name,Duration Description,...,AQI,Daily Criteria Indicator,Tribe Name,State Name,County Name,City Name,Local Site Name,Address,MSA or CBSA Name,Data Source
0,44,7,22,87101,1,41.807469,-71.412968,NAD83,"Particle Number, Total Count",1 HOUR,...,.,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
1,44,7,22,61107,1,41.807469,-71.412968,NAD83,Std Dev Vt Wind Direction,1 HOUR,...,.,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
2,44,7,22,62101,1,41.807469,-71.412968,NAD83,Outdoor Temperature,1 HOUR,...,.,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
3,44,7,22,61104,1,41.807469,-71.412968,NAD83,Wind Direction - Resultant,1 HOUR,...,.,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
4,44,7,22,84313,1,41.807469,-71.412968,NAD83,Black carbon PM2.5 STP,1 HOUR,...,.,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
5,44,7,22,88101,3,41.807469,-71.412968,NAD83,PM2.5 - Local Conditions,24-HR BLK AVG,...,30,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
6,44,7,22,88101,3,41.807469,-71.412968,NAD83,PM2.5 - Local Conditions,24-HR BLK AVG,...,30,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
7,44,7,22,88101,3,41.807469,-71.412968,NAD83,PM2.5 - Local Conditions,1 HOUR,...,.,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
8,44,7,22,62201,1,41.807469,-71.412968,NAD83,Relative Humidity,1 HOUR,...,.,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
9,44,7,22,84313,1,41.807469,-71.412968,NAD83,Black carbon PM2.5 STP,1 HOUR,...,.,Y,,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart


## Inspecting DataFrames
Here are useful attributes of the dataframe
* `.shape`:  Table dimensions
* `.columns`:  Sequence of columns
* `.index`:  Sequence of row indexes/labels
* `.dtypes`: Data types by column

Here are a few useful methods to inspect a dataframe:
* `.head()`: Shows the first 5 rows--can change the number by supplying an integer.
* `.tail()`: Shows the last 5 rows--can change the number by supplying an integer.
* `.info()`: Combines several DataFrame attributes to one report.


In [44]:
df_2022_01.shape

(743, 34)

In [46]:
df_2022_01.columns

Index(['State Code', 'County Code', 'Site Number', 'Parameter Code', 'POC',
       'Latitude', 'Longitude', 'Datum', 'Parameter Name',
       'Duration Description', 'Pollutant Standard', 'Date (Local)', 'Year',
       'Day In Year (Local)', 'Units of Measure', 'Exceptional Data Type',
       'Nonreg Observation Count', 'Observation Count', 'Observation Percent',
       'Nonreg Arithmetic Mean', 'Arithmetic Mean',
       'Nonreg First Maximum Value', 'First Maximum Value',
       'First Maximum Hour', 'AQI', 'Daily Criteria Indicator', 'Tribe Name',
       'State Name', 'County Name', 'City Name', 'Local Site Name', 'Address',
       'MSA or CBSA Name', 'Data Source'],
      dtype='object')

In [36]:
df_2022_01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 743 entries, 0 to 742
Data columns (total 34 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   State Code                  743 non-null    int64  
 1   County Code                 743 non-null    int64  
 2   Site Number                 743 non-null    int64  
 3   Parameter Code              743 non-null    int64  
 4   POC                         743 non-null    int64  
 5   Latitude                    743 non-null    float64
 6   Longitude                   743 non-null    float64
 7   Datum                       743 non-null    object 
 8   Parameter Name              743 non-null    object 
 9   Duration Description        743 non-null    object 
 10  Pollutant Standard          72 non-null     object 
 11  Date (Local)                743 non-null    object 
 12  Year                        743 non-null    int64  
 13  Day In Year (Local)         743 non

<div class="alert alert-block alert-warning">
You might be asking: What is an "object" dtype?

Short Answer: It is a column of string or mixed data types (e.g. string, ints, floats, etc).

Long Answer: Pandas was built upon the numpy package on its backend. Numpy can only store information in an array where each value is encoded in the same number of bytes. Because strings can be of variable length, they do not conform to the fixed byte requirement. Instead Pandas creates an object array with pointers to the strings and  the pointers are of equal byte size. This is similar for columns with mixtures of data types.
</div>

In [47]:
# Inspect Numerical Fields
df_2022_01.select_dtypes(include=['int','float'])

Unnamed: 0,State Code,County Code,Site Number,Parameter Code,POC,Latitude,Longitude,Year,Day In Year (Local),Exceptional Data Type,Nonreg Observation Count,Observation Count,Observation Percent,Nonreg Arithmetic Mean,Arithmetic Mean,Nonreg First Maximum Value,First Maximum Value,First Maximum Hour,Tribe Name
0,44,7,22,87101,1,41.807469,-71.412968,2022,1,,,24,100.0,,7062.208333,,14300.00,17,
1,44,7,22,61107,1,41.807469,-71.412968,2022,1,,,24,100.0,,17.166667,,25.00,7,
2,44,7,22,62101,1,41.807469,-71.412968,2022,1,,,24,100.0,,48.958333,,54.00,15,
3,44,7,22,61104,1,41.807469,-71.412968,2022,1,,,24,100.0,,140.791667,,195.00,15,
4,44,7,22,84313,1,41.807469,-71.412968,2022,1,,,24,100.0,,0.458333,,1.25,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
738,44,7,22,62201,1,41.807469,-71.412968,2022,31,,,24,100.0,,67.416667,,95.00,5,
739,44,7,22,84313,1,41.807469,-71.412968,2022,31,,,24,100.0,,1.300833,,3.76,22,
740,44,7,22,61107,1,41.807469,-71.412968,2022,31,,,24,100.0,,16.583333,,32.00,13,
741,44,7,22,61103,1,41.807469,-71.412968,2022,31,,,24,100.0,,2.141667,,4.30,14,


In [50]:
# Inspect Object fields
df_2022_01.select_dtypes(include='object')

Unnamed: 0,Datum,Parameter Name,Duration Description,Pollutant Standard,Date (Local),Units of Measure,AQI,Daily Criteria Indicator,State Name,County Name,City Name,Local Site Name,Address,MSA or CBSA Name,Data Source
0,NAD83,"Particle Number, Total Count",1 HOUR,,2022-01-01,Count per cm^3,.,Y,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
1,NAD83,Std Dev Vt Wind Direction,1 HOUR,,2022-01-01,Degrees Compass,.,Y,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
2,NAD83,Outdoor Temperature,1 HOUR,,2022-01-01,Degrees Fahrenheit,.,Y,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
3,NAD83,Wind Direction - Resultant,1 HOUR,,2022-01-01,Degrees Compass,.,Y,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
4,NAD83,Black carbon PM2.5 STP,1 HOUR,,2022-01-01,Micrograms/cubic meter (25 C),.,Y,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
738,NAD83,Relative Humidity,1 HOUR,,2022-01-31,Percent relative humidity,.,Y,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
739,NAD83,Black carbon PM2.5 STP,1 HOUR,,2022-01-31,Micrograms/cubic meter (25 C),.,Y,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
740,NAD83,Std Dev Vt Wind Direction,1 HOUR,,2022-01-31,Degrees Compass,.,Y,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
741,NAD83,Wind Speed - Resultant,1 HOUR,,2022-01-31,Knots,.,Y,Rhode Island,Providence,Providence,CCRI Liston Campus ROOFTOP,"1 Hilton St, PROVIDENCE RI","Providence-Warwick, RI-MA",AQS Data Mart
