In [None]:
import sys
import string
import numpy as np 
from datascience import *
from datetime import date 

## Exploratory Data Analysis 2 


Now that you have a general idea of the key variables in this table, we will use some of the table and visualization methods from the [Python Reference](https://www.data8.org/sp23/reference/) to analyze any underlying trends in the data. 

The `exploratorium_emissions` displays information about CO and CO2 emissions at the Exploratorium on various days.
It consists the following columns, in order : 

- `node_id`: Each node is assigned an identification number (the Exploratorium has id 48!)
- `local_timestamp`: Pacific Time
- `datetime`: Coordinated Universal Time (UTC)
- `CO2_ppm`: CO2 emissions in parts per million (ppm) adjusted for standard temperature and pressure
- `CO2_QC_level`: quality control level of the CO2 record
- `PM_ug/m3`: RH_corrected PM_2.5 concentrations from the Plantower instrument.
- `PM_QC_level`: quality control level of the PM record
- `CO_ppm`: CO emissions in parts per million (ppm)
- `CO_ppm_QC_level`: The quality control level of the CO record

In [None]:
exploratorium_emissions = Table.read_table('exploratorium_bays.csv')

In [None]:
exploratorium_emissions

Looking at the `co2_ppm` column, there isn't much change in the CO2 emissions at the Exploratorium. Take a look at the `co_ppm` column for information about CO emissions. There is a good spread to our data! Let's focus on trends in CO emissions and the respective time/date this data was collected from the Exploratorium Bay. Ignore the code below, we are just cleaning up the data above!

In [None]:
co_emissions = exploratorium_emissions.drop(2, 3, 4, 6,7, 8, 9, 10).sort('co_ppm', descending = True)

#Splitting dates and times for local timestamp
date_time = [x.split(' ') for x in co_emissions.column("local_timestamp")]
dates = np.array([item[0] for item in date_time])
times = np.array([item[1] for item in date_time])
table1 = Table().with_columns("local_timestamp", co_emissions.column("local_timestamp"),'Date', dates,  'Time', times)

#Splitting dates and times for UTC timestamp
date_time1 = [x.split(' ') for x in co_emissions.column("datetime")]
dates1 = np.array([item[0] for item in date_time1])
times1 = np.array([item[1] for item in date_time1])
table2 = Table().with_columns("datetime", co_emissions.column("datetime"),'Date1', dates,  'Time1', times)

#joining two tables by date, since columns are the exact same
tables_merge = table1.join('Date', table2, 'Date1')

#Creating new emissions table, with separated dates and times 
co_emissions = co_emissions.join('local_timestamp', tables_merge).drop('local_timestamp', 'datetime_2', 'datetime')
emissions = co_emissions.column('co_ppm')
co_emissions = Table().with_columns('Date', co_emissions.column(1), 'Pacific Time', co_emissions.column(2), 'Universal Time', co_emissions.column(3), 'CO_Emissions', emissions)

In [None]:
co_emissions.show()

Ignore this code below. It cleans the data frame a little more, and makes it more understandable.

We now have a table showing us the carbon monoxide emissions at different times and dates at the Exploratorium Bay. Here are our key features : 

- `local_timestamp`: Pacific Time
- `datetime`: Coordinated Universal Time (UTC)
- `CO_ppm`: CO emissions in parts per million (ppm)
- `CO2_ppm`: CO2 emissions in parts per million (ppm)

We have our table, now lets get to visualizing ! In this section, we will use two key visualizations : 
1. Scatterplot - we will use the [`.scatter()`](https://www.data8.org/sp23/reference/) method
2. Histogram - we use the [`.hist()`](https://www.data8.org/sp23/reference/) method

<div class="alert alert-info">
You may be wondering how to decide between two visualizations when trying to represent your data. For the two visualizations mentioned before, you should follow the following guidelines: 

   1. Use **scatter plots** to visualize non-sequential numerical data, and if you are looking for associations
   </br>
   2. Use **histograms** to visualize the distribution of a **single *numerical variable***
   </br>
    
   3.Optionally, although we won't be using it in this exercise, you can use **bar chart** (using [`.barh()`](https://www.data8.org/sp23/reference/)) to visualize the ditribution of a **single *categorical variable***
      </br>

create scatter plots 

<div class="alert alert-info">
When reading the scatter plot, take note of the following:

   1. What variable is on the x - axis ? y - axis ?
   2. Are the variables scaled differently ? 
   3. Is there any association between the two varaibles ? (positive, negative, no association)

create histograms 

<div class="alert alert-info">
When reading the histogram, take note of the following:
    
   1. The horizontal axis will **always** be numerical: drawn to scale, no gaps
   2. The area of the bars is proportional to the percent of individuals from the entire sample

further analysis, answer questions about trends 