In [49]:
from datascience import *

## Exploratory Data Analysis 2 


Now that you have a general idea of the key variables in this table, we will use some of the table and visualization methods from the [Python Reference](https://www.data8.org/sp23/reference/) to analyze any underlying trends in the data. 

The `exploratorium_emissions` displays information about CO and CO2 emissions at the Exploratorium on various days.
It consists the following columns, in order : 

- `node_id`: Each node is assigned an identification number (the Exploratorium has id 48!)
- `local_timestamp`: Pacific Time
- `datetime`: Coordinated Universal Time (UTC)
- `CO2_ppm`: CO2 emissions in parts per million (ppm) adjusted for standard temperature and pressure
- `CO2_QC_level`: quality control level of the CO2 record
- `PM_ug/m3`: RH_corrected PM_2.5 concentrations from the Plantower instrument.
- `PM_QC_level`: quality control level of the PM record
- `CO_ppm`: CO emissions in parts per million (ppm)
- `CO_ppm_QC_level`: The quality control level of the CO record

In [50]:
exploratorium_emissions = Table.read_table('exploratorium_bays.csv')

In [51]:
exploratorium_emissions.show()

local_timestamp,datetime,node_id,epoch,julian_day,co_ppm,co_ppm_QC_level,co2_ppm,co2_ppm_QC_level,PM_2.5_ug/m3,PM_2.5_ug/m3_QC_level
2022-01-01 23:00:00,2022-01-02 07:00:00,48,1641110000.0,2.29167,0.24,2,-999.0,2,-999.0,-999
2022-01-02 01:00:00,2022-01-02 09:00:00,48,1641110000.0,2.375,0.21,2,-999.0,2,-999.0,-999
2022-01-02 02:00:00,2022-01-02 10:00:00,48,1641120000.0,2.41667,0.2,2,-999.0,2,-999.0,-999
2022-01-02 03:00:00,2022-01-02 11:00:00,48,1641120000.0,2.45833,0.18,2,-999.0,2,-999.0,-999
2022-01-02 04:00:00,2022-01-02 12:00:00,48,1641120000.0,2.5,0.17,2,-999.0,2,-999.0,-999
2022-01-02 05:00:00,2022-01-02 13:00:00,48,1641130000.0,2.54167,0.17,2,-999.0,2,-999.0,-999
2022-01-02 06:00:00,2022-01-02 14:00:00,48,1641130000.0,2.58333,0.17,2,-999.0,2,-999.0,-999
2022-01-02 07:00:00,2022-01-02 15:00:00,48,1641140000.0,2.625,0.19,2,-999.0,2,-999.0,-999
2022-01-02 08:00:00,2022-01-02 16:00:00,48,1641140000.0,2.66667,0.2,2,-999.0,2,-999.0,-999
2022-01-02 09:00:00,2022-01-02 17:00:00,48,1641140000.0,2.70833,0.2,2,-999.0,2,-999.0,-999


Looking at the `co2_ppm` column, there isn't much change in the CO2 emissions at the Exploratorium. Take a look at the `co_ppm` column for information about CO emissions. There is a good spread to our data! Let's focus on trends in CO emissions and the respective time/date this data was collected from the Exploratorium Bay. We will use the `.drop()` method to drop the columns with any data that is unnecessary for our subsequent analyses.

In [52]:
co_emissions = exploratorium_co_emissions.drop(2, 3, 4, 6, 7, 8, 9, 10)
co_emissions

local_timestamp,datetime,co_ppm
2022-01-01 23:00:00,2022-01-02 07:00:00,0.24
2022-01-02 01:00:00,2022-01-02 09:00:00,0.21
2022-01-02 02:00:00,2022-01-02 10:00:00,0.2
2022-01-02 03:00:00,2022-01-02 11:00:00,0.18
2022-01-02 04:00:00,2022-01-02 12:00:00,0.17
2022-01-02 05:00:00,2022-01-02 13:00:00,0.17
2022-01-02 06:00:00,2022-01-02 14:00:00,0.17
2022-01-02 07:00:00,2022-01-02 15:00:00,0.19
2022-01-02 08:00:00,2022-01-02 16:00:00,0.2
2022-01-02 09:00:00,2022-01-02 17:00:00,0.2


Ignore this code below. It cleans the data frame a little more, and makes it more understandable.

We now have a table showing us the carbon monoxide emissions at different times and dates at the Exploratorium Bay. Here are our key features : 

- `local_timestamp`: Pacific Time
- `datetime`: Coordinated Universal Time (UTC)
- `CO_ppm`: CO emissions in parts per million (ppm)

We have our table, now lets get to visualizing ! In this section, we will use two key visualizations : 
1. Scatterplot - we will use the [`.scatter()`](https://www.data8.org/sp23/reference/) method
2. Histogram - we use the [`.hist()`](https://www.data8.org/sp23/reference/) method

<div class="alert alert-info">
You may be wondering how to decide between two visualizations when trying to represent your data. For the two visualizations mentioned before, you should follow the following guidelines: 

   1. Use **scatter plots** to visualize non-sequential numerical data, and if you are looking for associations
   </br>
   2. Use **histograms** to visualize the distribution of a **single *numerical variable***
   </br>
    
   3.Optionally, although we won't be using it in this exercise, you can use **bar chart** (using [`.barh()`](https://www.data8.org/sp23/reference/)) to visualize the ditribution of a **single *categorical variable***
      </br>

create scatter plots 

<div class="alert alert-info">
When reading the scatter plot, take note of the following:

   1. What variable is on the x - axis ? y - axis ?
   2. Are the variables scaled differently ? 
   3. Is there any association between the two varaibles ? (positive, negative, no association)

create histograms 

<div class="alert alert-info">
When reading the histogram, take note of the following:
    
   1. The horizontal axis will **always** be numerical: drawn to scale, no gaps
   2. The area of the bars is proportional to the percent of individuals from the entire sample

further analysis, answer questions about trends 