In [3]:
from datascience import * 

# Introduction to EDA (Exploratory Data Analysis)

In this notebook, you will be introduced to data science management methods (e.g. manipulating tables, naming columns, taking a subset of a data, etc.) using a dataset of emissions over the span of 1 year. The goal is to understand more about the atmosphere while answering questions such as: **Are emissions higher in the daytime?** 

## Exploring Emissions and the Atmosphere

**Emissions** are gases and particles that are released into the air. They can come from a variety of sources, including cars, factories, power plants, and even natural sources like volcanoes.
    When we talk about emissions, we're often talking about greenhouse gas emissions. These are gases like carbon dioxide, methane, and nitrous oxide that trap heat in the Earth's atmosphere and contribute to global warming and climate change.Given the current concerns regarding Global warming and we need to acknowledge that every year around 10 Gigatons of carbon (that’s 10 x 109 tons of carbon per year, written as GtC/yr) is emitted by combustion of fossil fuels, which has devastating enviornmental effects. 

The **Atmospere** is a layer of gas that surrounds the Earth. It's made up of different gases, including nitrogen, oxygen, and a small amount of other gases like carbon dioxide and methane.
Understanding it's impact is super important because it protects us from harmful radiation from the sun and helps regulate the temperature of the Earth. It's kind of like a blanket that keeps the Earth warm and cozy.

**Data Analytics** plays a important role in answering questions like, where does all this carbon come from? Where does it all go? How does it get there?


<div class=" alert alert-info">

## Reflection Question:
How can we as individuals reduce our personal emissions, and what are some practical steps we can take to make a positive impact on the environment?

</div> 


## Using Data Analytics to understand emissions
#### Introduction to the BEACON Dataset:
BEACO2N (Berkeley Environmental Air-quality & CO2 Network) is a new strategy for understanding green house gases (GHGs) and air quality at street level in near real time, giving pedestrians, companies, and policy-makers unique insight into their GHG emissions and air quality experiences. 
    Through their technology, BEACO2N is able to create a **highly detailed map** of CO2 and pollutants in our air. The data provides a clear route to evaluating the effectiveness of local and regional efforts to reduce GHG emissions, improve air quality, improve environmental equity and reduce the detrimental effects of emissions on public health.

Check out the website here for some cool visulizations and data: http://beacon.berkeley.edu/metadata/

<img src="b.png" align="center"/>

#### Who collected the data?
BEACO2N data is collected by a network of sensors, also known as "nodes", that are deployed in various locations. The nodes are part of a collaborative effort between different organizations, including academic institutions, government agencies, and non-profit organizations. The data collected by the nodes is made available to researchers and the public for analysis and use in developing solutions for climate change mitigation and adaptation.

#### How was the data collected?
BEACO2N blankets interesting locations with a network of sensors - called **"nodes"** -  approximately 1 mile (2km) apart from each other to measure green house gases and air quality. Although their individual nodes are less precise than the highly sensitive traditional sensors, when working as part of a network, our nodes create a highly detailed map of CO2 and pollutants in our air. Their nodes are sampling the air for 6 gases and also aerosol in the same locations, every minute of the day.

**CO2**, or carbon dioxide, is typically measured in parts per million (ppm) or parts per billion (ppb). This is because CO2 is a trace gas in the Earth's atmosphere, meaning it makes up only a small fraction of the gases in the air. To measure the concentration of CO2 in the atmosphere, scientists use instruments like infrared gas analyzers that can detect and quantify the amount of CO2 in a sample of air. 

#### What is represented in the data?
The locations BEACO2N tracks have nodes that contain sensors for **CO2, NO, NO2, O3, CO, and aerosol** in addition to sensors for **temperature, pressure, and relative humidity**. Data from these sensors are collected once every five seconds onto a miniature computer which then sends the data to a centralized server. When combined with data from other nodes, it can be used to produce concentration maps, track pollution plumes, and to constrain calculations of emissions, to name a few possibilities.

## Understanding the data

For the puposes of this notebook we will be using a dataset of CO2 emissions from the beginning of 2022 to beginning of 2023.

`Timestamp: 2022-01-01 23:00:00 to 2023-01-01 23:00:00` <br>
`Node: Explortanium Bay`

#### The Dataset
Based on the screenshot above, you should be able to download the metadata. You'll notice the file is saved as a '.csv'. A CSV (Comma Separated Values) file is a simple text file format used to store tabular data, which is commonly used in data science. In a CSV file, each row represents a single data record, and each column represents a specific attribute or feature of that record.

The values in a CSV file are separated by commas, which means that each comma separates a different column of data. The first row of a CSV file typically contains column headings, which describe the data in each column.

CSV files can be easily read and written by many software tools, including spreadsheet applications like Microsoft Excel, Google Sheets, or Python libraries like pandas. They are used in data science because they are a lightweight, easy-to-use format for storing and sharing large amounts of data, and they can be easily processed and analyzed by many programming languages and tools.


<img src="csv.png" align="center"/>

<div class=" alert alert-info">
Advanced: What are examples of other types of files?
Working with different file types is an essential part of data science as it involves importing, exporting, and manipulating data from various sources. Here are some commonly used file types in data science and their explanations:

1. CSV (Comma Separated Values) - CSV files are commonly used for storing tabular data where each row represents an observation and each column represents a variable. CSV files can be easily imported into various data analysis tools like Excel, R, or Python.
2. Excel - Excel files (.xlsx) are commonly used for storing data in tabular form. Excel files can be easily imported into R or Python using dedicated libraries like readxl in R or openpyxl in Python.
3. JSON (JavaScript Object Notation) - JSON files are used for storing structured data that can be easily understood by both humans and machines. JSON files are commonly used to store data that is transmitted between web applications.
4. XML (Extensible Markup Language) - XML files are used for storing structured data that can be easily understood by machines. XML files are commonly used to store data that is transmitted between web applications.
5. SQL (Structured Query Language) - SQL is used for managing relational databases. SQL files can be used to store data and query data from databases.
6. TXT (Text) - TXT files are used for storing unstructured text data. TXT files can be easily imported into R or Python using the readr package in R or the open() function in Python.


    
</div>


#### What makes a good dataset?
Data faithfulness refers to the degree to which data accurately represents the underlying phenomenon it purports to describe. The following terms are important to consider when discussing data faithfulness:

**Structure**: This refers to the organization of the data, including the format, schema, and relationships between different elements. Structured data is typically easier to analyze and interpret than unstructured data.

**Granularity**: This refers to the level of detail in the data. Higher granularity means more detailed data, which may be useful in certain contexts but may also increase the risk of identifying individuals or sensitive information.

**Scope**: This refers to the range of data being considered. For example, data may be limited to a specific geographic area or time period, which can impact its representativeness.

**Temporality**: This refers to the time dimension of the data, including when it was collected, how frequently it is updated, and whether it captures changes over time. Temporal consistency is important for longitudinal analyses.

**Faithfulness**: This refers to the accuracy and reliability of the data, including how it was collected, whether there were biases in the sampling or measurement processes, and whether the data accurately reflects the phenomenon being studied. High data faithfulness means that the data accurately represents the underlying reality, while low data faithfulness may lead to incorrect or biased conclusions.

### Reading in the Data

In this section we are introducing the data and answer the following questions in the section below this statement: Why are emissions higher in the daytime? Why are concentrations higher at night and lower in the daytime?

The command 'Table.read_table'. allows us create a new table by reading the file located in the filename



In [4]:
emission_levels = Table.read_table('data.csv')
emission_levels

local_timestamp,datetime,node_id,epoch,julian_day,co_ppm,co_ppm_QC_level,co2_ppm,co2_ppm_QC_level,o3_ppm,o3_ppm_QC_level,PM_2.5_ug/m3,PM_2.5_ug/m3_QC_level
2022-01-01 23:00:00,2022-01-02 07:00:00,48,1641110000.0,2.29167,0.24,2,-999,2,0.019,,-999,-999
2022-01-02 01:00:00,2022-01-02 09:00:00,48,1641110000.0,2.375,0.21,2,-999,2,0.0201,,-999,-999
2022-01-02 02:00:00,2022-01-02 10:00:00,48,1641120000.0,2.41667,0.19,2,-999,2,0.0184,,-999,-999
2022-01-02 03:00:00,2022-01-02 11:00:00,48,1641120000.0,2.45833,0.18,2,-999,2,0.0198,,-999,-999
2022-01-02 04:00:00,2022-01-02 12:00:00,48,1641120000.0,2.5,0.17,2,-999,2,0.0185,,-999,-999
2022-01-02 05:00:00,2022-01-02 13:00:00,48,1641130000.0,2.54167,0.17,2,-999,2,0.0207,,-999,-999
2022-01-02 06:00:00,2022-01-02 14:00:00,48,1641130000.0,2.58333,0.18,2,-999,2,0.0203,,-999,-999
2022-01-02 07:00:00,2022-01-02 15:00:00,48,1641140000.0,2.625,0.19,2,-999,2,0.0208,,-999,-999
2022-01-02 08:00:00,2022-01-02 16:00:00,48,1641140000.0,2.66667,0.21,2,-999,2,0.0202,,-999,-999
2022-01-02 09:00:00,2022-01-02 17:00:00,48,1641140000.0,2.70833,0.2,2,-999,2,0.0223,,-999,-999


## Let's build our EDA Toolkit!

The next steps after reading our data is to use methods to understand the structure of our data. 

In [5]:
number_rows = emission_levels.num_rows
number_rows

7740

In [6]:
number_columns = emission_levels.num_columns
number_columns

13

#### We can also modify our dataframe by dropping, adding, and selecting specific columns to suite our needs.

In [7]:
#We use the function '.drop()' to return a copy of the original table without the specified columns 
emission_levels.drop('datetime')


local_timestamp,node_id,epoch,julian_day,co_ppm,co_ppm_QC_level,co2_ppm,co2_ppm_QC_level,o3_ppm,o3_ppm_QC_level,PM_2.5_ug/m3,PM_2.5_ug/m3_QC_level
2022-01-01 23:00:00,48,1641110000.0,2.29167,0.24,2,-999,2,0.019,,-999,-999
2022-01-02 01:00:00,48,1641110000.0,2.375,0.21,2,-999,2,0.0201,,-999,-999
2022-01-02 02:00:00,48,1641120000.0,2.41667,0.19,2,-999,2,0.0184,,-999,-999
2022-01-02 03:00:00,48,1641120000.0,2.45833,0.18,2,-999,2,0.0198,,-999,-999
2022-01-02 04:00:00,48,1641120000.0,2.5,0.17,2,-999,2,0.0185,,-999,-999
2022-01-02 05:00:00,48,1641130000.0,2.54167,0.17,2,-999,2,0.0207,,-999,-999
2022-01-02 06:00:00,48,1641130000.0,2.58333,0.18,2,-999,2,0.0203,,-999,-999
2022-01-02 07:00:00,48,1641140000.0,2.625,0.19,2,-999,2,0.0208,,-999,-999
2022-01-02 08:00:00,48,1641140000.0,2.66667,0.21,2,-999,2,0.0202,,-999,-999
2022-01-02 09:00:00,48,1641140000.0,2.70833,0.2,2,-999,2,0.0223,,-999,-999


In [8]:
#We want to use the function '.relabeled()' to rename columns 
emission_levels.relabeled('co2_ppm', 'CO2')

local_timestamp,datetime,node_id,epoch,julian_day,co_ppm,co_ppm_QC_level,CO2,co2_ppm_QC_level,o3_ppm,o3_ppm_QC_level,PM_2.5_ug/m3,PM_2.5_ug/m3_QC_level
2022-01-01 23:00:00,2022-01-02 07:00:00,48,1641110000.0,2.29167,0.24,2,-999,2,0.019,,-999,-999
2022-01-02 01:00:00,2022-01-02 09:00:00,48,1641110000.0,2.375,0.21,2,-999,2,0.0201,,-999,-999
2022-01-02 02:00:00,2022-01-02 10:00:00,48,1641120000.0,2.41667,0.19,2,-999,2,0.0184,,-999,-999
2022-01-02 03:00:00,2022-01-02 11:00:00,48,1641120000.0,2.45833,0.18,2,-999,2,0.0198,,-999,-999
2022-01-02 04:00:00,2022-01-02 12:00:00,48,1641120000.0,2.5,0.17,2,-999,2,0.0185,,-999,-999
2022-01-02 05:00:00,2022-01-02 13:00:00,48,1641130000.0,2.54167,0.17,2,-999,2,0.0207,,-999,-999
2022-01-02 06:00:00,2022-01-02 14:00:00,48,1641130000.0,2.58333,0.18,2,-999,2,0.0203,,-999,-999
2022-01-02 07:00:00,2022-01-02 15:00:00,48,1641140000.0,2.625,0.19,2,-999,2,0.0208,,-999,-999
2022-01-02 08:00:00,2022-01-02 16:00:00,48,1641140000.0,2.66667,0.21,2,-999,2,0.0202,,-999,-999
2022-01-02 09:00:00,2022-01-02 17:00:00,48,1641140000.0,2.70833,0.2,2,-999,2,0.0223,,-999,-999


In [9]:
#We use the function '.select()' to select the columns we want to show 
emission_levels.select('local_timestamp', 'node_id')


local_timestamp,node_id
2022-01-01 23:00:00,48
2022-01-02 01:00:00,48
2022-01-02 02:00:00,48
2022-01-02 03:00:00,48
2022-01-02 04:00:00,48
2022-01-02 05:00:00,48
2022-01-02 06:00:00,48
2022-01-02 07:00:00,48
2022-01-02 08:00:00,48
2022-01-02 09:00:00,48


In [10]:
#We can use the function '.sort()' to sort our dataframe by a column and in a descending order or not
emission_levels.sort('co2_ppm', descending = True)

local_timestamp,datetime,node_id,epoch,julian_day,co_ppm,co_ppm_QC_level,co2_ppm,co2_ppm_QC_level,o3_ppm,o3_ppm_QC_level,PM_2.5_ug/m3,PM_2.5_ug/m3_QC_level
2022-12-26 22:00:00,2022-12-27 06:00:00,48,1672120000.0,361.25,0.24,2,512.1,2,0.0014,2,29.0,2
2022-11-16 01:00:00,2022-11-16 09:00:00,48,1668590000.0,320.375,0.32,2,507.0,2,-999.0,2,13.6,2
2022-12-26 21:00:00,2022-12-27 05:00:00,48,1672120000.0,361.208,0.23,2,505.1,2,0.0022,2,30.6,2
2022-12-26 20:00:00,2022-12-27 04:00:00,48,1672110000.0,361.167,0.22,2,502.4,2,0.0028,2,29.9,2
2022-12-21 11:00:00,2022-12-21 19:00:00,48,1671650000.0,355.792,0.19,2,501.1,2,0.0124,2,43.7,2
2022-11-27 02:00:00,2022-11-27 10:00:00,48,1669540000.0,331.417,0.27,2,500.8,2,0.0042,2,25.9,2
2022-11-16 02:00:00,2022-11-16 10:00:00,48,1668590000.0,320.417,0.29,2,499.8,2,-999.0,2,12.5,2
2022-11-22 22:00:00,2022-11-23 06:00:00,48,1669180000.0,327.25,0.32,2,499.3,2,-999.0,2,18.1,2
2022-11-27 01:00:00,2022-11-27 09:00:00,48,1669540000.0,331.375,0.28,2,496.7,2,0.0057,2,30.0,2
2022-12-21 10:00:00,2022-12-21 18:00:00,48,1671650000.0,355.75,0.24,2,496.5,2,0.0072,2,46.0,2


### Try it yourself! 
Given the functions in your toolkit, try to manuliate the `co_ppm` variable based on the prompts below. <br>

1. Sort your dataframe by `co_ppm` in a descending order and give it a new table name called `co_ppm_levels`.
2. Use relabeled to change the name of the local_timestamp column
3. Select specified columns you will need to analyze `co_ppm`

Your resultant table should look like this: 
<img src="table.png" align="center"/>

In [11]:
#sort the data
co_ppm_levels = ...


#relabel 
co_ppm_levels = ...


#select the required columns
...

Ellipsis

### Further analysis using the numpy library
Now that you have table that contains the specific columns you want to work with, we can use inbulit functions to understand the features better. 

**Null Values**
A table often contains 'null' values which are missing data points. They are often denoted as 'NaN', 'nan' or '-999'.
This table uses the '-999' notation, let's take a look at how to remove that from our data.

In [12]:
emission_levels = emission_levels.where('co2_ppm',lambda x: x != -999)

Similar to the datascience library, the numpy library provides a range of functions. NumPy is a Python library for numerical computing that provides powerful data structures and functions for working with multi-dimensional arrays and matrices. It is widely used in scientific computing, data analysis, and machine learning applications. NumPy provides efficient implementation of mathematical operations and enables vectorized computations that can significantly speed up data processing. 

In [13]:
#Importing numpy 
import numpy as np

We could use 'mean' function from the pandas package to take the average of a series('column'). We write the name of our table and between squared parenthesis the name of the column.

In [14]:
np.mean(emission_levels['co2_ppm'])

431.32958462842606

We could use 'min' function to know the smallest variable of a series('column'). We write the name of our table and between squared parenthesis the name of the column.

In [15]:
min(emission_levels['co2_ppm'])

404.3

We could use 'max' function to know the biggest variable of a series('column'). We write the name of our table and between squared parenthesis the name of the column.

In [16]:
max(emission_levels['co2_ppm'])

512.1

## Using the Data to understand patterns!

In [17]:
#This is a new modified dataset created to work with the date and time of each node
data_vis = Table.read_table('v_data.csv')

In [18]:
#Remove the null values using the method in your toolkit
data_vis = data_vis.where('co2_ppm',lambda x: x != -999)
data_vis

Unnamed: 0,co_ppm,co2_ppm,date_col,time_col
4157,0.21,415.0,2022-08-05,18:00:00
4158,0.09,412.0,2022-08-05,19:00:00
4159,0.08,411.2,2022-08-05,20:00:00
4160,0.08,411.1,2022-08-05,21:00:00
4161,0.08,410.4,2022-08-05,22:00:00
4162,0.07,409.6,2022-08-05,23:00:00
4163,0.07,409.9,2022-08-06,00:00:00
4164,0.07,410.0,2022-08-06,01:00:00
4165,0.06,410.3,2022-08-06,02:00:00
4166,0.08,410.5,2022-08-06,03:00:00


## Our Question: Is there a difference in emission during the day vs night?

Using the group function from the datascience library, we can figure out what the total emissions have been for a given time over the past one year!

In [19]:
grouped_data = data_vis.group('time_col', sum)
grouped_data = grouped_data.drop('Unnamed: 0 sum')
grouped_data = grouped_data.drop('date_col sum')
grouped_data 

time_col,co_ppm sum,co2_ppm sum
00:00:00,15.6,63261.3
01:00:00,16.4,63498.8
02:00:00,17.76,63285.4
03:00:00,19.37,63472.4
04:00:00,19.96,63575.9
05:00:00,19.22,63902.0
06:00:00,18.59,63881.8
07:00:00,18.07,63845.8
08:00:00,17.24,63315.8
09:00:00,16.74,62538.9


In [23]:
import matplotlib.pyplot as plt

#This line results in the co_ppm and co2_ppm being plotted against time. 
grouped_data.plot('time_col', ['co_ppm sum', 'co2_ppm sum'])

# set axis labels
plt.xlabel('Time')
plt.ylabel('Concentration (ppm)')

# show the plot
plt.show()

**Use the space below to draw patterns from the plot above. What is trends can you infer from the data? When are the emissions usually the highest?**

## Try it yourself! 

Now using the `data_vis` table, try grouping the data according to the dates. Answer the following question: <br>
1. What day has the highest co2_ppm and co_ppm emission based on the table?
2. Do the summer months have higher emissions than the winter months or visa versa?

In [24]:
#Your code here. Feel free to refer to the previous notebooks for more visualization methods
data_vis






Unnamed: 0,co_ppm,co2_ppm,date_col,time_col
4157,0.21,415.0,2022-08-05,18:00:00
4158,0.09,412.0,2022-08-05,19:00:00
4159,0.08,411.2,2022-08-05,20:00:00
4160,0.08,411.1,2022-08-05,21:00:00
4161,0.08,410.4,2022-08-05,22:00:00
4162,0.07,409.6,2022-08-05,23:00:00
4163,0.07,409.9,2022-08-06,00:00:00
4164,0.07,410.0,2022-08-06,01:00:00
4165,0.06,410.3,2022-08-06,02:00:00
4166,0.08,410.5,2022-08-06,03:00:00


## Congratulations! You finished the notebook. 
You can now try and find patterns from complex dataset on you own. Here are some extra resources: 

1. [Function Visualizer](https://www.data8.org/interactive_table_functions/)
2. [Detailed Table Functions](https://drive.google.com/file/d/1j2hjhweJdGWW0EdvmjGHsXFUatXIZax4/view)
