## Introduction
Emissions from the electric power generation significantly contribute to the reduced air quality in the industrialized areas, while also having a detrimental effect on the environment globally. The aim of this project is to perform exploratory data analysis focusing on the air contamination associated with electric power generation. 

The dataset used for this analysis contains characteristics and emission details of electric power plants in the US for 2010. As such, the following notebook aims to understand problematic areas in the plants’ performance and emission generation in order to identify areas where future emission reduction can be achieved.

## Dataset

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [8]:
# Load dataset and show info
df= pd.read_csv('plants.csv', index_col=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5393 entries, 0 to 5392
Data columns (total 38 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   ID              5393 non-null   int64  
 1   State           5393 non-null   object 
 2   Name            5393 non-null   object 
 3   County          5393 non-null   object 
 4   Lat             5393 non-null   float64
 5   Lon             5393 non-null   float64
 6   Combust         5393 non-null   float64
 7   Fuel            5364 non-null   object 
 8   FuelCat         5393 non-null   object 
 9   Capacity        5393 non-null   float64
 10  HeatInput       5393 non-null   float64
 11  NetGen          5393 non-null   float64
 12  NOX             5393 non-null   float64
 13  SO2             5393 non-null   float64
 14  CO2             5393 non-null   float64
 15  CoalGen         5393 non-null   float64
 16  OilGen          5393 non-null   float64
 17  GasGen          5393 non-null   f

In [9]:
# List of all variables
df.columns

Index(['ID', 'State', 'Name', 'County', 'Lat', 'Lon', 'Combust', 'Fuel',
       'FuelCat', 'Capacity', 'HeatInput', 'NetGen', 'NOX', 'SO2', 'CO2',
       'CoalGen', 'OilGen', 'GasGen', 'NuclearGen', 'HydroGen', 'BiomassGen',
       'WindGen', 'SolarGen', 'GeoGen', 'OtherFossilGen', 'OtherGen',
       'NonRenewGen', 'RenewGen', 'CombGen', 'NonCombGen', 'CoalPortion',
       'CapFac', 'SO2OutRate', 'CO2OutRate', 'NOXOutRate', 'NOXInRate',
       'CO2InRate', 'SO2InRate'],
      dtype='object')

- The dataset used contains 5393 entries corresponding to the plants with non-zero generation and/or heat input characteristic. 
- Each entry contains a unique identification number, name, and geographical information of the plant as well as information related to plants performance. 
- The key variables from the latter can be described with three categories: 
  - Energy resources,
  - Energy generation, 
  - Emissions.

**_Energy resources:_**

| Variable name | Description |
| :-- | :-- |
| Plant Combustion Status | Takes the value of: <br> 1.0 for full combustion plants, <br> 0.5 for partially combustion plants (combustion power plant that contains non-combustion generators), and <br> 0.0 for non-combustion plants. |
| Plant Primary Fuel Type | Identifies plant’s primary fuel type based on the maximum heat input as one of the 43 fuel types. |
| Plant Primary Fuel Category | Categorizes the Plant Primary Fuel Type variable into Coal, Oil, Gas, Nuclear, Hydro, Biomass, <br> Wind, Solar, Geothermal, Other Fossil, and Other Unknown/Purchased/Waste (referred to as Other). |

In [16]:
df[['Combust', 'Fuel', 'FuelCat']].astype("object").describe()

Unnamed: 0,Combust,Fuel,FuelCat
count,5393.0,5364,5393
unique,3.0,34,11
top,1.0,NG,GAS
freq,3233.0,1416,1408


**_Energy Generation:_**
- Primary variables include nameplate capacity (MW), annual heat input (MMBtu), and annual net generation (MWh). 
- Note, An additional set of variables is included in the dataset corresponding to the net generation by each fuel type, in total constituting annual net generation by a plant.

In [5]:
df[['Capacity', 'HeatInput', 'NetGen']].describe()

Unnamed: 0,Capacity,HeatInput,NetGen
count,5393.0,5393.0,5393.0
mean,207.852809,5254938.0,766323.7
std,456.751711,18582120.0,2398173.0
min,1.0,0.0,1.0
25%,5.5,0.0,6468.0
50%,28.8,7210.0,45014.0
75%,145.2,1161249.0,302175.0
max,6809.0,244965000.0,31199940.0



**_Emissions:_**
- Primary variables include CO2 equivalent annual emission (Short Tons), NOx annual emission (Short Tons), SO2 annual emission (Short Tons).

In [14]:
df[['CO2', 'NOX', 'SO2']].describe()

Unnamed: 0,CO2,NOX,SO2
count,5393.0,5393.0,5393.0
mean,473750.5,427.892412,1011.099905
std,1880969.0,1853.812913,5434.895068
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,77.6,0.58,0.02
75%,49621.2,59.18,2.77
max,25271860.0,38837.06,112951.18
