# Analyze Global Power Plant Data - Python Data Science Portfolio Project

Objectives:
- Use pandas to clean and prepare data for exploration and analysis
- Apply aggregation, merges, and other data science techniques to answer data questions
- Report key findings

### Questions:
- What are the countries with the most amount of power plants and the least amount of power plants? 
- What is the most popuar fuel type?
- Looking at comissioning year data, how has the fuel type changed?

## Load and Inspect the data
Load the data into a Pandas dataframe and preview the data sets.

In [154]:
import pandas as pd
power = pd.read_csv('global_power_plant_database.csv')
power.info()
power.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34936 entries, 0 to 34935
Data columns (total 36 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   country                         34936 non-null  object 
 1   country_long                    34936 non-null  object 
 2   name                            34936 non-null  object 
 3   gppd_idnr                       34936 non-null  object 
 4   capacity_mw                     34936 non-null  float64
 5   latitude                        34936 non-null  float64
 6   longitude                       34936 non-null  float64
 7   primary_fuel                    34936 non-null  object 
 8   other_fuel1                     1944 non-null   object 
 9   other_fuel2                     276 non-null    object 
 10  other_fuel3                     92 non-null     object 
 11  commissioning_year              17447 non-null  float64
 12  owner                           

  power = pd.read_csv('global_power_plant_database.csv')


Unnamed: 0,country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,other_fuel1,other_fuel2,...,estimated_generation_gwh_2013,estimated_generation_gwh_2014,estimated_generation_gwh_2015,estimated_generation_gwh_2016,estimated_generation_gwh_2017,estimated_generation_note_2013,estimated_generation_note_2014,estimated_generation_note_2015,estimated_generation_note_2016,estimated_generation_note_2017
0,AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.322,65.119,Hydro,,,...,123.77,162.9,97.39,137.76,119.5,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1
1,AFG,Afghanistan,Kandahar DOG,WKS0070144,10.0,31.67,65.795,Solar,,,...,18.43,17.48,18.25,17.7,18.29,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE
2,AFG,Afghanistan,Kandahar JOL,WKS0071196,10.0,31.623,65.792,Solar,,,...,18.64,17.58,19.1,17.62,18.72,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE
3,AFG,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,GEODB0040541,66.0,34.556,69.4787,Hydro,,,...,225.06,203.55,146.9,230.18,174.91,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1
4,AFG,Afghanistan,Naghlu Dam Hydroelectric Power Plant Afghanistan,GEODB0040534,100.0,34.641,69.717,Hydro,,,...,406.16,357.22,270.99,395.38,350.8,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1


### Observations:
- There are over 34,000 data entries
- We do have many rows with missing data. 
- Data types that exist do appear correct.

## Trim down our data set

This is an extremely large dataset. To make it a slightly more managable, we are going to drop some columns. Columns to be dropped are:
- country (We have the country_long value to go off of)
- url, geolocation_source, gppd_idnr, lat, long, estimated generation notes and wepp_id.
- other_fuel1 - 3. We are looking at the primary fuel for these questions

In [155]:
drop_columns = ['country', 'url', 'geolocation_source', 'wepp_id', 'other_fuel1', 'other_fuel2', 'other_fuel3', 'gppd_idnr', 'latitude', 'longitude', 'estimated_generation_note_2013', 'estimated_generation_note_2014', 'estimated_generation_note_2015', 'estimated_generation_note_2016', 'estimated_generation_note_2017']
power = power.drop(labels=drop_columns, axis=1)
power.info()
power.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34936 entries, 0 to 34935
Data columns (total 21 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   country_long                   34936 non-null  object 
 1   name                           34936 non-null  object 
 2   capacity_mw                    34936 non-null  float64
 3   primary_fuel                   34936 non-null  object 
 4   commissioning_year             17447 non-null  float64
 5   owner                          20868 non-null  object 
 6   source                         34921 non-null  object 
 7   year_of_capacity_data          14887 non-null  float64
 8   generation_gwh_2013            6417 non-null   float64
 9   generation_gwh_2014            7226 non-null   float64
 10  generation_gwh_2015            8203 non-null   float64
 11  generation_gwh_2016            9144 non-null   float64
 12  generation_gwh_2017            9500 non-null  

Unnamed: 0,country_long,name,capacity_mw,primary_fuel,commissioning_year,owner,source,year_of_capacity_data,generation_gwh_2013,generation_gwh_2014,...,generation_gwh_2016,generation_gwh_2017,generation_gwh_2018,generation_gwh_2019,generation_data_source,estimated_generation_gwh_2013,estimated_generation_gwh_2014,estimated_generation_gwh_2015,estimated_generation_gwh_2016,estimated_generation_gwh_2017
0,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,33.0,Hydro,,,GEODB,2017.0,,,...,,,,,,123.77,162.9,97.39,137.76,119.5
1,Afghanistan,Kandahar DOG,10.0,Solar,,,Wiki-Solar,,,,...,,,,,,18.43,17.48,18.25,17.7,18.29
2,Afghanistan,Kandahar JOL,10.0,Solar,,,Wiki-Solar,,,,...,,,,,,18.64,17.58,19.1,17.62,18.72
3,Afghanistan,Mahipar Hydroelectric Power Plant Afghanistan,66.0,Hydro,,,GEODB,2017.0,,,...,,,,,,225.06,203.55,146.9,230.18,174.91
4,Afghanistan,Naghlu Dam Hydroelectric Power Plant Afghanistan,100.0,Hydro,,,GEODB,2017.0,,,...,,,,,,406.16,357.22,270.99,395.38,350.8


## What are the countries with the most amount of power plants and the least amount of power plants? 

In [156]:
power.value_counts('country_long')

country_long
United States of America    9833
China                       4235
United Kingdom              2751
Brazil                      2360
France                      2155
                            ... 
Lesotho                        1
Western Sahara                 1
Suriname                       1
Palestine                      1
Guinea-Bissau                  1
Name: count, Length: 167, dtype: int64

In [157]:
power['country_long'].describe()

count                        34936
unique                         167
top       United States of America
freq                          9833
Name: country_long, dtype: object

#### Here, we see a obvious correlation between the number of power plants and the GDP or wealth of a given country. In the above data, we see the United States has the most power plants while developing countries such as Palestine and Guniea-Bissau have 1. 

## What is the most popular fuel type?

We will analyze the most popular fuel type globally. We will also look at the top three countries and see what observations can be made.

In [158]:
power.value_counts('primary_fuel')

primary_fuel
Solar             10665
Hydro              7156
Wind               5344
Gas                3998
Coal               2330
Oil                2320
Biomass            1430
Waste              1068
Nuclear             195
Geothermal          189
Storage             135
Other                43
Cogeneration         41
Petcoke              12
Wave and Tidal       10
Name: count, dtype: int64

In [159]:
usa = power['country_long'] == "United States of America"
power[usa].value_counts('primary_fuel')

primary_fuel
Solar           3283
Gas             1818
Hydro           1449
Wind            1139
Oil              876
Waste            541
Coal             286
Biomass          153
Storage          104
Geothermal        65
Nuclear           58
Cogeneration      34
Other             16
Petcoke           11
Name: count, dtype: int64

In [160]:
china = power['country_long'] == "China"
power[china].value_counts('primary_fuel')

primary_fuel
Solar         1318
Hydro          947
Coal           946
Wind           835
Gas            170
Nuclear         12
Oil              5
Geothermal       2
Name: count, dtype: int64

In [161]:
uk = power['country_long'] == "United Kingdom"
power[uk].value_counts('primary_fuel')

primary_fuel
Solar             1170
Wind               780
Waste              329
Biomass            226
Hydro              119
Gas                 55
Storage             31
Oil                 11
Coal                 8
Nuclear              8
Cogeneration         7
Wave and Tidal       7
Name: count, dtype: int64

#### Overall, solar energy is the most popular primary_fuel. However, when we look at the top three countries, we see China has an outsized dependency on coal as a primary fuel source relative to the United States and United Kingdom.

## Looking at comissioning year data, how has the fuel type changed over the years?

In [162]:
power[power['commissioning_year'] == 2000.000000].value_counts('primary_fuel')

primary_fuel
Gas           74
Coal          42
Oil           38
Wind          30
Hydro         25
Biomass       10
Waste          7
Geothermal     5
Nuclear        2
Other          1
Name: count, dtype: int64

Contrasted with primary fuel in the year 2017

In [163]:
power[power['commissioning_year'] == 2018.000000].value_counts('primary_fuel')

primary_fuel
Solar         474
Gas            49
Wind           40
Coal           14
Storage        10
Oil             7
Hydro           4
Waste           4
Biomass         3
Geothermal      1
Other           1
Name: count, dtype: int64

In [164]:
solar = power['primary_fuel'] == 'Solar'
year = power['commissioning_year'] == 2017.000000
cntry = power['country_long'] == 'United States of America'


#power[year & cntry].sort_values(by='primary_fuel', ascending=False)
power[solar & cntry].value_counts('commissioning_year')

commissioning_year
2017.000000    507
2018.000000    470
2019.000000    458
2016.000000    424
2014.000000    307
              ... 
2013.500000      1
2013.514286      1
2013.520396      1
2013.608696      1
2013.363636      1
Name: count, Length: 85, dtype: int64

In [165]:
solar_percent_change = (507 - 1) / 1 * 100
solar_percent_change

50600.0

# Conclusions:

### Developed nations versus developing nations:
As we might expect, developed nations have the most power plants while poorer, developing nations have the least amount of poewr plants


### The most popular fuel type:
Surprisingly, the most popular fuel type globally is solar. And in the top three countries according to number of power plants, solar is also the most popular. This is a great encouragement for the future of renewable energy and the environment. In the United States alone, we see a 50,600% growth in solar commissionings between the years of 2013 and 2017.

However, for all of the good data we see, China's continued reliance on coal energy must be mentioned.


### Changes over time:
From the data, we see quite a lot has changed since as recently as 2000. In 2000, the top 3 fuel types were:
- Gas (74)
- Coal (42)
- Oil (38)

Contrast that with 2018 data where the top three types were:
- Solar (474)
- Gas (49)
- Wind (40)