# Phase 2

link to our github repository: https://github.com/akkij26/info_2950

### Disclaimer

*Our research focus/question changed drastically during the phase 2 process. We did some initial cleaning and analysis on our preliminary data, but eventually decided to refocus our attention. We had some data on renewable energy generation in the US that we were planning on using simply to contextualize our initial question, which dealt primarily with placement of windmills so as to generate the most amount of energy while also being considerate of other social, environmental, and economic factors. As we progressed in our research, however, we decided that refocusing our question would require more data on energy usage and production (and via what source), so we went looking for better datasets. While we did find better datasets eventually and we understand that this is simply part of the process, we wanted to acknowledge some of the work that did go into phase 2 but which will, more likely than not, be scrapped in the final phase of this project. As such, we keep that preliminary work in this notebook, but we are sure to note explicitely which datasets we will probably not be using in future.*

<a id=’introduction’></a>
## Introduction

The 2022 global energy crisis began shortly after the Covid-19 pandemic in 2021, which has affected quite literally everyone. We observe it everyday - increased gas prices, headlines about oil exports, etc. This crisis is not limited to the energy industry but is bound to have a domino effect on other sectors like the food sector, labor industry and all possible parts of the economy. We are staring at a global shortage which must be addressed. Moreover, the use of fossil fuels will have to hit a decline in the coming years, not only due to remaining quantity, but also because of its negative and cyclical effects upon the environment. We got interested in the energy sector mainly because of the dire effects it can have if we do not deal with its ever increasing consumption demands, but also with the deleterious consequences of no changes being made in our current energy consumption/production. We are therefore interested in understanding how well equipped we are to deal with this predicted global shortage. We also wish to propose alternatives in terms of renewable energy and in doing so plan on analyzing different energy sources to find the most optimal switch while remaining economically sound. 

Throughout the process of Phase II we were continuously iterating on our research question to try to find something that was interesting, while also unique. Through our research questions listed below, we hope to estimate and forecast the emerging gap between energy production and consumption and find the most efficient transition to renewable sources instead to fulfill this gap. 



### Research Question(s):

Can we predict US energy consumption and production by type (ie renewable, fossil fuels) in the next 30 years? And if so, what do these predictions tell us about energy needs and the US's ability to meet those demands?

Given a host of social, economic, and environmental factors (including variables such as wind speed and distance to an urban area), what are the best locations for windmill placement in the US?

Can we predict energy generation from the selected set of windmills? What percent of the demand we explore from analyzing energy consumption and production will that entail?

What other recommendations can we make for future energy considerations in the US?


<a id=’section_1’></a>
## Dataset Description 

**Dataset 1 and Dataset 2 (Consumption and Production)**

The datasets consumption_df and production_df have been taken from  EIA annual energy review, and they detail US energy production and consumption (https://ourworldindata.org/fossil-fuels) from 1973-June 2022. The values are given in units of quadrillion BTU. The dataset itself uses different sources (BP Statistical Review of World Energy, The Shift Project (TSP), IEA – International Energy Agency, Energy Information Administration, World Development Indicators – World Bank). We looked into the data sources to find that some are either US government agencies or cooperatives of different shareholders. Most of the data collection has been to provide general information to the public and not for a particular project to get a specific result which means we are eliminating the possibility of a biased dataset. 

We are using this dataset to assess the ability of US energy production sources to meet demands, in order to anticipate production and technology challenges. We plan on doing a comparative study in the growth rate of the two to predict the future shortages or trends depending on what we find in our analysis. Right now there are two different datasets but we will be combining them later on for easier analysis. From a top view, we can see that there is an increase of renewable resources over the years, but then when new methods of fossil fuel production increased, fossil fuel usage increased as well. Our goal is to further analyze the difference in these growth rates. 

**Dataset 3 (Windmill Dataset)**

The dataset ref_df has been taken from the Geospatial Data Science datasets presented by NREL (National Renewable Energy Laboratory). The link can be found here: https://www.nrel.gov/gis/wind-supply-curves.html. The NREL falls under the U.S. Department of Energy and works as a government agency. The purpose of this data collection was to provide general information to the public and create visualizations to understand Geospatial Data and how it varies from place to place. Since this data was not for any particular study, we are eliminating the possibility of a biased dataset. 

*Reason why we are using wind energy:* After graphing the total renewable energy production graph we can notice that there is an upward trend in most of the renewable energy sources except for Hydroelectric Power Production. On looking closely, we can notice that by the end of 2022, the wind energy production curve had the steepest slope with one of the largest contributions. Therefore, we decided to focus on harnessing this energy resource to find an optimal solution for the future energy requirements.   

The dataset provides us with a potential list of areas for windmill placements. The columns include the coordinates of the places (longitude and latitude), the area of the location (area_sq_km), energy generation capabilities (capacity_mw, generation_mwh, capacity_factor), wind speed (wind_speed_120meters), and some information about potentially storing the energy generated (distance_to_transmission_km). We did some initial mappings to find that while the dataset gives us a lot of locations (for context, the dataset initially contained 300,000 rows), not all are practically usable. We plan on considering other factors such as biodiversity considerations, size restrictions for economically sound wind farms, organizational measures, etc to narrow down this list. Our goal is to find the energy demand gap by using the dataset consumption_df and production_df and find the most optimal locations for the windmill placements to fulfill this gap. 

**Dataset 4 (renewable energy)** 

We are not going to use this dataset in the final, but the information is subsumed in datasets 1 and 2 so please look above for description. 

## Data Limitations

**Geographic limitations:** Our datasets are focused on energy production and consumption inside the United States. The conclusions may not be generalizable across other regions. 

**Time period limitations:** Our dataset is between the time period of late 1900s until 2022.

**Factor consideration limitations:** We are considering very limited factors which, according to our research, seemed to have high priority. This might not be comprehensive since we are not including monetary restrictions, inflation, other economic considerations, climate and natural disaster prone areas etc.

**Social factor consideration limitations:**  We are entering an era of activism for climate change which is causing social changes. This change may not necessarily be measurable by the tools we are familiar with by now and hence are excluded from our model.

**Assumption limitation:** We are considering consumption to be equal to demand in our analysis which might not always be the case.



## Data Cleaning Description

Here, we clean the datasets mentioned in [Dataset Description](#section_1) above, and we also explain this process in detail.  

### Loading Data

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from scipy.special import expit as logistic
from scipy.optimize import curve_fit
from scipy import stats
import statsmodels.api as sm
from matplotlib.pyplot import figure

import duckdb, sqlalchemy

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn import metrics
from sklearn.metrics import PrecisionRecallDisplay
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error 

%load_ext sql

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

%sql duckdb:///:memory:

ModuleNotFoundError: No module named 'statsmodels'

In [None]:
#please see above for description of these datasets
ref_df = pd.read_csv("reference.csv")
consumption_df = pd.read_csv("consumption.csv")
production_df = pd.read_csv("production.csv")

#this dataset will most likely not be used in final
us_total_df = pd.read_csv("total_renewable.csv")

### Cleaning Data

**ref_df:** First, ref_df was in a format that was not hard to parse through. It contains appropriate column titles and all the criteria that we plan on using in identifying ideal windmill placement. We first check the dtypes to ensure that all columns contain float values that we could easily analyze. We also not that the column "Unnamed: 0" is simply an identifier that the data collectors used for the separate places that will not serve any purpose in our analysis, so we drop that column. And we change the column title wind_speed_120meters to simply wind_speed.

Finally, we show first 5 rows of the cleaned ref_df.

In [None]:
#check ref_df dtypes
print(ref_df.dtypes)

#drop unneeded first column
ref_df = ref_df.drop('Unnamed: 0', axis = 1)

#rename wind_speed_120meters column
ref_df = ref_df.rename(columns={"wind_speed_120meters": "wind_speed"})

#show first 5 columns of cleaned dataset
ref_df.head()

**consumption_df and production_df:** For consumption_df and production_df, we did a sql INNER JOIN to make working with the data easier by just having to reference one dataset instead of 2. The column 'Month' for consumption_df and production_df contain the same information so we join on that key and delete duplicates (we got a duplicate for the column 'Month'). The 'Month' column is renamed 'date' since it contains more information than just the month. We then make the 'date' column a datetime object so as to make analysis easier later on. We additionally add a month and year column to our new dataframe as later on, we may want to focus on certain years and/or months. 

We also check the dtypes attribute of this new dataframe, named *total_df* so as to continue cleaning. 

In [None]:
#renaming column 'Month'
production_df = production_df.rename(columns={"Month": "date"})

#inner join 
%sql total_df << SELECT * FROM production_df INNER JOIN consumption_df ON production_df.date = consumption_df.Month  

#delete duplicate
total_df = total_df.drop('Month', axis = 1)

#change 'date' to datetime
total_df['date'] = pd.to_datetime(total_df['date'], format = "%Y %B")

#add month and year columns
total_df['year'] = pd.DatetimeIndex(total_df['date']).year
total_df['month'] = pd.DatetimeIndex(total_df['date']).month

#check dataframe dtypes attribute
print(total_df.dtypes)

We note that when checking the dtypes of the dataframe that there are several columns with dtype object, and this is since there are cells with the value 'Not Available.' To prevent difficulty when trying to analyze this data, we change all columns but the date column to float64 and assign NaN (not a number) values to those cells which contain string values, as follows:

In [None]:
#making column values float64 where applicable 
cols = total_df.columns.values  
for i in range(1,26):
    name = cols[i]
    total_df[name] = pd.to_numeric(total_df[name], errors = 'coerce')

Finally, we show first 5 rows of the cleaned total_df.

In [None]:
total_df.head()

**us_total_df:**
*We will most likely not be using this dataset, but since we cleaned and it made some plots of the data, we show that work here.*

This dataframe is very messy upon first looking at it. The first 5 rows are shown below for reference. There are unneccessary columns and the titles are contained within a row of the dataframe. There is also a row which does not contain any values within it.

In [None]:
us_total_df.head()

As such, we reset all the indices, take the first row and make it the column titles. We drop the row with no values in it, we drop the first column which does not seem to serve any purpose, and we transpose so that we can have years in one column so as to make analysis/plotting easier later on. We make sure every value within the table is a float, and we change the year column into datetime. 

Finally, the first 5 rows are displayed. 

In [None]:
us_total_df = us_total_df.reset_index(drop = False).drop(labels='level_0', axis=1)
index = us_total_df.iloc[0,:].squeeze()[:].reset_index(drop=True)

us_total_df.columns.values[::] = index
us_total_df = us_total_df.drop(0)
us_total_df = us_total_df.transpose().reset_index(drop = False)

index_2 = us_total_df.iloc[0,:].squeeze()[:].reset_index(drop=True)
us_total_df.columns.values[::] = index_2

us_total_df = us_total_df.drop(0)
us_total_df = us_total_df.rename(columns={'Technology': 'Year'}).drop(labels= 'Bagasse', axis = 1)

cols = us_total_df.columns
us_total_df[cols[1:]] = us_total_df[cols[1:]].astype(float)

us_total_df['Year'] = pd.to_datetime(us_total_df['Year'], format = "%Y")
us_total_df['Year'] = pd.DatetimeIndex(us_total_df['Year']).year

us_total_df.head()

## Preliminary Analysis

<a id=’total_df’></a>

### Concerning total_df

We wanted to build some context at the beginning of our work. As we note in our [Introduction](#introduction), while renewable energy production/consumption are on the rise, we wanted to see these trends over time. Here, we initially plot total fossil fuels production/consumption and total renewable energy production/consumption in the US. Note that the units are Quadrillion BTUs. (Note: we may make a function later so as to make the plotting easier if we wanteed to see a particular source and its trends over time. 

In [None]:
#plotting total fossil fuels production/consumption and total renewable energy production/consumption
sns.lineplot(data = total_df, x = 'date', y = 'Total Fossil Fuels Production', label = 'Total Fossil Fuels Production')
sns.lineplot(data = total_df, x = 'date', y = 'Total Fossil Fuels Consumption', label = 'Total Fossil Fuels Consumption')
sns.lineplot(data = total_df, x = 'date', y = 'Total Renewable Energy Production', label = 'Total Renewable Energy Production')
sns.lineplot(data = total_df, x = 'date', y = 'Total Renewable Energy Consumption', label = 'Total Renewable Energy Consumption')

plt.title("Fossil Fuel and Renewable Energy Production and Consumption Over Time")
plt.xlabel("time")
plt.ylabel("Quadrillion BTU")
 
plt.legend(bbox_to_anchor=(1.0, 1.0))
plt.show()

Here, we see that total renewable energy production/consumption is about equal. We note that fossil fuel consumption is higher than production usually. We also observe that while renewable energy production has been on the rise, fossil fuel production has not necessarily been on a decline. The next two plots make this clearer. We plot total renewable energy production over time, and we plot the curves for different energy source production as well to show an initial visualization how much of that renewable energy production comes form each type. We see that wind has the steepest curve as of late. We also plot simply the rise in total fossil fuels production alone over the years. 

In [None]:
#plotting total renewable energy production as well as wind energy production curve
sns.lineplot(data = total_df, x = 'date', y = 'Total Renewable Energy Production', label = 'Total Renewable Energy Production')
sns.lineplot(data = total_df, x = 'date', y = 'Wind Energy Production', label = 'Wind Energy Production')
sns.lineplot(data = total_df, x = 'date', y = 'Hydroelectric Power Production', label = 'Hydroelectric Power Production')
sns.lineplot(data = total_df, x = 'date', y = 'Geothermal Energy Production', label = 'Geothermal Energy Production')
sns.lineplot(data = total_df, x = 'date', y = 'Solar Energy Production', label = 'Solar Energy Production')
sns.lineplot(data = total_df, x = 'date', y = 'Biomass Energy Production', label = 'Biomass Energy Production')


plt.title("Renewable Energy Production Over Time")
plt.xlabel("time")
plt.ylabel("Quadrillion BTU")
plt.legend(bbox_to_anchor=(1.0, 1.0))
plt.show()

#plotting total fossil fuels production alone over the years
sns.lineplot(data = total_df, x = 'date', y = 'Total Fossil Fuels Production')
plt.title("Fossil Fuels Production Over Time")
plt.xlabel("time")
plt.ylabel("Quadrillion BTU")
plt.show()

To make clearer again this large dependence on fossil fuels, we make a pie chart of the most recent year's energy production by type. First, we create a new dataframe containing only data from 2022 and we take the mean of each individual energy type.

In [None]:
# make a pie chart for 2022 so far 
%sql energy_2022 << SELECT * FROM total_df WHERE year = 2022
col = energy_2022.columns
avg_production = pd.DataFrame(energy_2022[col[1:12]].mean(axis=0))
avg_production = avg_production.rename(columns={ 0 : "Amount"}).drop('Total Fossil Fuels Production')
                                                                    
avg_production.plot.pie(y='Amount', figsize=(8, 8))
plt.legend(bbox_to_anchor=(1.7, 1.0))
plt.title("2022 US Energy Production by Energy Type")
plt.show

We note that the these curves over time look logistic. For this preliminary analysis, let us simply see total renewable energy production over the years (so we take the yearly average to find annual renewable energy production). We plot these points on a scatterplot then overlay a linear curve then an exponential curve. We see that obviously the exponential curve matches better. 

In [None]:
annual_df = total_df.iloc[:, [12, 26]].groupby(['year']).mean('Total Renewable Energy Production').reset_index()
annual_df

x = annual_df['year']
y = annual_df['Total Renewable Energy Production']

sns.lmplot(data = annual_df, x = 'year', y = 'Total Renewable Energy Production')
plt.show()

curve = np.polyfit(x, np.log(y), 1)
x_new = np.linspace(np.min(x), np.max(x))
y_new = np.exp(curve[1]) * np.exp(curve[0] * x_new)

sns.scatterplot(data=annual_df, x= x, y= y)
plt.plot(x_new, y_new)
plt.show()

As we continue, we will want to fit curves such as this one to predict both production and consumption based on past trends. In this way, we would like to anticipate ability to meet demand and forecast production and consumption over some amount of years into the future. Questions arise like will how will production and consumption compare? And then more questions will arise as we consider ref_df.

<a id=’ref_df’></a>
### Concerning ref_df

When it comes to ref_df, we want to come up with an equation or some other type of way to rank the criteria listed in the dataset. If we attach some sort of weight to each factor, we can then make a suggestion for where to best place windmills. Note that the factors contained in ref_df range from distance to an urban area to wind speed to whether land is protected and so on. Eventually, we would like to predict energy generations as well as economic output of what certain suggestions of wind mill placement will generate in the future and what percent of the demand we explore from analyzing total_df that will entail. 

Since we have yet to come up with our equation/ranking system yet, there is not much preliminary analysis we can do here that would serve much purpose. Instead, we plot the longitudes and latitudes to with the hue of capacity factor in the first map and then available area in km in the second. These will give us a good idea of where the most amount of energy generation would take place given that capacity_factor is electrical energy generated per unit capacity. The second map, however, begins to show us some of the clash between our different criteria in that the places with the largest available areas might not neceassarily always be located at the same longitudes and latitudes. 

We may plan on adding an actual map of the US underneath this scattering, but for now, it is easy to see the outline of the US without it. 

In [None]:
ref_df.plot(x="longitude", y="latitude", kind="scatter", c="capacity_factor", colormap="YlGn", figsize = (12,6))
plt.show()

In [None]:
ref_df.plot(x="longitude", y="latitude", kind="scatter", c="area_sq_km", colormap="PuBuGn", figsize = (12,6))
plt.show()

copy_ref_df = ref_df.copy()

## Acknowledgements

*Here, we list any other sites that might have helped us but not included in the links to our datasets:*

for color scheme identifiers on maps: 
https://matplotlib.org/stable/tutorials/colors/colormaps.html

for plotting maps:
https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

for some questions on datetime:
https://dataindependent.com/pandas/pandas-to-datetime-string-to-date-pd-to_datetime/

for how to make a pie plot:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.pie.html

Thanks for reading!

# Phase 3

## Building a Forecasting Model for Renewable Energy Production

**MODEL 1**

First, let's try some naive fitting. Using PolynomialFeatures, we are going to fit our data and run a regression on those transformed x-values. We will do a train/test split, being careful since this is time-series. Finally, we will predict the next 30 years of renewable energy production. 

In [None]:
x = pd.array(annual_df['year'])
y = pd.array(annual_df['Total Renewable Energy Production'])

poly = PolynomialFeatures(degree=3, include_bias=False)
x1 = poly.fit_transform(x.reshape(-1,1))

test_size = 10

#since is time-series, have to split differently. 
x_train = x1[:-test_size]
y_train = annual_df['Total Renewable Energy Production'][:-test_size]
x_test = x1[-test_size:]
y_test = annual_df['Total Renewable Energy Production'][-test_size:]

model_1 = LinearRegression().fit(x_train, y_train)
predictions_train = model_1.predict(x_train)
predictions_test = model_1.predict(x_test)

plt.scatter(x, y, label = "data")
plt.plot(x[:-test_size], predictions_train, c="red", label = "train fit")
plt.plot(x[-test_size:], predictions_test, c = "green", label = "test fit")
plt.legend(loc='best')
plt.show()

mse_train = mean_squared_error(y_train, predictions_train)
mse_test = mean_squared_error(y_test, predictions_test)
mae_train = mean_absolute_error(y_train, predictions_train)
mae_test = mean_absolute_error(y_test, predictions_test)
    
print('\ncoef = ', model_1.coef_, '\nintercept = ', model_1.intercept_, '\ntrain RMSE = ', np.sqrt(mse_train),\
        '\ntest RMSE = ', np.sqrt(mse_test), '\ntrain MAE = ', mae_train,'\ntest MAE = ', mae_test)

model = LinearRegression().fit(x1, y)
y_predicted = model.predict(x1)

years = np.linspace(2022, 2052, 31)
years_res = poly.fit_transform(years.reshape(-1,1))
new_pred = model.predict(years_res)

plt.plot(x, y_predicted, c="red", label = "fit")
plt.plot(years, new_pred, c = "green", label = "future predictions")
plt.legend(loc='best')
plt.show()

future_df = pd.DataFrame([])
future_df['year'] = years
future_df['Total Renewable Energy Production'] = new_pred
total_df = pd.concat([annual_df, future_df])

x_up = x1.tolist()
y_up = y.tolist()
x_up = sm.add_constant(x_up)
 
result = sm.OLS(y_up, x_up).fit()
print(result.summary())

We can see that perhaps we are over-fitting the date because of our RMSE scores. Instead, let us try to fit the data with a logistic transformation. We'll see that this also does not work well, right off the bat visually.

Now, we try to fit to a simple s-curve and make more predictions.

In [None]:
x = pd.array(total_df['year'])
y = pd.array(total_df['Total Renewable Energy Production'])

def fourPL(x, A, B, C, D):
    return ((A-D)/(1.0+((x/C)**(B))) + D)

def PL4_pred (x):
    y_pred = ((A-D)/(1.0+((x/C)**(B))) + D)
    return y_pred

guess = [min(y), np.median(x), 2030 , max(y)] 
params, params_covariance = curve_fit(fourPL, x, y, guess)

A = params[0]
B = params[1]
C = params[2]
D = params[3]

upy = np.linspace(2052, 2080, 29)
new_y_vals_3 = np.zeros(29)
for i in range(29):
    out = PL4_pred(upy[i])
    new_y_vals_3[i] = out

y_nnn = fourPL(x, *params)
#plt.plot(x, y, 'o', label='data')
plt.plot(x,y_nnn, label='fit')
plt.plot(upy, new_y_vals_3, label = 'predictions')
plt.legend(loc='best')
plt.show()

In [None]:
# evaluation of significance 
#coef_df = np.zeros(1000)
#for i in range(1000):
#    sample_df = new_tot.sample(n = new_tot.shape[0], replace = True)
#    model = LinearRegression().fit(sample_df[['year']], sample_df[['energy']])
#    coef_df[i] = model.coef_[0][0]

#plt.hist(coef_df) 
#plt.show()


coef_df = np.zeros(1000)
for i in range(1000):
    sample_df = annual_df.sample(n = annual_df.shape[0], replace = True)
    model = LinearRegression().fit(sample_df[['year']], sample_df[['Total Renewable Energy Production']])
    coef_df[i] = model.coef_[0][0]

plt.hist(coef_df) 
plt.show()

In [None]:
#new_tot = pd.DataFrame([])
#year = np.zeros(150)
#energy = np.zeros(150)

#j = 0
#i = 0
#for i in range(50):
 #   year[i] = x1[j][0]
 #   year[i+1] = x1[j][1]
 #   year[i+2] = x1[j][2]
 #   energy[i] = annual_df['Total Renewable Energy Production'][j]
 #   energy[i+1] = annual_df['Total Renewable Energy Production'][j]
 #   energy[i+2] = annual_df['Total Renewable Energy Production'][j]
 #   j+=1
  #  i+=3
  #  if j%50 == 0:
  #      j = 0
  
#new_tot['year'] = year
#new_tot['energy'] = energy
#new_tot
#coef_1[0:30]
#coef_1  = new_tot.iloc[::3, :]
#coef_1.head()

# Finding Optimal positions for Windmill Placement

In this next part of our project, we will be focusing on predicting the best positions for windmill placement to cover the energy gap predicted by our models in the next 30 years. 
<br> 

Below is the dataset we started working with: 

In [None]:
ref_df

We have divided the parameters used to judge the likelihood of a certain area being a potentially good location for a windmill farm into 3 broad categories: 
 
1. Efficiency Based Parameters 
2. Location Based Parameters
3. Monetary based Parameters 

### Efficiency Based Parameters
<br>
One of our main considerations is what is the expected capacity factor of the wind turbine placed at a ceratin location. 
Capacity factor refers of a wind turbine is its average power output divided by its maximum power capability. In essence, it is a measure of predicted performance. For a wind turbine, the maximum possible output would be the capacity x 8760 hr (there are 8760 hrs in a year). 
<br> 

Capacity factor of land based wind in the U.S. ranges from 24% to 56% and averages of 36%. (Stats have been taken from University of Michigan's center of Sustainable system) The histogram below shows the distribution of the capacity factor in our dataset.

In [None]:
ref_df.hist(column = 'capacity_factor', bins = 7)
plt.show()

In [None]:
#Akanksha_to_elida: not sure what this part is doing? did you want to show a correlation between these two? 
plt.scatter(data = ref_df, x = 'capacity_factor', y = 'distance_to_transmission_km')
plt.title("Capacity Factor Over Distance to Transmission")
plt.xlabel("Capacity Factor")
plt.ylabel("Distance to Transmission (km) ")
plt.show()

According to our research, it seems to be economically viable to have a wind turbine whose capacity factor lies somewhere between 30% to 60%. This covers our average and makes sure we are not wasting our resources and getting somewhat of a practical Return on Investment. After putting this restriction, we have done some of our inital filtering with our potential options coming down to _____ from ________. 

######### potential idea???? want to explore??? 
can probably do linear programming on this problem.

In [None]:
ref_df.drop(ref_df[ref_df['capacity_factor'] < 0.3].index, inplace = True)
ref_df.shape

### Location based Parameter 

#### Annual Average Wind
Given the setup of a windmill, we in general are looking for area with Annual average wind speed of at least 9 miles per hour (mph)—or 4 meters per second (m/s)—for small wind turbines and 13 mph (5.8 m/s) for utility-scale turbines. This is the floor for consideration as a potential area. 
<br> 

#### Distance from obstacle 
Furthermore, we want to look at the geographical area around the windmill. The optimal physical location would be a plain terrain without any "obstacles" nearby (this means that the area should not be surrounded very closely by mountains or should not be located at the endge of a valley). The way we have calculated this is by looking at the squared area available to us for the wind farm. 
<br>

Our research shows that the industry rule of thumb is as follows: the distance between a turbine and the nearest obstacle should be at least twice the turbine height, unless the turbine is more than twice the height of the obstacle in which case the distance can be less (we are not considering the second part due to added complexities). We have taken our wind turbine height to be 120m which means we want there to be aleast 0.5 sq km distance. 

#### Distance to Urban Setting 
In this parameters, we are considering some of the social constraints of 

In [None]:
ref_df.drop(ref_df[ref_df['wind_speed'] < 9].index, inplace = True)
ref_df.drop(ref_df[ref_df['area_sq_km'] < 0.5].index, inplace = True)
ref_df.drop(ref_df[ref_df['distance_to_transmission_km'] < 0.015].index, inplace = True)

After our second phase of filtering we are working with the following list of potential location: 

In [None]:
print(ref_df.head()) 
print(ref_df.shape)

### Monetary Based Parameters

Lastly, to make our predictions more applicable we came up with a formula which can help calculate approximate cost of building a wind farm in a certain location. We have not made assumptions about what kind of budget is available to us which is why we have left this part as a future work for making our model more applicable instead of incorporating into our windmill location ranking system. 
<br>

Our formula consists of the following parameters: 
<br>

###### FCR = Fixed Charge Rate 
- The fraction of the Total Installed Cost that must be set aside each year to retire capital costs to cover the interest on debt, return on equity, etc.
- For our purposes, we can assume this to be 7% or 0.07.
<br> 

###### CE = Capital Expenditure 
- The initial upfront capital required for set up (basically the inital cost) 
<br>

###### AEP = Net Annual Energy Production
- Refers to the cost per kwH of energy generated. 
- Comes from the capacity factor 
<br> 

###### RC = Replacement Cost 
- RC = Cost of turbine / Expected life + other costs 
- Refers to all the sinking cost related to replacement and overhauls of the mechanical or other technological aspects of the windmill. 
<br> 


Based on the parameters written above we can develop the following equation: 

C = [(FCR x CE) / AEP] + [(RC) / AEP ]

In [None]:
#150 meters away from an nearby obsturction
#half a square km 
ref_df['generation_mwh'].sum()

In [None]:
fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(copy_ref_df['longitude'], copy_ref_df['latitude'], s=8, c='tan')
ax1.scatter(ref_df['longitude'], ref_df['latitude'], s=8, c=ref_df['capacity_factor'])
plt.show()

#copy_ref_df.plot(x="longitude", y="latitude", kind = 'scatter', figsize = (12,6))
#ref_df.plot(x="longitude", y="latitude", kind="scatter", c="capacity_factor", colormap="YlGn", figsize = (12,6))
#plt.show()