# Power Production - a Machine Learning project playground

>Author: **Andrzej Kocielski**  

This is a playground notebook for testing only. The actual project notebook is [Powerproduction_ML.ipynb](https://github.com/andkoc001/Machine-Learning-and-Statistics-Project.git/Powerproduction_ML.ipynb).

For more information see [README.md](https://github.com/andkoc001/Machine-Learning-and-Statistics-Project.git/README.MD).


___

## Loose notes and ideas

### Ideas

1. There is a number of observation in the data set where produced power output is zero, regardless of the wind speed. These data points should be removed from analysis.
2. ...

### ML Techniques

- Unsupervised
    - Clustering
    - Dimensional reduction
- Supervised
    - Regression
    - Classification
- Reinforced Learning

![image](https://miro.medium.com/max/700/1*AqNYz4M_GgfUN2ROb798yg.jpeg)

Image source: [Medium.com](https://miro.medium.com/max/700/1*AqNYz4M_GgfUN2ROb798yg.jpeg)

### ML algorithms

- Linear regression
- Logarithmic regression
- Decision Tree
- Decision Forrest
- Random Forrest
- t-Test
- k nearest neighbour (kNN)
- k-means
- Anova (analysis of variance)
- Support Vector Machine (SVM)
- Principal Component Analysis (PCA)
- Naive Bayes
- Dimensionality Reduction Algorithms

![image](https://docs.microsoft.com/en-us/azure/machine-learning/media/algorithm-cheat-sheet/machine-learning-algorithm-cheat-sheet.svg)
Image source: [Microsoft.com](https://docs.microsoft.com/en-us/azure/machine-learning/media/algorithm-cheat-sheet/machine-learning-algorithm-cheat-sheet.svg)

![image](https://miro.medium.com/max/1920/1*Lejtm0oGlOC5U0-J0JmGhg.png)
Image source: [Medium.com](https://miro.medium.com/max/1920/1*Lejtm0oGlOC5U0-J0JmGhg.png)

### To Do

1. Exploratory data analysis
2. Data cleaning
3. Data modeling (add / combine / infer additional data)
4. Select ML techniques to be used (explain why)
5. Do ML - analyse predictions accuracy etc. for various boundary conditions and parameters
6. Draw a conclusion

___

## Importing libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import requests

## Data set

In [None]:
# Data set is loaded from the file `powerproduction.txt`.
df = pd.read_csv(r"powerproduction.txt")

In [None]:
df

## Import Seaborn

In [None]:
import seaborn as sns

In [None]:
sns.relplot(data=df, x="speed", y="power", s=20, palette="pastel", height=6, aspect=2)

In [None]:

df_sd = pd.DataFrame()
print(df_sd)
for index, row in df.iterrows():
    speed_difference = df.iloc[index]['speed'] - df.iloc[index-1]['speed']
    # print(speed_difference)
    df_sd = df_sd.append({'speed diff': speed_difference}, ignore_index=True)

df_sd = df_sd.drop(index=[0], axis=0)
print(df_sd)  

In [None]:
df_sd.describe()

In [None]:
sns.distplot(df_sd)

In [None]:
# what wind speeds dominate - it appears to be more or less uniformely distributed
plt.figure(figsize=(20,2))
sns.distplot(df.speed, bins=100, kde=False)

In [None]:
plt.figure(figsize=(20,4))
sns.distplot(df.power, bins=100, kde=False)

In [None]:
sns.relplot(
    data=df,
    x="speed", y="power",
    kind="line", size_order=["T1", "T2"], palette="pastel",
    height=6, aspect=3, facet_kws=dict(sharex=False)
)

In [None]:
# Linear regression is an underfitting approximation
plt.figure(figsize=(18,6))
sns.regplot(data=df, x="speed", y="power", scatter_kws={'s':1})

In [None]:
# clean the dataset by removing all observations where the power output is zero

df_clean = df[df['power'] !=0]
df_clean

In [None]:
# Polynomial regression for cleaned dataset

a_plot = sns.lmplot(data=df_clean, x="speed", y="power", order=9, height=6, aspect=2, scatter_kws={'s':1})

a_plot.set(xlim=(0, 25))
a_plot.set(ylim=(0, 120))

plt.show()

The above polynomial appears to closely follow the pattern of the data points in the domain (wind speed in range 0-25).

Let's now apply the Numpy function `polyfit()` to get the value of the coefficients that minimise the squared order.

In [None]:
coeff = np.polyfit(df['speed'], df['power'], 9)
#coeff

Testing the above - attempt to reproduce the plot of the polynomial with the above coefficients.

In [None]:
coeff = np.polyfit(df['speed'], df['power'], 9)
#coeff

#with warnings.catch_warnings():
#    warnings.simplefilter('ignore', np.RankWarning)    
#    y = np.poly1d(coeff)
    
yp = np.poly1d(coeff)
    
x = np.linspace(0, 24.5, 101)
xp = plt.plot(x, yp(x))

plt.xlim(0,25)
plt.ylim(0,120)

plt.rcParams['figure.figsize'] = [15, 6]
plt.show()
print("y = ")
print(yp)

In [None]:
from scipy.stats import linregress

#help(linregress)
linregress(df)

Where (from the function help file):

slope - Slope of the regression line.

intercept - Intercept of the regression line.

rvalue - Correlation coefficient.

pvalue - Two-sided p-value for a hypothesis test whose null hypothesis is that the slope is zero, using Wald Test with t-distribution of the test statistic.

stderr - Standard error of the estimated gradient.

https://towardsdatascience.com/polynomial-regression-bbe8b9d97491

---

## Playing with Requests

The `requests` library has been now added to the other imported libraries on top of the notebook.

In [None]:
url = "https://www.gmit.ie"

# also check https://www.httpbin.org

In [None]:
res = requests.get(url)
res

In [None]:
print(dir(res))
print(help(res))

In [None]:
res.status_code

In [None]:
print(res.headers)

In [None]:
print(res.text) # print() is used for better text formating