<center>
# How to get some useful statistics using the brightwind library
</center>
***

In [1]:
import datetime
print('Last updated: {}'.format(datetime.date.today().strftime('%d %B, %Y')))

Last updated: 26 June, 2019


***
## Outline:

This guide will demonstrate how to get some useful statistics from a sample dataset using the following steps:

- Import the brightwind library and some sample data
- Find time continuity gaps within the sample data
- Get some basic statistics on each of the columns from the sample dataset
- Find the monthly coverage of the dataset or the coverage of any time period.
- Return the mean of monthly means of a anemometer or of a range of anemometers

***

In [2]:
import brightwind as bw

In [3]:
# specify location of existing sample dataset
filepath = r'C:\...\brightwind\datasets\demo\demo_data.csv'
# load data as dataframe
data = bw.load_csv(filepath) 
# show first few rows of dataframe
data.head(5)

Unnamed: 0_level_0,Spd80mN,Spd80mS,Spd60mN,Spd60mS,Spd40mN,Spd40mS,Spd80mNStd,Spd80mSStd,Spd60mNStd,Spd60mSStd,...,Dir78mSStd,Dir58mS,Dir58mSStd,Dir38mS,Dir38mSStd,T2m,RH2m,P2m,PrcpTot,BattMin
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-09 15:30:00,8.37,7.911,8.16,7.849,7.857,7.626,1.24,1.075,1.06,0.947,...,6.1,110.1,6.009,112.2,5.724,0.711,100.0,935.0,0.0,12.94
2016-01-09 15:40:00,8.25,7.961,8.1,7.884,7.952,7.84,0.897,0.875,0.9,0.855,...,5.114,110.9,4.702,109.8,5.628,0.63,100.0,935.0,0.0,12.95
2016-01-09 17:00:00,7.652,7.545,7.671,7.551,7.531,7.457,0.756,0.703,0.797,0.749,...,4.172,113.1,3.447,111.8,4.016,1.126,100.0,934.0,0.0,12.75
2016-01-09 17:10:00,7.382,7.325,6.818,6.689,6.252,6.174,0.844,0.81,0.897,0.875,...,4.68,118.8,5.107,115.6,5.189,0.954,100.0,934.0,0.0,12.71
2016-01-09 17:20:00,7.977,7.791,8.11,7.915,8.14,7.974,0.556,0.528,0.562,0.524,...,3.123,115.9,2.96,113.6,3.54,0.863,100.0,934.0,0.0,12.69


### Time Continuity

First we want to see if there are any gaps in the data. We can use the time_continuity_gap function to identify periods where there are gaps in the timestamp that are not consistent with typical gap seen between timestamps in the file(s). The function returns a pandas dataframe showing the timestamp at the start of the missing period and the timestamp at the end of the missing period. An additional column shows how many days were lost in the missing period.

In [4]:
bw.time_continuity_gaps(data)

Unnamed: 0,Date From,Date To,Days Lost
1,2016-01-09 15:40:00,2016-01-09 17:00:00,0.055556
17750,2016-05-11 23:00:00,2016-05-31 15:20:00,19.680556


### Basic Statistics

Next we may want to get some basic statistics of each of the columns found in the wind data file. The basic_stats function returns the count, mean, standard deviation, minimum and maximum of each column. This can be useful for a variety of checks, one example is confirming calibrations have been applied to the anemometers by checking if the minimum value for each anemometer matches the corresponding calibration offset.

In [5]:
bw.basic_stats(data)

Unnamed: 0,count,mean,std,min,max
Spd80mN,95629.0,7.498665,3.998231,0.215,29.0
Spd80mS,95629.0,6.474298,4.457503,0.0,29.27
Spd60mN,95629.0,7.033594,3.809893,0.214,28.22
Spd60mS,95629.0,7.113664,3.905644,0.08,29.03
Spd40mN,95629.0,6.742682,3.73894,0.228,27.38
Spd40mS,95629.0,6.800116,3.816079,0.092,28.45
Spd80mNStd,95629.0,1.005663,0.540208,0.0,5.056
Spd80mSStd,95629.0,0.820888,0.596739,0.0,5.151
Spd60mNStd,95629.0,1.015741,0.536483,0.0,5.043
Spd60mSStd,95629.0,0.94206,0.535222,0.0,5.185


### Data Coverage

Next we can see check the coverage of each column in the dataset. By default, the coverage function returns the monthly coverage.

In [6]:
bw.coverage(data)

Unnamed: 0_level_0,Spd80mN_Coverage,Spd80mS_Coverage,Spd60mN_Coverage,Spd60mS_Coverage,Spd40mN_Coverage,Spd40mS_Coverage,Spd80mNStd_Coverage,Spd80mSStd_Coverage,Spd60mNStd_Coverage,Spd60mSStd_Coverage,...,Dir78mSStd_Coverage,Dir58mS_Coverage,Dir58mSStd_Coverage,Dir38mS_Coverage,Dir38mSStd_Coverage,T2m_Coverage,RH2m_Coverage,P2m_Coverage,PrcpTot_Coverage,BattMin_Coverage
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-01,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534,...,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534
2016-02-01,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2016-03-01,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2016-04-01,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2016-05-01,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367,...,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367
2016-06-01,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2016-07-01,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2016-08-01,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2016-09-01,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2016-10-01,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Returning the coverage of all of the columns is more information that we need in this case! So we can assign each of the anemometers to a list, by specifiying the column headings from the table that correspond to the average 10-min values from the anemometers, and then passing them through the coverage function.

In [7]:
anemometers = ['Spd80mN','Spd80mS', 'Spd60mN', 'Spd60mS', 'Spd40mN', 'Spd40mS']
bw.coverage(data[anemometers])

Unnamed: 0_level_0,Spd80mN_Coverage,Spd80mS_Coverage,Spd60mN_Coverage,Spd60mS_Coverage,Spd40mN_Coverage,Spd40mS_Coverage
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-01-01,0.719534,0.719534,0.719534,0.719534,0.719534,0.719534
2016-02-01,1.0,1.0,1.0,1.0,1.0,1.0
2016-03-01,1.0,1.0,1.0,1.0,1.0,1.0
2016-04-01,1.0,1.0,1.0,1.0,1.0,1.0
2016-05-01,0.365367,0.365367,0.365367,0.365367,0.365367,0.365367
2016-06-01,1.0,1.0,1.0,1.0,1.0,1.0
2016-07-01,1.0,1.0,1.0,1.0,1.0,1.0
2016-08-01,1.0,1.0,1.0,1.0,1.0,1.0
2016-09-01,1.0,1.0,1.0,1.0,1.0,1.0
2016-10-01,1.0,1.0,1.0,1.0,1.0,1.0


But what if we dont want monthly coverage? We can then use the period variable to return whatever time period we want, whether that is 10-min (period='10min'), hourly (period='1H'), daily (period='1D'), weekly (period='1W') or yearly (period='1AS'). Here we have opted to return the yearly coverage.

In [8]:
bw.coverage(data[anemometers],period='1AS')

Unnamed: 0_level_0,Spd80mN_Coverage,Spd80mS_Coverage,Spd60mN_Coverage,Spd60mS_Coverage,Spd40mN_Coverage,Spd40mS_Coverage
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2016-01-01,0.922492,0.922492,0.922492,0.922492,0.922492,0.922492
2017-01-01,0.894406,0.894406,0.894406,0.894406,0.894406,0.894406


### Mean of monthly means

The mean of monthly means is a method of adjusting the average to take account of seasonal bias. For example this would remove the upward bias of having a 1.5 year dataset that covers two windier winter periods and one calm summer period. We can call the function in two ways, either by passing a specific column from the dataset which will return a value, or sending a list of column names (in this case anemometers) which will return the mean of monthly means for each column name as a dataframe.

In [9]:
bw.momm(data.Spd80mN)

7.556588194559553

In [10]:
bw.momm(data[anemometers])

Unnamed: 0,MOMM
Spd80mN,7.556588
Spd80mS,6.587765
Spd60mN,7.081094
Spd60mS,7.163933
Spd40mN,6.785035
Spd40mS,6.844676
