
*   Computing Platforms: Set up the Workspace for Machine Learning Projects.  https://ms.pubpub.org/pub/computing
*  Machine Learning for Predictions. https://ms.pubpub.org/pub/ml-prediction
* Machine Learning Packages: https://scikit-learn.org/stable/


# Part I: Import and Inspect Data

In [3]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/Rising-Stars-by-Sunshine/stats201-tutorial-prediction/main/data/Queried_Data/queried_data.csv',index_col="Unnamed: 0")
df.head()

Unnamed: 0,number,timestamp,gas_used,gas_limit
100,14650515,2022-04-25 00:00:04,0,30000000
101,14650516,2022-04-25 00:00:07,3067277,29970705
102,14650517,2022-04-25 00:00:09,29927116,29941438
103,14650518,2022-04-25 00:00:35,29951281,29970676
104,14650519,2022-04-25 00:00:38,15598681,29999943


# Part II: Prepare the Y varible for Regression

## 2.1. Write functions to calculte the Y variable for Regression 

*(skip the step if the Y variable already exists)*

In [5]:
df['theta'] = df['gas_used']/df['gas_limit']
df.head()

Unnamed: 0,number,timestamp,gas_used,gas_limit,theta
100,14650515,2022-04-25 00:00:04,0,30000000,0.0
101,14650516,2022-04-25 00:00:07,3067277,29970705,0.102343
102,14650517,2022-04-25 00:00:09,29927116,29941438,0.999522
103,14650518,2022-04-25 00:00:35,29951281,29970676,0.999353
104,14650519,2022-04-25 00:00:38,15598681,29999943,0.519957


## 2.2. Make Sure that the Data Type of Y is "numeric"

In [6]:
df.dtypes

number         int64
timestamp     object
gas_used       int64
gas_limit      int64
theta        float64
dtype: object

In [7]:
df['theta'] = pd.to_numeric(df['theta'])
df.dtypes

number         int64
timestamp     object
gas_used       int64
gas_limit      int64
theta        float64
dtype: object

# Part III: Prepare the Y variable for Classification

reference:

https://datatofish.com/if-condition-in-pandas-dataframe/ *italicized text*

In [9]:
#@title Define the Congestion Threshold
cut = 0.95 #@param {type:"number"}


## 3.1. Method 1: If function

In [10]:
df['congested'] = df['theta'] >= cut
df.head()

Unnamed: 0,number,timestamp,gas_used,gas_limit,theta,congested
100,14650515,2022-04-25 00:00:04,0,30000000,0.0,False
101,14650516,2022-04-25 00:00:07,3067277,29970705,0.102343,False
102,14650517,2022-04-25 00:00:09,29927116,29941438,0.999522,True
103,14650518,2022-04-25 00:00:35,29951281,29970676,0.999353,True
104,14650519,2022-04-25 00:00:38,15598681,29999943,0.519957,False


In [11]:
df.loc[(df['theta'] >= cut), 'congested'] = 1
df.loc[(df['theta'] <cut), 'congested'] = 0
df.head()

Unnamed: 0,number,timestamp,gas_used,gas_limit,theta,congested
100,14650515,2022-04-25 00:00:04,0,30000000,0.0,0
101,14650516,2022-04-25 00:00:07,3067277,29970705,0.102343,0
102,14650517,2022-04-25 00:00:09,29927116,29941438,0.999522,1
103,14650518,2022-04-25 00:00:35,29951281,29970676,0.999353,1
104,14650519,2022-04-25 00:00:38,15598681,29999943,0.519957,0


## 3.2. Method 2: Lambda function

notes: the best method that I suggest

In [12]:
df['congested'] = df['theta'].apply(lambda x: 1 if x>= cut else 0)
df.head()

Unnamed: 0,number,timestamp,gas_used,gas_limit,theta,congested
100,14650515,2022-04-25 00:00:04,0,30000000,0.0,0
101,14650516,2022-04-25 00:00:07,3067277,29970705,0.102343,0
102,14650517,2022-04-25 00:00:09,29927116,29941438,0.999522,1
103,14650518,2022-04-25 00:00:35,29951281,29970676,0.999353,1
104,14650519,2022-04-25 00:00:38,15598681,29999943,0.519957,0


## 3.3. Method 3: Cut function

reference: 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

notes: I do not suggest this method if you are newbies to data science 

In [13]:
df.head()

Unnamed: 0,number,timestamp,gas_used,gas_limit,theta,congested
100,14650515,2022-04-25 00:00:04,0,30000000,0.0,0
101,14650516,2022-04-25 00:00:07,3067277,29970705,0.102343,0
102,14650517,2022-04-25 00:00:09,29927116,29941438,0.999522,1
103,14650518,2022-04-25 00:00:35,29951281,29970676,0.999353,1
104,14650519,2022-04-25 00:00:38,15598681,29999943,0.519957,0


In [14]:
import numpy as np
 
congested = pd.cut(df['theta'], bins=[0,0.95,1], labels=[0,1]) #might have problems at boundaries
df.insert(3, 'congested2',congested)
df.head()

Unnamed: 0,number,timestamp,gas_used,congested2,gas_limit,theta,congested
100,14650515,2022-04-25 00:00:04,0,,30000000,0.0,0
101,14650516,2022-04-25 00:00:07,3067277,0.0,29970705,0.102343,0
102,14650517,2022-04-25 00:00:09,29927116,1.0,29941438,0.999522,1
103,14650518,2022-04-25 00:00:35,29951281,1.0,29970676,0.999353,1
104,14650519,2022-04-25 00:00:38,15598681,0.0,29999943,0.519957,0


In [15]:
import numpy as np
 
congested = pd.cut(df['theta'], bins=[-1,0.95,2], labels=[0,1]) #avoid the boundary problems
df.insert(3, 'congested3',congested)
df.head()

Unnamed: 0,number,timestamp,gas_used,congested3,congested2,gas_limit,theta,congested
100,14650515,2022-04-25 00:00:04,0,0,,30000000,0.0,0
101,14650516,2022-04-25 00:00:07,3067277,0,0.0,29970705,0.102343,0
102,14650517,2022-04-25 00:00:09,29927116,1,1.0,29941438,0.999522,1
103,14650518,2022-04-25 00:00:35,29951281,1,1.0,29970676,0.999353,1
104,14650519,2022-04-25 00:00:38,15598681,0,0.0,29999943,0.519957,0


# Part IV: Create the X variables

## 4.1. Shift the Y to get past values

reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html

In [16]:
# generate a new variable as the previous 1 observable of your Y variable for regression
df['theta_past'] =df['theta'].shift(1)
df.head()

Unnamed: 0,number,timestamp,gas_used,congested3,congested2,gas_limit,theta,congested,theta_past
100,14650515,2022-04-25 00:00:04,0,0,,30000000,0.0,0,
101,14650516,2022-04-25 00:00:07,3067277,0,0.0,29970705,0.102343,0,0.0
102,14650517,2022-04-25 00:00:09,29927116,1,1.0,29941438,0.999522,1,0.102343
103,14650518,2022-04-25 00:00:35,29951281,1,1.0,29970676,0.999353,1,0.999522
104,14650519,2022-04-25 00:00:38,15598681,0,0.0,29999943,0.519957,0,0.999353


## 4.2. Calculate the Moving Averages

references: 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html

https://towardsdatascience.com/moving-averages-in-python-16170e20f6c

In [17]:
#@title Define the Window
window = 10 #@param {type:"number"}


In [18]:
df['theta_past_ma10']=df['theta_past'].rolling(window=window,min_periods=1).mean()
df.head(20)

Unnamed: 0,number,timestamp,gas_used,congested3,congested2,gas_limit,theta,congested,theta_past,theta_past_ma10
100,14650515,2022-04-25 00:00:04,0,0,,30000000,0.0,0,,
101,14650516,2022-04-25 00:00:07,3067277,0,0.0,29970705,0.102343,0,0.0,0.0
102,14650517,2022-04-25 00:00:09,29927116,1,1.0,29941438,0.999522,1,0.102343,0.051171
103,14650518,2022-04-25 00:00:35,29951281,1,1.0,29970676,0.999353,1,0.999522,0.367288
104,14650519,2022-04-25 00:00:38,15598681,0,0.0,29999943,0.519957,0,0.999353,0.525304
105,14650520,2022-04-25 00:00:47,10844553,0,0.0,30000000,0.361485,0,0.519957,0.524235
106,14650521,2022-04-25 00:00:52,7476517,0,0.0,30000000,0.249217,0,0.361485,0.49711
107,14650522,2022-04-25 00:00:56,0,0,,30000000,0.0,0,0.249217,0.461697
108,14650523,2022-04-25 00:00:57,18525539,0,0.0,30000000,0.617518,0,0.0,0.403985
109,14650524,2022-04-25 00:01:01,10934632,0,0.0,30000000,0.364488,0,0.617518,0.42771


# Part V Train and Test Split

*reference*:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

In [31]:
from sklearn.model_selection import TimeSeriesSplit
tss = TimeSeriesSplit()
print(tss)

TimeSeriesSplit(gap=0, max_train_size=None, n_splits=5, test_size=None)


In [33]:
# change the train and test split parameters 
tss = TimeSeriesSplit(gap=0, max_train_size=None, n_splits=2, test_size=None)

In [34]:
for train_idx, test_idx in tss.split(df):
    print("TRAIN:", train_idx, "TEST:", test_idx)

TRAIN: [    0     1     2 ... 21147 21148 21149] TEST: [21150 21151 21152 ... 42297 42298 42299]
TRAIN: [    0     1     2 ... 42297 42298 42299] TEST: [42300 42301 42302 ... 63447 63448 63449]


In [40]:
train_idx

array([    0,     1,     2, ..., 42297, 42298, 42299])

In [41]:
test_idx

array([42300, 42301, 42302, ..., 63447, 63448, 63449])

In [42]:
train_df = df.filter(items=train_idx, axis=0)
test_df =  df.filter(items=test_idx, axis=0)

In [50]:
train_df.head()

Unnamed: 0,number,timestamp,gas_used,congested3,congested2,gas_limit,theta,congested,theta_past,theta_past_ma10
0,14650615,2022-04-25 00:22:41,3788742,0,0,29970705,0.126415,0,0.991364,0.557929
1,14650616,2022-04-25 00:22:43,29979945,1,1,29999972,0.999332,1,0.126415,0.562116
2,14650617,2022-04-25 00:23:35,29962455,1,1,29970677,0.999726,1,0.999332,0.56206
3,14650618,2022-04-25 00:23:42,29979756,1,1,29999944,0.999327,1,0.999726,0.597107
4,14650619,2022-04-25 00:23:56,23281823,0,0,30000000,0.776061,0,0.999327,0.675015


In [49]:
test_df.head()

Unnamed: 0,number,timestamp,gas_used,congested3,congested2,gas_limit,theta,congested,theta_past,theta_past_ma10
42300,14692815,2022-05-01 15:40:05,11340559,0,0,29970705,0.378388,0,0.999309,0.543044
42301,14692816,2022-05-01 15:40:30,29995034,1,1,29999972,0.999835,1,0.378388,0.552995
42302,14692817,2022-05-01 15:40:49,3958965,0,0,30029267,0.131837,0,0.999835,0.627107
42303,14692818,2022-05-01 15:40:51,3535970,0,0,29999943,0.117866,0,0.131837,0.621367
42304,14692819,2022-05-01 15:40:55,4848119,0,0,30000000,0.161604,0,0.117866,0.605332


# Part VI Prepare the Train and Test Data for Classification and Regression

## 6.1. Classification

### 6.1.1 Define the columns (Y, X) for Classification 

In [51]:
cols_C = ['congested','theta_past_ma10']

### 6.1.2 Define the Data Frame of Train and Test Data for Classification

In [52]:
df_C_train = train_df[cols_C]
df_C_test = test_df[cols_C]

### 6.1.3 Export the Train and Test Data for Classification

In [53]:
df_C_train.head()

Unnamed: 0,congested,theta_past_ma10
0,0,0.557929
1,1,0.562116
2,1,0.56206
3,1,0.597107
4,0,0.675015


In [54]:
df_C_train.to_csv('Classification_Train.csv')

In [55]:
df_C_test.head()

Unnamed: 0,congested,theta_past_ma10
42300,0,0.543044
42301,1,0.552995
42302,0,0.627107
42303,0,0.621367
42304,0,0.605332


In [56]:
df_C_test.to_csv('Classification_Test.csv')

## 6.2 Regression

### 6.2.1. Define the columns (Y, X) for Regression

In [57]:
cols_R = ['theta','theta_past_ma10']

### 6.2.2. Define the Data Frame of Train and Test Data for Regression

In [58]:
df_R_train = train_df[cols_R]
df_R_test = test_df[cols_R]

### 6.2.3. Export the Train and Test Data for Regression

In [59]:
df_R_train.head()

Unnamed: 0,theta,theta_past_ma10
0,0.126415,0.557929
1,0.999332,0.562116
2,0.999726,0.56206
3,0.999327,0.597107
4,0.776061,0.675015


In [60]:
df_R_train.to_csv('Regression_Train.csv')

In [62]:
df_R_test.head()

Unnamed: 0,theta,theta_past_ma10
42300,0.378388,0.543044
42301,0.999835,0.552995
42302,0.131837,0.627107
42303,0.117866,0.621367
42304,0.161604,0.605332


In [61]:
df_R_test.to_csv('Regression_Test.csv')