## ⛽ UK Fuel Sale Year Prediction

Given *data about fuel sales in the UK*, let's try to predict if a given sale was **made in the last nine years**.

We will use a variety of different models to make our predictions.

Data source: https://www.kaggle.com/datasets/benten867/uk-fuel-price-weekly-statistics20032020

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

In [21]:
data = pd.read_csv('archive/fuel price.csv')
data

Unnamed: 0.1,Unnamed: 0,Date,Pump price in pence/litre (ULSP),Pump price in pence/litre (ULSD),Duty rate in pence/litre (ULSP),Duty rate in pence/litre (ULSD),VAT percentage rate (ULSP),VAT percentage rate (ULSD)
0,2,09/06/2003,74.59,76.77,45.82,45.82,17.5,17.5
1,3,16/06/2003,74.47,76.69,45.82,45.82,17.5,17.5
2,4,23/06/2003,74.42,76.62,45.82,45.82,17.5,17.5
3,5,30/06/2003,74.35,76.51,45.82,45.82,17.5,17.5
4,6,07/07/2003,74.28,76.46,45.82,45.82,17.5,17.5
...,...,...,...,...,...,...,...,...
904,906,05/10/2020,113.26,118.11,57.95,57.95,20.0,20.0
905,907,12/10/2020,113.19,118.05,57.95,57.95,20.0,20.0
906,908,19/10/2020,113.18,118.08,57.95,57.95,20.0,20.0
907,909,26/10/2020,113.14,118.08,57.95,57.95,20.0,20.0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 909 entries, 0 to 908
Data columns (total 8 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Unnamed: 0                        909 non-null    int64  
 1   Date                              909 non-null    object 
 2   Pump price in pence/litre (ULSP)  909 non-null    float64
 3   Pump price in pence/litre (ULSD)  909 non-null    float64
 4   Duty rate in pence/litre (ULSP)   909 non-null    float64
 5   Duty rate in pence/litre (ULSD)   909 non-null    float64
 6   VAT percentage rate (ULSP)        909 non-null    float64
 7   VAT percentage rate (ULSD)        909 non-null    float64
dtypes: float64(6), int64(1), object(1)
memory usage: 56.9+ KB


### Preprocessing

In [4]:
df = data.copy()

In [5]:
df = df.drop('Unnamed: 0', axis=1) # Drop index column
df

Unnamed: 0,Date,Pump price in pence/litre (ULSP),Pump price in pence/litre (ULSD),Duty rate in pence/litre (ULSP),Duty rate in pence/litre (ULSD),VAT percentage rate (ULSP),VAT percentage rate (ULSD)
0,09/06/2003,74.59,76.77,45.82,45.82,17.5,17.5
1,16/06/2003,74.47,76.69,45.82,45.82,17.5,17.5
2,23/06/2003,74.42,76.62,45.82,45.82,17.5,17.5
3,30/06/2003,74.35,76.51,45.82,45.82,17.5,17.5
4,07/07/2003,74.28,76.46,45.82,45.82,17.5,17.5
...,...,...,...,...,...,...,...
904,05/10/2020,113.26,118.11,57.95,57.95,20.0,20.0
905,12/10/2020,113.19,118.05,57.95,57.95,20.0,20.0
906,19/10/2020,113.18,118.08,57.95,57.95,20.0,20.0
907,26/10/2020,113.14,118.08,57.95,57.95,20.0,20.0


In [8]:
# Generate Date column
df['Date'] = pd.to_datetime(df['Date'], format='mixed')

In [10]:
df['Year'] = df['Date'].apply(lambda x: x.year)
df['Month'] = df['Date'].apply(lambda x: x.month)
df['Day'] = df['Date'].apply(lambda x: x.day)
df = df.drop('Date', axis=1)

In [11]:
df

Unnamed: 0,Pump price in pence/litre (ULSP),Pump price in pence/litre (ULSD),Duty rate in pence/litre (ULSP),Duty rate in pence/litre (ULSD),VAT percentage rate (ULSP),VAT percentage rate (ULSD),Year,Month,Day
0,74.59,76.77,45.82,45.82,17.5,17.5,2003,9,6
1,74.47,76.69,45.82,45.82,17.5,17.5,2003,6,16
2,74.42,76.62,45.82,45.82,17.5,17.5,2003,6,23
3,74.35,76.51,45.82,45.82,17.5,17.5,2003,6,30
4,74.28,76.46,45.82,45.82,17.5,17.5,2003,7,7
...,...,...,...,...,...,...,...,...,...
904,113.26,118.11,57.95,57.95,20.0,20.0,2020,5,10
905,113.19,118.05,57.95,57.95,20.0,20.0,2020,12,10
906,113.18,118.08,57.95,57.95,20.0,20.0,2020,10,19
907,113.14,118.08,57.95,57.95,20.0,20.0,2020,10,26


In [12]:
# Split df into X and y
y = df['Year'].copy()
X = df.drop('Year', axis=1).copy()

In [17]:
# Create labels from the Year column
y = y.apply(lambda x: 1 if x >= 2012 else 0)

In [18]:
y.value_counts()

Year
1    462
0    447
Name: count, dtype: int64

In [19]:
# Scale X with a standard scaler
scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X

Unnamed: 0,Pump price in pence/litre (ULSP),Pump price in pence/litre (ULSD),Duty rate in pence/litre (ULSP),Duty rate in pence/litre (ULSD),VAT percentage rate (ULSP),VAT percentage rate (ULSD),Month,Day
0,-1.994782,-2.002815,-1.866717,-1.866717,-0.818744,-0.818744,0.716806,-1.113329
1,-2.001340,-2.006938,-1.866717,-1.866717,-0.818744,-0.818744,-0.158185,0.025944
2,-2.004073,-2.010545,-1.866717,-1.866717,-0.818744,-0.818744,-0.158185,0.823435
3,-2.007898,-2.016215,-1.866717,-1.866717,-0.818744,-0.818744,-0.158185,1.620926
4,-2.011724,-2.018791,-1.866717,-1.866717,-0.818744,-0.818744,0.133479,-0.999402
...,...,...,...,...,...,...,...,...
904,0.118560,0.127729,0.735037,0.735037,0.813358,0.813358,-0.449848,-0.657620
905,0.114735,0.124636,0.735037,0.735037,0.813358,0.813358,1.591796,-0.657620
906,0.114188,0.126183,0.735037,0.735037,0.813358,0.813358,1.008469,0.367726
907,0.112002,0.126183,0.735037,0.735037,0.813358,0.813358,1.008469,1.165217


### Training

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=123)

In [22]:
models = {
    '   Logistic Regression': LogisticRegression(),
    'Support Vector Machine': SVC(),
    '         Decision Tree': DecisionTreeClassifier(),
    '        Neural Network': MLPClassifier(),
    '   K-Nearest Neighbors': KNeighborsClassifier(),
    '     Gradient Boosting': GradientBoostingClassifier(),
    '         Random Forest': RandomForestClassifier(),
    '              AdaBoost': AdaBoostClassifier()
}

In [24]:
for model in models.values():
    model.fit(X_train, y_train)



### Results

In [27]:
print("Model Accuracies: \n----------------------")

for name, model in models.items():
    print(name + ": {:.2f}%".format(model.score(X_test, y_test)*100))

Model Accuracies: 
----------------------
   Logistic Regression: 97.07%
Support Vector Machine: 97.07%
         Decision Tree: 98.17%
        Neural Network: 97.07%
   K-Nearest Neighbors: 94.51%
     Gradient Boosting: 95.60%
         Random Forest: 97.07%
              AdaBoost: 97.44%
