### First and foremost is to upload the dataset:

In [1]:
import pandas as pd
df = pd.read_csv('Harley_Dataset.csv')

In [2]:
df.head()

Unnamed: 0,Order Number,Product Name,Quantity,Price,Payment Mode,Store Name,City,Country,Year,Month,Order Date
0,101071,Harley-Davidson Street 750,30,7000,NetBanking,RIDGES HARLEY-DAVIDSON,NYC,USA,2001,2,24-Feb-2001
1,101072,Harley-Davidson Street 750,34,7000,Cash,SEVEN ISLANDS HARLEY-DAVIDSON,Reims,France,2001,5,7-May-2001
2,101073,Harley-Davidson Street 750,41,7000,Credit card,BANJARA HARLEY-DAVIDSON,Paris,France,2001,7,1-Jul-2001
3,101074,Harley-Davidson Street 750,45,7000,NetBanking,TUSKER HARLEY-DAVIDSON LAVELLE ROAD,Pasadena,USA,2001,8,25-Aug-2001
4,101075,Harley-Davidson Street 750,49,7000,NetBanking,CAPITAL HARLEY-DAVIDSON,San Francisco,USA,2001,10,10-Oct-2001


#### I then dropped the columns deemed unecessary and turned the remaining categorical values into dummy variables to allow for subsequent analyses

In [3]:
df.drop(['Order Number', 'Store Name', 'Order Date'], axis=1, inplace=True)


In [4]:
df = pd.get_dummies(df, columns=['Product Name', 'Payment Mode', 'City', 'Country'], drop_first=False, dtype=int)

### The final column of this dataset, Order_Profit, will be defined as Quantity x Price:

In [5]:
Order_Profit = df["Quantity"] * df["Price"]
df["Order_Profit"] = Order_Profit

In [6]:
df.head()

Unnamed: 0,Quantity,Price,Year,Month,Product Name_Harley-Davidson 1200 Custom,Product Name_Harley-Davidson CVO Limited,Product Name_Harley-Davidson Fat Bob,Product Name_Harley-Davidson Fat Boy,Product Name_Harley-Davidson Forty-Eight,Product Name_Harley-Davidson Heritage Softail Classic,...,Country_Japan,Country_Norway,Country_Philippines,Country_Singapore,Country_Spain,Country_Sweden,Country_Switzerland,Country_UK,Country_USA,Order_Profit
0,30,7000,2001,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,210000
1,34,7000,2001,5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,238000
2,41,7000,2001,7,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,287000
3,45,7000,2001,8,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,315000
4,49,7000,2001,10,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,343000


In [7]:
df.isna().any().sum()

0

## The quantity per shipment, price per motorcycle, year of shipment, month of shipment, series of motorcycle, country of shipment, city of shipment, and payment method will be used to predict the profitability of the shipment.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Columns: 116 entries, Quantity to Order_Profit
dtypes: int32(111), int64(5)
memory usage: 1.3 MB


#### After corroborating that there exist no null values and all dataypes in the dataset are integer-based, the dataset is ready for the Test/Train split:

# Test/Train Split
A scaler is necessary to fit the data and enable the models to interpret it more easily

In [9]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
 
 
# Initialise the Scaler
scaler = StandardScaler()
 
# To scale data
scaler.fit(df)

StandardScaler()

y will be the dependent variable, which, in this case, is the last column of the dataset, Order_Profit. The goal of this project is to predict the profitability per order.

x will be the independent variables, which is every variable from this dataset other than Order_Profit.

In [10]:
>>> import pandas as pd
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.datasets import load_iris

y = df.Order_Profit
x = df.drop('Order_Profit', axis = 1)

In [11]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)
x_train.head()

Unnamed: 0,Quantity,Price,Year,Month,Product Name_Harley-Davidson 1200 Custom,Product Name_Harley-Davidson CVO Limited,Product Name_Harley-Davidson Fat Bob,Product Name_Harley-Davidson Fat Boy,Product Name_Harley-Davidson Forty-Eight,Product Name_Harley-Davidson Heritage Softail Classic,...,Country_Italy,Country_Japan,Country_Norway,Country_Philippines,Country_Singapore,Country_Spain,Country_Sweden,Country_Switzerland,Country_UK,Country_USA
1491,35,10000,2011,9,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2762,50,10000,2004,11,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
565,41,13000,2006,5,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
668,23,10000,2007,11,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2509,29,10000,2014,11,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
y_train.head()


1491    350000
2762    500000
565     533000
668     230000
2509    290000
Name: Order_Profit, dtype: int64

## The goal of this project is to predict the profitability per order. As a result, we will try to find a continuous value rather than a binary output (1 for yes, 0 for no). As such, in the next part of this project, I will explore the use of different regression metrics instead of classification ones.

## Algorithmic Execution:


In [16]:
from sklearn.linear_model import LinearRegression
clf = LinearRegression()

In [17]:
clf.fit(x_train,y_train)

LinearRegression()

In [18]:
clf.score(x_test,y_test)


0.9334302480133214

# Analyzing the Results
With a 93.34% success rate, the linear regression model has proven to be outstanding