# Regression Model

## Introduction
In today's modern age, technological devices have become an integral part of society. With ever-evolving computer components, it can be difficult to conduct research on a laptop that fits an individual's specific needs. In our algorithm, we will be providing price estimates based on the hardware specifications provided in our dataset. For instance, we could use variables such as CPU speed and RAM size to predict the price of a laptop. Ultimately, our goal is to assist the users and companies by providing a price estimate for their ideal laptop, thus, reducing the time needed for research. Thus our predictive question would be **"What will be the price of a laptop based on its specifications?"**. The dataset we will be using is an open-source file from Kaggle. Link for the original dataset: https://www.kaggle.com/datasets/ehtishamsadiq/uncleaned-laptop-price-dataset/data'

## Preparing Libraries and Setup

Unfortunately, the python language does not natively support everything we will use to create this model, and it would be inefficient to re-invent the wheel. As such, we will be using the following libraries in our code:
#### **Pandas**
##### We use the DataFrame object from pandas to store and manipulate our data

#### **Altair**
##### Visualization library that allows us to make charts to help see the data

#### **Numpy**
##### Math library. We use it to set seeds and convert number Types. Also used by pandas in many operations

#### **Scikit learn**
##### Machine learning library. This will be the backbone of our regression model. We use it to build and train the model on our data, and to test the results.

In [1]:
### Uncomment cell below whenever Altair stops working to reinstall latest version

## For some reason, whenever the jupyter server restarts, it
## sends you back to the old version of altair (4.2.2)

In [2]:
#pip install -U altair      #<---- UNCOMMENT

In [3]:
## If the text below says anything below version 5.0.0,
## run the code above
import altair as alt; alt.__version__

'5.2.0'

In [4]:
### Run this cell before continuing.

import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error


# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")
    
np.random.seed(1137110237) #Randomly picked seed

### Importing and Wrangling Data

We will be using a dataset we found on kaggle which contains a bunch of laptop data (model, price, os, build, etc). However, this dataset is unfortunately not very clean when it comes to our needs. As such we will wrangle this data.
<br> Link for the original dataset: [Sadiq, E. (2023, February 15). Uncleaned Laptop Price Dataset. Retrieved November 2023.](https://www.kaggle.com/datasets/ehtishamsadiq/uncleaned-laptop-price-dataset/data)
<br> (Trying to load from the original link led to a lot of headaches and hours wasted, so we copied the raw data to our Github. No alterations have been made)

To begin, we use `pd.read_csv()` to read the dataframe from the copy we have stored in our Github repository.

In [5]:
# Loading csv file data as a pandas dataframe
laptop_raw = pd.read_csv("https://raw.githubusercontent.com/fyip3/ds_project/main/data/laptopData.csv")
laptop_raw.head(5)

Unnamed: 0.1,Unnamed: 0,Company,TypeName,Inches,ScreenResolution,Cpu,Ram,Memory,Gpu,OpSys,Weight,Price
0,0.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8GB,128GB SSD,Intel Iris Plus Graphics 640,macOS,1.37kg,71378.6832
1,1.0,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8GB,128GB Flash Storage,Intel HD Graphics 6000,macOS,1.34kg,47895.5232
2,2.0,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8GB,256GB SSD,Intel HD Graphics 620,No OS,1.86kg,30636.0
3,3.0,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16GB,512GB SSD,AMD Radeon Pro 455,macOS,1.83kg,135195.336
4,4.0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8GB,256GB SSD,Intel Iris Plus Graphics 650,macOS,1.37kg,96095.808


Table 1

As we can see, this is almost useless to us as it is. There is an extra index column, almost every column has a mix of letters and numbers, the column titles are less than ideal, and (although you can't see it here), there is a lot of pointlessly blank data entries.

#### Fixing it
To fix this, we begin by dropping the extra column. Then we drop those blank entries. This gets rid of useless data that our program would otherwise spend time parsing through and trying to interpret.

Then we go on to delete everything beyond and including the space within what are supposed to be numerical columns (RAM and Memory). This leaves just a number that we can later turn into a numeric value using `pd.to_numeric()`. From the Weight column, we remove the kg suffix using `str.removesuffix()`. Finally, we multiply price by the INR -> CAD conversion rate to get the correct currency for our analysis.

The last thing we do is convert all the numbers stored as strings into numbers using `pd.to_numeric()` with `errors='coerce'` and drop all the broken values

In [6]:
# Cleaning data
laptop_data = laptop_raw.drop(columns=["Unnamed: 0"])           # Filtering Columns
laptop_data = laptop_data.dropna()                              # Removing redundant non-numeric part
laptop_data['Ram'] = laptop_data['Ram'].str.extract('(\d+)', expand=False)
laptop_data['Weight'] = laptop_data['Weight'].str.removesuffix("kg")
laptop_data['Memory'] = laptop_data['Memory'].str.extract('(\d+)', expand=False)
laptop_data["Price"] = laptop_data["Price"] * 0.017                         # Convert Price from INR to CAD
laptop_data = laptop_data.rename(columns={"Inches": "ScreenSize_Inches", "Ram": "Memory_GB", "Memory" : "Storage", "Weight" : "Weight_Kg", "Price" : "Price_CAD"})
# Convert columns from strings to int/float
laptop_data["Memory_GB"] = pd.to_numeric(laptop_data.Memory_GB, errors='coerce')
laptop_data["Weight_Kg"] = pd.to_numeric(laptop_data.Weight_Kg, errors='coerce')
laptop_data["ScreenSize_Inches"] = pd.to_numeric(laptop_data.ScreenSize_Inches, errors='coerce')
laptop_data["Storage"] = pd.to_numeric(laptop_data.Storage, errors='coerce')
laptop_data = laptop_data.dropna()

In [7]:
laptop_data

Unnamed: 0,Company,TypeName,ScreenSize_Inches,ScreenResolution,Cpu,Memory_GB,Storage,Gpu,OpSys,Weight_Kg,Price_CAD
0,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 2.3GHz,8,128.0,Intel Iris Plus Graphics 640,macOS,1.37,1213.437614
1,Apple,Ultrabook,13.3,1440x900,Intel Core i5 1.8GHz,8,128.0,Intel HD Graphics 6000,macOS,1.34,814.223894
2,HP,Notebook,15.6,Full HD 1920x1080,Intel Core i5 7200U 2.5GHz,8,256.0,Intel HD Graphics 620,No OS,1.86,520.812000
3,Apple,Ultrabook,15.4,IPS Panel Retina Display 2880x1800,Intel Core i7 2.7GHz,16,512.0,AMD Radeon Pro 455,macOS,1.83,2298.320712
4,Apple,Ultrabook,13.3,IPS Panel Retina Display 2560x1600,Intel Core i5 3.1GHz,8,256.0,Intel Iris Plus Graphics 650,macOS,1.37,1633.628736
...,...,...,...,...,...,...,...,...,...,...,...
1298,Lenovo,2 in 1 Convertible,14.0,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4,128.0,Intel HD Graphics 520,Windows 10,1.80,577.874880
1299,Lenovo,2 in 1 Convertible,13.3,IPS Panel Quad HD+ / Touchscreen 3200x1800,Intel Core i7 6500U 2.5GHz,16,512.0,Intel HD Graphics 520,Windows 10,1.30,1357.734240
1300,Lenovo,Notebook,14.0,1366x768,Intel Celeron Dual Core N3050 1.6GHz,2,64.0,Intel HD Graphics,Windows 10,1.50,207.419040
1301,HP,Notebook,15.6,1366x768,Intel Core i7 6500U 2.5GHz,6,1.0,AMD Radeon R5 M330,Windows 10,2.19,692.000640


Table 2

## Preliminary Data Analysis and Visualization

In [8]:
count = laptop_data.nunique()  #number of unique observations for each varaible
count

Company               19
TypeName               6
ScreenSize_Inches     24
ScreenResolution      40
Cpu                  118
Memory_GB             10
Storage               13
Gpu                  110
OpSys                  9
Weight_Kg            180
Price_CAD            775
dtype: int64

We will now perform some visualizations to form an understanding of how the distribution of different variables (specifications of the laptop) compares to the price of each observation (laptop)

In [9]:
laptop_resolution_avg_price = (
    laptop_data.groupby(["ScreenResolution"])
        .mean(["Price_CAD"])
        .reset_index()
        .rename(columns = {"Price_CAD" : "Average Price"})
)

laptop_resolution_plot = alt.Chart(laptop_resolution_avg_price, title = 'Figure 1').mark_bar().encode(
    x=alt.X("ScreenResolution")
        .title("Screen Resolution"),
    y=alt.Y("Average Price")
        .title("Average Price of Laptops"),
    color=alt.Color("ScreenResolution")
            .scale(scheme="category20b")
).configure_axisX(labelAngle=-45)
laptop_resolution_plot

In [10]:
laptop_ram_avg_price = (
    laptop_data.groupby(["Memory_GB"])
        .mean(["Price_CAD"])
        .reset_index()
        .rename(columns = {"Price_CAD" : "Average Price"})
)

laptop_ram_plot = alt.Chart(laptop_ram_avg_price, title = 'Figure 2').mark_line().encode(
    x=alt.X("Memory_GB")
        .title("Installed Memory"),
    y=alt.Y("Average Price")
        .title("Average Price of Laptops"),
    # color=alt.Color("Memory_GB")
    #         .scale(scheme="category20b")
)
laptop_ram_plot

In [11]:
laptop_storage_avg_price = (
    laptop_data.groupby(["Storage"])
        .mean(["Price_CAD"])
        .reset_index()
        .rename(columns = {"Price_CAD" : "Average Price"})
)

laptop_storage_plot = alt.Chart(laptop_storage_avg_price, title = 'Figure 3').mark_point().encode(
    x=alt.X("Storage")
        .title("Storage Capacity"),
    y=alt.Y("Average Price")
        .title("Average Price of Laptops"),
    color=alt.Color("Storage")
            .scale(scheme="category20b")
)
laptop_storage_plot

In [12]:
laptop_screen_size_plot_not_avg = alt.Chart(laptop_data, title = 'Figure 4').mark_point(opacity = 0.3).encode(
    x=alt.X("ScreenSize_Inches")
        .title("Screen Size in inches").scale(zero=False),
    y=alt.Y("Price_CAD")
        .title("Price of Laptops"),
   # color=alt.Color("ScreenSize_Inches")
).facet('TypeName')
laptop_screen_size_plot_not_avg

Figure 4

In [13]:
laptop_screen_size_avg_price = (
    laptop_data.groupby(["ScreenSize_Inches"])
        .mean(["Price_CAD"])
        .reset_index()
        .rename(columns = {"Price_CAD" : "Average Price"})
)

laptop_screen_size_plot = alt.Chart(laptop_screen_size_avg_price, title = 'Figure 5').mark_point().encode(
    x=alt.X("ScreenSize_Inches")
        .title("Screen Size in inches").scale(zero=False),
    y=alt.Y("Average Price")
        .title("Average Price of Laptops"),
    color=alt.Color("ScreenSize_Inches")
            .scale(scheme="category20b")
)
laptop_screen_size_plot

Beyond the numerical variables, lets also explore the relations to price of categorical variable, by visualizing the mean price of each category.

In [14]:
laptop_brand_avg_price = (
    laptop_data.groupby(["Company"])
        .mean(["Price_CAD"])
        .reset_index()
        .rename(columns = {"Price_CAD" : "Average Price"})
)

laptop_brand_plot = alt.Chart(laptop_brand_avg_price, title = 'Figure 6').mark_bar().encode(
    x=alt.X("Company")
        .title("Laptop Brand"),
    y=alt.Y("Average Price")
        .title("Average Price of Laptops"),
    color=alt.Color("Company")
            .scale(scheme="category20b")
).configure_axisX(labelAngle=-45)
laptop_brand_plot

In [15]:
laptop_type_avg_price = (
    laptop_data.groupby(["TypeName"])
        .mean(["Price_CAD"])
        .reset_index()
        .rename(columns = {"Price_CAD" : "Average Price"})
)

laptop_type_plot = alt.Chart(laptop_type_avg_price, title = 'Figure 7').mark_bar().encode(
    x=alt.X("TypeName")
        .title("Laptop type"),
    y=alt.Y("Average Price")
        .title("Average Price of Laptops"),
    color=alt.Color("TypeName")
            .scale(scheme="category20b")
).configure_axisX(labelAngle=-45)
laptop_type_plot

In [16]:
laptop_resolution_avg_price = (
    laptop_data.groupby(["ScreenResolution"])
        .mean(["Price_CAD"])
        .reset_index()
        .rename(columns = {"Price_CAD" : "Average Price"})
)

laptop_resolution_plot = alt.Chart(laptop_resolution_avg_price, title = 'Figure 8').mark_bar().encode(
    x=alt.X("ScreenResolution")
        .title("Screen Resolution"),
    y=alt.Y("Average Price")
        .title("Average Price of Laptops"),
    color=alt.Color("ScreenResolution")
            .scale(scheme="category20b")
).configure_axisX(labelAngle=-45)
laptop_resolution_plot

After conducting our preliminary data analysis through visualizations, we can observe many correlations with price. Since many categorical variables have many many (50-100) unique observations, we decide on using only all the numeric columns in our data-set  to build a K-nearest-neighbours (knn) Regression algorithm model using the Sklearn module. Namely: "ScreenSize_Inches" (screen size of the laptop (inch)), "Memory_GB"(Ram in laptop(GB)), "Storage" (Storage space of laptop (GB)) and "Weight_Kg" (weight of laptop(kg)).

## Separation Into Train and Test Data

We separate the data into train data (which will be used to develop the model), and test data (which will be used to gauge the accuracy of the model). The test data will not be seen by the model until it comes the time to test how well it performs.

In [17]:
laptop_train, laptop_test = train_test_split(
    laptop_data,
    test_size=.25,   # Test data will be a quarter of the full data set, train the rest
)
X_train = laptop_train[["ScreenSize_Inches", "Memory_GB", "Storage", "Weight_Kg"]]
y_train = laptop_train["Price_CAD"]

X_test = laptop_test[["ScreenSize_Inches", "Memory_GB", "Storage", "Weight_Kg"]]
y_test = laptop_test["Price_CAD"]


In [18]:
laptop_train.head(5)

Unnamed: 0,Company,TypeName,ScreenSize_Inches,ScreenResolution,Cpu,Memory_GB,Storage,Gpu,OpSys,Weight_Kg,Price_CAD
241,Asus,Notebook,17.3,Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,8,128.0,Nvidia GeForce 150MX,Windows 10,2.1,1037.0952
1080,Lenovo,Ultrabook,12.5,IPS Panel Touchscreen 2560x1440,Intel Core M 6Y75 1.2GHz,8,512.0,Intel HD Graphics 515,Windows 10,0.99,1267.15824
147,Asus,Notebook,15.6,Full HD 1920x1080,Intel Celeron Dual Core N3350 1.1GHz,4,1.0,Intel HD Graphics 500,Windows 10,2.0,311.58144
746,Samsung,Ultrabook,13.3,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,16,256.0,Intel HD Graphics 620,Windows 10,0.81,1493.59824
1270,Lenovo,2 in 1 Convertible,14.0,IPS Panel Full HD / Touchscreen 1920x1080,Intel Core i7 6500U 2.5GHz,4,128.0,Intel HD Graphics 520,Windows 10,1.8,577.87488


Table 3

## Building the Model

To select the optimal value for k in k-nearest neighbors (k-NN) regression, we will utilize cross-validation on our training data. We will employ the Root Mean Squared Percentage Error (RMSPE) as the scoring metric. A smaller RMSPE indicates that the predicted values closely align with the true values, while a larger RMSPE suggests a larger deviation.

First, we will create a pipeline for k-NN regression. The pipeline will consist of the k-NN algorithm, along with the StandardScaler for standardizing the numerical features. We will store this pipeline in an object called "laptop_pipe". Next, we will perform cross-validation using the cross_validate function with 10 folds, specifying that we will use the negative RMSPE ("neg_root_mean_squared_error") as the scoring metric.

In [19]:
knn=KNeighborsRegressor()

laptop_pipe = make_pipeline(StandardScaler(),knn)

laptop_cv=pd.DataFrame(
    cross_validate(
        estimator=laptop_pipe,
        cv=10,
        X=X_train,
        y=y_train,
        scoring="neg_root_mean_squared_error",
        return_train_score=True,
    )
)
laptop_cv

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.009825,0.004261,-305.234685,-319.329553
1,0.006683,0.004759,-424.038323,-307.334961
2,0.007021,0.004993,-297.605291,-324.645483
3,0.006414,0.004186,-316.23479,-324.40321
4,0.007294,0.010403,-371.836831,-312.372239
5,0.007062,0.0042,-389.959378,-319.459551
6,0.006581,0.004322,-359.90226,-315.207774
7,0.006973,0.004577,-478.81334,-304.423813
8,0.006866,0.004442,-486.351961,-302.368197
9,0.009189,0.005796,-412.285067,-316.294989


Table 4

We will now use a `param_grid` to test out 50 values of `k`, in order to find the most effective one to use for our model. We will employ the function `GridSearchCV`, passing the parameters `laptop_pipe`, `cv=10`, `n_jobs=-1` and once again `scoring="neg_root_mean_squared_error`.

In [20]:
param_grid = {"kneighborsregressor__n_neighbors": range(1, 51, 1)}
laptop_tuned = GridSearchCV(laptop_pipe, param_grid, cv=10, n_jobs=-1, scoring="neg_root_mean_squared_error")
laptop_results = pd.DataFrame(laptop_tuned.fit(X_train,y_train).cv_results_).rename(columns={"param_kneighborsregressor__n_neighbors":"n_neighbors"})
laptop_results.head(15)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.008493,0.00441,0.0046,0.000514,1,{'kneighborsregressor__n_neighbors': 1},-366.354006,-566.207155,-507.91299,-352.20789,-439.688967,-450.104452,-525.761779,-513.396457,-425.16623,-441.336595,-458.813652,65.725352,50
1,0.006635,0.000304,0.004473,0.0006,2,{'kneighborsregressor__n_neighbors': 2},-350.356481,-479.366096,-385.325331,-306.255345,-379.655262,-440.209775,-438.557007,-483.435669,-438.677215,-416.819978,-411.865816,53.494498,49
2,0.006625,0.000252,0.004533,0.000599,3,{'kneighborsregressor__n_neighbors': 3},-324.246478,-432.288571,-344.998681,-320.434125,-381.694261,-398.284843,-396.126447,-496.651245,-427.545658,-428.209854,-395.048016,52.096899,31
3,0.006848,0.000726,0.004253,0.000164,4,{'kneighborsregressor__n_neighbors': 4},-303.991217,-432.753207,-317.692089,-318.204644,-376.327705,-391.261737,-366.69303,-472.87758,-468.583928,-409.82371,-385.820885,57.967706,21
4,0.006415,0.000335,0.004297,0.000161,5,{'kneighborsregressor__n_neighbors': 5},-305.234685,-424.038323,-297.605291,-316.23479,-371.836831,-389.959378,-359.90226,-478.81334,-486.351961,-412.285067,-384.226193,63.947163,19
5,0.007342,0.001841,0.004837,0.000861,6,{'kneighborsregressor__n_neighbors': 6},-313.599158,-419.904778,-299.33462,-320.632225,-350.518108,-398.67267,-360.986422,-485.423877,-481.124434,-406.950279,-383.714657,62.881589,16
6,0.007066,0.001802,0.004528,0.000587,7,{'kneighborsregressor__n_neighbors': 7},-326.291884,-411.644053,-294.802556,-327.908064,-330.82236,-390.959364,-377.502115,-483.368347,-483.296812,-412.362262,-383.895782,62.172068,18
7,0.006437,0.000248,0.004625,0.00063,8,{'kneighborsregressor__n_neighbors': 8},-327.326027,-414.515904,-298.256016,-329.882356,-314.207579,-386.118276,-369.524473,-484.238148,-478.63221,-411.073335,-381.377432,62.590773,10
8,0.006766,0.000956,0.004849,0.001097,9,{'kneighborsregressor__n_neighbors': 9},-329.482424,-404.970733,-303.79101,-327.483097,-306.270683,-391.192422,-369.160614,-461.882067,-479.863484,-421.984501,-379.608103,59.882154,5
9,0.006322,0.000278,0.004385,0.000108,10,{'kneighborsregressor__n_neighbors': 10},-329.889776,-406.339656,-295.584728,-338.189085,-299.369216,-401.408586,-367.829698,-461.09934,-480.370862,-403.021654,-378.31026,60.282275,3


Table 5

We can visualize this to get an idea of how the RMSE changes with the number of neighbors.

In [21]:
knn_graph = alt.Chart(laptop_results, title="Neg Mean RMSE vs K from Grid Search Algorithm | Figure 9").mark_line().encode(
    x=alt.X("n_neighbors").title("Number of neighbors"),
    y=alt.Y("mean_test_score")
        .title("Negative mean RMSE")
        .scale(zero=False)
)
knn_graph

As this is negative RMSE, the highest value on the y-axis will be the most ideal. The RMSE appears to rise sharply up until around `K=10` before decreasing at a constant rate. So now, let's use python to determine the best parameter for `K` as well as the lowest RMSE produced.

In [22]:
laptop_param = laptop_tuned.best_params_
laptop_RMSE = -laptop_tuned.best_score_

laptop_param

{'kneighborsregressor__n_neighbors': 11}

In [23]:
laptop_RMSE

376.8411474729222

## Putting the Model to the Test

To evaluate the performance of our model in predicting unseen data, we will calculate the Root Mean Squared Predicted Error (RMSPE) on the test data. We will utilize the predict function to generate predictions on the test data using `laptop_tuned` (which automatically uses the best `k` parameter) and store the results in a variable called "laptop_prediction". Then, we will compute the RMSPE on the test data using the mean_squared_error function.

In [24]:
laptop_prediction = laptop_tuned.predict(X_test)

laptop_prediction_vs_real = laptop_test.assign(Predicted_CAD = laptop_prediction)
laptop_prediction_vs_real.head(5)

Unnamed: 0,Company,TypeName,ScreenSize_Inches,ScreenResolution,Cpu,Memory_GB,Storage,Gpu,OpSys,Weight_Kg,Price_CAD,Predicted_CAD
355,Dell,Notebook,15.6,Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,8,128.0,AMD Radeon 530,Windows 10,2.02,879.402384,751.183822
69,Asus,Gaming,17.3,Full HD 1920x1080,Intel Core i7 7700HQ 2.8GHz,12,1.0,Nvidia GeForce GTX 1050 Ti,Linux,3.0,859.56624,949.648189
202,Acer,Notebook,15.6,Full HD 1920x1080,Intel Core i7 7500U 2.7GHz,8,1.0,Nvidia GeForce 940MX,Windows 10,2.23,672.07392,618.799587
678,LG,Ultrabook,15.6,IPS Panel Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,8,512.0,Intel HD Graphics 620,Windows 10,1.09,2082.34224,1508.544103
71,Dell,Ultrabook,13.3,IPS Panel Full HD 1920x1080,Intel Core i7 8550U 1.8GHz,64,256.0,AMD Radeon 530,Windows 10,1.4,865.0008,2952.929109


Table 6

Now we find the RMSPE of this prediction.

In [25]:
summary = mean_squared_error(laptop_prediction, y_test)**(1/2)

summary

386.2866224507617

## Evaluating Quality of Model

As we can see by the RMSPE of about $386, the model is decent at predicting prices based on the specifications of a laptop. This error is more significant at the lower end of prices, so we wouldn't recommend using this model if you're thinking about work laptops. On the other end of the prices though, it provides a good approximation of what prices you can expect to be paying for a high end laptop, whether it be a gaming laptop or some other powerhouse.

In [26]:
laptop_data['Price_CAD'].agg(['mean', 'median'])

mean      1018.063235
median     884.927520
Name: Price_CAD, dtype: float64

With the mean and median price of laptops in our dataset being about 1000, and about 900 dollars, which is quite representative of the actual market, our RMSPE might seem high, but this could have a lot more to do with the variation among pricing of laptops themselves which might be too cheap attributing to quality of specifications or too expensive in the presence of "gimmicks" such as OLED screens or mechanical keyboards in a laptop itself. As an example, take a look at the following entry:

In [27]:
laptop_prediction_vs_real.iloc[4]

Company                                     Dell
TypeName                               Ultrabook
ScreenSize_Inches                           13.3
ScreenResolution     IPS Panel Full HD 1920x1080
Cpu                   Intel Core i7 8550U 1.8GHz
Memory_GB                                     64
Storage                                    256.0
Gpu                               AMD Radeon 530
OpSys                                 Windows 10
Weight_Kg                                    1.4
Price_CAD                               865.0008
Predicted_CAD                        2952.929109
Name: 71, dtype: object

At a glance, the Ultrabook, i7 CPU, and 64gb of RAM really stand out. You'd likely expect this to be quite expensive (well above 1000 dollars). However, the actual price sits at a shocking 865 dollars. The model takes a look at these specs and decides that the laptop might be priced around 3000 dollars, perhaps only a bit more to what you yourself might expect. 

Thus, we can conclude that we were able to reach the goal for this model of being simply a metric where we could use variables such as CPU speed and RAM size to predict the price of a laptop, helping assist the users and companies by providing a price estimate for their ideal laptop, and as such reducing the time needed for research.


In the future, there could be more research into more accurate models, and evaluating whether including factors such as company name, or type of RAM/Storage the laptop is using in the model increases or decreases the efficacy of the model.

## Works Cited

**Through this project, we have used code and/or ideas that are similar to that presented in the following sources:**

  Timbers, T., Lee, M., Heagy, L., &amp; Ostblom, J. (2022). Data Science: A First Introduction (Python Edition). https://python.datasciencebook.ca/ 
 
  NumFOCUS, Inc. (2023, December). Pandas documentation. pandas documentation - pandas 2.1.4 documentation. https://pandas.pydata.org/docs/ 
  
  Mattjin, (2023, Aug). TypeError: 'UndefinedType' object is not callable. https://stackoverflow.com/a/75450069


**Dataset Source:**

  Sadiq, E. (2023, February 15). Uncleaned Laptop Price Dataset. Retrieved November 2023, https://www.kaggle.com/datasets/ehtishamsadiq/uncleaned-laptop-price-dataset. 