<a href="https://colab.research.google.com/github/ayushs0911/Laptop-Price-Predictions/blob/main/Without%20output.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Import Libraries

In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sn
import matplotlib.pyplot as plt 
%matplotlib inline 

In [None]:
from matplotlib import colors
cmap = colors.ListedColormap(["#682F2F", "#9E726F", "#D6B2B1", "#B9C0C9", "#9F8A78", "#F3AB60"])
palette= ["#682F2F","#F3AB60"]
pal = ["#682F2F","#B9C0C9", "#9F8A78","#F3AB60"]


# Data Handling 

In [None]:
df = pd.read_csv("/content/laptop_data.csv")

In [None]:
df.head()

In [None]:
df.columns 

**Dropping the Unnamed column.**

In [None]:
df = df[['Company', 'TypeName', 'Inches', 'ScreenResolution',
       'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']]
df.head()

Checking null values

In [None]:
df.isnull().sum()

Duplicated rows. 

In [None]:
df.duplicated().sum()

In [None]:
df.info()

In [None]:
categorical = df.select_dtypes(include = ['object']).columns 
numerical = df.select_dtypes(include = ['int32', 'int64', 'float32', 'float64']).columns 

categorical, numerical

In [None]:
def uniquevals(col):
  print(f"Details of the particular col {col} is : {df[col].unique()}")

def valuecounts(col):
  print(f"Valuecounts of the particular col {col} is : {df[col].value_counts()}")

In [None]:
for col in df.columns:
  uniquevals(col)
  print('-'*75)

**If we remove 'GB' from RAM, we can make it a integer value, same with Memory, Weight**

In [None]:
df['Ram'] = df['Ram'].str.replace('GB', '')
df["Weight"] = df['Weight'].str.replace('kg', '')

#converting columns from string to Int 
df['Ram'] = df['Ram'].astype('int32')

#converting to float 
df['Weight'] = df['Weight'].astype('float32')

df.head()

In [None]:
df.info()

# Exploratory Data Analysis 

In [None]:
#viewing the distribution of the price column
sn.distplot(df.Price, color = 'red')

A little bit Left Skewed Gaussian Distribution. 

In [None]:
# plotting the countplots for categorical variables 

def drawplot(col):
  plt.figure(figsize = (10,7))
  sn.countplot(data = df, x = col, palette= pal)
  plt.xticks(rotation = 'vertical')

In [None]:
view = ['Company', 'TypeName', 'Ram', 'OpSys']
for col in view:
  drawplot(col)

In [None]:
#average price for each of the laptop brands 
# this will give us the insight that as per compnay, how the price of the laptop vary 

plt.figure(figsize = (10,7))
sn.barplot(x = df['Company'], y = df['Price'], palette= pal)
plt.xticks(rotation = 'vertical')
plt.show()

In [None]:
# various types of laptops 

sn.countplot(data = df, x = 'TypeName', palette= pal)
plt.xticks(rotation = 'vertical')

In [None]:
# laptop type and variation about the price 

sn.barplot(x = df['TypeName'],y = df['Price'], palette= pal)
plt.xticks(rotation = 'vertical')


Notebook which is higest selling type, gives the minimum price range. Affordability makes it highest selling product. 

In [None]:
# variations of incjes towards the price 

sn.scatterplot(x = df['Inches'], y = df['Price'])

**For the `Screen Resolution` column we have many types of Screen Resolutions out there as shown `Touch Screen` and `Normal` and `IPS Panel` are the 3 parts on basis of which we can segregate the things**

In [None]:
df['ScreenResolution'].value_counts()

In [None]:
df['TouchScreen'] = df['ScreenResolution'].apply(lambda element: 1 
                                                 if 'Touchscreen' in element 
                                                 else 0)

In [None]:
sn.countplot(df, x = 'TouchScreen', palette= palette)

In [None]:
# touch screen on comparison with price of laptop 

sn.barplot(x = df.TouchScreen, y = df.Price, palette= palette)
plt.xticks(rotation = 'vertical')

In [None]:
# creating a new col names IPS, does the laptop have IPS facility or not 
df ['IPS'] = df['ScreenResolution'].apply(
    lambda element : 1 if 'IPS' in element else 0
)

In [None]:
df.sample()

In [None]:
sn.countplot(df, x = 'IPS', palette= palette)

In [None]:
sn.barplot(x = df.IPS, y = df.Price, palette= palette)
plt.xticks(rotation = 'vertical')

**Extracting the X resolution and Y Resolution**

In [None]:
#we will split the text at the "x" letter and separate the 2 parts 

splitdf = df['ScreenResolution'].str.split('x', n =1, expand = True)
splitdf.head()

In [None]:
df['X_res'] = splitdf[0]
df['Y_res'] = splitdf[1]
df.head()

Now we have to extract number from `X_res`, we need to extract the digits from it. 

Using `regex` to exactly get the numbers which we are looking for. 
- replacing all "," with "" and then find numbers 
- `\d+\.?\d+` means the integer number and `\.?` all the numbers which come after a number and `\d+` the string must end with number. 

In [None]:
df['X_res'] = df['X_res'].str.replace(',', '').str.findall(r'(\d+\.?\d+)').apply(lambda x:x[0])
df.head()

In [None]:
df['X_res'] = df['X_res'].astype('int')
df['Y_res'] = df['Y_res'].astype('int')

In [None]:
plt.figure(figsize = (10,7))
sn.heatmap(df.corr(), annot = True, cmap = cmap)

In [None]:
df.corr()['Price']

above results show that `X_res` and `Y_res` are positively correlated, so we can combine them with `Inches` which is giving less collinearity. <br>
We can create a new column named `PPI(pixles per inch)`

$$
PPI(pixels per inch) = \frac{\sqrt{X_resolution^2+Y_resolution^2}}{inches}
$$

In [None]:
df['PPI'] = (((df['X_res']**2+df['Y_res']**2))**0.5/df['Inches']).astype('float')
df.head()

In [None]:
df.corr()['Price']

So it can be seen from the correlation data that the `PPI` is having good correlaiton, so we will be using that, as that is combination of 3 features and gives collective results of 3 columns, so we will drop `Inches`, `X_res`, `Y_res` as well

In [None]:
df.drop(columns = ['ScreenResolution', 'Inches', 'X_res','Y_res'], inplace = True)

In [None]:
df.head()

**Processing `Cpu` column.**

In [None]:
df['Cpu'].value_counts()

Most common processors are by Intel, so we will be clustering their processors into different categories like `i5, i7, other`, now other means the processors of intel which do not have i3, i5 or i7 attached to it, they're completely diffrent so that's the reason I will clutter them into `other` and other category is `AMD` which is a diffrent category in whole. 

In [None]:
text = "Intel Core i5 7200U 2.5GHz"
' '.join(text.split()[:3])

In [None]:
df['CPU_name'] = df['Cpu'].apply(lambda text : " ".join(text.split()[:3]))
df.head()

If we get any of the intel `i3, i5 or i7` versions we will return them as it is, but if we get any other processor. <br>
We will first check whether is that a variant of intel? or not. <br>
If yes, then we will tag it as 'Other Intel Processor' else we will say it as 'AMD Processor'. 

In [None]:
def processortype(text):
  if text =='Intel Core i7' or text == 'Intel Core i5' or text == 'Intel Core i3':
    return text 
  
  else :
    if text.split()[0] == 'Intel':
      return "Other Intel Processor"
    else: 
      return 'AMD Processor'

In [None]:
df['CPU_name'] = df['CPU_name'].apply(lambda text:processortype(text))

In [None]:
df['CPU_name'].value_counts()

In [None]:
#price vs processor variation 
sn.barplot(x = df['CPU_name'], y = df['Price'], palette = palette)
plt.xticks(rotation = 'vertical')

In [None]:
##dropping the CPU column 
df.drop(columns = ['Cpu'], inplace = True)
df.head(1)

**Analysis on the RAM Column**

In [None]:
sn.countplot(df, x = 'Ram', palette = palette)

In [None]:
# price and RAM relation 

sn.barplot(x = df.Ram, y = df.Price, palette = palette)

**Memory Column**<br>


In [None]:
df['Memory'].value_counts()

4 most common variants observed : HHD, SSD, Flash, Hybrid(SSD + HHD). 

In [None]:
#removing decimal space --> 1.0 TB to 1 TBz
#some columns have these floats
df['Memory'] = df['Memory'].astype(str).replace('\.0', '', regex = True)

In [None]:
#replacing GB word with " "
df['Memory'] = df['Memory'].str.replace('GB', '')

In [None]:
#replace TB word with 000
df['Memory'] = df['Memory'].str.replace('TB', '000')

In [None]:
#split the word across '+' character 
newdf = df['Memory'].str.split('+', n = 1, expand = True)

In [None]:
newdf

In [None]:
df['first'] = newdf[0]
df['first'] = df['first'].str.strip()
df.head(1)

In [None]:
def applychanges(value):
    
    df['Layer1'+value] = df['first'].apply(lambda x:1 if value in x else 0)
    
    
listtoapply = ['HDD','SSD','Hybrid','FlashStorage']    
for value in listtoapply:
    applychanges(value)
    
    
df.head()

In [None]:
# remove all the characters just keep the numbers

df['first'] = df['first'].str.replace(r'\D','')
df['first'].value_counts()

In [None]:
df["Second"] = newdf[1]

In [None]:
def applychanges1(value):
    
    df['Layer2'+value] = df['Second'].apply(lambda x:1 if value in x else 0)
    
    
listtoapply1 = ['HDD','SSD','Hybrid','FlashStorage']
df['Second'] = df['Second'].fillna("0")
for value in listtoapply1:
    applychanges1(value)
    

# remove all the characters just keep the numbers

df['Second'] = df['Second'].str.replace(r'\D','')
df['Second'].value_counts()

In [None]:
df['first'] = df['first'].astype('int')
df['Second'] = df['Second'].astype('int')
df.head()

In [None]:
# multiplying the elements and storing the result in subsequent columns


df["HDD"]=(df["first"]*df["Layer1HDD"]+df["Second"]*df["Layer2HDD"])
df["SSD"]=(df["first"]*df["Layer1SSD"]+df["Second"]*df["Layer2SSD"])
df["Hybrid"]=(df["first"]*df["Layer1Hybrid"]+df["Second"]*df["Layer2Hybrid"])
df["Flash_Storage"]=(df["first"]*df["Layer1FlashStorage"]+df["Second"]*df["Layer2FlashStorage"])


## dropping of uncessary columns

df.drop(columns=['first', 'Second', 'Layer1HDD', 'Layer1SSD', 'Layer1Hybrid',
       'Layer1FlashStorage', 'Layer2HDD', 'Layer2SSD', 'Layer2Hybrid',
       'Layer2FlashStorage'],inplace=True)

In [None]:
df.sample(5)

In [None]:
df.drop(columns = ["Memory"], inplace = True)

In [None]:
df.corr()['Price']

Based on the correlation we observe that `Hybrid` and `Flash Storage` are almost negligible, we can simply drop them off. <br>
`HOD` and `SSD` are giving good correlation, `HOD` have negative relation with Price. 

In [None]:
df.columns 

In [None]:
df.drop(columns = ['Hybrid', 'Flash_Storage'], inplace = True)

Analysis on GPU

In [None]:
df['Gpu'].value_counts()

In [None]:
df['Gpu brand'] = df['Gpu'].apply(lambda x:x.split()[0])

In [None]:
sn.countplot(df, x = 'Gpu brand',palette=palette)

In [None]:
#removing the 'ARM' tuple 

df = df[df['Gpu brand']!= 'ARM']

In [None]:
sn.countplot(df, x = 'Gpu brand',palette=palette)

In [None]:
sn.barplot(x = df['Gpu brand'], y = df.Price, estimator = np.median, palette = palette)

In [None]:
df = df.drop(columns = ['Gpu'])
df.head(1)

Operating System Analysis

In [None]:
df['OpSys'].value_counts()

In [None]:
sn.barplot( x = df['OpSys'], y = df['Price'], palette = palette)
plt.xticks(rotation = 'vertical')
plt.show()

- Clubbing Windows 10, windows 7, windows 7S --> Windows 
- club macOS, macOS x ---> mac 
- else return others. 

In [None]:
def setcategory(text):
    if text=='Windows 10' or text=='Windows 7' or text=='Windows 10 S':
        return 'Windows'
    elif text=='Mac OS X' or text=='macOS':
        return 'Mac'
    else:
        return 'Other'
    
    
df['OpSys'] = df['OpSys'].apply(lambda x:setcategory(x))
df.head()

In [None]:
sn.countplot(df, x = 'OpSys', palette=palette)


In [None]:
sn.barplot(x = df['OpSys'],y = df['Price'], palette = palette)
plt.xticks(rotation = 'vertical')

**Weight Analysis**

In [None]:
sn.distplot(df['Weight'])

In [None]:
sn.scatterplot(df, x= 'Weight' , y = 'Price')

Price Analysis 

In [None]:
sn.distplot(df['Price'])

Left Skewed gaussian distribution, we can try applying log to convert it into Uniform Gaussian distribution. 

In [None]:
sn.distplot(np.log(df['Price']))

In [None]:
df.corr()['Price']

In [None]:
plt.figure(figsize=(10,5))
sn.heatmap(df.corr(),annot=True,cmap=cmap)

#Model Building 

In [None]:
test = np.log(df['Price'])
train = df.drop(['Price'], axis = 1)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn import metrics
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn import tree

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train,test,
                                                   test_size=0.15,random_state=2)

X_train.shape,X_test.shape

`Column Transfer` used to build model pipelines, for this we have to get the index numbers of columns which are having categorical variables. 

In [None]:
mapper = {i:value for i, value in enumerate(X_train.columns)}
mapper

### Linear Regression 

We will Apply one hot encoding on the columns with this indices --> [0,3,8,11], the remainder we keep as passthrough,i.e., no other col must get effected except the ones undergoing the transformation. 

Use of scikit-learn's `ColumnTransformer`, `LinearRegression`, and `Pipeline` classes to create a ML pipeline for data preprocessing and regression.

Step 1: ColumnTransformer
The `ColumnTransformer` class : apply different preprocessing steps to different columns of the input data. 
- The `transformers` argument specifies a list of tuples, where each tuple contains a name for the transformation ('col_tnf') and the transformer to be applied (`OneHotEncoder`).
- The `sparse=False` argument specifies that the encoded output should be returned as a dense array rather than a sparse matrix.
- The `drop='first'` argument instructs the `OneHotEncoder` to drop the first category in each encoded feature, which avoids multicollinearity issues.

The `remainder='passthrough'` argument specifies that any columns not explicitly specified in the transformers list should be passed through without any transformations.


Pipeline:
The `Pipeline` class is used to chain multiple steps together into a single object that can be used as a single estimator. It sequentially applies a list of transformers and ends with an estimator.  

- The first step in the pipeline is `step1`, which represents the `ColumnTransformer` defined earlier.
- The second step is `step2`, which represents the `LinearRegression` model.

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,3,8,11])
],remainder='passthrough')

step2 = LinearRegression()

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',metrics.r2_score(y_test,y_pred))
print('MAE',metrics.mean_absolute_error(y_test,y_pred)) 

## Ridge Regression
Ridge regression is a regularization technique used in linear regression to mitigate the problem of multicollinearity and overfitting. It adds a penalty term to the ordinary least squares (OLS) objective function to control the complexity of the model and reduce the impact of correlated features.

In ordinary least squares, the goal is to minimize the sum of squared differences between the predicted values and the actual target values. However, when the input features are highly correlated, the coefficient estimates can become sensitive to small changes in the input data. This leads to instability and overfitting.

Ridge regression addresses this issue by introducing a regularization term that penalizes large coefficient values.  

The addition of the regularization term encourages the model to find coefficient values closer to zero, effectively reducing their impact on the predictions. The larger the alpha value, the stronger the penalty and the more the coefficients are shrunk.

By shrinking the coefficients, ridge regression helps to mitigate multicollinearity by reducing the impact of highly correlated features. This leads to a more stable and robust model, with less sensitivity to minor changes in the input data. However, it is important to note that ridge regression does not perform feature selection or eliminate irrelevant features entirely. Instead, it reduces their impact.

To apply ridge regression, the alpha value needs to be chosen. This is typically done through cross-validation, where different alpha values are tested, and the one that provides the best performance on unseen data is selected.

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,3,8,11])
],remainder='passthrough')

step2 = Ridge(alpha=10)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',metrics.r2_score(y_test,y_pred))
print('MAE',metrics.mean_absolute_error(y_test,y_pred))

## Lasso 

One difference, in case of Lasso Regression the Shrinkage Term(Penalty Term) forces some of the model coefficients to become exactly 0 thereby removing the entire feature from the model (given that the λ value is large enough). This gives a whole new application of **Lasso Regression — Feature Selection**. This is not possible in case of Ridge Regression.


By promoting sparsity, lasso can effectively identify and select the most relevant features, removing irrelevant or redundant features from the model. This can lead to improved interpretability and more efficient models.

As with ridge regression, the alpha value needs to be chosen to balance the level of regularization. Cross-validation is commonly used to select the optimal alpha value by evaluating the model's performance on unseen data.

In scikit-learn, you can use the Lasso class to perform lasso regression. It provides methods for fitting the model, making predictions, and accessing the coefficient values. Additionally, scikit-learn also provides the ElasticNet class, which combines the penalties of both ridge regression and lasso, allowing for a mix of both regularization techniques.

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,3,8,11])
],remainder='passthrough')

step2 = Lasso(alpha=0.001)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',metrics.r2_score(y_test,y_pred))
print('MAE',metrics.mean_absolute_error(y_test,y_pred))

## Decision Tree

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,3,8,11])
],remainder='passthrough')

step2 = DecisionTreeRegressor(max_depth=8)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',metrics.r2_score(y_test,y_pred))
print('MAE',metrics.mean_absolute_error(y_test,y_pred))

## Random Forest

Random Forest regression is an ensemble learning method that combines multiple decision trees to make predictions. It is a powerful and popular technique for regression tasks that provides improved accuracy and robustness compared to individual decision trees.

Here's how Random Forest regression works:

1. Data Sampling:
   Random Forest uses a technique called Bootstrap Aggregating, or "bagging," to create multiple subsets of the original training data. Each subset, called a "bootstrap sample," is created by randomly selecting data points from the original training set with replacement. These subsets are used to train individual decision trees.

2. Building Decision Trees:
   For each bootstrap sample, a decision tree is constructed using the following steps:
   - Feature Selection: At each node of the decision tree, a random subset of features is considered for splitting. This random subset of features helps introduce diversity among the trees and reduce correlation.
   - Splitting: The decision tree is built by recursively selecting the best feature and split point based on a chosen criterion (such as Gini impurity or information gain).
   - Tree Growth: The tree continues to grow until a specified stopping criterion is met, such as reaching a maximum depth or minimum number of samples in a leaf node.

3. Aggregating Predictions:
   Once all the individual decision trees are built, predictions are made by aggregating the predictions of each tree. For regression, the predictions from each tree are averaged to obtain the final prediction.

The main advantages of Random Forest regression include:
- Reduction of overfitting: Random Forest helps mitigate overfitting by using multiple trees with different subsets of data and features. This helps to capture a more generalized pattern from the data.
- Robustness: Random Forest is less sensitive to outliers and noisy data compared to a single decision tree.
- Feature Importance: Random Forest provides a measure of feature importance based on how much the mean squared error (MSE) is reduced by each feature across all the trees. This information can be useful for feature selection.

In scikit-learn, you can use the `RandomForestRegressor` class to implement Random Forest regression. It provides methods for fitting the model, making predictions, and accessing feature importances.

In [None]:
step1 = ColumnTransformer(transformers=[
    ('col_tnf',OneHotEncoder(sparse=False,drop='first'),[0,1,3,8,11])
],remainder='passthrough')

step2 = RandomForestRegressor(n_estimators=100,
                              random_state=3,
                              max_samples=0.5,
                              max_features=0.75,
                              max_depth=15)

pipe = Pipeline([
    ('step1',step1),
    ('step2',step2)
])

pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)

print('R2 score',metrics.r2_score(y_test,y_pred))
print('MAE',metrics.mean_absolute_error(y_test,y_pred))

In [None]:
import pickle
pickle.dump(df, open('df.pkl', 'wb'))
pickle.dump(pipe, open('pipe.pkl', 'wb'))

In [None]:
train.to_csv('traineddata.csv', index = None)

In [None]:
indexlist = [0,1,3,8,11]
transformlist = []
for key, value in mapper.items():
  if key in indexlist:
    transformlist.append(value)

In [None]:
transformlist 

In [None]:
train = pd.get_dummies(train, columns = transformlist, drop_first = True)
train.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train,test,
                                                   test_size=0.15,random_state=2)

X_train.shape,X_test.shape

In [None]:
reg = DecisionTreeRegressor(random_state=0)
reg.fit(X_train,y_train)
plt.figure(figsize=(16,9))
tree.plot_tree(reg,filled=True,feature_names=train.columns)

### We have to optimize it. 

In [None]:
path = reg.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas 

In [None]:
alphalist = []
for alpha in ccp_alphas:
  reg = DecisionTreeRegressor(random_state = 0, ccp_alpha = alpha)
  reg.fit(X_train, y_train)
  alphalist.append(reg)

In [None]:
train_score = [reg.score(X_train, y_train) for reg in alphalist]
test_score = [reg.score(X_test, y_test) for reg in alphalist]

plt.xlabel('ccp alpha')
plt.ylabel('Accuracy')

plt.plot(ccp_alphas, train_score, marker = 'o', 
         label = 'training', color = 'magenta')
plt.plot(ccp_alphas, test_score, marker = '+', 
         label = 'testing', color = 'red', drawstyle = 'steps-post')
plt.legend()
plt.show()

**possible values of alpha can lie between `[0.0025-->0.0075]`**

In [None]:
reg = DecisionTreeRegressor(random_state=0,ccp_alpha=0.0085)
reg.fit(X_train,y_train)
plt.figure(figsize=(16,9))
tree.plot_tree(reg,filled=True,feature_names=train.columns)

- Going till optimal deep. 
- short and simple tree 
- MSE also falling down. 

# Hyperparameter Tuning 

---



In [None]:
params=  {
    
    'RandomForest':{
        'model' : RandomForestRegressor(),
        'params':{
            'n_estimators':[int(x) for x in np.linspace(100,1200,10)],
            'criterion':['squared_error'],
            'max_depth':[int(x) for x in np.linspace(1,30,5)],
            'max_features':['auto','sqrt','log2'],
            'ccp_alpha':[x for x in np.linspace(0.0025,0.0125,5)],
            'min_samples_split':[2,5,10,14],
            'min_samples_leaf':[2,5,10,14],
        }
    },
    'Decision Tree':{
        'model':DecisionTreeRegressor(),
        'params':{
            'criterion':['squared_error'],
            'max_depth':[int(x) for x in np.linspace(1,30,5)],
            'max_features':['auto','sqrt','log2'],
            'ccp_alpha':[x for x in np.linspace(0.0025,0.0125,5)],
            'min_samples_split':[2,5,10,14],
            'min_samples_leaf':[2,5,10,14],
        }
    }
}

In [None]:
scores = []

for modelname, mp in params.items():
  clf = RandomizedSearchCV(mp['model'],
                           param_distributions = mp['params'], cv = 5,
                           n_iter = 10, scoring = 'neg_mean_squared_error', verbose =2)
  clf.fit(X_train, y_train)
  scores.append({
      'model_name' : modelname,
      'best_score' : clf.best_score_,
      'best_estimator' : clf.best_estimator_,
  })


In [None]:
scores_df = pd.DataFrame(scores, columns = ['model_name', 'best_score', 'best_estimator'])

In [None]:
scores_df

In [None]:
scores

In [None]:
rf = RandomForestRegressor(ccp_alpha=0.0025, max_depth=22, min_samples_leaf=14,
                        min_samples_split=5, n_estimators=1200)

rf.fit(X_train,y_train)
ypred = rf.predict(X_test)
print(metrics.r2_score(y_test,y_pred))

# Prediction on whole dataset 

In [None]:
predicted = []
testtrain = np.array(train)
for i in range(len(testtrain)):
    predicted.append(rf.predict([testtrain[i]]))
    
predicted

In [None]:
# as we transformed our price variable to np.log
# we have to retranform it from np.log-->np.exp inorder to get the result

ans = [np.exp(predicted[i][0]) for i in range(len(predicted))]

In [None]:
df['Predicted Price'] = np.array(ans)
df

In [None]:
sn.distplot(df['Price'], hist = False, color = 'orange', label = 'Actual')
sn.distplot(df['Predicted Price'], hist = False, color = 'blue', label = 'Prediction')
plt.legend()
plt.show()

## This is not a good result. 

# Random Forest Regressor version_2

In [None]:
rf1 = RandomForestRegressor(n_estimators=100,
                              random_state=3,
                              max_samples=0.5,
                              max_features=0.75,
                              max_depth=15)

rf1.fit(X_train,y_train)
print(f'R2 score : {metrics.r2_score(y_test,rf1.predict(X_test))}')

In [None]:
predicted = []
testtrain = np.array(train)
for i in range(len(testtrain)):
    predicted.append(rf1.predict([testtrain[i]]))
    
predicted

In [None]:
ans = [np.exp(predicted[i][0]) for i in range(len(predicted))]

In [None]:
data = df.copy()
data['Predicted Price'] = np.array(ans)
data

In [None]:
sn.distplot(data['Price'],hist=False,color='orange',label='Actual')
sn.distplot(data['Predicted Price'],hist=False,color='blue',label='Predicted')
plt.legend()
plt.show()

### Better results. 

In [None]:
import pickle
file = open('laptoppricepredictor.pkl','wb')
pickle.dump(rf1,file)
file.close()