<a href="https://colab.research.google.com/github/dejokz/ML-Competitions/blob/baseline-improve/Predict_book_prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://i.imgur.com/hkSNxIO.png)

Machine Hack [Predict the Price of Books](https://https://machinehack.com/hackathons/predict_the_price_of_books/overview)

Important Details:

Size of training set: 6237 recordsl; Size of test set: 1560 records
FEATURES

* Title: The title of the book
* Author: The author(s) of the book
* Edition: The edition of the book eg (Paperback,– Import, 26 Apr 2018)
* Reviews: The customer reviews about the book
* Ratings: The customer ratings of the book
* Synopsis: The synopsis of the
* Genre: The genre the book belongs to
* BookCategory: The department the book is usually available at
* Price: The price of the book (Target variable)

![](https://imgur.com/a/EVErQra)

# Importing and loading data and libraries

*Notebook is being built on Google Colab,hence mounting drive and importing the [data](https://https://machinehack.com/hackathons/predict_the_price_of_books/data)*

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls /content/drive/MyDrive/dataset/predict_book

Data_Test.xlsx	Data_Train.xlsx  Sample_Submission.xlsx


In [2]:
import pandas as pd
import numpy as np
from math import sqrt
from sklearn.metrics import make_scorer, mean_squared_log_error
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, KFold, train_test_split
import plotly.express as px
import xgboost as xgb

In [3]:
try:
  df = pd.read_excel('/content/drive/MyDrive/dataset/predict_book/Data_Train.xlsx')
  test_df = pd.read_excel('/content/drive/MyDrive/dataset/predict_book/Data_Test.xlsx')
except FileNotFoundError:
  print("File not found. Please check the path and try again")
except Exception as e:
  print(f"Error loading data: {e}")

In [4]:
display(df)

Unnamed: 0,Title,Author,Edition,Reviews,Ratings,Synopsis,Genre,BookCategory,Price
0,The Prisoner's Gold (The Hunters 3),Chris Kuzneski,"Paperback,– 10 Mar 2016",4.0 out of 5 stars,8 customer reviews,THE HUNTERS return in their third brilliant no...,Action & Adventure (Books),Action & Adventure,220.00
1,Guru Dutt: A Tragedy in Three Acts,Arun Khopkar,"Paperback,– 7 Nov 2012",3.9 out of 5 stars,14 customer reviews,A layered portrait of a troubled genius for wh...,Cinema & Broadcast (Books),"Biographies, Diaries & True Accounts",202.93
2,Leviathan (Penguin Classics),Thomas Hobbes,"Paperback,– 25 Feb 1982",4.8 out of 5 stars,6 customer reviews,"""During the time men live without a common Pow...",International Relations,Humour,299.00
3,A Pocket Full of Rye (Miss Marple),Agatha Christie,"Paperback,– 5 Oct 2017",4.1 out of 5 stars,13 customer reviews,A handful of grain is found in the pocket of a...,Contemporary Fiction (Books),"Crime, Thriller & Mystery",180.00
4,LIFE 70 Years of Extraordinary Photography,Editors of Life,"Hardcover,– 10 Oct 2006",5.0 out of 5 stars,1 customer review,"For seven decades, ""Life"" has been thrilling t...",Photography Textbooks,"Arts, Film & Photography",965.62
...,...,...,...,...,...,...,...,...,...
6232,Humans: A Brief History of How We F*cked It Al...,Tom Phillips,"Paperback,– 8 Aug 2018",5.0 out of 5 stars,2 customer reviews,'F*cking brilliant' Sarah Knight\n'Very funny'...,Anthropology (Books),Humour,322.00
6233,The Chemist,Stephenie Meyer,"Paperback,– 21 Nov 2016",3.3 out of 5 stars,9 customer reviews,"In this gripping page-turner, an ex-agent on t...",Contemporary Fiction (Books),"Crime, Thriller & Mystery",421.00
6234,The Duke And I: Number 1 in series (Bridgerton...,Julia Quinn,"Paperback,– 8 Jun 2006",3.8 out of 5 stars,3 customer reviews,'The most refreshing and radiant love story yo...,Romance (Books),Romance,399.00
6235,Frostfire (Kanin Chronicles),Amanda Hocking,"Paperback,– 15 Jan 2015",3.5 out of 5 stars,4 customer reviews,Frostfire by Amanda Hocking is the stunning fi...,Action & Adventure (Books),Action & Adventure,319.00


In [None]:
print(f"Training data set shape: {df.shape}")
print(f"Testing data set shape: {test_df.shape}")

Training data set shape: (6237, 9)
Testing data set shape: (1560, 8)


# EDA

## Histogram of Target variable

In [None]:
fig = px.histogram(df, x='Price', nbins=30, title='Distribution of Prices')
fig.update_traces(hovertemplate='Count: %{y}')
fig.show()

## Box plot of target variable(check for outliers)

In [None]:
fig = px.box(df, y='Price')
fig.show()

`Price` has many outliers.  
The number of ways to handle outlier  
**Removal**: If you’re certain that the outliers in your dataset are due to errors in data collection or entry, one approach could be to remove these outliers. However, you should be careful with this approach as it involves the loss of data.

**Imputation**: Another method to handle outliers is to replace the outlier value with some central value like the mean or median of the data. This method is useful when you don’t want to lose any data.

**Capping**: In this method, the outliers are capped at a certain value. For example, any value above the 95th percentile can be set to the value at the 95th percentile. Similarly, any value below the 5th percentile can be set to the value at the 5th percentile.

**Binning**: The data is divided into bins or intervals, and each bin gets a score assigned to it. The raw, outlier-containing data is then transformed into bin-related scores.

**Log Transformation**: Applying a logarithmic transformation can reduce the impact of outliers. This is particularly useful for dealing with positive skewness in the data.

**Use Robust Models**: Some machine learning models are less sensitive to outliers than others. For example, tree-based models are not affected by outliers since they try to partition the space into different regions.

Since the evaluation metric is **Root mean squared Logorithmic Error(RMSLE)**, we'll be log transformating the target variable `Price`

## ad-hoc analysis

In [None]:
# Unique number of authors
num_unique_authors = df['Author'].nunique()
print(f"Number of unique authors: {num_unique_authors}")

# Unique number of genres
num_unique_genres = df['Genre'].nunique()
print(f"Number of unique genres: {num_unique_genres}")

num_unique_category = df['BookCategory'].nunique()
print(f"Number of unique category: {num_unique_category}")


Number of unique authors: 3679
Number of unique genres: 345
Number of unique category: 11


In [None]:
# Get counts of each unique author
author_counts = df['Author'].value_counts().reset_index()

# Rename the columns
author_counts.columns = ['Author', 'Count']

# Filter to include only authors that appear more than 10
author_counts_filtered = author_counts[author_counts['Count'] > 10]

# Creating a bar plot
fig = px.bar(author_counts_filtered, x='Author', y='Count', title='Counts of Each Unique Author')
fig.show()

In [None]:
# Creating a bar plot for 'Genre'
fig_genre = px.histogram(df, x='Genre', title='Counts of Unique Genres')
fig_genre.show()

# Creating a bar plot for 'BookCategory'
fig_book_category = px.histogram(df, x='BookCategory', title='Counts of Unique Book Categories')
fig_book_category.show()

`BookCategories` carries the necessary information contained in `Genre`. Hence going to drop it.

In [None]:
booktype_counts = df['BookType'].value_counts()

fig = px.bar(booktype_counts, x=booktype_counts.index, y=booktype_counts.values, labels={'x':'Book Type', 'y':'Count'})
fig.show()

# Data preparation

In [5]:
def rmsle(y_true, y_pred):
  return sqrt(mean_squared_log_error(y_true, y_pred))

rmsle_scorer = make_scorer(rmsle, greater_is_better=False)

Without the use of sklearn Pipeline

In [6]:
# Function to extract float from 'Reviews'
def extract_reviews(X):
    return X['Reviews'].str.extract('(\d+\.\d+)').astype(float)

# Function to extract int from 'Ratings'
def extract_ratings(X):
    return X['Ratings'].str.extract('(\d+)').astype(int).rename(columns={0: 'numOfRatings'})

# Function to extract year from 'Edition'
def extract_year(X):
    # Extract the last 4 characters, assuming they represent the year
    extracted_years = X['Edition'].str[-4:]

    # Convert to numeric, setting errors='coerce' to handle non-numeric strings by converting them to NaN
    extracted_years = pd.to_numeric(extracted_years, errors='coerce')

    # Return the result as a DataFrame to comply with ColumnTransformer expectations
    return extracted_years.to_frame(name='year')


In [7]:
# Custom transformer for frequency encoding
class FrequencyEncoder(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.freq_enc = X.value_counts(normalize=True).to_dict()
        return self

    def transform(self, X, y=None):
        return X.map(self.freq_enc).fillna(0).to_frame()

    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            raise RuntimeError("input_features must be defined")
        return input_features

# Custom transformer to extract cover type from edition
class CoverTypeExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        if hasattr(X, 'columns'):
          self.feature_names_in_ = X.columns.tolist()
        else:
          self.feature_names_in = None
        return self

    def transform(self, X, y=None):
        return X.apply(lambda x: 'Hardcover' if 'hardcover' in x.lower()
                       else 'Paperback' if 'paperback' in x.lower()
                       else 'Other').to_frame()

    def get_feature_names_out(self, input_features=None):
        if input_features is None:
            raise ValueError("input_features must be provided")
        return np.array(['hardcover', 'paperback', 'other'])

In [8]:
# Preprocessing steps for features
try:
    preprocessor = ColumnTransformer(
        transformers=[
            ('freq_enc', FrequencyEncoder(), 'Author'),
            ('cover_type', Pipeline([
                ('extractor', CoverTypeExtractor()),
                ('one_hot', OneHotEncoder())
            ]), 'Edition'),
            ('one_hot_category', OneHotEncoder(), ['BookCategory']),
            ('reviews', FunctionTransformer(extract_reviews), ['Reviews']),
            ('num_of_ratings', FunctionTransformer(extract_ratings), ['Ratings']),

            ('year', FunctionTransformer(extract_year), ['Edition'])
        ],
        remainder='drop'
    )

    # Define the pipeline
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', xgb.XGBRegressor(objective='reg:squarederror'))
    ])

    # Preparing data for model training
    X = df.drop(['Price'], axis=1)
    y = df['Price'].apply(np.log1p)

    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
    # y_train = df['Price'].apply(np.log1p)  # Log transformation for the target variable
    # X_train = df
    print(f"Preprocessing completed successfully!!")
except Exception as e:
    print(f"Error in data preprocessing: {e}")

Preprocessing completed successfully!!


#Training

## setting up KFold cross validation in gridSearch/RandomizedSearch

In [9]:
# param_grid = {
#     'model__max_depth': [3, 4, 5],
#     'model__n_estimators': [150, 175, 180],
#     'model__learning_rate': [0.01, 0.007, 0.009],
#     'model__min_child_weight': [0.001, 10],
#     'model__gamma': [0, 5],
#     'model__subsample': [0.5,1.0],
#     'model__colsample_bytree': [0.3, 1.0],
#     'model__colsample_bylevel': [0.3, 1.0],
#     'model__colsample_bynode' : [0.3, 1.0],
#     'model__reg_alpha': [0,1],
#     'model__reg_lambda': [1,5],
#     'model__scale_pos_weight': [1, 10]
# }

param_distributions = {
    'model__max_depth': [3, 4, 5],
    'model__n_estimators': [150, 175, 180],
    'model__learning_rate': [0.01, 0.007, 0.009],
    'model__min_child_weight': [0.001, 10],
    'model__gamma': [0, 5],
    'model__subsample': [0.5,1.0],
    'model__colsample_bytree': [0.3, 1.0],
    'model__colsample_bylevel': [0.3, 1.0],
    'model__colsample_bynode' : [0.3, 1.0],
    'model__reg_alpha': [0,1],
    'model__reg_lambda': [1,5],
    'model__scale_pos_weight': [1, 10]
}

kf = KFold(n_splits=5, shuffle=True, random_state=42)
# GridSearchCV setup
# grid_search = GridSearchCV(
#     pipeline,
#     param_grid,
#     cv=kf,
#     scoring=rmsle_scorer #Custom RMSLE scorer
# )

random_search = RandomizedSearchCV(
    pipeline,
    param_distributions,  # Rename param_grid to param_distributions
    cv=kf,
    scoring=rmsle_scorer,
    n_iter=100,  # Example: Specify the number of parameter settings to try
    random_state=42
)


In [10]:
try:
    # grid_search.fit(X_train, y_train)
    random_search.fit(X_train, y_train)
    print("Best parameters found: ", random_search.best_params_)
    print("Best CV score: ", -random_search.best_score_)

 # Predict and evaluate using the best model
    y_pred_best = random_search.predict(X_valid)
    y_pred_best = np.expm1(y_pred_best)  # Revert log transformation
    test_mse = rmsle(np.expm1(y_valid), y_pred_best)
    print("Validation RMSLE: ", test_mse)
except Exception as e:
    print(f"Error in hyperparameter tuning: {e}")

Best parameters found:  {'model__subsample': 1.0, 'model__scale_pos_weight': 1, 'model__reg_lambda': 5, 'model__reg_alpha': 0, 'model__n_estimators': 175, 'model__min_child_weight': 0.001, 'model__max_depth': 5, 'model__learning_rate': 0.01, 'model__gamma': 0, 'model__colsample_bytree': 1.0, 'model__colsample_bynode': 1.0, 'model__colsample_bylevel': 1.0}
Best CV score:  0.09251806986780911
Validation RMSLE:  0.6301893108263518


## Training with best parameters on entire training data

In [13]:
best_params = {key.replace('model__', ''): value for key, value in random_search.best_params_.items()}


In [None]:
# best_params = random_search.best_params_
best_params = {key.replace('model__', ''): value for key, value in random_search.best_params_.items()}
final_model = xgb.XGBRegressor(**best_params)
X_transformed = preprocessor.fit_transform(X)
final_model.fit(X_transformed, y)

## generating submission file

In [None]:
# Load test data and predict
try:
    test_transformed = preprocessor.fit_transform(test_df)
    y_pred = final_model.predict(test_transformed)
    y_pred = np.expm1(y_pred)  # Revert log transformation

    # Create submission file
    submission = pd.DataFrame({'Price': y_pred})
    submission.to_excel('prices.xlsx', index=False)
    print("Submission file created successfully.")
except Exception as e:
    print(f"Error in prediction or file creation: {e}")


Submission file created successfully.


Reaches public score of 0.71.
to do:
* Include all features
* hyper-parameter tuning with Hypot
* Ensemble learning