<a href="https://www.kaggle.com/code/yunasheng/amazon-book-viz?scriptVersionId=165242638" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Import needed libraries**¶

`pandas`: Used for data manipulation and analysis. It provides data structures like DataFrame and Series, which allow you to easily handle structured data, perform operations like filtering, grouping, and joining, and read/write data from/to various file formats such as CSV, Excel, and SQL databases.

`numpy`: Stands for Numerical Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays efficiently. Numpy is widely used for numerical computations in fields like machine learning, scientific computing, and data analysis.

`matplotlib.pyplot`: A plotting library that provides a MATLAB-like interface for creating static, interactive, and animated visualizations in Python. It allows you to create various types of plots such as line plots, scatter plots, bar plots, histograms, etc., to visualize data and explore relationships between variables.

`seaborn`: Built on top of matplotlib, seaborn provides a high-level interface for creating attractive and informative statistical graphics. It simplifies the process of creating complex visualizations like categorical plots, distribution plots, regression plots, and heatmaps. Seaborn also integrates well with pandas DataFrame objects.

`warnings`: Python's built-in module for issuing warnings to alert users about potential issues or deprecated features in their code. In the given context, warnings.filterwarnings("ignore") suppresses the display of warning messages, which can be helpful when you want to avoid cluttering your output with non-critical warnings.

`wordcloud`: A library for creating word clouds from text data. Word clouds visually represent the frequency of words in a text corpus, where the size of each word corresponds to its frequency. It's often used in text analysis and visualization to identify prominent words or topics within a body of text.

`sklearn.preprocessing`: Part of scikit-learn (sklearn), this submodule provides various functions for preprocessing data before feeding it into machine learning models. This includes scaling features, encoding categorical variables, imputing missing values, and more.

`sklearn.cluster`: Another submodule of scikit-learn, it provides implementations of various clustering algorithms for unsupervised learning tasks. The KMeans class imported from this submodule is used to perform K-means clustering, a popular method for partitioning data into clusters based on similarity.

Each of these libraries serves a specific purpose in data analysis, visualization, and machine learning workflows, and they are often used together to perform end-to-end data analysis tasks.

<div style="text-align: center"><img src="https://lh5.googleusercontent.com/proxy/TNKDH3dV1GAMVP6aMeWC7HFjhAYdkiFCEafVtn2qcE2RSpMd7vO3eY75rPDgWGSQ4bRfNKAQL-H9Y7H85aptikS4uQIPLBqZQBWL-pyGeUwvsJDXZpPaxP68WazXFs97iU9t8m2Ib6xXDnsa20D8sXwWlj0" width="100%" heigh="100%" alt="Retrieve&Re-Rank pipeline"></div>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from wordcloud import WordCloud, STOPWORDS
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [2]:
# Load the dataset
file_path = '/kaggle/input/amazon-books-dataset/Amazon_Books_Scraping/Books_df.csv'
books_df = pd.read_csv(file_path)

# Display the first 5 rows of the dataframe
books_df.head()

Unnamed: 0.1,Unnamed: 0,Title,Author,Main Genre,Sub Genre,Type,Price,Rating,No. of People rated,URLs
0,0,The Complete Novel of Sherlock Holmes,Arthur Conan Doyle,"Arts, Film & Photography",Cinema & Broadcast,Paperback,₹169.00,4.4,19923.0,https://www.amazon.in/Complete-Novels-Sherlock...
1,1,Black Holes (L) : The Reith Lectures [Paperbac...,Stephen Hawking,"Arts, Film & Photography",Cinema & Broadcast,Paperback,₹99.00,4.5,7686.0,https://www.amazon.in/Black-Holes-Lectures-Ste...
2,2,The Kite Runner,Khaled Hosseini,"Arts, Film & Photography",Cinema & Broadcast,Kindle Edition,₹175.75,4.6,50016.0,https://www.amazon.in/Kite-Runner-Khaled-Hosse...
3,3,Greenlights: Raucous stories and outlaw wisdom...,Matthew McConaughey,"Arts, Film & Photography",Cinema & Broadcast,Paperback,₹389.00,4.6,32040.0,https://www.amazon.in/Greenlights-Raucous-stor...
4,4,The Science of Storytelling: Why Stories Make ...,Will Storr,"Arts, Film & Photography",Cinema & Broadcast,Paperback,₹348.16,4.5,1707.0,https://www.amazon.in/Science-Storytelling-Wil...


In [3]:
# Remove the 'Unnamed: 0' column
books_df.drop(columns=['Unnamed: 0'], inplace=True)

# Remove currency symbol from 'Price' and convert to float
books_df['Price'] = books_df['Price'].str.replace('₹', '').str.replace(',', '').astype(float)

# Check for missing values
missing_values = books_df.isnull().sum()

# Check datatypes of all columns after initial cleanup
dtypes_after_cleanup = books_df.dtypes

missing_values, dtypes_after_cleanup

(Title                   0
 Author                 21
 Main Genre              0
 Sub Genre               0
 Type                    0
 Price                   0
 Rating                  0
 No. of People rated     0
 URLs                    0
 dtype: int64,
 Title                   object
 Author                  object
 Main Genre              object
 Sub Genre               object
 Type                    object
 Price                  float64
 Rating                 float64
 No. of People rated    float64
 URLs                    object
 dtype: object)

In [4]:
# Fill missing values in the 'Author' column with "Unknown"
books_df['Author'].fillna('Unknown', inplace=True)

# Brief analysis of numerical columns
numerical_analysis = books_df[['Price', 'Rating', 'No. of People rated']].describe()

numerical_analysis

Unnamed: 0,Price,Rating,No. of People rated
count,7928.0,7928.0,7928.0
mean,492.733737,4.260797,6479.312941
std,945.900146,0.910659,22082.884343
min,0.01,0.0,0.0
25%,194.0,4.3,63.0
50%,317.65,4.5,499.0
75%,464.34,4.6,2905.25
max,35829.0,5.0,500119.0
