# KDD Methodology Steps

In [None]:
!pip install pycaret



**1. Data Selection**

In this phase, we'll:

1. Load the dataset.
2. Provide a basic overview by displaying the first few rows.
3. Check the data types of each column.
4. Identify any null values.

In [None]:
import pandas as pd
from pycaret.classification import *

# Load the dataset
data = pd.read_csv('Amazon_popular_books_dataset.csv')

data.head()

Unnamed: 0,asin,ISBN10,answered_questions,availability,brand,currency,date_first_available,delivery,department,description,...,upc,url,video,video_count,categories,best_sellers_rank,buybox_seller,image,number_of_sellers,colors
0,7350813,7350813,0,In Stock.,Emily Brontë,USD,,"[""FREE delivery Tuesday, December 28 if you sp...",,,...,,https://www.amazon.com/dp/0007350813,,0,"[""Books"",""Literature & Fiction"",""Genre Fiction""]","[{""category"":""Books / Literature & Fiction / H...",,,,
1,7513763,9780007513765,0,In Stock.,Drew Daywalt,USD,,"[""FREE delivery Tuesday, December 28 if you sp...",,,...,,https://www.amazon.com/dp/0007513763,,0,"[""Books"",""Children's Books"",""Literature & Fict...","[{""category"":""Books / Children's Books / Liter...",VMG Books & Media,,,
2,8183988,8183988,0,,Bernard Cornwell,USD,,"[""FREE delivery January 4 - 10 if you spend $2...",,,...,,https://www.amazon.com/dp/0008183988,,0,"[""Books"",""Literature & Fiction"",""Genre Fiction""]","[{""category"":""Books / Literature & Fiction / H...",Reuseaworld,,,
3,8305838,8305838,0,In Stock.,David Walliams,USD,,"[""FREE delivery Tuesday, December 28 if you sp...",,,...,,https://www.amazon.com/dp/0008305838,,0,"[""Books"",""Children's Books"",""Literature & Fict...","[{""category"":""Books / Children's Books / Liter...",Bahamut Media,,,
4,8375526,8375526,0,In Stock.,Caroline Hirons,USD,,"[""FREE delivery Tuesday, December 28"",""Or fast...",,,...,,https://www.amazon.com/dp/0008375526,,0,"[""Books"",""Crafts, Hobbies & Home"",""Home Improv...","[{""category"":""Books / Health, Fitness & Dietin...",KathrynAshleyGallery,,,


**2. Data Preprocessing**

In this phase, we'll:

1. Handle missing values.
2. Identify outliers.
3. Explore basic statistics.

In [None]:
from pycaret.clustering import *

# Assuming the target column is named 'title'
setup = setup(data=data, session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Original data shape,"(2269, 40)"
2,Transformed data shape,"(2269, 21058)"
3,Ordinal features,1
4,Numeric features,13
5,Categorical features,27
6,Rows with missing values,100.0%
7,Preprocess,True
8,Imputation type,simple
9,Numeric imputation,mean


**Step 4: Data Transformation**

PyCaret's setup function also provides options for transformations, normalization, and feature scaling. If you wanted to normalize the features, for example, you could add the normalize=True argument

**Step 5: Data Mining**

PyCaret offers a variety of models for classification, regression, clustering, etc. Depending on the task (classification or regression), you can easily compare different models or create specific models.

In [None]:
# Create a specific model, e.g., kmeans
kmeans_model = setup.create_model('kmeans')

Unnamed: 0,Silhouette,Calinski-Harabasz,Davies-Bouldin,Homogeneity,Rand Index,Completeness
0,0.8285,5050.3378,0.52,0,0,0


Processing:   0%|          | 0/3 [00:00<?, ?it/s]

**Step 6: Evaluation**

PyCaret provides various tools for model interpretation, including SHAP values, feature importance plots, and more.

In [None]:
# Plot feature importance
plot_model(kmeans_model)

**Step 7: Knowledge**

Once satisfied with a model's performance, you can finalize and deploy it.

In [None]:
# Save the model
save_model(kmeans_model, 'final_rf_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['answered_questions', 'department',
                                              'discount', 'final_price',
                                              'images_count', 'initial_price',
                                              'plus_content', 'reviews_count',
                                              'root_bs_rank', 'upc', 'video',
                                              'video_count',
                                              'number_of_sellers'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fi...
                                                                     'title',
                                             