## **Final Task** - Category prediction based on 'products.csv' data

##### **Author**: Danilo Jelovac
---
##### >>. **Goal**:
- Our goal here is to chose the best performing model and train it so it will predict
product category based on the product itself with big precision. 
This notebook will use preprocessed data to train and evaluate different ML models.
Once we get the best one, we will train it to predict categories.

##### >>. **Requirements** (if not on colab/jupyter lab):
- pandas
- skicit-learn
- jupyter
- ipykernel


##### Step#1 - Importing libraries
---

In [5]:
# ------------------------------------------
# Importing libraries required for this task:
# ------------------------------------------

import pandas as pd
# --
from sklearn.feature_extraction.text import TfidfVectorizer
# --
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# --
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
# --
from sklearn.metrics import classification_report

# --Confirmation message:
print(">. If you see this message - the libraries are uploaded successfuly!\n")

>. If you see this message - the libraries are uploaded successfuly!



##### Step#2 - Loading dataset
---
- The dataset we cleaned in our 'products_data_analysis.ipynb': we're using it now to train
our model. We removed any bloat data that may affect our model, reducing it's error chance.

In [6]:
# ----------------------------------------------
# Loading dataset, printing out samples and data:
# ----------------------------------------------


# ------------------------
FOLDER_NAME = "ml_data"
FILE_NAME = "products_cleaned.csv"
# ------------------------


# --Loading the dataframe:

df = pd.read_csv(f"../{FOLDER_NAME}/{FILE_NAME}")

# --Checking the df...

print("\n>. Dataframe:\n------")
display(df.head())


print("\n-----\n")


>. Dataframe:
------


Unnamed: 0.1,Unnamed: 0,product_title,category_label
0,0,apple iphone 8 plus 64gb silver,Mobile Phones
1,1,apple iphone 8 plus 64 gb spacegrau,Mobile Phones
2,2,apple mq8n2b/a iphone 8 plus 64gb 5.5 12mp sim...,Mobile Phones
3,3,apple iphone 8 plus 64gb space grey,Mobile Phones
4,4,apple iphone 8 plus gold 5.5 64gb 4g unlocked ...,Mobile Phones



-----



##### Step#3 - Testing the data
---
- Next step would be testing the data by feeding it into five different models.
This way we can compare them and chose the most accurate one for proceeding with
our task.

In [7]:
# -----------------
# Chosing the model:
# -----------------


# --Preparing data:

x_input = df['product_title']    # --data we're giving to model
y_output = df['category_label']    # --output we're expecting

X_train, X_test, y_train, y_test = train_test_split(
    x_input, y_output, test_size=0.2, random_state=42, stratify=y_output
    )    # --we're dividing the samples by 80:20 (train vs test)

# --Creating a vectorizer:

vectorizer = TfidfVectorizer()
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

# --Preparing models:

models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Naive Bayes": MultinomialNB(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Support Vector Machine": LinearSVC()
}

# --Testing our models!

for model_name, model in models.items():
    model.fit(X_train_vect, y_train)
    y_pred = model.predict(X_test_vect)
    print(f"\n>>. MODEL -> ['{model_name}'] - Classification Report:")
    print(classification_report(y_test, y_pred))
    

print("\n-----\n")


>>. MODEL -> ['Logistic Regression'] - Classification Report:
                  precision    recall  f1-score   support

            CPUs       1.00      1.00      1.00       759
 Digital Cameras       1.00      0.99      1.00       532
     Dishwashers       0.97      0.92      0.95       675
 Fridge Freezers       0.94      0.99      0.96      2226
      Microwaves       1.00      0.94      0.97       461
   Mobile Phones       0.99      0.99      0.99       805
             TVs       0.99      0.97      0.98       700
Washing Machines       0.99      0.92      0.96       794

        accuracy                           0.97      6952
       macro avg       0.98      0.97      0.97      6952
    weighted avg       0.97      0.97      0.97      6952


>>. MODEL -> ['Naive Bayes'] - Classification Report:
                  precision    recall  f1-score   support

            CPUs       1.00      1.00      1.00       759
 Digital Cameras       1.00      1.00      1.00       532
     Dis

##### General report:
---
- Each model has shown very good results!
- We're looking into `precision` - how many items are actualy categorized correctly, `recall` - how many items in given category are found by the model,
`f1 score` - ballance between 'precision' and 'recall' and `support` - number of rows in given category.
- Also, each model has it's `accuracy` - overall accuracy of the model, `macro avg` - average data by category, and `weighted avg` - average data by category, but in specific case where bigger categories have bigger impact.
---
- Seeing these results, we're chosing `Naive Bayes` since it has highest accuracy score (up to 98%). Other models are also very good, but we are opting for the best one. `SVM` is also at the top, with no big difference regarding `f1 score` and `accuracy` and it would also be a safe choice.

- Proceeding to training and testing `Naive Bayes` model in separate .py scripts, located in `'src_train/model_test.py'` and `'src_train/models_train.py'`.