# Search Keywords Classification

**This is a task I was worked while in internship.** I have used `text-classification` technique for this but didn't get
the work done (because it was my first working with BERT).

Now, after internship I am thinking to work on this once more but with some different and efficient approach.
Previously, I fine-tunned BERT model to classify keywords into (BRAND, LOCATION, COMPETITION and COMPARISION) labels but
this doesn't fulfill our needs because we also have to assign a score to each keyword to group them in better-to-good
hierarchy.

I think BERT is a heavy way to achieve this instead we can use **hybrid approach of Rule based and Machine Learning
technique** to classify them and assign a score to each of them.


---


### Hybrid Approach: Rule-Based + Machine Learning

> 🤖 Suggestion from ChatGPT

You can combine **rule-based methods** with **machine learning** for keyword classification. This can work especially
well in tasks like **location recognition**, where places can often be identified by specific keywords or patterns.

1. **Rule-Based Methods**

Use regular expressions (e.g., patterns for brands like `"Apple"`, `"Samsung"`) or dictionaries of location names to
automatically tag certain entities. Combine such rule-based classification with machine learning methods to reduce the
complexity of handling rare or ambiguous entities.

2. **Conditional Random Fields (CRF)**[^1]

For token-level classification, CRF models are another technique used in NER tasks. They can take into account the
context of neighboring tokens to improve classification. You can use a **CRF layer** on top of BERT or other
transformers.

[^1]: Doesn't understand the technique


---


_Currently I don't know how to integrate ML here,_ but I am sure this approach will work and faster than previous one
because:

1. I will ask user to input their interest of **brand, location, competition** keywords.

   ```json
   {
     "brand": ["apple", "iphone", "macbook", "airpod"],
     "location": ["patna", "bihar"],
     "competition": ["samsung", "redmi", "vivo", "galaxy 23", "samsung galaxy"]
   }
   ```

   **Using these I can parse the keywords and assign the labels.**

   ```json
   {
     "buy iphone 13": ["BRAND"],
     "iphone 13 in patna": ["BRAND", "LOCATION"],
     "price of samsung galaxy 21 FE": ["COMPETITION"]
   }
   ```

2. Using the same method, I can also tag keywords' **(PURCHASE, NAVIGATIONAL, COMPARISION, INFORMATIONAL)** intent.


In [1]:
import polars as pl
from polars import selectors as cs

In [2]:
pl.Config.set_fmt_str_lengths(50)

polars.config.Config

In [3]:
__fav_kwds = ["iphone", "redmi", "samsung", "galaxy", "vivo"]
kwds = (
    pl.scan_csv("data/samsung_galaxy.csv")
    .filter(
        pl.col("keyword").str.contains_any(__fav_kwds, ascii_case_insensitive=True),
    )
    .filter(pl.col("keyword").is_first_distinct())
)
kwds.head().collect()

keyword
str
"""samsung m & a series"""
"""samsung smartplaza bharath electronics & appliance…"
"""vivo camera & music"""
"""best samsung's series"""
"""iphone 13 0"""


### Add category based columns


In [4]:
input_kwds = {
    "brand": ["apple", "iphone", "macbook", "airpod"],
    "location": ["patna", "bihar"],
    "competition": ["samsung", "redmi", "vivo", "galaxy 23", "samsung galaxy"],
}

In [5]:
kwds = kwds.with_columns(
    *[
        pl.col("keyword")
        .str.contains_any(v, ascii_case_insensitive=True)
        .alias(f"cat_{k}")
        for k, v in input_kwds.items()
    ],
)
kwds.filter(
    pl.col("cat_brand"),
    pl.col("cat_competition"),
).collect()

keyword,cat_brand,cat_location,cat_competition
str,bool,bool,bool
"""samsung with 3 camera like iphone""",true,false,true
"""samsung galaxy watch 4 compatible with iphone""",true,false,true
"""samsung galaxy watch 4 iphone""",true,false,true
"""samsung watch 4 iphone""",true,false,true
"""samsung airpods""",true,false,true
…,…,…,…
"""samsung smart switch iphone""",true,false,true
"""samsung to iphone""",true,false,true
"""iphone to samsung tv""",true,false,true
"""samsung watch iphone""",true,false,true


### Add intent based columns


In [6]:
# these are some keywords which are common in respective intent
intent_kwds = {
    "transactional": ["price", "buy", "under", "flipkart", "amazon"],
    "comparison": [" vs ", "&", "versus", " and "],
    "navigational": ["near"],
    "informational": ["antutu", "camera", "best"],
}

In [7]:
kwds = kwds.with_columns(
    *[
        pl.col("keyword")
        .str.contains_any(v, ascii_case_insensitive=True)
        .alias(f"int_{k}")
        for k, v in intent_kwds.items()
    ],
).with_columns(
    # Represent keywords which have no intent
    pl.any_horizontal(cs.starts_with("int_")).not_().alias("no_intent"),
)

kwds.filter(pl.col("no_intent")).collect().sample(20)

keyword,cat_brand,cat_location,cat_competition,int_transactional,int_comparison,int_navigational,int_informational,no_intent
str,bool,bool,bool,bool,bool,bool,bool,bool
"""samsung galaxy a05""",false,false,true,false,false,false,false,true
"""galaxy s7""",false,false,false,false,false,false,false,true
"""samsung galaxy 5g mobile""",false,false,true,false,false,false,false,true
"""samsung 22""",false,false,true,false,false,false,false,true
"""iphone 12mini""",true,false,false,false,false,false,false,true
…,…,…,…,…,…,…,…,…
"""vivo 8 pro""",false,false,true,false,false,false,false,true
"""samsung c7 pro""",false,false,true,false,false,false,false,true
"""samsung sabse mehnga phone""",false,false,true,false,false,false,false,true
"""samsung s 5g""",false,false,true,false,false,false,false,true


### Assign score _(manual approach)_

**Score Types**

1. **Counter Score:** Add 1 point for each `true` value of different category and intent of keyword.
2. **Proportional Score**: A pre-defined points dictionary which used to assign score if the category or intent value is
   `true`.


In [8]:
# Assign Counter Score
counter_scores = kwds.select(
    "keyword",
    cat_score_counter=pl.sum_horizontal(cs.starts_with("cat_")),
    int_score_counter=pl.sum_horizontal(cs.starts_with("int_")),
)
counter_scores.head().collect()

keyword,cat_score_counter,int_score_counter
str,u32,u32
"""samsung m & a series""",1,1
"""samsung smartplaza bharath electronics & appliance…",1,1
"""vivo camera & music""",1,2
"""best samsung's series""",1,1
"""iphone 13 0""",1,0


In [9]:
print(counter_scores.collect().get_column("cat_score_counter").value_counts())
print(counter_scores.collect().get_column("int_score_counter").value_counts())

shape: (3, 2)
┌───────────────────┬───────┐
│ cat_score_counter ┆ count │
│ ---               ┆ ---   │
│ u32               ┆ u32   │
╞═══════════════════╪═══════╡
│ 0                 ┆ 886   │
│ 1                 ┆ 10972 │
│ 2                 ┆ 15    │
└───────────────────┴───────┘
shape: (4, 2)
┌───────────────────┬───────┐
│ int_score_counter ┆ count │
│ ---               ┆ ---   │
│ u32               ┆ u32   │
╞═══════════════════╪═══════╡
│ 0                 ┆ 8019  │
│ 1                 ┆ 3709  │
│ 2                 ┆ 144   │
│ 3                 ┆ 1     │
└───────────────────┴───────┘


In [10]:
# Proportional Score dictionary
proportional_score_dict = {
    "cat_brand": 7,
    "cat_location": 6,
    "cat_competition": 5,
    "int_transactional": 4,
    "int_navigational": 3,
    "int_comparison": 2,
    "int_informational": 1,
}

In [11]:
proportional_scores = kwds.with_columns(
    *[
        pl.col(k).replace_strict(True, v, default=0, return_dtype=pl.UInt8)
        for k, v in proportional_score_dict.items()
    ],
).select(
    "keyword",
    int_score_proportional=pl.sum_horizontal(cs.starts_with("int_")),
    cat_score_proportional=pl.sum_horizontal(cs.starts_with("cat_")),
)
proportional_scores.head().collect()

keyword,int_score_proportional,cat_score_proportional
str,u8,u8
"""samsung m & a series""",2,5
"""samsung smartplaza bharath electronics & appliance…",2,5
"""vivo camera & music""",3,5
"""best samsung's series""",1,5
"""iphone 13 0""",0,7


In [12]:
# Merge both scores dataframe
scores = counter_scores.join(proportional_scores, on="keyword")
scores.head().collect()

keyword,cat_score_counter,int_score_counter,int_score_proportional,cat_score_proportional
str,u32,u32,u8,u8
"""samsung m & a series""",1,1,2,5
"""samsung smartplaza bharath electronics & appliance…",1,1,2,5
"""vivo camera & music""",1,2,3,5
"""best samsung's series""",1,1,1,5
"""iphone 13 0""",1,0,0,7


## Assign a final score to each keyword

1. Summation of Min-Max scaled scores to tackle the different range of score metrics.


In [13]:
def min_max_normalization(expr: pl.Expr, col: str):
    # Minimun is 0 here for all column
    return expr.sub(0).truediv(pl.max(col).sub(0))


scores = scores.with_columns(
    [
        pl.col(col).pipe(min_max_normalization, col)
        for col in scores.select(cs.matches("*_score_*")).collect_schema().names()
    ],
).with_columns(final_score=pl.sum_horizontal(cs.matches("*_score_*")))
scores.head().collect()

keyword,cat_score_counter,int_score_counter,int_score_proportional,cat_score_proportional,final_score
str,f64,f64,f64,f64,f64
"""samsung m & a series""",0.5,0.333333,0.285714,0.416667,1.535714
"""samsung smartplaza bharath electronics & appliance…",0.5,0.333333,0.285714,0.416667,1.535714
"""vivo camera & music""",0.5,0.666667,0.428571,0.416667,2.011905
"""best samsung's series""",0.5,0.333333,0.142857,0.416667,1.392857
"""iphone 13 0""",0.5,0.0,0.0,0.583333,1.083333


In [14]:
scores.select("final_score").describe()

statistic,final_score
str,f64
"""count""",11873.0
"""null_count""",0.0
"""mean""",1.13796
"""std""",0.498753
"""min""",0.0
"""25%""",0.916667
"""50%""",0.916667
"""75%""",1.821429
"""max""",2.916667


In [15]:
scores.filter(pl.col("final_score").eq(0)).select(
    pl.col("keyword").implode(),
).collect().head(30)["keyword"].item().to_list()

['galaxy a 03',
 'galaxy 03',
 'galaxy 03s',
 'galaxy fold 1',
 'galaxy z fold 1',
 'galaxy note 1',
 'galaxy watch 1',
 'galaxy note 10 5g',
 'galaxy note 10 plus 5g',
 'galaxy s 10 5g',
 'galaxy a 10 s',
 'galaxy s 10 e',
 'galaxy 10',
 'galaxy j 10',
 'galaxy note 10 lite',
 'galaxy note 10',
 'galaxy note 10 plus',
 'galaxy note 10 pro',
 'galaxy note 10 ultra',
 'galaxy s 10 plus',
 'galaxy tab a 10.1',
 'galaxy note 10.1',
 'galaxy 11',
 'galaxy note 11',
 'galaxy a 12',
 'galaxy 12',
 'galaxy j 12',
 'galaxy note 12',
 'galaxy s21 fe 5g 128 gb',
 'galaxy s20 fe 5g 128gb',
 'galaxy s22 5g 128gb',
 'galaxy a12 128gb',
 'galaxy a23 128gb',
 'galaxy a53 128gb',
 'galaxy s22 plus 128gb',
 'galaxy s22 128gb',
 'galaxy s22 ultra 128gb',
 'galaxy 13',
 'galaxy 14 phone',
 'galaxy book 2 360',
 'galaxy book 2 pro 360',
 'galaxy ace 2',
 'galaxy active 2',
 'galaxy watch active 2',
 'galaxy beam 2',
 'galaxy book 2',
 'galaxy book 2 pro',
 'galaxy buds 2',
 'galaxy buds 2 pro',
 'galaxy c

## Additional filtering on keywords

We can apply additional filtering while deciding good keywords like:

1. If a keyword has only `cat_brand=true` and `no_intent=true` then it is like `"iphone 12"`/`"galaxy 21 FE"` which
   there is no intent at all just searching for a product on internet. _It seem if we ignore them we don't loose much
   requirement._
2. The same is **true for competition only keywords**.

This filtering can lead to _removal of more than 30% of keywords_ that's why we must consider asking this to **a Domain
Expert**.


In [None]:
# Keywords which has only brand and no intent
kwds.filter(
    pl.all_horizontal("cat_brand", "no_intent"),
    pl.any_horizontal(cs.starts_with("cat_").exclude("cat_brand")).not_(),
).collect()

keyword,cat_brand,cat_location,cat_competition,int_transactional,int_comparison,int_navigational,int_informational,no_intent
str,bool,bool,bool,bool,bool,bool,bool,bool
"""iphone 13 0""",true,false,false,false,false,false,false,true
"""iphone 14 0""",true,false,false,false,false,false,false,true
"""iphone 0""",true,false,false,false,false,false,false,true
"""002 iphone""",true,false,false,false,false,false,false,true
"""new iphone 1""",true,false,false,false,false,false,false,true
…,…,…,…,…,…,…,…,…
"""iphone5""",true,false,false,false,false,false,false,true
"""iphone6""",true,false,false,false,false,false,false,true
"""iphone7""",true,false,false,false,false,false,false,true
"""iphone8 plus""",true,false,false,false,false,false,false,true


In [None]:
# Keywords which has only competition and no intent
kwds.filter(
    pl.all_horizontal("cat_competition", "no_intent"),
    pl.any_horizontal(cs.starts_with("cat_").exclude("cat_competition")).not_(),
).collect()

keyword,cat_brand,cat_location,cat_competition,int_transactional,int_comparison,int_navigational,int_informational,no_intent
str,bool,bool,bool,bool,bool,bool,bool,bool
"""samsung galaxy a 01""",false,false,true,false,false,false,false,true
"""samsung a 01""",false,false,true,false,false,false,false,true
"""vivo y 01""",false,false,true,false,false,false,false,true
"""samsung galaxy a 02""",false,false,true,false,false,false,false,true
"""samsung a 02""",false,false,true,false,false,false,false,true
…,…,…,…,…,…,…,…,…
"""vivoy33s""",false,false,true,false,false,false,false,true
"""vivoy33t""",false,false,true,false,false,false,false,true
"""vivoy53s""",false,false,true,false,false,false,false,true
"""vivoy73""",false,false,true,false,false,false,false,true


In [None]:
# Keywords which has either brand or competition and no intent
kwds.filter(
    pl.any_horizontal("cat_brand", "cat_competition"),
    pl.col("no_intent"),
    pl.any_horizontal(
        cs.starts_with("cat_").exclude("cat_brand", "cat_competition"),
    ).not_(),
).collect()

keyword,cat_brand,cat_location,cat_competition,int_transactional,int_comparison,int_navigational,int_informational,no_intent
str,bool,bool,bool,bool,bool,bool,bool,bool
"""iphone 13 0""",true,false,false,false,false,false,false,true
"""iphone 14 0""",true,false,false,false,false,false,false,true
"""iphone 0""",true,false,false,false,false,false,false,true
"""002 iphone""",true,false,false,false,false,false,false,true
"""samsung galaxy a 01""",false,false,true,false,false,false,false,true
…,…,…,…,…,…,…,…,…
"""vivoy33s""",false,false,true,false,false,false,false,true
"""vivoy33t""",false,false,true,false,false,false,false,true
"""vivoy53s""",false,false,true,false,false,false,false,true
"""vivoy73""",false,false,true,false,false,false,false,true
