<a href="https://colab.research.google.com/github/edwardLum/work-related/blob/main/clustering-search-terms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Imports**

Libraries used:

* **pandas**: Pandas pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. User guide [here](https://pandas.pydata.org/docs/user_guide/index.**html**)

* **gensim**: Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Documentation [here](https://github.com/RaRe-Technologies/gensim/#documentation)

* **sklean**: scikit-learn is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. User guide [here](https://scikit-learn.org/stable/user_guide.html)



In [None]:
import pandas as pd
from gensim.models import Word2Vec
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

import chardet

**Detect encoding**

Use the detect method of chardet to detect the encoding of the provided file.

In [None]:
def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
    return result['encoding']

file_path = '/content/Search terms report - 2023-10-13T182616.564.csv'
original_encoding = detect_encoding(file_path)

print(f"Detected encoding: {original_encoding}")

Detected encoding: UTF-16


**Read data**

Create dataframe using the provided csv. Provide:

* the separator the csv uses
* the encoding of the file
* how many rows have to be skipped (if any)

Remove summary rows if any.

In [None]:
# Load data
all_terms = pd.read_csv(file_path, sep='\t', encoding='utf-16', skiprows=2)

all_terms = df[~df['Search term'].str.startswith('Total: ')]

all_terms.dtypes


Search term          object
Match type           object
Added/Excluded       object
Campaign             object
Ad group             object
Keyword              object
Currency code        object
Cost                float64
Impr.                object
Interactions         object
Interaction rate     object
Avg. cost           float64
Conversions          object
Cost / conv.        float64
Conv. rate           object
dtype: object

In [None]:
all_terms


Unnamed: 0,Search term,Match type,Added/Excluded,Campaign,Ad group,Keyword,Currency code,Cost,Impr.,Interactions,Interaction rate,Avg. cost,Conversions,Cost / conv.,Conv. rate
0,www bet24 betting,Phrase match (close variant),,ZA | Search | Generic sports betting,Test Ad Grou,"""online sports bet""",ZAR,7.82,1,1,100.00%,7.82,2.00,3.91,200.00%
1,which betting sites have aviator game,Phrase match (close variant),,ZA | Search | Generic sports betting,Generic sports betting,bet on sports,ZAR,40.94,3,1,33.33%,40.94,3.00,13.65,300.00%
2,bet games south africa,Phrase match (close variant),,ZA | Search | Generic sports betting,Test Ad Grou,sports gambling,ZAR,60.84,21,3,14.29%,20.28,2.00,30.42,66.67%
3,bet,Broad match,,ZA | Search | Generic sports betting,Generic sports betting,online betting,ZAR,844.53,1257,41,3.26%,20.60,7.00,120.65,17.07%
4,free bet,Broad match,,ZA | Search | Sports,Soccer,bet Soccer,ZAR,115.98,132,6,4.55%,19.33,2.00,57.99,33.33%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1720,exchange bet,Broad match,,ZA | Search | Generic sports betting,Generic sports betting,online sports bet,ZAR,13.10,42,1,2.38%,13.10,2.00,6.55,200.00%
1721,www betway co za login,Phrase match (close variant),,ZA | Search | Generic sports betting,Generic sports betting,online sports bet,ZAR,98.18,249,25,10.04%,3.93,2.00,49.09,8.00%
1722,sportpesa join,Phrase match,,ZA | Search | Branded,Branded,"""Sportpesa""",ZAR,21.82,1,1,100.00%,21.82,2.00,10.91,200.00%
1723,sportpesa south africa,Phrase match,Added,ZA | Search | Branded,Branded,"""Sportpesa""",ZAR,35.87,167,50,29.94%,0.72,8.00,4.48,16.00%
