<a href="https://colab.research.google.com/github/edwardLum/work-related/blob/main/clustering_search_terms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Imports**

Libraries used:

* **pandas**: Pandas pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. User guide [here](https://pandas.pydata.org/docs/user_guide/index.**html**)

* **gensim**: Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Documentation [here](https://github.com/RaRe-Technologies/gensim/#documentation)

* **sklean**: scikit-learn is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. User guide [here](https://scikit-learn.org/stable/user_guide.html)



In [9]:
import pandas as pd
import chardet

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from gensim.models import Word2Vec

from google.colab import files

**Files upload**

Choose file to upload and pass them to a list

In [10]:
uploaded = files.upload()

filenames = []

# Upload files:
for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  filenames.append(fn)

Saving Search terms report - 2023-10-17T184921.753.csv to Search terms report - 2023-10-17T184921.753 (1).csv
User uploaded file "Search terms report - 2023-10-17T184921.753 (1).csv" with length 222554 bytes


**Detect encoding**

Use the detect method of chardet to detect the encoding of the provided file.

In [11]:
def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read())
    return result['encoding']

file_path = f"/content/{filenames[0]}"
original_encoding = detect_encoding(file_path)

print(f"Detected encoding: {original_encoding}")

Detected encoding: UTF-16


**Read data**

Create dataframe using the provided csv. Provide:

* the separator the csv uses
* the encoding of the file
* how many rows have to be skipped (if any)
* thousands separator

Remove summary rows if any.

In [12]:
# Load data
all_terms = pd.read_csv(file_path, sep='\t',
                        encoding=original_encoding,
                        skiprows=2,
                        thousands=',')

**Process data**

Actions:
* Remove summary rows
* Remove unnecessary colums
* Deduplicate and aggregate metrics


In [13]:
# Remove Summary rows
all_terms = all_terms[~all_terms['Search term'].str.startswith('Total: ')]

# Remove unnecessary columns
columns_to_drop = ['Conv. rate', 'CTR', 'Cost / conv.', 'Avg. CPC']
all_terms_required_columns = all_terms.drop(columns=columns_to_drop)

aggregated_terms = all_terms_required_columns.groupby('Search term').agg({
    'Clicks': 'sum',
    'Cost': 'sum',
    'Impr.': 'sum',
    'Conversions': 'sum',
}).reset_index()

data = aggregated_terms

Unnamed: 0,Search term,Clicks,Cost,Impr.,Conversions
0,100 sign up bonus sports betting,28,619.30,363,9.0
1,10bet,387,4252.92,13899,76.0
2,10bet africa,3,30.06,12,3.0
3,10bet login,39,320.78,1319,12.0
4,10bet register,4,133.52,123,4.0
...,...,...,...,...,...
422,xbet login,9,93.34,174,4.0
423,xbet register,23,556.24,239,16.0
424,xbets,3,38.74,9,3.0
425,yankee bet,1,21.80,4,5.0


**TF-IDF vectorization**

TF-IDF (Term Frequency-Inverse Document Frequency) vectorization is a technique used to quantify the importance of words in a document relative to a corpus. It weighs terms based on their frequency in a document, penalized by their frequency across all documents. This results in emphasizing terms that are unique or specific to a particular document, while downplaying common or repetitive terms.

Need to provide two parameters:

**max_df**: This parameter is used to remove terms that appear too frequently in the corpus. It can be either:
an integer (e.g., 5), which specifies the maximum number of documents a term can appear in for it to be included as a feature, or
a float (e.g., 0.85), which represents a proportion of the entire corpus.

If a term appears in more than this proportion of documents, it will be discarded.
The main idea behind max_df is that words appearing in a very large proportion of documents are likely to be common words (e.g., stopwords) that might not carry specific, meaningful information about the content of a document.

**max_features**: This parameter limits the number of top features (words or tokens) the vectorizer will learn from the corpus based on term frequency.
If set (e.g., to 10000), the vectorizer will only consider the top max_features ordered by term frequency across the corpus.

This can be useful to limit the dimensionality of the output, especially when dealing with very large datasets where memory or computational resources are a concern.

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.85, max_features=10000, stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(data['Search term'])