## Search queries anomaly detection: Process we can follow

##### This is a technique to identify unusual or unexpected patterns in search query data. This can be used to identify potential issues with the search engine or to identify new trends in user behavior.

#### The process we can follow is as follows:
#### 1.- Gather historical search query data from the source, such as a search engine or a website's search functionality.
#### 2.- Conduct an initial analysis to understand the distribution of search queries, their frequency, and any noticeable patterns or trends.
#### 3.- Create relevant features or attributes from the search query data that can aid in anomaly detection.
#### 4.- Choose an appropiate anomaly detection algorithm. Common methods include statistical approaches like Z-score analysis and machine learning algorithms like Isolation Forests or One-Class SVM.
#### 5.- Train the selected model on the prepared data.
#### 6.- Apply the trained model to the search query data to identify anomalies or outliers.

In [1]:
# Start the task by importing the necessary python libraries and the dataset
import pandas as pd
from collections import Counter
import re
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"

# Change according to your path
queries_df = pd.read_csv(
  '/home/xamanek/PythonProjects/TransformersML/Datasets/20240202a_Queries.csv'
)
print(queries_df.head())

                                 Top queries  Clicks  Impressions     CTR  \
0                number guessing game python    5223        14578  35.83%   
1                        thecleverprogrammer    2809         3456  81.28%   
2           python projects with source code    2077        73380   2.83%   
3  classification report in machine learning    2012         4959  40.57%   
4                      the clever programmer    1931         2528  76.38%   

   Position  
0      1.61  
1      1.02  
2      5.94  
3      1.28  
4      1.09  


In [2]:
# Lets take a look at the column insights before moving forward
print(queries_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Top queries  1000 non-null   object 
 1   Clicks       1000 non-null   int64  
 2   Impressions  1000 non-null   int64  
 3   CTR          1000 non-null   object 
 4   Position     1000 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB
None


In [3]:
# Now we convert the CTR column from a percentage string to a float
# Clean the CTR column
queries_df['CTR'] = queries_df['CTR'].str.rstrip('%').astype('float') / 100.0

In [4]:
# Analyze common words in each search history
# Function to clean and split the queries into words
def clean_and_split( query ):
  words = re.findall( r'\b[a-zA-Z]+\b', query.lower() )
  return words

# Split each query into words and count the frequency of each word
word_counts = Counter()
for query in queries_df['Top queries']:
  word_counts.update( clean_and_split( query ) )

word_freq_df = pd.DataFrame( word_counts.most_common( 20 ), columns = [ 'Word', 'Frequency' ] )

# Plot the word frequencies
fig = px.bar( word_freq_df, x = 'Word', y = 'Frequency', title = 'Top 20 Most Common Words in Search Queries' )
fig.show()