# **Search Queries Anomaly Detection using Python**






**Introduction**

Search Queries Anomaly Detection involves identifying outliers in search query data based on performance metrics. This process helps businesses uncover potential issues or opportunities, such as unexpectedly high or low Click-Through Rates (CTR).

**Process Overview**
The process for Search Queries Anomaly Detection can be broken down into several key steps:

1. **Data Collection**: Gather historical search query data from sources like search engines or website search functionality.
   
2. **Initial Analysis**: Perform an initial analysis to understand the distribution of search queries, their frequency, and any noticeable patterns or trends.
   
3. **Feature Engineering**: Create relevant features or attributes from the search query data that can aid in anomaly detection.
   
4. **Model Selection**: Choose an appropriate anomaly detection algorithm. Common methods include statistical approaches (e.g., Z-score analysis) and machine learning algorithms (e.g., Isolation Forests, One-Class SVM).
   
5. **Model Training**: Train the selected model on the prepared data.
   
6. **Anomaly Detection**: Apply the trained model to the search query data to identify anomalies or outliers.

## Exploratory Data Analysis (EDA)

**Loading and Inspecting the Dataset**

In [66]:
import pandas as pd
from collections import Counter
import re
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"


- pandas is used for data manipulation and analysis
- Counter is used to count the frequency of elements in an iterable
- re is used for regular expression operations
- plotly.express is a high-level interface for Plotly, a graphing library
- plotly.io is used to configure Plotly's behavior
- sets the default template for Plotly graphs to "plotly_white"


In [67]:
# Reading the dataset
queries_df = pd.read_csv(r"C:\Users\Alpana\Desktop\project\Search Queries Anomaly Detection\dataset Queries.csv") # reads the CSV file into a pandas DataFrame


In [68]:
# prints the first five rows of the DataFrame
queries_df.head() 

Unnamed: 0,Top queries,Clicks,Impressions,CTR,Position
0,number guessing game python,5223,14578,35.83%,1.61
1,thecleverprogrammer,2809,3456,81.28%,1.02
2,python projects with source code,2077,73380,2.83%,5.94
3,classification report in machine learning,2012,4959,40.57%,1.28
4,the clever programmer,1931,2528,76.38%,1.09


- **Top Queries:** The actual search terms used by users.
- **Clicks:** The number of times users clicked on the website after using the query.
- **Impressions:** The number of times the website appeared in search results for the query.
- **CTR (Click Through Rate):** The ratio of clicks to impressions, indicating the effectiveness of the query in leading users to the website.
- **Position:** The average ranking of the website in search results for the query.

# Cleaning and Preparing Data

In [69]:
queries_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Top queries  1000 non-null   object 
 1   Clicks       1000 non-null   int64  
 2   Impressions  1000 non-null   int64  
 3   CTR          1000 non-null   object 
 4   Position     1000 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB


In [70]:
queries_df.describe()

Unnamed: 0,Clicks,Impressions,Position
count,1000.0,1000.0,1000.0
mean,172.275,1939.466,3.98593
std,281.0221,4856.702605,2.841842
min,48.0,62.0,1.0
25%,64.0,311.0,2.01
50%,94.0,590.5,3.12
75%,169.0,1582.75,5.3425
max,5223.0,73380.0,28.52


In [71]:
queries_df.isna().sum()

Top queries    0
Clicks         0
Impressions    0
CTR            0
Position       0
dtype: int64

**Cleaning CTR column**

In [72]:
# removes the '%' sign from the CTR column, converts it to float, and divides by 100 to get the CTR as a decimal
queries_df['CTR'] = queries_df['CTR'].str.rstrip('%').astype('float') / 100


- The str accessor is used to apply string methods to each element in the 'CTR' column.
- The rstrip('%') method removes any trailing '%' characters from each string in the column.
- For example, "35.83%" becomes "35.83".

# Analyzing Common Words in Queries

In [73]:
# Function to clean and split the queries into words
def clean_and_split(query):
    words = re.findall(r'\b[a-zA-Z]+\b', query.lower()) # uses a regular expression to find all words in the query (ignoring case)
    return words



In [74]:
# Split each query into words and count the frequency of each word
word_counts = Counter() # initializes a Counter object to count word frequencies
for query in queries_df['Top queries']:
    word_counts.update(clean_and_split(query)) # updates the word counts with the words from each query



In [75]:
word_freq_df = pd.DataFrame(word_counts.most_common(20), columns=['Word', 'Frequency']) # creates a DataFrame with the 20 most common words and their frequencies


# Visualizing Word Frequencies

In [76]:
# Plotting the word frequencies
fig = px.bar(word_freq_df, x='Word', y='Frequency', title='Top 20 Most Common Words in Search Queries') # creates a bar plot of the top 20 most common words
fig.show() # displays the plot


# Analyzing Key Metrics

 **Top Queries by Clicks and Impressions**

In [77]:
# Top queries by Clicks and Impressions
top_queries_clicks_vis = queries_df.nlargest(10, 'Clicks')[['Top queries', 'Clicks']] # selects the top 10 queries by Clicks
top_queries_impressions_vis = queries_df.nlargest(10, 'Impressions')[['Top queries', 'Impressions']] # selects the top 10 queries by Impressions

# Plotting
fig_clicks = px.bar(top_queries_clicks_vis, x='Top queries', y='Clicks', title='Top Queries by Clicks') # creates a bar plot of the top queries by Clicks
fig_impressions = px.bar(top_queries_impressions_vis, x='Top queries', y='Impressions', title='Top Queries by Impressions') # creates a bar plot of the top queries by Impressions
fig_clicks.show() # displays the Clicks plot
fig_impressions.show() # displays the Impressions plot


**Queries with Highest and Lowest CTR**

In [78]:
# Queries with highest and lowest CTR
top_ctr_vis = queries_df.nlargest(10, 'CTR')[['Top queries', 'CTR']] # selects the top 10 queries by CTR
bottom_ctr_vis = queries_df.nsmallest(10, 'CTR')[['Top queries', 'CTR']] # selects the bottom 10 queries by CTR

# Plotting
fig_top_ctr = px.bar(top_ctr_vis, x='Top queries', y='CTR', title='Top Queries by CTR') # creates a bar plot of the top queries by CTR
fig_bottom_ctr = px.bar(bottom_ctr_vis, x='Top queries', y='CTR', title='Bottom Queries by CTR') # creates a bar plot of the bottom queries by CTR
fig_top_ctr.show() # displays the top CTR plot
fig_bottom_ctr.show() # displays the bottom CTR plot


**Correlation Matrix**

In [79]:
# Correlation matrix visualization
correlation_matrix = queries_df[['Clicks', 'Impressions', 'CTR', 'Position']].corr() # computes the correlation matrix for the specified columns
fig_corr = px.imshow(correlation_matrix, text_auto=True, title='Correlation Matrix') # creates a heatmap of the correlation matrix
fig_corr.show() # displays the heatmap


# **Anomaly Detection**

**Using Isolation Forest Algorithm**

In [80]:
# Detecting Anomalies in Search Queries
from sklearn.ensemble import IsolationForest # imports the Isolation Forest algorithm from scikit-learn

# Selecting relevant features
features = queries_df[['Clicks', 'Impressions', 'CTR', 'Position']] # selects the relevant features for anomaly detection

# Initializing Isolation Forest
iso_forest = IsolationForest(n_estimators=100, contamination=0.01) # initializes the Isolation Forest with 100 trees and a contamination rate of 1%

# Fitting the model
iso_forest.fit(features) # fits the Isolation Forest model to the selected features

# Predicting anomalies
queries_df['anomaly'] = iso_forest.predict(features) # predicts anomalies in the dataset and adds a new column 'anomaly' to the DataFrame

# Filtering out the anomalies
anomalies = queries_df[queries_df['anomaly'] == -1] # filters the DataFrame to include only the rows classified as anomalies


In [81]:
# Analyzing the detected anomalies
anomalies[['Top queries', 'Clicks', 'Impressions', 'CTR', 'Position']]


Unnamed: 0,Top queries,Clicks,Impressions,CTR,Position
0,number guessing game python,5223,14578,0.3583,1.61
1,thecleverprogrammer,2809,3456,0.8128,1.02
2,python projects with source code,2077,73380,0.0283,5.94
4,the clever programmer,1931,2528,0.7638,1.09
15,rock paper scissors python,1111,35824,0.031,7.19
21,classification report,933,39896,0.0234,7.53
34,machine learning roadmap,708,42715,0.0166,8.97
82,r2 score,367,56322,0.0065,9.33
167,text to handwriting,222,11283,0.0197,28.52
232,standardscaler,177,39267,0.0045,10.23


**Conclusion**

Search Queries Anomaly Detection is crucial for businesses to identify and respond to unusual patterns in search query performance. By leveraging machine learning techniques like Isolation Forests, businesses can uncover actionable insights to optimize their search strategies and improve user engagement.