**This is a notebook for the AICP Internship Task for Search Queries Anomaly Detection**

# Dataset attributes
Top Queries: The actual search terms used by users.  
Clicks: The number of times users clicked on the website after using the query.  
Impressions: The number of times the website appeared in search results for the query.  
CTR (Click Through Rate): The ratio of clicks to impressions, indicating the effectiveness of
the query in leading users to the website.  
Position: The average ranking of the website in search results for the query.   


In [64]:
# Importing Libraries
import pandas as pd
from collections import Counter
import re
import plotly.express as px
from sklearn.ensemble import IsolationForest 

**Q.1: Import data and check null values, check column info, and descriptive statistics of the data.**

In [11]:
# Read the dataset
data = pd.read_csv('Queries.csv')

In [12]:
data.head(5)

Unnamed: 0,Top queries,Clicks,Impressions,CTR,Position
0,number guessing game python,5223,14578,35.83%,1.61
1,thecleverprogrammer,2809,3456,81.28%,1.02
2,python projects with source code,2077,73380,2.83%,5.94
3,classification report in machine learning,2012,4959,40.57%,1.28
4,the clever programmer,1931,2528,76.38%,1.09


In [13]:
# Check for null values
print(data.isnull().sum())

Top queries    0
Clicks         0
Impressions    0
CTR            0
Position       0
dtype: int64


In [14]:
# Get column information
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Top queries  1000 non-null   object 
 1   Clicks       1000 non-null   int64  
 2   Impressions  1000 non-null   int64  
 3   CTR          1000 non-null   object 
 4   Position     1000 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 39.2+ KB
None


In [15]:
# Get descriptive statistics
print(data.describe())

          Clicks   Impressions     Position
count  1000.0000   1000.000000  1000.000000
mean    172.2750   1939.466000     3.985930
std     281.0221   4856.702605     2.841842
min      48.0000     62.000000     1.000000
25%      64.0000    311.000000     2.010000
50%      94.0000    590.500000     3.120000
75%     169.0000   1582.750000     5.342500
max    5223.0000  73380.000000    28.520000


**Q.2: Convert the CTR column from a percentage string to a float.**

In [17]:
data['CTR'] = data['CTR'].str.rstrip('%').astype('float') / 100

In [18]:
data.head(5)

Unnamed: 0,Top queries,Clicks,Impressions,CTR,Position
0,number guessing game python,5223,14578,0.3583,1.61
1,thecleverprogrammer,2809,3456,0.8128,1.02
2,python projects with source code,2077,73380,0.0283,5.94
3,classification report in machine learning,2012,4959,0.4057,1.28
4,the clever programmer,1931,2528,0.7638,1.09


**Q.3: Analyze common words in each search query.**

In [20]:
# Function to clean and split queries into words
def clean_and_split(query):
    words = re.findall(r'\w+', query.lower())
    return words

# Split each query into words and count frequency
word_counts = Counter()
for query in data['Top queries']:
    words = clean_and_split(query)
    word_counts.update(words)


In [21]:
# Plot word frequencies
word_freq_df = pd.DataFrame.from_dict(word_counts, orient='index', columns=['Frequency'])
word_freq_df = word_freq_df.reset_index().rename(columns={'index': 'Word'})
word_freq_df = word_freq_df.sort_values('Frequency', ascending=False)

fig = px.bar(word_freq_df.head(20), x='Word', y='Frequency', title='Most Common Words in Search Queries')
fig.show()

**Q.4: Analyze the top queries by clicks and impressions.**

In [24]:
# Plot the top queries by clicks
top_queries_by_clicks = data.sort_values('Clicks', ascending=False).head(10)
fig = px.bar(data.head(10), x='Top queries', y='Clicks', title=' Top Queries by clicks')
fig.show()

# Print the top queries and their corresponding clicks
print(top_queries_by_clicks[['Top queries', 'Clicks']])

                                 Top queries  Clicks
0                number guessing game python    5223
1                        thecleverprogrammer    2809
2           python projects with source code    2077
3  classification report in machine learning    2012
4                      the clever programmer    1931
5        standard scaler in machine learning    1559
6                               aman kharwal    1490
7                python turtle graphics code    1455
8      python game projects with source code    1421
9        82 python projects with source code    1343


In [28]:
# Plot the top queries by Impressions
top_queries_by_Impressions = data.sort_values('Impressions', ascending=False).head(10)
fig = px.bar(top_queries_by_Impressions, x='Top queries', y='Impressions', title='Top Impressions by Top queries')
fig.show()

# Print the top queries and their corresponding Impressions
print(top_queries_by_Impressions[['Top queries', 'Impressions']])

                          Top queries  Impressions
2    python projects with source code        73380
82                           r2 score        56322
34           machine learning roadmap        42715
21              classification report        39896
232                    standardscaler        39267
91     facebook programming languages        36055
15         rock paper scissors python        35824
36                  pandas datareader        26663
180             classification_report        24917
54                  pandas_datareader        24689


**Q.5: Analyze the queries with the highest and lowest CTRs.**

In [31]:
# Sort the data by CTR in descending order and select the top 10 rows
highest_ctr_queries = data.sort_values('CTR', ascending=False).head(10)

# Create a bar graph
fig = px.bar(highest_ctr_queries, x='Top queries', y='CTR', title='Top Queries by CTR')
fig.update_layout(width=800, height=600)
fig.show()

# Print the top queries and their corresponding CTR
print("Queries with the highest CTR:")
print(highest_ctr_queries[['Top queries', 'CTR']])

Queries with the highest CTR:
                                           Top queries     CTR
928                           the cleverprogrammer.com  0.8548
927                          the clever programmer.com  0.8281
1                                  thecleverprogrammer  0.8128
732               the clever programmer python project  0.7857
307    the clever programmer machine learning projects  0.7735
4                                the clever programmer  0.7638
964               python program to send otp to mobile  0.7083
95                        the card game code in python  0.6699
771  write a python program that calculates number ...  0.6632
137  python program to calculate number of seconds ...  0.6585


In [33]:
lowest_ctr_queries = data.sort_values('CTR').head(10)

# Create a bar graph
fig = px.bar(lowest_ctr_queries, x='Top queries', y='CTR', title='Bottom Queries by CTR')
fig.update_layout(width=800, height=600)
fig.show()
# Print the Bottom queries and their corresponding CTR
print("Queries with the lowest CTR:")
print(lowest_ctr_queries[['Top queries', 'CTR']])


Queries with the lowest CTR:
                        Top queries     CTR
929                   python turtle  0.0029
232                  standardscaler  0.0045
423   classification report sklearn  0.0047
544                 standard scaler  0.0048
981                r2 score sklearn  0.0062
82                         r2 score  0.0065
536              python source code  0.0067
684                 turtle graphics  0.0070
664  online payment fraud detection  0.0070
858          water quality analysis  0.0076


**Q.6: Check the correlation between different metrics.**

In [42]:
# Calculate the correlation matrix
correlation_matrix = data[['Clicks', 'Impressions', 'CTR', 'Position']].corr()

# Create the heatmap
fig = px.imshow(correlation_matrix,
                labels=dict(x='Metrics', y='Metrics', color='Correlation'),
                x=correlation_matrix.index,
                y=correlation_matrix.columns,
                color_continuous_scale='Viridis',
                text_auto=True,      
                )
# Set the title
fig.update_layout(title='Correlation Matrix Heatmap')

# Show the heatmap
fig.show()



# Observation from the Correlation Matrix

 # Clicks vs. Position:
 A notable finding is the strong negative correlation between CTR and position (-0.73), which suggests that users are more likely to click on ads appearing higher in search results
 # Clicks vs. Impressions:
 The moderate positive correlation between clicks and impressions (0.38) indicates that increasing ad visibility can lead to more user engagement
 # Impressions vs. CTR (Click-Through Rate):
The moderate negative correlation between impressions and CTR (-0.33) implies that while more impressions might result in more clicks, they could also lead to a decrease in CTR due to factors like ad fatigue or irrelevance
 # Clicks vs. CTR (Click-Through Rate)
The weak positive correlation (0.11) between clicks and CTR suggests a slight increase in click-through rate with more clicks. This implies that a higher number of user interactions could lead to a marginally better ad performance in terms of click-through rate.
 # Impressions vs. Position:
The moderate positive correlation (0.36) between impressions and position indicates that ads in lower positions tend to have more impressions. This implies that despite a lower rank, these ads might still gain significant visibility among users, potentially influencing their overall advertising impact
# CTR (Click-Through Rate) vs. Position:
The correlation coefficient between CTR and position indicating a strong negative correlation.Means 
higher ranking in search results (lower position number) leads to significantly more clicks. Aim for the top spot for maximum clicks

**Q.7: Detect anomalies in search queries using the Isolation Forest algorithm**

In [62]:
# Select the features for anomaly detection (e.g., Clicks, Impressions, CTR, or Position)
features = ['Clicks', 'Impressions', 'CTR', 'Position']
X = data[features]

# Train the Isolation Forest model
model = IsolationForest(contamination=0.01)  # Adjust the contamination parameter as needed
model.fit(X)

# Predict anomalies
anomaly_labels = model.predict(X)
anomalies = data[anomaly_labels == -1]

anomalies_subset = anomalies[['Top queries', 'Clicks', 'Impressions', 'CTR', 'Position']]


In [63]:
anomalies_subset.head(10)

Unnamed: 0,Top queries,Clicks,Impressions,CTR,Position
0,number guessing game python,5223,14578,0.3583,1.61
1,thecleverprogrammer,2809,3456,0.8128,1.02
2,python projects with source code,2077,73380,0.0283,5.94
4,the clever programmer,1931,2528,0.7638,1.09
11,clever programmer,1243,21566,0.0576,4.82
15,rock paper scissors python,1111,35824,0.031,7.19
21,classification report,933,39896,0.0234,7.53
34,machine learning roadmap,708,42715,0.0166,8.97
82,r2 score,367,56322,0.0065,9.33
929,python turtle,52,18228,0.0029,18.75
