# Analysis for Search Traffic Breakdown

[T301902](https://phabricator.wikimedia.org/T301902)

Search and Structured Data teams plan to work on improving special:search experience on emerging language wikis that generally have less content/articles than bigger wikipedias. 
In this anaylysis, we want to understand the breakdown of user search traffic on emerging language wikipedias, so we can understand the estimated scale of impact of planned features as part of special:search experimentations.

Questions we want to answer:
 - Total search volume per wiki: What is the total number of searches in the go bar?
 - Autocomplete only
 - Go bar-to-special:search volume per wiki:
   - What is the amount/% of searches initiated in the go-bar that end up on the special page?
   - What amount/percentage of queries that get redirected to special:search had no autocomplete suggestions?
   - What amount/percentage of queries that have no autocomplete suggestions also have zero full text search 
   results (i.e. 0 autosuggest suggestions > 0 special:search results)? inverse: what amount/percentage of queries with no autocomplete suggestions do have results in special:search?
 - Click through rates for Autocomplete searches and special searches

In this analysis, we are interested in the following emerging languages for the search experimentations:

Priority 1:
Arabic, Bengali*, Spanish, Portuguese*, Russian

Priority 2: French*, Korean*, Indonesian, Ukrainian, Thai* ,Malaysian (?), Hindi, Tagalog, Afrikaans, Cantonese, Malayalam, Telugu

We pulled a reduced version of search event data from `searchsatisfaction` table in another notebook, and store in a new table `cchen_search.search_events`. 

In [1]:
import datetime as dt
import pandas as pd
import numpy as np

from wmfdata import hive, spark

You are using wmfdata v1.3.1, but v1.3.3 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release --ignore-installed`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md


In [2]:
start_date = dt.date(2022, 4, 3)
end_date = dt.date(2022, 4, 10)

## Number of searches in the Go bar

Searches in the go bar have input_location in header navigator (instead of in content) and start with autocomplete searches. We count every distinct `searchSessionId` + `pageViewId` combination. A search session can consist of multiple searches as the user types out their query, and this collapses them into a single unit.


In [39]:
total_search_query = '''

SELECT wiki,
       TO_DATE(dt) AS log_date,
       COUNT(DISTINCT session_id, pageview_id) AS n_searches
FROM cchen_search.search_events
WHERE user_is_bot = false
AND action = "searchResultPage"
AND source = "autocomplete"
AND input_location LIKE "%header%"
GROUP BY wiki, TO_DATE(dt)  

'''

In [40]:
go_bar_searches_daily =  spark.run(total_search_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
                                                                                

In [41]:
go_bar_searches_daily = go_bar_searches_daily.loc[(go_bar_searches_daily['log_date'] >= start_date) &  
                                                  (go_bar_searches_daily['log_date'] <= end_date)]

In [42]:
go_bar_searches = go_bar_searches_daily.groupby(['wiki']).sum().reset_index()

In [43]:
go_bar_searches

Unnamed: 0,wiki,n_searches
0,afwiki,1437
1,arwiki,57338
2,bnwiki,7076
3,eswiki,831582
4,frwiki,1398610
5,hiwiki,2957
6,idwiki,64386
7,kowiki,277048
8,mlwiki,1298
9,mswiki,6106


## Number of Go bar-to-special:search

If users didn't find the result they are looking for or there's no exact article match in the go bar, they will click and then be redirected to the special search page. 
In this case, one Go bar-to-special:search consists of following events:
1. Series of autocomplete searches start in Go bar with input location in header navigator;
2. A click action has the same sessionid and pageview_id as the previous autocomplete searches;
3. A fulltext search event has the same session id and search query as the last autocomplete search preceeding (or that happened temporally near) a fulltext search. And these two searches have different pageview ids.

We find searches have these events, and count every distinct `searchSessionId` + `pageViewId` combination. 

In [44]:
special_search_query = '''

WITH full_text AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, pageview_id, query, results_returned
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "searchResultPage"
    AND source = "fulltext"
), auto AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, pageview_id, query, results_returned
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "searchResultPage"
    AND source = "autocomplete"
    AND input_location LIKE "%header%"
), click AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, pageview_id
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "click"
    AND source = "autocomplete"
)

SELECT a.log_date, 
       a.wiki,
       COUNT(DISTINCT a.session_id, a.pageview_id) AS n_special_searches,
       COUNT(DISTINCT(CASE WHEN a.results_returned = 0 THEN (a.session_id, a.pageview_id) END)) AS zero_auto_searches,
       COUNT(DISTINCT(CASE WHEN a.results_returned = 0 AND f.results_returned IS NULL THEN (a.session_id, a.pageview_id) END)) AS zero_auto_special_searches
FROM full_text f 
  INNER JOIN auto a ON (f.session_id = a.session_id AND f.query = a.query AND f.log_date = a.log_date AND f.wiki = a.wiki)
  INNER JOIN click c ON (a.session_id = c.session_id AND a.pageview_id = c.pageview_id AND a.log_date = c.log_date AND a.wiki = c.wiki)
WHERE f.pageview_id != a.pageview_id
GROUP BY a.log_date, a.wiki

'''

spaecial_searches_daily =  spark.run(special_search_query)

In [45]:
spaecial_searches_daily = spaecial_searches_daily.loc[(spaecial_searches_daily['log_date'] >= start_date) &  
                                                  (spaecial_searches_daily['log_date'] <= end_date)]

spaecial_searches=spaecial_searches_daily.groupby(['wiki']).sum().reset_index()
search_metrics = go_bar_searches.merge(spaecial_searches, on= 'wiki')
search_metrics['special_pct'] = (search_metrics['n_special_searches'] / search_metrics['n_searches']) * 100

In [46]:
search_metrics.loc[:,['wiki','n_special_searches','special_pct']]

Unnamed: 0,wiki,n_special_searches,special_pct
0,afwiki,565,39.318024
1,arwiki,20471,35.702327
2,bnwiki,2692,38.044093
3,eswiki,147765,17.769144
4,frwiki,267560,19.130422
5,hiwiki,1707,57.727426
6,idwiki,21506,33.401671
7,kowiki,86336,31.162831
8,mlwiki,440,33.898305
9,mswiki,1918,31.411726


**Number of searches that get redirected to special:search had no autocomplete suggestions: autocomplete searches that with `results_returned` = 0.**

In [47]:
search_metrics['zero_auto_pct'] = (search_metrics['zero_auto_searches'] / search_metrics['n_special_searches']) * 100

In [48]:
search_metrics.loc[:,['wiki','zero_auto_searches','zero_auto_pct']]

Unnamed: 0,wiki,zero_auto_searches,zero_auto_pct
0,afwiki,410,72.566372
1,arwiki,15615,76.278638
2,bnwiki,2196,81.575037
3,eswiki,107394,72.678916
4,frwiki,199072,74.402751
5,hiwiki,1486,87.05331
6,idwiki,17102,79.521994
7,kowiki,54911,63.60151
8,mlwiki,336,76.363636
9,mswiki,1410,73.514077


**Number of searches get redirected to special:searchs that have no autocomplete suggestions also have zero full text search results.**

In [49]:
search_metrics['zero_auto_special_pct'] = (search_metrics['zero_auto_special_searches'] / search_metrics['n_special_searches']) * 100

In [50]:
search_metrics.loc[:,['wiki','zero_auto_special_searches','zero_auto_special_pct']]

Unnamed: 0,wiki,zero_auto_special_searches,zero_auto_special_pct
0,afwiki,134,23.716814
1,arwiki,2153,10.517317
2,bnwiki,1014,37.667162
3,eswiki,10869,7.355598
4,frwiki,17170,6.417252
5,hiwiki,711,41.652021
6,idwiki,2470,11.485167
7,kowiki,8622,9.986564
8,mlwiki,170,38.636364
9,mswiki,196,10.218978


## Number of autocomplete only searches in the Go bar

In [51]:
search_metrics['n_auto_searches'] = (search_metrics['n_searches'] -  search_metrics['n_special_searches']) 

In [52]:
search_metrics['auto_pct'] = (search_metrics['n_auto_searches'] / search_metrics['n_searches']) * 100

In [53]:
search_metrics.loc[:,['wiki','n_auto_searches','auto_pct']]

Unnamed: 0,wiki,n_auto_searches,auto_pct
0,afwiki,872,60.681976
1,arwiki,36867,64.297673
2,bnwiki,4384,61.955907
3,eswiki,683817,82.230856
4,frwiki,1131050,80.869578
5,hiwiki,1250,42.272574
6,idwiki,42880,66.598329
7,kowiki,190712,68.837169
8,mlwiki,858,66.101695
9,mswiki,4188,68.588274


## Click Through Rates

For autocomplete searches in the go-bar, clicks have the same pageview ids as the searches. In additions, the click positions are greater than or equal to 0 (click_position = -1 if click on button leads to special:search page).

In [54]:
auto_click_query = ''' 

WITH gobar_search AS (
  SELECT wiki, TO_DATE(dt) AS log_date, session_id, pageview_id
  FROM cchen_search.search_events
  WHERE user_is_bot = false
  AND action = "searchResultPage"
  AND source = "autocomplete"
  AND input_location LIKE "%header%"
)

SELECT 
    gs.log_date,
    s.wiki,
    COUNT(DISTINCT s.session_id, s.pageview_id) AS n_clicks
FROM cchen_search.search_events s
INNER JOIN gobar_search gs 
ON (s.wiki = gs.wiki AND gs.log_date = TO_DATE(s.dt) AND s.session_id = gs.session_id AND s.pageview_id = gs.pageview_id)
AND action = "click"
AND source = "autocomplete"
AND s.click_position >= 0 
GROUP BY gs.log_date,s.wiki

'''

In [55]:
auto_click_daily = spark.run(auto_click_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
                                                                                

In [56]:
auto_click_daily = auto_click_daily.loc[(auto_click_daily['log_date'] >= start_date) &  
                                                  (auto_click_daily['log_date'] <= end_date)]

auto_clicks = auto_click_daily.groupby(['wiki']).sum().reset_index()
go_bar_metrics = go_bar_searches.merge(auto_clicks, on= 'wiki')
go_bar_metrics['auto_ctr'] = (go_bar_metrics['n_clicks'] / go_bar_metrics['n_searches']) * 100

**Number of clicks and click through rate for searches start in the gobar.**

In [57]:
go_bar_metrics.loc[:,['wiki','n_clicks','auto_ctr']]

Unnamed: 0,wiki,n_clicks,auto_ctr
0,afwiki,553,38.482951
1,arwiki,27048,47.172905
2,bnwiki,3035,42.891464
3,eswiki,429048,51.59419
4,frwiki,799744,57.181344
5,hiwiki,576,19.479202
6,idwiki,28720,44.60597
7,kowiki,65627,23.687953
8,mlwiki,635,48.921418
9,mswiki,2107,34.507042


In [58]:
special_click_query = '''

WITH full_text AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, pageview_id, query
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "searchResultPage"
    AND source = "fulltext"
), auto AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, pageview_id, query
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "searchResultPage"
    AND source = "autocomplete"
    AND input_location LIKE "%header%"
), click AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, pageview_id
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "click"
    AND source = "autocomplete"
    AND click_position = -1
), visit AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, query
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "visitPage"
    AND source = "fulltext"
    AND click_position >= 0
)

SELECT
  a.log_date, 
  a.wiki,
  COUNT(DISTINCT a.session_id, a.pageview_id) AS n_special_clicks
FROM full_text f 
  INNER JOIN auto a ON (f.session_id = a.session_id AND f.query = a.query AND f.log_date = a.log_date AND f.wiki = a.wiki)
  INNER JOIN click c ON (a.session_id = c.session_id AND a.pageview_id = c.pageview_id AND a.log_date = c.log_date AND a.wiki = c.wiki)
  INNER JOIN visit v ON (f.session_id = v.session_id AND f.query = v.query AND f.log_date = v.log_date AND f.wiki = v.wiki)
WHERE f.pageview_id != a.pageview_id
GROUP BY a.log_date, a.wiki


'''

In [None]:
special_click_daily = spark.run(special_click_query)

In [36]:
special_click_daily = special_click_daily.loc[(special_click_daily['log_date'] >= start_date) &  
                                                  (special_click_daily['log_date'] <= end_date)]

special_clicks = special_click_daily.groupby(['wiki']).sum().reset_index()
search_metrics = search_metrics.merge(special_clicks, on= 'wiki')
search_metrics['special_ctr'] = (search_metrics['n_special_clicks'] / search_metrics['n_special_searches']) * 100

**Number of clicks and click through rate for searches start in the gobar.**

In [66]:
search_metrics.loc[:,['wiki','n_special_clicks','special_ctr']]

Unnamed: 0,wiki,n_special_clicks,special_ctr
0,afwiki,97,17.168142
1,arwiki,4521,22.084901
2,bnwiki,321,11.92422
3,eswiki,37388,25.302338
4,frwiki,77005,28.78046
5,hiwiki,300,17.574692
6,idwiki,4781,22.231005
7,kowiki,19935,23.09002
8,mlwiki,46,10.454545
9,mswiki,325,16.944734
