# Analysis for Search Traffic Breakdown

[T301902](https://phabricator.wikimedia.org/T301902)

Search and Structured Data teams plan to work on improving special:search experience on emerging language wikis that generally have less content/articles than bigger wikipedias. 
In this anaylysis, we want to understand the breakdown of user search traffic on emerging language wikipedias, so we can understand the estimated scale of impact of planned features as part of special:search experimentations.

Questions we want to answer:
 - Total search volume per wiki: What is the total number of searches in the go bar?
 - Autocomplete only
 - Go bar-to-special:search volume per wiki:
   - What is the amount/% of searches initiated in the go-bar that end up on the special page?
   - What amount/percentage of queries that get redirected to special:search had no autocomplete suggestions?
   - What amount/percentage of queries that have no autocomplete suggestions also have zero full text search results (i.e. 0 autosuggest suggestions > 0 special:search results)? inverse: what amount/percentage of queries with no autocomplete suggestions do have results in special:search?

In this analysis, we are interested in the following emerging languages for the search experimentations:

Priority 1:
Arabic, Bengali*, Spanish, Portuguese*, Russian

Priority 2: French*, Korean*, Indonesian, Ukrainian, Thai* ,Malaysian (?), Hindi, Tagalog, Afrikaans, Cantonese, Malayalam, Telugu

We pulled a reduced version of search event data from `searchsatisfaction` table in another notebook, and store in a new table `cchen_search.search_events`. 

In [1]:
import datetime as dt
import pandas as pd
import numpy as np

from wmfdata import hive, spark

You are using wmfdata v1.3.1, but v1.3.3 is available.

To update, run `pip install --upgrade git+https://github.com/wikimedia/wmfdata-python.git@release --ignore-installed`.

To see the changes, refer to https://github.com/wikimedia/wmfdata-python/blob/release/CHANGELOG.md


In [7]:
start_date = dt.date(2022, 4, 3)
end_date = dt.date(2022, 4, 10)

## Number of searches in the Go bar

Searches in the go bar have input_location in header navigator (instead of in content) and start with autocomplete searches. We count every distinct `searchSessionId` + `pageViewId` combination. An search session can consist of multiple searches as the user types out their query, and this collapses them into a single unit.


In [3]:
total_search_query = '''

SELECT wiki,
       TO_DATE(dt) AS log_date,
       COUNT(DISTINCT session_id, pageview_id) AS n_searches
FROM cchen_search.search_events
WHERE user_is_bot = false
AND action = "searchResultPage"
AND source = "autocomplete"
AND input_location LIKE "%header%"
GROUP BY wiki, TO_DATE(dt)  

'''

In [12]:
go_bar_searches_daily =  spark.run(total_search_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
                                                                                

In [13]:
go_bar_searches_daily = go_bar_searches_daily.loc[(go_bar_searches_daily['log_date'] >= start_date) &  
                                                  (go_bar_searches_daily['log_date'] <= end_date)]

In [15]:
go_bar_searches = go_bar_searches_daily.groupby(['wiki']).sum().reset_index()

In [16]:
go_bar_searches

Unnamed: 0,wiki,n_searches
0,arwiki,57338
1,bnwiki,7076
2,eswiki,831582
3,ptwiki,323991
4,ruwiki,1336483


## Number of Go bar-to-special:search

If users didn't find the result they are looking for or there's no exact article match in the go bar, they will click and then be redirected to the special search page. 
In this case, one Go bar-to-special:search consists of following events:
1. Series of autocomplete searches start in Go bar with input location in header navigator;
2. A click action has the same sessionid and pageview_id as the previous autocomplete searches;
3. A fulltext search event has the same session id and search query as the last autocomplete search preceeding (or that happened temporally near) a fulltext search. And these two searches have different pageview ids.

We find searches have these events, and count every distinct `searchSessionId` + `pageViewId` combination. 

In [24]:
special_search_query = '''

WITH full_text AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, pageview_id, query, results_returned
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "searchResultPage"
    AND source = "fulltext"
), auto AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, pageview_id, query, results_returned
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "searchResultPage"
    AND source = "autocomplete"
    AND input_location LIKE "%header%"
), click AS (
    SELECT TO_DATE(dt) AS log_date, wiki, session_id, pageview_id, unique_id
    FROM cchen_search.search_events
    WHERE user_is_bot = false
    AND action = "click"
    AND source = "autocomplete"
)

SELECT a.log_date, 
       a.wiki,
       COUNT(DISTINCT a.session_id, a.pageview_id) AS n_special_searches,
       COUNT(DISTINCT(CASE WHEN a.results_returned = 0 THEN (a.session_id, a.pageview_id) END)) AS zero_auto_searches,
       COUNT(DISTINCT(CASE WHEN a.results_returned = 0 AND f.results_returned IS NULL THEN (a.session_id, a.pageview_id) END)) AS zero_auto_special_searches
FROM full_text f 
  LEFT JOIN auto a ON (f.session_id = a.session_id AND f.query = a.query AND f.log_date = a.log_date AND f.wiki = a.wiki)
  LEFT JOIN click c ON (a.session_id = c.session_id AND a.pageview_id = c.pageview_id AND a.log_date = c.log_date AND a.wiki = c.wiki)
WHERE c.unique_id IS NOT NULL
AND f.pageview_id != a.pageview_id
GROUP BY a.log_date, a.wiki

'''

In [25]:
spaecial_searches_daily =  spark.run(special_search_query)

PySpark executors will use /usr/lib/anaconda-wmf/bin/python3.
                                                                                

In [26]:
spaecial_searches_daily = spaecial_searches_daily.loc[(spaecial_searches_daily['log_date'] >= start_date) &  
                                                  (spaecial_searches_daily['log_date'] <= end_date)]

In [31]:
spaecial_searches=spaecial_searches_daily.groupby(['wiki']).sum().reset_index()

In [33]:
search_metrics = go_bar_searches.merge(spaecial_searches, on= 'wiki')

In [37]:
search_metrics['special_pct'] = (search_metrics['n_special_searches'] / search_metrics['n_searches']) * 100

In [39]:
search_metrics.loc[:,['wiki','n_special_searches','special_pct']]

Unnamed: 0,wiki,n_special_searches,special_pct
0,arwiki,20471,35.702327
1,bnwiki,2692,38.044093
2,eswiki,147765,17.769144
3,ptwiki,78914,24.356849
4,ruwiki,148760,11.130706


Number of searches that get redirected to special:search had no autocomplete suggestions: autocomplete searches that with `results_returned` = 0

In [40]:
search_metrics['zero_auto_pct'] = (search_metrics['zero_auto_searches'] / search_metrics['n_special_searches']) * 100

In [41]:
search_metrics.loc[:,['wiki','zero_auto_searches','zero_auto_pct']]

Unnamed: 0,wiki,zero_auto_searches,zero_auto_pct
0,arwiki,15615,76.278638
1,bnwiki,2196,81.575037
2,eswiki,107394,72.678916
3,ptwiki,61797,78.309299
4,ruwiki,103010,69.245765


Number of searches get redirected to special:searchs that have no autocomplete suggestions also have zero full text search results. 

In [43]:
search_metrics['zero_auto_special_pct'] = (search_metrics['zero_auto_special_searches'] / search_metrics['n_special_searches']) * 100

In [44]:
search_metrics.loc[:,['wiki','zero_auto_special_searches','zero_auto_special_pct']]

Unnamed: 0,wiki,zero_auto_special_searches,zero_auto_special_pct
0,arwiki,2153,10.517317
1,bnwiki,1014,37.667162
2,eswiki,10869,7.355598
3,ptwiki,7190,9.111184
4,ruwiki,15757,10.592229


## Number of autocomplete only searches in the Go bar

In [45]:
search_metrics['n_auto_searches'] = (search_metrics['n_searches'] -  search_metrics['n_special_searches']) 

In [46]:
search_metrics['auto_pct'] = (search_metrics['n_auto_searches'] / search_metrics['n_searches']) * 100

In [48]:
search_metrics.loc[:,['wiki','n_auto_searches','auto_pct']]

Unnamed: 0,wiki,n_auto_searches,auto_pct
0,arwiki,36867,64.297673
1,bnwiki,4384,61.955907
2,eswiki,683817,82.230856
3,ptwiki,245077,75.643151
4,ruwiki,1187723,88.869294
