# Google natural questions (scraped)
This notebooks aims to answer the following research question: **Which natural questions do users make when making dictionary-based queries**. We will use the Google NLQ dataset for this aim.

In particular, given the one-shot nature of Google NLQ we will try to find questions starting from "useful" linguistic keywords.

Unfortunately, the simplified dataset is gated by a Google login page. We thus have to donwload the full dataset.

In [1]:
!pip3 install gsutil==4.51



In [2]:
from tools.dumps import download_to, get_filename_path
import fnmatch
import os

In [3]:
%%capture
get_filename_path("GoogleNLQ/dummy")

In [3]:
# Be wary! This will download the full dataset of 41 GBs.
!gsutil -m cp -R gs://natural_questions/v1.0 $googlenlq_dir

Copying gs://natural_questions/v1.0/LICENSE.txt...
Copying gs://natural_questions/v1.0/README.txt...                               
Copying gs://natural_questions/v1.0/dev/nq-dev-00.jsonl.gz...                   
Copying gs://natural_questions/v1.0/dev/nq-dev-01.jsonl.gz...                   
Copying gs://natural_questions/v1.0/dev/nq-dev-02.jsonl.gz...                   
Copying gs://natural_questions/v1.0/dev/nq-dev-03.jsonl.gz...                   
Copying gs://natural_questions/v1.0/dev/nq-dev-04.jsonl.gz...                   
Copying gs://natural_questions/v1.0/sample/nq-dev-sample.jsonl.gz...            
Copying gs://natural_questions/v1.0/sample/nq-train-sample.jsonl.gz...          
Copying gs://natural_questions/v1.0/train/nq-train-00.jsonl.gz...               
Copying gs://natural_questions/v1.0/train/nq-train-01.jsonl.gz...               
Copying gs://natural_questions/v1.0/train/nq-train-02.jsonl.gz...
Copying gs://natural_questions/v1.0/train/nq-train-04.jsonl.gz...
Copying

## Pyserini Analisis

We need to generate the indices for them. For that purpose, we limit ourselves to listing the `example_id` and `question_text` columns which contain the actual ids and questions, respectively.

In [4]:
import pyspark
from pyspark.sql import SparkSession

# TODO: check that Arrow is properly set up.
sc = pyspark.SparkContext()
spark = SparkSession(sc)

In [5]:
from pyspark.sql.functions import col

In [59]:
num_slices = 10
googlenlq_filename = [f"data/GoogleNLQ/v1.0/train/nq-train-%.2d.jsonl.gz" % i for i in range(num_slices)]

nlq_df = spark.read.json(googlenlq_filename)

In [60]:
nlq_df.printSchema()

root
 |-- annotations: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotation_id: decimal(20,0) (nullable = true)
 |    |    |-- long_answer: struct (nullable = true)
 |    |    |    |-- candidate_index: long (nullable = true)
 |    |    |    |-- end_byte: long (nullable = true)
 |    |    |    |-- end_token: long (nullable = true)
 |    |    |    |-- start_byte: long (nullable = true)
 |    |    |    |-- start_token: long (nullable = true)
 |    |    |-- short_answers: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- end_byte: long (nullable = true)
 |    |    |    |    |-- end_token: long (nullable = true)
 |    |    |    |    |-- start_byte: long (nullable = true)
 |    |    |    |    |-- start_token: long (nullable = true)
 |    |    |-- yes_no_answer: string (nullable = true)
 |-- document_html: string (nullable = true)
 |-- document_title: string (nullable = true)
 |-- document_to

In [61]:
overall_example_count = nlq_df.count()
single_slice_count = overall_example_count // num_slices

print(f"Overall we extracted {overall_example_count} examples. We estimate a single slice contains {single_slice_count} examples.")

Overall we extracted 61477 examples. We estimate a single slice contains 6147 examples.


In [64]:
simplified_path = get_filename_path("GoogleNLQ/simplified_questions.jsonl")

nlq_df.select([col('example_id').alias("id"),
                  col("question_text").alias("contents")])\
        .toPandas().to_json(simplified_path, orient='records', force_ascii=False, lines=True)

In [24]:
# Just some cruft to make Pyjnius happy, see https://github.com/kivy/pyjnius/issues/304
# In order not to break pyspark, this has to be done AFTER a spark session is
# instantiated.

java_path = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["JAVA_HOME"] = java_path
os.environ["JDK_HOME"] = java_path
# PyJnius internally invokes javac to extract the JNI symbols
# via a shared object in the JRE.
os.environ["PATH"] =  f"{java_path}/bin:" + os.environ["PATH"]

In [65]:
index_path = get_filename_path("GoogleNLQ/lucene-index.googlenlq-simplified_questions.pos+docvectors+raw")
simplified_path_folder = get_filename_path("GoogleNLQ")

!./3rdparty/anserini/target/appassembler/bin/IndexCollection \
    -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
    -threads 10 -input $simplified_path_folder -index $index_path -storePositions \
    -storeDocvectors -storeRaw

2020-08-19 19:26:49,160 INFO  [main] index.IndexCollection (IndexCollection.java:636) - Setting log level to INFO
2020-08-19 19:26:49,161 INFO  [main] index.IndexCollection (IndexCollection.java:639) - Starting indexer...
2020-08-19 19:26:49,162 INFO  [main] index.IndexCollection (IndexCollection.java:641) - DocumentCollection path: data/GoogleNLQ
2020-08-19 19:26:49,162 INFO  [main] index.IndexCollection (IndexCollection.java:642) - CollectionClass: JsonCollection
2020-08-19 19:26:49,162 INFO  [main] index.IndexCollection (IndexCollection.java:643) - Generator: DefaultLuceneDocumentGenerator
2020-08-19 19:26:49,162 INFO  [main] index.IndexCollection (IndexCollection.java:644) - Threads: 10
2020-08-19 19:26:49,162 INFO  [main] index.IndexCollection (IndexCollection.java:645) - Stemmer: porter
2020-08-19 19:26:49,163 INFO  [main] index.IndexCollection (IndexCollection.java:646) - Keep stopwords? false
2020-08-19 19:26:49,163 INFO  [main] index.IndexCollection (IndexCollection.java:647) 

Now we get a list of the 30 mostly spoken languages in the world accorting to Wikidata.

In [32]:
from tools.sparql_wrapper import wikidata_sparql

language_list = wikidata_sparql.run_query("""
SELECT DISTINCT ?lang

WHERE
{
    ?langEntity wdt:P31 wd:Q34770;
                  wdt:P1098 ?num.
    ?langEntity rdfs:label ?lang.
    FILTER(LANG(?lang) = "en").
}

ORDER BY DESC(?num)
LIMIT 30
""", keep_namespaces=True)

In [33]:
language_list

Unnamed: 0,lang
0,Chinese
1,Mandarin Chinese
2,Standard Chinese
3,English
4,Spanish
5,Standard Hindi
6,Hindi
7,Arabic
8,Portuguese
9,Bengali


In [35]:
from functools import reduce

most_common_languages = set(language_list["lang"].str.lower().array)

how_do_you_say = {"how do you say", "how does one say",
                      "what is the translation of",
                      "how does one translate"}

# in and into are stopwords and may  be deleted, but let us see if it works...
in_lang = {"in " + lang for lang in most_common_languages }.union({'into ' + lang for lang in most_common_languages})

# These keywords are arguably domain-specific.
keywords = {'translate', 'definition', 'mean', 'meaning',
              'singular', 'plural', 'conjugation', 'conjugate', 'declinate',
              'noun', 'verb', 'adjective', 'pronoun', 'comparative',
              'superlative', 'irregular', 'definition', 'synonyms', 'language',
              'linguistic'}

#search_terms = reduce(lambda a, b: a.union(b), [most_common_languages, how_do_you_say,
#                                                         in_lang, keywords])

#search_terms_regex = "|".join(search_terms)

In [66]:
from pyserini.search import SimpleSearcher

searcher = SimpleSearcher(index_path)

In [67]:
def predict_print(queries):
    for query, hits in searcher.batch_search(list(queries), list(queries)).items():
        print(f"for query: {query}")
        for hit in hits:
            print(hit.score, hit.raw)
        print("=======")

print("Retrieving basic how-to queries")
predict_print(how_do_you_say)
    
print("\n\nRetrieving domain-specific linguistic keyword-based queries")
predict_print(keywords)

print("\n\nRetrieving language keyword-based queries")
predict_print(in_lang)

Retrieving basic how-to queries
for query: what is the translation of
5.124899864196777 what is the grail translation of the psalms
5.124898910522461 what is the niv translation of the bible
5.1248979568481445 what is the niv translation of the bible
4.968299865722656 what is the first english translation of the bible
4.96829891204834 during the process of translation what is produced
4.968297958374023 what is the most reliable translation of the bible
4.821000099182129 what is the most common bible translation in english
4.8209991455078125 what is the greek translation of philadelphia located in pennsylvania
4.551199913024902 translation of antibody proteins in eukaryotic cells is associated with what organelle
4.005000114440918 lucretius on the nature of things english translation
for query: how do you say
8.14109992980957 how do you say bless you in italian
8.141098976135254 how do you say bless you in italian
7.827099800109863 how do you say the capital of iowa
7.6016998291015625 h

In [None]:
for query in in_lang:
    

In [73]:
def contains(column):
    return any([expression in column.question_text for expression in search_terms])

returned_rows = nlq_df.select("question_text") \
      .where(lower(nlq_df.question_text).rlike(search_terms_regex)) \
      .limit(100) \
      .collect()


In [74]:
returned_rows

[Row(question_text='when did the us start fighting germany in ww2'),
 Row(question_text='who created the dothraki language on game of thrones'),
 Row(question_text='the man who set up the first spanish colony in the new world was'),
 Row(question_text='see no evil hear no evil speak no evil skulls meaning'),
 Row(question_text='the word theatre comes from greek and literally means seeing place'),
 Row(question_text='what is the meaning of kinetic molecular theory'),
 Row(question_text='who won the english football cup in 1949'),
 Row(question_text='who is going to get eliminated in bigg boss telugu'),
 Row(question_text='english colonies in north america established a form of blank based on elections'),
 Row(question_text='what is the longest english word in which no letter is repeated'),
 Row(question_text='what is the meaning of love me like you do in hindi'),
 Row(question_text='what is the meaning of llc in a company'),
 Row(question_text='took a pill in ibiza meaning of song'),
 R