# Challenge definition

**Wikipedia Search (30-60 min)**

Wikipedia has a [free open search API](https://www.mediawiki.org/wiki/API:Search).

In the language and framework of your choice, write a piece of code that **queries** the pages containing the term 絵⽂字 in both English and Japanese pages, limiting the search to the first 100 pages for each language.

From the combined queried pages, **output** the page title, page size and page language ranked by page size. Only output the first 20 results.

---

# Solution

The provided link includes a complete documentation that covers everything that is asked for this task. 

It is possible to set the parameters of API to what is requested here and fetch the required data as requested.
- `srsearch` parameter is used to search all the titles or contents, looking for the term that is asked for.
- Limiting the number of pages to 100 is done by setting the parameter `srlimit`
- Lastly, we would like to rank the pages based on their size. By assigning the proper value(s) to `srprop` parameter we can get various properties in the output of the API. In this case by setting this parameter to `size` we get our ranking metric.

After setting the parameters, we can call the API once for English pages and once for Japanese ones. Then we extract the required fields (`title`, `size`, and `language`) from the output of the API and keep them in a list of jsons. After that we combine the two lists and sort them based the page size. Finally we select the top 20 results and output them.

In the following, you can find the implemented version of this in Python:

### Install required libraries

In [1]:
!pip install -r requirements.txt



### Import required libraries

In [2]:
import requests

In [3]:
def query_wikipedia(api_url: str, language: str, search_term: str, limit: int) -> list:
    """This function calls the Wikipedia API and fetches data for the search_term limiting the search to 100 first pages

    Args:
        api_url (str): _description_
        language (str): _description_
        search_term (str): _description_
        limit (int): _description_

    Returns:
        list: _description_
    """
    params = {
        "action": "query",
        "list": "search",
        "srsearch": search_term,
        "srlimit": limit,
        "format": "json",
        "srprop": "size",
    }

    response = requests.get(api_url, params=params)
    if response.status_code == 200:
        data = response.json()
        return [
            {"title": page["title"], "size": page["size"], "language": language}
            for page in data["query"]["search"]
        ]
    else:
        print(f"Failed to fetch data from {language} Wikipedia")
        return []

Set parameters

In [4]:
# Wikipedia API URLs
EN_WIKI_API = "https://en.wikipedia.org/w/api.php"
JA_WIKI_API = "https://ja.wikipedia.org/w/api.php"

# Search term
SEARCH_TERM = "絵⽂字"
LIMIT = 100

In [5]:
# Query English and Japanese Wikipedia
english_results = query_wikipedia(EN_WIKI_API, "English", SEARCH_TERM, LIMIT)
japanese_results = query_wikipedia(JA_WIKI_API, "Japanese", SEARCH_TERM, LIMIT)

# Combine results from both languages
combined_results = english_results + japanese_results

# Sort results by page size in descending order
sorted_results = sorted(combined_results, key=lambda x: x["size"], reverse=True)

# Output the first 20 results
top_20_results = sorted_results[:20]

# Display the results
for idx, result in enumerate(top_20_results, start=1):
    print(f"{idx}. {result['title']} ({result['language']}) - {result['size']} bytes")

1. Unicode6.0の携帯電話の絵文字の一覧 (Japanese) - 152640 bytes
2. Tomokazu Sugita (English) - 144111 bytes
3. UnicodeのEmojiの一覧 (Japanese) - 125658 bytes
4. Shin-ichiro Miki (English) - 120314 bytes
5. Takahiro Sakurai (English) - 111199 bytes
6. Emoji (English) - 107096 bytes
7. 囲み文字 (Japanese) - 102499 bytes
8. Au (携帯電話) (Japanese) - 95109 bytes
9. List of Japanese loanwords in Indonesian (English) - 86042 bytes
10. 文字 (Japanese) - 85694 bytes
11. Takehito Koyasu (English) - 81287 bytes
12. ロンゴロンゴ (Japanese) - 81024 bytes
13. Jouji Nakata (English) - 74380 bytes
14. 国旗の一覧 (Japanese) - 67466 bytes
15. 地上天気図 (Japanese) - 55690 bytes
16. アナトリア象形文字 (Japanese) - 52699 bytes
17. Moai (English) - 49249 bytes
18. 7月17日 (Japanese) - 45715 bytes
19. Kōji Yusa (English) - 43175 bytes
20. 携帯電話の絵文字 (Japanese) - 38611 bytes
