# Wikipedia Data Scraping: A Guide to Using the Wikipedia API

In this Colab notebook, we aim to scrape data from Wikipedia using its Mediawiki API, through a python wrapper library [wikipedia](https://pypi.org/project/wikipedia/) . By referring to this demo notebook, students will have a basic idea of the wikipedia API and how to extract relevant data from the responses.

Please note that you are not required to use this particular API for your Data Scraping task. Feel free to use any other libraries as long as it serves the purpose.

### Installing and Importing Necessary Libraries


In [None]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=8f99af6c7425ccf66d6a0ddc3c1379b6623ba48a37143628941d9c72c63c8b9f
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [None]:
import wikipedia

Always run help command when you get stuck, for example if you don't know how to use the API read about it using the help()

In [None]:
help(wikipedia)

Help on package wikipedia:

NAME
    wikipedia

PACKAGE CONTENTS
    exceptions
    util
    wikipedia

DATA
    API_URL = 'http://en.wikipedia.org/w/api.php'
    ODD_ERROR_MESSAGE = "This shouldn't happen. Please report on GitHub: g...
    RATE_LIMIT = False
    RATE_LIMIT_LAST_CALL = None
    RATE_LIMIT_MIN_WAIT = None
    USER_AGENT = 'wikipedia (https://github.com/goldsmith/Wikipedia/)'
    geosearch = <wikipedia.util.cache object>
        Do a wikipedia geo search for `latitude` and `longitude`
        using HTTP API described in http://www.mediawiki.org/wiki/Extension:GeoData
        
        Arguments:
        
        * latitude (float or decimal.Decimal)
        * longitude (float or decimal.Decimal)
        
        Keyword arguments:
        
        * title - The title of an article to search for
        * results - the maximum number of results returned
        * radius - Search radius in meters. The value must be between 10 and 10000
    
    languages = <wikipedia.util.c

### Searching on specific keyword

Do a Wikipedia search for `query`. Default search returns 10 relevant pages (documents) for the query, you can play with the `results` parameter to change the number of retrieved results

In [None]:
search_results = wikipedia.search("Information Retrieval", results=10)
search_results

['Information retrieval',
 'Precision and recall',
 'Evaluation measures (information retrieval)',
 'Relevance (information retrieval)',
 'Music information retrieval',
 'Thesaurus (information retrieval)',
 'Ranking (information retrieval)',
 'Private information retrieval',
 'Legal information retrieval',
 'Cross-language information retrieval']

Now we dive deep to find out what is contained in each of the page, here we are retrieving the contents of the page titled as 'Information retrieval'. We are also setting the `auto_suggest` flag as `False` which prevents changing the text 'Information retrieval'.

In [None]:
content = wikipedia.page(search_results[0], auto_suggest=False)

We can see a list of properties we can use on `content` object to retrieve the relevant informations

In [None]:
help(content)

Help on WikipediaPage in module wikipedia.wikipedia object:

class WikipediaPage(builtins.object)
 |  WikipediaPage(title=None, pageid=None, redirect=True, preload=False, original_title='')
 |  
 |  Contains data from a Wikipedia page.
 |  Uses property methods to filter data from the raw HTML.
 |  
 |  Methods defined here:
 |  
 |  __eq__(self, other)
 |      Return self==value.
 |  
 |  __init__(self, title=None, pageid=None, redirect=True, preload=False, original_title='')
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  html(self)
 |      Get full page HTML.
 |      
 |  
 |  section(self, section_title)
 |      Get the plain text content of a section from `self.sections`.
 |      Returns None if `section_title` isn't found, otherwise returns a whitespace stripped string.
 |      
 |      This is a convenience method that wraps self.content.
 |      
 |             the full text of all of the subsect

We will specifically use the following:

1. title
2. revision_id
3. summary
4. URL

In [None]:
content.title

'Information retrieval'

In [None]:
content.revision_id

1170129916

In [None]:
content.summary

'Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources.  Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.\nAutomated information retrieval systems are used to reduce what has been called information overload. An IR system is a software system that provides access to books, journals and other documents; it also stores and manages those documents. Web search engines are the most visible IR applications.\n\n'

In [None]:
content.url

'https://en.wikipedia.org/wiki/Information_retrieval'

Some other useful  properties.

Suggestion: You may need this to retrieve 500 relevant pages (documents). You should traverse over the links to find the adequate number of unique documents.

In [None]:
content.links

['1890 US Census',
 '3D retrieval',
 'Adversarial information retrieval',
 'Allen Kent',
 'Alvin Weinberg',
 'As We May Think',
 'Association for Computing Machinery',
 'Atlantic Monthly',
 'Automatic summarization',
 "Bayes' theorem",
 'Bibliometrics',
 'Bill Maron',
 'Binary Independence Model',
 'C. J. van Rijsbergen',
 'CERN',
 'Calvin Mooers',
 'Case Western Reserve University',
 'Categorization',
 'Censorship',
 'Citation index',
 'CiteSeerX (identifier)',
 'Classification of the sciences (Peirce)',
 'Co-occurrence',
 'Collaborative information seeking',
 'Communications of the ACM',
 'Compound term processing',
 'Computational linguistics',
 'Computer data storage',
 'Computer memory',
 'Computing',
 'Conference on Information and Knowledge Management',
 'Controlled vocabulary',
 'Cornelis J. van Rijsbergen',
 'Cross-language information retrieval',
 'Cultural studies',
 'Cyril W. Cleverdon',
 'Data mining',
 'Data modeling',
 'Data retrieval',
 'Database',
 'Desk Set',
 'Deskto

In [None]:
content.url

'https://en.wikipedia.org/wiki/Information_retrieval'