# Scraping Wikipedia data 

The first step in this project is to extract the data required. We extract this data from the Wikipedia webiste. We make use of BeautifulSoup library to scrape the data from the website. This data is then processed and used for further analysis.

The whole notebook is divided into different parts:
* Importing Libraries.
* Defining different variables.
* Defining the different functions neended. 
* Defining the main scraping function.

### Importing Libraries:

Importing the required libraries.
requests library for making HTTP requests.
BeautifulSoup for extracting data from the http request.
Pandas for storing data in DataFrames.
CSV for writing the results into a file.

In [1]:
import requests
from bs4 import BeautifulSoup
import random
import pandas as pd
import time
import csv
import re

### Defining the different variables:

* Count is the current count of the loop.
* totalPages is the total number of pages from which the data should be extracted.
* csvName is the name of the file in which we save the extracted raw data

In [2]:
count = 0
totalPages = 99
csvName = "finalDataFile"

### Defining the different functions:

Each function is given a seperate task to perform. All these functions are then used by the scrape function to extract the data. The different functions used are:
* initResp - initializes the web crawling process.
* titleExtractor - extracts the title of the topic.
* contextExtractor - extracts the data written about the topic.
* writeCSV - stores the data into a csv file.
* nextLink - jumps to the next Wikipedia link.

initResp is the function that initializes the whole process. The get method indicates that you are trying to get or retrieve data from a specified url.

In [3]:
def initResp(url):
    response = requests.get(url)
    return response

We need a function to extract the title and content of each page. titleExtractor and contextExtractor do the job of extracting title and content respectively. 

In [4]:
def titleExtractor(soup):
    title = soup.find('title')
    return title.string

def contextExtractor(soup):
    context = " "
    for i in soup.select('p'):
        context = context + i.getText()
    return context

Now that the contents are extracted, we need to store the results in a file. writeCSV is a function which writes the data into  a csv file. The csv file consists of two columns: Title of the page , Contents of the page.

In [5]:
def writeCSV(title, para, fname):
    with open(fname, 'a', newline='', encoding='utf-8') as file:
            writer = csv.writer(file, delimiter = ',')
            field = [ ''+title.string , ''+para]
            writer.writerow(field)

Now we need a function which helps us jump from the current page to another wikipedia page. nextLink function collects all the links in the HTML page and returns a random link for another wikipedia page.

In [6]:
def nextLink(soup):
    allLinks = soup.find(id="bodyContent").find_all("a",href =True)
    random.shuffle(allLinks)
    linkToScrape = 0  
    #print("No. of links : ",len(allLinks))
    
    for link in allLinks:
        # We are only interested in other wiki articles
        #print('before loop 1: '+str(link['href']))
        
        if link['href'].find("/wiki/") == -1:
            #print('c')
            continue

        # Use this link to scrape
        #print('before loop 2: '+link['href'])
        
        if link['href'].find(".org") == -1:
            linkToScrape = link
            #print('b')
            break
    
    print("Link returned : ", linkToScrape)
    return linkToScrape

### Defining the scrape function:

scrape function is the function that calls all the other functions in a loop. A 7 second sleep has been introduced as the wikipedia server will block the program if we hit one page after another in a loop. scrape is a recursion function which calls itself until the count exceeds the total number of pages we want to hit. 

In [7]:
def scrape(url):
    
    global count
    count = count+1
    
    global totalPages
    
    global csvName
    
    response=initResp(url)
    
    soup = BeautifulSoup(response.content, 'html.parser')
    
    title = titleExtractor(soup)
    #print(title)
    
    context = contextExtractor(soup)
    #print(context)
    
    writeCSV(title,context, csvName)
    
    linkToScrape = nextLink(soup)
    
    print("Current Link count: " , count)
    #print('Next Link href: '+ str(linkToScrape['href']))
    
    time.sleep(7)
    
    if(count<totalPages):
        scrape("https://en.wikipedia.org" + str(linkToScrape['href']))
        
        
        
        
    
#scrape("https://en.wikipedia.org/wiki/Web_scraping")
