## How to Scrape Word List *Page Links* from Wiktionary

Probably you wanted to scrape the word lists with BeautifulSoup or Rvest or whatever... You wanted to write a loop, but it seems like every single "next page" is a different link. Bummer? Of course not! Just with few steps you can parse the link pages of Wiktionary and we can parse the words with Rvest. 

**This is the first step of scraping the words. Please see the *scrape-wkt-words.Rmd* file in the repository after this.**

- In this tutorial, you will learn how to scrape Wiktionary word list page links.
- Wiktionary does not provide a standard link for its pages. 
- Please read the #Comments carefully to fully understand how this process works.

## Before starting...
- We'll use Selenium. Which opens a portable browser and does our scraping work automatically.
- You'll need to install ChromeDriver (or GeckoDriver if you're using Firefox), I'll use Chrome here.
- Here's a tutorial video of how to install these drivers: https://www.youtube.com/watch?v=dz59GsdvUF8
- This is where you download ChromeDriver: https://chromedriver.chromium.org/downloads
- Make sure you match the ChromeDriver and Chrome Version. To check it, write this in the address bar: chrome://version
- You **must** install ChromeDriver (or Geckodriver) before starting.

In [15]:
# Necessary Libraries

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

## Important Aspect in Wiktionary

Every page has 200 words. You need to decide on the range by the word count. 

E.g. if it says 'The following 200 pages are in this category, out of 8,979 total.' 

You will have to divide 8979 to 200 = 44,895

That means there are 44 pages to scrape. In the function. You will have to write that down to the code.

**Beware:** When you run this code, a Google Chrome browser will pop up. *Don't freak out*, that's what Selenium does. It copies stuff for you automatically through a portable browser. No need to worry. Everything is cool. If you run this cell but a browser just opens and immediately closes, it means that your ChromeDriver is out of date, download the newest version.

In [None]:
# For this tutorial, we'll use the Turkish nouns list.
url = 'https://en.wiktionary.org/wiki/Category:Turkish_nouns'
PATH = "C:/Windows/chromedriver.exe"
driver = webdriver.Chrome(PATH)



# First of all, creating a new list to store our link list
page_list = []



# Now here's one thing to remember. The code will parse pages starting with the first "next page". 
# So the content in the very first page will not be parsed. We need to add that!
page_list.append(url)



# This function opens the website, parses the link of every single "next page", stores in a list, then clicks on it.
def scrapeandclick():
    nextpg = driver.find_element(By.XPATH, '//*[@id="mw-pages"]/a[2]')
    pagelink = (nextpg.get_attribute('href'))
    print(pagelink)
    pagelist_item = pagelink
    page_list.append(pagelist_item)
    nextpg.click()

    
    
# Time to Run Selenium
driver.get(url)



# Now as described in the Markdown, the Turkish nouns list has 8979 total words
# As it should be remembered, each page has 200 words. 
# Let's divide that, 8979/200 = 44,895
# As a solution our range is 44

for x in range(44):
    scrapeandclick()
    

driver.quit()

In [None]:
# Check how cool your list is
page_list

In [None]:
# Create a Data Frame with the list

df = pd.DataFrame(page_list)
df.columns = ['Wiktionary Page Link']

df

In [None]:
# Save your output to a CSV file for future use

df.to_csv("wiktionary_page_link.csv")