## Task 1.4. Accessing Web Data with Data Scraping

### Scraping from a webpage

In [1]:
import sys
print(sys.executable)


C:\Users\Asus\anaconda3\envs\venv_alice\python.exe


In [2]:
# Import libraries

import pandas as pd
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import matplotlib.pyplot as plt 
import os
import logging

In [30]:
#installing the driver
service = Service(ChromeDriverManager().install())

# starts Chrome using Service
driver = webdriver.Chrome(service=service)

In [4]:
#Get the page's contents
page_url = "https://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonderland"
driver.get(page_url)

#### Extracting the elements of interest

In [5]:
# Create a collection of the characters
characters_elems = driver.find_elements(by = By.CLASS_NAME, value = 'div-col')

It's using Selenium to **search the webpage** for all HTML elements that have the CSS class name `div-col`.

Breaking it down:

- `driver.find_elements(...)` — tells Selenium to look through the page and collect **all matching elements** (notice it's `find_elements` plural, so it returns a list)
- `by = By.CLASS_NAME` — specifies that you're searching **by CSS class name**
- `value = 'div-col'` — the class name you're looking for

So it's essentially saying: *"go through the Wikipedia page and find all `<div class="div-col">` elements, then store them in `characters_elems`."*

Which is exactly the div you found when you inspected the page containing the characters list!

In [11]:
list_char = characters_elems[0].text.split("\n")

It means you're accessing the **first item** in the list `characters_elems`.

In Python, lists are **zero-indexed**, meaning counting starts at 0 instead of 1:

- `[0]` → first item
- `[1]` → second item
- `[2]` → third item
- ...and so on

So since `characters_elems` only has 1 element (as you saw from `len()`), `characters_elems[0]` is the only valid index you can use. If you tried `characters_elems[1]` it would throw an `IndexError` because there's no second item!

In [12]:
list_char

['Alice',
 'The White Rabbit',
 'The Mouse',
 'The Dodo',
 'The Lory',
 'The Eaglet',
 'The Duck',
 'Pat',
 'Bill the Lizard',
 'Puppy',
 'The Caterpillar',
 'The Duchess',
 'The Cheshire Cat',
 'The Hatter',
 'The March Hare',
 'The Dormouse',
 'The Queen of Hearts',
 'The King of Hearts',
 'The Knave of Hearts',
 'The Gryphon',
 'The Mock Turtle']

In [10]:
#testing
print(len(characters_elems))

1


It means that driver.find_elements(by=By.CLASS_NAME, value='div-col') found only 1 element on the page with the class div-col — which is actually good news! It means your selector is pointing exactly to the one <div class="div-col"> that contains the characters list you inspected earlier.

#### Storing the data in a suitable format

In [13]:
# Put the characters into a dataframe
df = pd.DataFrame(list_char, columns = ["character"])

In [14]:
df

Unnamed: 0,character
0,Alice
1,The White Rabbit
2,The Mouse
3,The Dodo
4,The Lory
5,The Eaglet
6,The Duck
7,Pat
8,Bill the Lizard
9,Puppy


### Requests and BeautifulSoup

In [15]:
#import libraries
from bs4 import BeautifulSoup
import requests 

ModuleNotFoundError: No module named 'bs4'

# Assistant
This error occurs because the 'bs4' module (BeautifulSoup) is not installed in your Python environment. BeautifulSoup is a popular library for web scraping, but it doesn't come pre-installed with Python.

Would you like me to provide the corrected code?

# User
yes

In [16]:
# Assistant
# First, install the required package
!pip install beautifulsoup4

# Then import libraries
from bs4 import BeautifulSoup
import requests

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.14.3-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>=1.6.1 (from beautifulsoup4)
  Downloading soupsieve-2.8.3-py3-none-any.whl.metadata (4.6 kB)
Downloading beautifulsoup4-4.14.3-py3-none-any.whl (107 kB)
Downloading soupsieve-2.8.3-py3-none-any.whl (37 kB)
Installing collected packages: soupsieve, beautifulsoup4

   ---------------------------------------- 0/2 [soupsieve]
   -------------------- ------------------- 1/2 [beautifulsoup4]
   -------------------- ------------------- 1/2 [beautifulsoup4]
   ---------------------------------------- 2/2 [beautifulsoup4]

Successfully installed beautifulsoup4-4.14.3 soupsieve-2.8.3


In [17]:
#import libraries
from bs4 import BeautifulSoup
import requests 

In [24]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'}

In [25]:
page_url_2 = requests.get("https://en.wikipedia.org/wiki/Alice%27s_Adventures_in_Wonderland", headers=headers)

In [26]:
soup = BeautifulSoup(page_url_2.text, 'html.parser')
print(soup.title)

<title>Alice's Adventures in Wonderland - Wikipedia</title>


I had to use Claude AI to help me solve this. I had to add a headers in order for Wikipedia to allow access to webscraping. Then it worked.

In [27]:
text = soup.get_text()

In [28]:
text = text.encode ('utf-8')

In [29]:
with open('Alice_article_Wiki.txt', 'wb') as f:
       f.write(text)

Done! Thank goodness there's AI. 