## WEB SCRAPING

Web scraping is the process of using a computer program to gather information from the  internet. The modules needed for web scraping are:
1. requests: downloads files  and  web pages from the internet.
2. bs4 (BeautifulSoup): parses HTML

## THE REQUESTS MODULE
The requests module allows us to send HTTP requests using python. To download a file or a web page, we use the get() method of the requests module. This module returns a response object with which we can access a lot of information (such as status code, content) about the results of our GET request.

In [43]:
import requests
page = requests.get('https://www.jumia.com.ng/mobile-phones/apple/')
print(type(page))
#page

<class 'requests.models.Response'>


## STATUS CODES
A status code informs you of the status of the request. For example, a status code of 200 OK tells you that your request was successful whereas a status  code of 404 NOT FOUND tells you that the page was not found. To access the status  code of the response object, we us the *status_code* attribute.

In [44]:
page.status_code

200

In [45]:
page.raise_for_status()

## PAYLOAD

A response object has some valuable information known as a payload in the message body. We can access the payload of a response object in different formats using the response object attributes. Commonly used attributes for accessing the payload are *content and text*.
The content attribute returns bytes while the text attribute returns a string.

In [46]:
#https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/19/6109852/1.jpg?1489

In [47]:
print(page.content)

b'<!DOCTYPE html><html lang="en" dir="ltr"><head><meta charset="utf-8"/><title>Apple iPhones | Buy iPhones Online | Jumia Nigeria</title><meta property="og:type" content="product"/><meta property="og:site_name" content="Jumia Nigeria"/><meta property="og:title" content="Apple iPhones | Buy iPhones Online | Jumia Nigeria"/><meta property="og:description" content="Buy Apple iPhones online at Jumia Nigeria | Large selection of iPhones at best prices - iPhone 13, 13 pro max, iphone 12, iphone X, &amp; more | Order now!"/><meta property="og:url" content="/mobile-phones/apple/"/><meta property="og:image" content="https://ng.jumia.is/cms/jumialogonew.png"/><meta property="og:locale" content="en_NG"/><meta name="title" content="Apple iPhones | Buy iPhones Online | Jumia Nigeria"/><meta name="robots" content="index,follow"/><meta name="description" content="Buy Apple iPhones online at Jumia Nigeria | Large selection of iPhones at best prices - iPhone 13, 13 pro max, iphone 12, iphone X, &amp; m

In [48]:
print(page.text)

<!DOCTYPE html><html lang="en" dir="ltr"><head><meta charset="utf-8"/><title>Apple iPhones | Buy iPhones Online | Jumia Nigeria</title><meta property="og:type" content="product"/><meta property="og:site_name" content="Jumia Nigeria"/><meta property="og:title" content="Apple iPhones | Buy iPhones Online | Jumia Nigeria"/><meta property="og:description" content="Buy Apple iPhones online at Jumia Nigeria | Large selection of iPhones at best prices - iPhone 13, 13 pro max, iphone 12, iphone X, &amp; more | Order now!"/><meta property="og:url" content="/mobile-phones/apple/"/><meta property="og:image" content="https://ng.jumia.is/cms/jumialogonew.png"/><meta property="og:locale" content="en_NG"/><meta name="title" content="Apple iPhones | Buy iPhones Online | Jumia Nigeria"/><meta name="robots" content="index,follow"/><meta name="description" content="Buy Apple iPhones online at Jumia Nigeria | Large selection of iPhones at best prices - iPhone 13, 13 pro max, iphone 12, iphone X, &amp; mor

## INSPECTING A WEB PAGE
Web pages are written in HTML and consists of HTML files. HTML stands for Hyper Text MarkUp Languageand an HTML file is a plain text file with .html extension. An html file contains tags while tells browser how to format the web page. HTML tags have a starting `<>` and closing tag `</>`.

We can use the web developer's tool to inspect any web page. This helps us to understand the structure of the web page we want to scrape.To access  the web developer tool, click on the three dots on the top right corner of your browser, select more tools and then, developer tools. Below is an image of the web developer's tool. 

![image.png](attachment:image.png)

Once we have the web developer's tool opened, we can locate the html tags of any part of a webpage by moving our cursor over the part we are interested in on the web page.

## THE BEAUTIFUL SOUP MODULE
This module allows us to interact with a web page like we do using the web developer's tool. It allows us to extract information from a web page. To create a beautiful soup object, we must pass to the beautiful function() two arguments. The first argument is a string containing the html that it will parse or an html file and the second argument is the parser to use analyse the HTML. 

In [49]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')
type(soup)

bs4.BeautifulSoup

We can also use the faster `lxml` parser instead of the html parser. With a beautiful soup object, we can use its methods to locate specific part of an HTML file. 

In [50]:
soup = BeautifulSoup(page.text, 'lxml')
print(soup)

<!DOCTYPE html>
<html dir="ltr" lang="en"><head><meta charset="utf-8"/><title>Apple iPhones | Buy iPhones Online | Jumia Nigeria</title><meta content="product" property="og:type"/><meta content="Jumia Nigeria" property="og:site_name"/><meta content="Apple iPhones | Buy iPhones Online | Jumia Nigeria" property="og:title"/><meta content="Buy Apple iPhones online at Jumia Nigeria | Large selection of iPhones at best prices - iPhone 13, 13 pro max, iphone 12, iphone X, &amp; more | Order now!" property="og:description"/><meta content="/mobile-phones/apple/" property="og:url"/><meta content="https://ng.jumia.is/cms/jumialogonew.png" property="og:image"/><meta content="en_NG" property="og:locale"/><meta content="Apple iPhones | Buy iPhones Online | Jumia Nigeria" name="title"/><meta content="index,follow" name="robots"/><meta content="Buy Apple iPhones online at Jumia Nigeria | Large selection of iPhones at best prices - iPhone 13, 13 pro max, iphone 12, iphone X, &amp; more | Order now!" na

To print the beautiful soup object with the tags properly nested, we use the prettify method. 

In [51]:
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Apple iPhones | Buy iPhones Online | Jumia Nigeria
  </title>
  <meta content="product" property="og:type"/>
  <meta content="Jumia Nigeria" property="og:site_name"/>
  <meta content="Apple iPhones | Buy iPhones Online | Jumia Nigeria" property="og:title"/>
  <meta content="Buy Apple iPhones online at Jumia Nigeria | Large selection of iPhones at best prices - iPhone 13, 13 pro max, iphone 12, iphone X, &amp; more | Order now!" property="og:description"/>
  <meta content="/mobile-phones/apple/" property="og:url"/>
  <meta content="https://ng.jumia.is/cms/jumialogonew.png" property="og:image"/>
  <meta content="en_NG" property="og:locale"/>
  <meta content="Apple iPhones | Buy iPhones Online | Jumia Nigeria" name="title"/>
  <meta content="index,follow" name="robots"/>
  <meta content="Buy Apple iPhones online at Jumia Nigeria | Large selection of iPhones at best prices - iPhone 13, 13 pro max, iph

## FINDING ELEMENTS IN AN HTML DOCUMENT
We can access the elements of an html document using various methods such as select, find and find_all. The select and find_all methods return a list of tags while the find method returns a single tag, the first match. The syntax of the select method is different from that of the find and find_all methods.  
To find an element by id, we can write any of the codes below. The select method uses the `#` symbol to indicate an id and the `.` to indicate a class.

In [52]:
print(soup.find(id = 'jm').prettify())

<div id="jm">
 <div class="banner" data-bnrid="57" data-end="2026-12-31T23:59:00+01:00" style="background:#0D7A5A;">
  <div class="row _no-go -phs">
   <a class="col16 ar _1168-56" data-creative="https://ng.jumia.is/cms/0-2-Shopping-Festival/2024/Brand-days/Infinix/infinix_jumia_brand_day-smart_8_plus-1170x60.png" data-id="catalog_category_DS_CP_JSF_ADS_INF_HP" data-name="DS_CP_JSF_ADS_INF_HP" data-position="banner_top" data-track-onclick="eecPromo" data-track-onview="eecPromo" href="https://www.jumia.com.ng/mlp-infinix-store/">
    
   </a>
  </div>
 </div>
 <div class="vb row -i-ctr -j-ctr _head -bg-gy05">
  <div class="col3 -df -j-start">
   <a class="_link -df -i-ctr -or5 -m -fs12" href="/marketplace-vendor/" 

In [53]:
soup.select('#jm')

[<div id="jm"><div class="banner" data-bnrid="57" data-end="2026-12-31T23:59:00+01:00" style="background:#0D7A5A;"><div class="row _no-go -phs"><a class="col16 ar _1168-56" data-creative="https://ng.jumia.is/cms/0-2-Shopping-Festival/2024/Brand-days/Infinix/infinix_jumia_brand_day-smart_8_plus-1170x60.png" data-id="catalog_category_DS_CP_JSF_ADS_INF_HP" data-name="DS_CP_JSF_ADS_INF_HP" data-position="banner_top" data-track-onclick="eecPromo" data-track-onview="eecPromo" href="https://www.jumia.com.ng/mlp-infinix-store/"></a></div></div><div class="vb row -i-ctr -j-ctr _head -bg-gy05"><div class="col3 -df -j-start"><a class="_link -df -i-ctr -or5 -m -fs12" href="/marketplace-vendor/" rel="noopener" target="_blank">

Passing the name of the tag as an argument to the select, find and find_all methods will return the specified tag element(s). We can also retrieve a tag element by using the .tag_name attribute. Suppose we want to select div element(s) in the web page, we can do any of the following.

In [54]:
soup.div #This will return the first match

<div id="jm"><div class="banner" data-bnrid="57" data-end="2026-12-31T23:59:00+01:00" style="background:#0D7A5A;"><div class="row _no-go -phs"><a class="col16 ar _1168-56" data-creative="https://ng.jumia.is/cms/0-2-Shopping-Festival/2024/Brand-days/Infinix/infinix_jumia_brand_day-smart_8_plus-1170x60.png" data-id="catalog_category_DS_CP_JSF_ADS_INF_HP" data-name="DS_CP_JSF_ADS_INF_HP" data-position="banner_top" data-track-onclick="eecPromo" data-track-onview="eecPromo" href="https://www.jumia.com.ng/mlp-infinix-store/"></a></div></div><div class="vb row -i-ctr -j-ctr _head -bg-gy05"><div class="col3 -df -j-start"><a class="_link -df -i-ctr -or5 -m -fs12" href="/marketplace-vendor/" rel="noopener" target="_blank"><

In [55]:
soup.find('div')

<div id="jm"><div class="banner" data-bnrid="57" data-end="2026-12-31T23:59:00+01:00" style="background:#0D7A5A;"><div class="row _no-go -phs"><a class="col16 ar _1168-56" data-creative="https://ng.jumia.is/cms/0-2-Shopping-Festival/2024/Brand-days/Infinix/infinix_jumia_brand_day-smart_8_plus-1170x60.png" data-id="catalog_category_DS_CP_JSF_ADS_INF_HP" data-name="DS_CP_JSF_ADS_INF_HP" data-position="banner_top" data-track-onclick="eecPromo" data-track-onview="eecPromo" href="https://www.jumia.com.ng/mlp-infinix-store/"></a></div></div><div class="vb row -i-ctr -j-ctr _head -bg-gy05"><div class="col3 -df -j-start"><a class="_link -df -i-ctr -or5 -m -fs12" href="/marketplace-vendor/" rel="noopener" target="_blank"><

In [56]:
soup.select('div')

[<div id="jm"><div class="banner" data-bnrid="57" data-end="2026-12-31T23:59:00+01:00" style="background:#0D7A5A;"><div class="row _no-go -phs"><a class="col16 ar _1168-56" data-creative="https://ng.jumia.is/cms/0-2-Shopping-Festival/2024/Brand-days/Infinix/infinix_jumia_brand_day-smart_8_plus-1170x60.png" data-id="catalog_category_DS_CP_JSF_ADS_INF_HP" data-name="DS_CP_JSF_ADS_INF_HP" data-position="banner_top" data-track-onclick="eecPromo" data-track-onview="eecPromo" href="https://www.jumia.com.ng/mlp-infinix-store/"></a></div></div><div class="vb row -i-ctr -j-ctr _head -bg-gy05"><div class="col3 -df -j-start"><a class="_link -df -i-ctr -or5 -m -fs12" href="/marketplace-vendor/" rel="noopener" target="_blank">

In [57]:
soup.find_all('div')

[<div id="jm"><div class="banner" data-bnrid="57" data-end="2026-12-31T23:59:00+01:00" style="background:#0D7A5A;"><div class="row _no-go -phs"><a class="col16 ar _1168-56" data-creative="https://ng.jumia.is/cms/0-2-Shopping-Festival/2024/Brand-days/Infinix/infinix_jumia_brand_day-smart_8_plus-1170x60.png" data-id="catalog_category_DS_CP_JSF_ADS_INF_HP" data-name="DS_CP_JSF_ADS_INF_HP" data-position="banner_top" data-track-onclick="eecPromo" data-track-onview="eecPromo" href="https://www.jumia.com.ng/mlp-infinix-store/"></a></div></div><div class="vb row -i-ctr -j-ctr _head -bg-gy05"><div class="col3 -df -j-start"><a class="_link -df -i-ctr -or5 -m -fs12" href="/marketplace-vendor/" rel="noopener" target="_blank">

In [69]:
soup.find_all('div', class_ = 'info')

[<div class="info"><div class="bdg _mall _xs">Official Store</div><h3 class="name">Apple IPhone 13 6.1" 4GB RAM/256GB ROM IOS 15 - Blue</h3><div class="prc">₦ 1,040,999</div><div class="s-prc-w"><div class="old">₦ 2,999,999</div><div class="bdg _dsct _sm">65%</div></div><div class="rev"><div class="stars _s">5 out of 5<div class="in" style="width:100%"></div></div>(2)</div></div>,
 <div class="info"><h3 class="name">Apple IPhone 14 Pro Max 6.7" 128GB ROM, 6GB RAM - IOS 16 - Silver</h3><div class="prc">₦ 1,480,000</div></div>,
 <div class="info"><h3 class="name">Apple IPhone 12 - 6.1" - 4GB RAM, 64GB ROM, 5G, 2815mAh - Green</h3><div class="prc">₦ 760,000</div><p class="mpg" data-prc="₦ 760,000" data-tot="5">offers from</p></div>,
 <div class="info"><h3 class="name">Apple IPhone 14 Pro Max 6.7" (128GB ROM + 6GB RAM) Nano Sim- Black</h3><div class="prc">₦ 1,480,000</div></div>,
 <div class="info"><h3 class="name">Apple IPhone 13 6.1" , (4GB RAM + 128GB ROM), IOS 15 - Midnight</h3><div cl

At times we do not want to retrieve all element tags but tags that belong to a specific class. We can do this by specifying the class parameter of the find/find_all method or by using the `.` symbol with the select method. Suppose we want to retrieve all article tags that belong to the class `prd`. 

In [59]:
soup.select('article.prd._fb.col.c-prd')

[<article class="prd _fb col c-prd"><a class="btn _i _rnd -mas -fsh0 -me-start _wslt _sec" data-pop-open="addToWishlist" data-pop-trig="atw" data-simplesku="AP044MP3SD3N6NAFAMZ-401390922" data-sku="AP044MP3SD3N6NAFAMZ" data-track-onclick="wishlist" href="/customer/account/login/?tkWl=AP044MP3SD3N6NAFAMZ-401390922&amp;return=%2Fmobile-phones%2Fapple%2F" rel="nofollow"><svg aria-label="Add to wishlist" class="ic -f-or5" height="16" viewbox="0 0 24 24" width="16"><use xlink:href="https://www.jumia.com.ng/assets_he/images/i-icons.995b8ca3.svg#saved-items"></use></svg></a><a class="core" data-ga4-discount="648.32" data-ga4-index="1" data-ga4-item_brand="Apple" data-ga4-item_category="Phones &amp; Tablets" data-ga4-item_category2="Mobile Phones" data-ga4-item_category3="Smartphones" data-ga4-item_category4="iOS Phones" data-ga4-item_id="AP044MP3SD3N6NAFAMZ" data-ga4-item_name="IPhone 15 Pro Max 512gb - Nano Sim - Natural Titanium" data-ga4-list="" data-ga4-price="1389.25" data-gtm-brand="App

In [60]:
soup.find_all('article', class_='prd _fb col c-prd')

[<article class="prd _fb col c-prd"><a class="btn _i _rnd -mas -fsh0 -me-start _wslt _sec" data-pop-open="addToWishlist" data-pop-trig="atw" data-simplesku="AP044MP3SD3N6NAFAMZ-401390922" data-sku="AP044MP3SD3N6NAFAMZ" data-track-onclick="wishlist" href="/customer/account/login/?tkWl=AP044MP3SD3N6NAFAMZ-401390922&amp;return=%2Fmobile-phones%2Fapple%2F" rel="nofollow"><svg aria-label="Add to wishlist" class="ic -f-or5" height="16" viewbox="0 0 24 24" width="16"><use xlink:href="https://www.jumia.com.ng/assets_he/images/i-icons.995b8ca3.svg#saved-items"></use></svg></a><a class="core" data-ga4-discount="648.32" data-ga4-index="1" data-ga4-item_brand="Apple" data-ga4-item_category="Phones &amp; Tablets" data-ga4-item_category2="Mobile Phones" data-ga4-item_category3="Smartphones" data-ga4-item_category4="iOS Phones" data-ga4-item_id="AP044MP3SD3N6NAFAMZ" data-ga4-item_name="IPhone 15 Pro Max 512gb - Nano Sim - Natural Titanium" data-ga4-list="" data-ga4-price="1389.25" data-gtm-brand="App

Passing the class values as a list will return all elements that have any of the specified value as a class.

In [61]:
soup.find_all('article', class_=['prd', '_fb', 'col', 'c-prd'])

[<article class="prd _fb col c-prd"><a class="btn _i _rnd -mas -fsh0 -me-start _wslt _sec" data-pop-open="addToWishlist" data-pop-trig="atw" data-simplesku="AP044MP3SD3N6NAFAMZ-401390922" data-sku="AP044MP3SD3N6NAFAMZ" data-track-onclick="wishlist" href="/customer/account/login/?tkWl=AP044MP3SD3N6NAFAMZ-401390922&amp;return=%2Fmobile-phones%2Fapple%2F" rel="nofollow"><svg aria-label="Add to wishlist" class="ic -f-or5" height="16" viewbox="0 0 24 24" width="16"><use xlink:href="https://www.jumia.com.ng/assets_he/images/i-icons.995b8ca3.svg#saved-items"></use></svg></a><a class="core" data-ga4-discount="648.32" data-ga4-index="1" data-ga4-item_brand="Apple" data-ga4-item_category="Phones &amp; Tablets" data-ga4-item_category2="Mobile Phones" data-ga4-item_category3="Smartphones" data-ga4-item_category4="iOS Phones" data-ga4-item_id="AP044MP3SD3N6NAFAMZ" data-ga4-item_name="IPhone 15 Pro Max 512gb - Nano Sim - Natural Titanium" data-ga4-list="" data-ga4-price="1389.25" data-gtm-brand="App

We can get various information from an element tag using the attributes of a tag and also by using the attributes of a beautiful soup object 

In [62]:
articles = soup.find_all('article', class_='prd _fb col c-prd')
articles
print(articles[0].prettify())

<article class="prd _fb col c-prd">
 <a class="btn _i _rnd -mas -fsh0 -me-start _wslt _sec" data-pop-open="addToWishlist" data-pop-trig="atw" data-simplesku="AP044MP3SD3N6NAFAMZ-401390922" data-sku="AP044MP3SD3N6NAFAMZ" data-track-onclick="wishlist" href="/customer/account/login/?tkWl=AP044MP3SD3N6NAFAMZ-401390922&amp;return=%2Fmobile-phones%2Fapple%2F" rel="nofollow">
  <svg aria-label="Add to wishlist" class="ic -f-or5" height="16" viewbox="0 0 24 24" width="16">
   <use xlink:href="https://www.jumia.com.ng/assets_he/images/i-icons.995b8ca3.svg#saved-items">
   </use>
  </svg>
 </a>
 <a class="core" data-ga4-discount="648.32" data-ga4-index="1" data-ga4-item_brand="Apple" data-ga4-item_category="Phones &amp; Tablets" data-ga4-item_category2="Mobile Phones" data-ga4-item_category3="Smartphones" data-ga4-item_category4="iOS Phones" data-ga4-item_id="AP044MP3SD3N6NAFAMZ" data-ga4-item_name="IPhone 15 Pro Max 512gb - Nano Sim - Natural Titanium" data-ga4-list="" data-ga4-price="1389.25" 

In [63]:
phone_name = articles[0].find('h3').text
print(phone_name)

Apple IPhone 15 Pro Max 512gb - Nano Sim - Natural Titanium


In [64]:
img_src = articles[0].img['data-src']
print(img_src)

https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/29/0390262/1.jpg?0127


In [65]:
price = articles[0].find('div', class_='prc').text
print(price)

₦ 2,250,000


In [66]:
for article in articles:
        phone_name = article.find('h3', class_='name').text
        img_src = article.img['data-src']
        price = article.find('div', class_='prc').text
        print(f'Name: {phone_name}\nImage link: {img_src}\nPrice: {price}')

Name: Apple IPhone 15 Pro Max 512gb - Nano Sim - Natural Titanium
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/29/0390262/1.jpg?0127
Price: ₦ 2,250,000
Name: Apple IPhone 15 Pro Max 512gb - Nano Sim - Blue Titanium
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/58/0390262/1.jpg?9778
Price: ₦ 3,300,000
Name: Apple IPhone 13 Pro Max 6.7" Super Retina XDR Display With ProMotion, (6GB RAM + 512GB ROM), IOS 15, 5G, FaceTime - Gold
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/55/122749/1.jpg?5260
Price: ₦ 1,700,000
Name: Apple IPhone 15 Pro Max 256gb - Nano Sim - Blue Titanium
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/81/0390262/1.jpg?9046
Price: ₦ 2,150,000
Name: Apple IPhone 12 Pro Max - 6.7-Inch - 128GB ROM, 6GB RAM - 2815mAh
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/52/3501422/1.jpg?0218
Price: ₦ 780,0

To goal of web scraping is to get data from the internet. Most times, we want  save this data to a file for further processing. Let's see how we can do this.

In [70]:
with open('phones.txt', 'w', encoding ='utf-8') as f:
   # f.write(f'phone name, image source, price')
   # f.write('\n')
    for article in articles:
        phone_name = article.find('h3', class_='name').text
        img_src = article.img['data-src']
        price = article.find('div', class_='prc').text
        f.write(f'Name: {phone_name}\nImage link: {img_src}\nPrice: {price}')
        f.write('\n')
        

<div class="info"><h3 class="name">Apple Iphone Xs Max 64gb/4gb 6.5inch Silver, Case &amp; Screen Guide</h3><div class="prc">₦ 305,000</div></div>

In [26]:
phones_info = soup.find_all('div', class_ = 'info')
for phone in phones_info:
    phone_name = phone.find('h3').text
    print(phone_name)
    

Apple IPhone 15 Pro Max 512gb - Nano Sim - Natural Titanium
Apple IPhone 15 Pro Max 512gb - Nano Sim - Blue Titanium
Apple IPhone 13 Pro Max 6.7" Super Retina XDR Display With ProMotion, (6GB RAM + 512GB ROM), IOS 15, 5G, FaceTime - Gold
Apple IPhone 15 Pro Max 256gb - Nano Sim - Blue Titanium
Apple IPhone 12 Pro Max - 6.7-Inch - 128GB ROM, 6GB RAM - 2815mAh
Apple IPhone XR - 6.1" - 64GB ROM, 3GB RAM, 2942mAh - Blue
Apple IPhone 13 - 6.1" - 128GB ROM, 4GB RAM, 3240mAh - White
Apple IPhone 14 Pro Max 6.7" 256GB Nano Sim - Deep Purple
Apple IPhone 13 Pro Max 6.7" Super Retina XDR Display With ProMotion, (6GB RAM + 512GB ROM) IOS 15, 5G, FaceTime - Sierra Blue
Apple IPhone 11 Pro Max - 6.5" - 4GB RAM, 64GB ROM - Gold
Apple IPhone 12 Pro Max - 6.7" - 256GB ROM, 6GB RAM - IOS 14.
Apple Iphone 12 Pro Max 256GB - Gold
Apple IPhone 15 Pro 128gb - Nano Sim - Natural Titanium
Apple IPHONE 12 PRO MAX  6.7" (12PM+12PM+12PM) 128+6GB RAM 3687mAh
Apple IPhone 13 Pro 6.1 Inch (6GB RAM + 256GB ROM)- Bl

You can also save the scraped data as a csv file (comma seperated file).This is a very common file used in data science and machine learning. 

In [27]:
with open('phones.csv', 'w', encoding ='utf-8') as f:
    f.write(f'phone name, image source, price')
    f.write('\n')
    for article in articles:
        phone_name = article.find('h3', class_='name').text
        img_src = article.img['data-src']
        price = article.find('div', class_='prc').text
        f.write(f'{phone_name}, {img_src}, {price}')
        f.write('\n')

## SCRAPING DATA ACROSS MULTIPLE PAGES

To scrape across multiple pages, we need observe the website url and how it changes as we navigate across pages. You will notice that a part of the Url remains the same as we navigate while a part is constantly changing following a pattern. 

For example, the jumia webpage we are currently working on, the web url is `https://www.jumia.com.ng/mobile-phones/apple/` but as we click on next page, then the url changes to `https://www.jumia.com.ng/mobile-phones/apple/?page=2`. As we navigate through pages, the page number will always increase by 1. Also, it was observed that if we set page  = 1 in the url (`https://www.jumia.com.ng/mobile-phones/apple/?page=1`), it will return same page as `https://www.jumia.com.ng/mobile-phones/apple/`. With this info, we can scrape data across all the pages, we just need to know the last page number.

In [28]:
with open('Iphones.txt', 'w', encoding ='utf-8') as f:
    for index in range (1, 42):
        page = requests.get(f'https://www.jumia.com.ng/mobile-phones/apple/?page={index}')
        soup = BeautifulSoup(page.text, 'lxml')
        articles = soup.find_all('article', class_='prd _fb col c-prd')
        for article in articles:
            phone_name = article.find('h3', class_='name').text
            img_src = article.img['data-src']
            price = article.find('div', class_='prc').text
            f.write(f'Name: {phone_name}\nImage link: {img_src}\nPrice: {price}')
            f.write('\n')

In [29]:
page = requests.get(f'https://www.jumia.com.ng/mobile-phones/apple/')
soup = BeautifulSoup(page.text, 'lxml')
page_num = len(soup.find_all('a', class_ = ["pg"]))

In [30]:
          
for index in range (1, page_num+2):
        page = requests.get(f'https://www.jumia.com.ng/mobile-phones/apple/?page={index}#catalog-listing')
        soup = BeautifulSoup(page.text, 'lxml')
        articles = soup.find_all('article', class_='prd _fb col c-prd')
        for article in articles:
            phone_name = article.find('h3', class_='name').text
            img_src = article.img['data-src']
            price = article.find('div', class_='prc').text
            print(f'Name: {phone_name}\nImage link: {img_src}\nPrice: {price}')

Name: Apple IPhone 15 Pro Max 512gb - Nano Sim - Natural Titanium
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/29/0390262/1.jpg?0127
Price: ₦ 2,250,000
Name: Apple IPhone 15 Pro Max 512gb - Nano Sim - Blue Titanium
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/58/0390262/1.jpg?9778
Price: ₦ 3,300,000
Name: Apple IPhone 13 Pro Max 6.7" Super Retina XDR Display With ProMotion, (6GB RAM + 512GB ROM), IOS 15, 5G, FaceTime - Gold
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/55/122749/1.jpg?5260
Price: ₦ 1,700,000
Name: Apple IPhone 15 Pro Max 256gb - Nano Sim - Blue Titanium
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/81/0390262/1.jpg?9046
Price: ₦ 2,150,000
Name: Apple IPhone 12 Pro Max - 6.7-Inch - 128GB ROM, 6GB RAM - 2815mAh
Image link: https://ng.jumia.is/unsafe/fit-in/300x300/filters:fill(white)/product/52/3501422/1.jpg?0218
Price: ₦ 780,0

In [68]:
import pandas as pd
# list to hold scraped data
data1=[]
with open('newphones.csv', 'w', encoding ='utf-8') as f:
    for index in range (1,8):
        page = requests.get(f'https://www.jumia.com.ng/mobile-phones/apple/?page={index}')
        soup=BeautifulSoup(page.text,'html.parser')
        articles= soup.find_all('article', class_='prd_fb col c-prd')

# extracting from each article
        for z in articles:
            phone_name= z.find('h3', class_ ='name').text
            img_src =z.img['data-src']
            price= z.find('div', class_ = 'prc').text
            # append extracted data as dictionary to list
            data1.append(f'Name: {phone_name}\nImage Link: {img_src}\nPrice: {price}')

#create a dataframe form the list of dictionaries
df = pd.DataFrame(data1)
    
    #save the dataframe to csv
df.to_csv('newphones.csv', index=False, encoding='utf-8')

print ('data has been successfully saved to newphones.csv')




data has been successfully saved to newphones.csv


In [67]:
with open('newphones1.csv', 'w', encoding ='utf-8') as f:
    for index in range (1, 42):
        page = requests.get(f'https://www.jumia.com.ng/mobile-phones/apple/?page={index}')
        soup = BeautifulSoup(page.text, 'lxml')
    
        for article in articles:
            phone_name = article.find('h3', class_='name').text
            img_src = article.img['data-src']
            price = article.find('div', class_='prc').text
            f.write(f'{phone_name}, {img_src}, {price}')
            f.write('\n')