## Web scraping with BS:

A tutorial for website scrapping using the Requests and BeautifullSoup libraries. 

In [116]:
import requests # To be able to access websites we wish to scrap.
from bs4 import BeautifulSoup
import csv
import pandas as pd
from IPython.display import Image, HTML # To be able to display images within a DataFrame

In [2]:
url='http://ensias.um5.ac.ma/page/ing%C3%A9nieurs' # We will extract names of Allumni from my own school ENSIAS for the 2015 promo.

In [3]:
req=requests.get(url) 
status=req.status_code # Status of the connection attempt if everything is good this should have a 200 value.
encoding=req.encoding
text=BeautifulSoup(req.text,'html.parser')
print(status)

200


In [5]:
data=text.findAll('tr') # Finding all tags with 'tr' in them
for item in data:
    pass
item # We only catch the last one to see how it looks like.

<tr height="20"><td align="right" height="20" style="height:20px;">183</td>
<td align="left">ZRAIBI</td>
<td align="left">SALMA</td>
</tr>

In [14]:
a=item.contents # Turns the response obtained into a list
a

[<td align="right" height="20" style="height:20px;">183</td>,
 '\n',
 <td align="left">ZRAIBI</td>,
 '\n',
 <td align="left">SALMA</td>,
 '\n']

In [23]:
print("List of students promo 2015: \n")
print(f"{'First name':<20}{'Last name':<15}\n")
for item in data:
    a=item.contents
    print(f"{a[2].text:<20}{a[4].text:<15}") # .text method returns the contents of the tag as a string

List of students promo 2015: 

First name          Last name      

ABI                 YASSIR         
ABIDALLAH           YOUSSEF        
ACHAOUD             HANANE         
ADDAD               ABDELHALIM     
AFIF                MOHAMMED       
AIT EL HAJ          KHADIJA        
AIT EL HARRAJ       AMINE          
AIT MANSOUR         YOUSSEF        
AITOUNA             MOHAMED        
AL MAACH            WAHIBA         
ALAOUI MRANI        SOUKAINA       
ALLAOU              MARIAME        
ALMAMOUN            ZAKARIAE       
AMARA               HALA           
AMCHICH             IMANE          
AMENCHAR            NABIL          
AMMARI              KHALID         
AMNAS               ASMAE          
AOULADLAHCEN        MOHAMMED       
AOUNI               HAMZA          
AOURAGH             YOUNESSE       
AOUTIL              AHMED          
AQQA                MILOUD         
BAHADOU             HIND           
BARBARE             MOHAMMED AMINE 
BELAKHAL            HAMZA       

Now for another task, we will now try to retrieve images too. We will be using for this task Jumia.

In [59]:
url='https://www.jumia.com.ng/laptops/'

In [60]:
# Same story here for establishing connection and getting the response.
req=requests.get(url)
status=req.status_code
encoding=req.encoding
text=BeautifulSoup(req.text,'html.parser')
print(status)

200


In [87]:
# By inspecting the web html code, we can know the name of tags for which we are interested for the scraping.
# In this case the <a> tag that has a 'link' class is the one containing the product data, image, name, brand, etc..
# So that's what we will look up using findAll. We will work on the last one as an exhibit to know its structure.
data=text.findAll('a',{'class':'link'})
for item in data:
    pass
item

<a class="link" href="https://www.jumia.com.ng/dell-alienware-corei7.3-16gb-ram-2tb-hdd256gb-ssd-8gb-nvidia-geforce-gtx-windows10-home-bag-32gb-flash-wireless-mouse-36472029.html"> <div class="top"> </div> <div class="image-wrapper default-state"><noscript><img class="image" height="220" src="https://ng.jumia.is/Ql8rdB0tB68gwOLSpn9a1LZ3Xn0=/fit-in/220x220/filters:fill(white):sharpen(1,0,false):quality(100)/product/92/027463/1.jpg?8665" width="220"/></noscript></div> <h2 class="title"

In [88]:
brand=item.find_all('span',{'class':'brand'})[0].text.strip() # Strip to avoid any formating trouble
name=item.find_all('span',{'class':'name'})[0].text.strip()
price=item.find_all('span',{'class':'price'})[0].text.strip()
print('Mark: ',brand,'\nNom: ',name,'\nPrix: ',price)

Mark:  DELL 
Nom:  Alienware Corei7.3-(16GB RAM, 2TB HDD+256GB SSD) 8GB NVIDIA GeForce GTX ,Windows10 Home+ BAG, 32GB FLASH, WIRELESS MOUSE 
Prix:  ₦ 430,000


In [110]:
img=item.find_all('img',{'class':'lazy image'})
img
img_src=img[0].get('data-src')
img_src

'https://ng.jumia.is/AJQFFwlEvzNiA0hhxrmAPn4hNU0=/fit-in/220x220/filters:fill(white):sharpen(1,0,false):quality(100)/product/43/662352/1.jpg?9611'

In [91]:
def get_img(url_img,id):
    r=requests.get(url_img)
    if r.status_code==200:
        #img = Image.open(BytesIO(r.content))
        img_container="E:/git/Web-Scraping/"+str(id)+'.jpg'
        with open(img_container ,'wb') as f:
            f.write(r.content) 
            print('Image saved!')
    else:
        print('Connection trouble!')

In [92]:
get_img(img_src,2)

Image saved!


![](2.jpg)

In [150]:
def path_to_image_html(path):
    '''
     This function essentially convert the image url to 
     '<img src="'+ path + '"/>' format. And one can put any
     formatting adjustments to control the height, aspect ratio, size etc.
     within as in the below example. 
    '''
    return ("<img src="+"'"+path+"'"+"/>")

In [153]:
# Let's create a dataframe that goes through the computer offers page of Jumia and returns all computers with their
#respective name, brand and image.

Jdf=pd.DataFrame(columns=['Brand','Specs','Price'])
i=0
images=[]
for item in data:
    brand=item.find_all('span',{'class':'brand'})[0].text.strip() 
    name=item.find_all('span',{'class':'name'})[0].text.strip()
    price=item.find_all('span',{'class':'price'})[0].text.strip()
    img=item.find_all('img',{'class':'lazy image'})
    if (len(img) > 1):
        img_src=img[1].get('data-src')
    else:
        img_src=img[0].get('data-src')
    Jdf.loc[i]=[brand,name,price]
    images.append(img_src)
    i+=1
Jdf['Images']=images
print(len(Jdf))
pd.set_option('display.max_colwidth', -1)
HTML(Jdf.to_html(escape=False ,formatters=dict(Images=path_to_image_html)))

40


Unnamed: 0,Brand,Specs,Price,Images
0,Hp,HP Pavilion 15 8th Gen Intel Core I5 1TB HDD 12GB RAM Win10,"₦ 185,900",
1,Hp,"250 G7, Intel Core I3, 1TB Hdd, 4GB Ram, Bluetooth, Webcam, Win 10, Plus Bag.","₦ 131,500",
2,Hp,Stream 11-Intel Celeron (4GB/32GB)Windows10+32Gb Flash Drive,"₦ 59,800",
3,Hp,Pavilion 15 8th Gen Intel Core I5 1TB HDD 12GB RAM Win10,"₦ 186,000",
4,Lenovo,Ideapad Intel Celeron N4000 8thGen 4GB RAM 500GB HDD Wins10 + 32GB Flash Drive,"₦ 78,800",
5,Hp,Stream 11 Intel Celeron Mini Laptop(32GB HDD/2GB Ram- 32GB Flash - USB LIght)Wins 10 White,"₦ 57,000",
6,Hp,"15 AMD Dual Core E2 9000e 500GB HDD, 4GB RAM, WLAN, Webcam, Windows 10","₦ 77,400",
7,Hp,"15, AMD DUAL CORE, 500gb Hdd, 4gb RaM 15.6"" Win10 +16GB And Wireless Mouse","₦ 77,500",
8,Hp,"15 Intel Celeron (4GB RAM, 500GB HDD+ Free 32GB Flash- USB Light) Windows 10","₦ 83,500",
9,Lenovo,Ideapad 4GBRAM/500 HDD Intel Celeron Dual Core Windows 10 +free Mouse,"₦ 78,990",


Now one final practice, from the main page of Wikipedia, we want to extract the languages available and the number of articles for each language.
![](Wiki.png)

In [154]:
DataFrame=pd.DataFrame(columns=['Lang', 'Num Artic'])

In [155]:
url='https://www.wikipedia.org/'

In [156]:
req=requests.get(url)
status=req.status_code
encoding=req.encoding
text=BeautifulSoup(req.text,'html.parser')
print(status)

200


In [157]:
data=text.findAll('div',{'class':'central-featured'})
for item in data:
    pass
item

<div class="central-featured" data-el-section="primary links">
<!-- Rankings from http://stats.wikimedia.org/EN/Sitemap.htm -->
<!-- Article counts from http://meta.wikimedia.org/wiki/List_of_Wikipedias/Table -->
<!-- #1. en.wikipedia.org - 1 750 870 000 views/day -->
<div class="central-featured-lang lang1" dir="ltr" lang="en">
<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English â Wikipedia â The Free Encyclopedia">
<strong>English</strong>
<small><bdi dir="ltr">5 935 000+</bdi> <span>articles</span></small>
</a>
</div>
<!-- #2. es.wikipedia.org - 258 167 000 views/day -->
<div class="central-featured-lang lang2" dir="ltr" lang="es">
<a class="link-box" data-slogan="La enciclopedia libre" href="//es.wikipedia.org/" id="js-link-box-es" title="EspaÃ±ol â Wikipedia â La enciclopedia libre">
<strong>EspaÃ±ol</strong>
<small><bdi dir="ltr">1 546 000+</bdi> <span>artÃ­culos</span></small>
</a>
</div>
<!-- #3. ja.wikip

In [158]:
i=7
j=0
while i<44:
    l=[item.contents[i].findAll('strong')[0].text.strip(),item.contents[i].findAll('bdi')[0].text.strip()]
    DataFrame.loc[j]=l
    i+=4
    j+=1

In [159]:
DataFrame

Unnamed: 0,Lang,Num Artic
0,English,5 935 000+
1,EspaÃ±ol,1 546 000+
2,æ¥æ¬èª,1 169 000+
3,Deutsch,2 345 000+
4,Ð ÑÑÑÐºÐ¸Ð¹,1 569 000+
5,FranÃ§ais,2 141 000+
6,Italiano,1 554 000+
7,ä¸­æ,1 074 000+
8,PortuguÃªs,1 014 000+
9,Polski,1 360 000+


We can see we have some problems w/ formating languages names, this due to encoding, but we can verify results are correct. Otherwise, dealing with the encoding or systematically modify the languages manually are easy tasks to do.