# Bienvenue au Live Coding (Web Scraping) ***Data Science CI***
<div align='center'>

    ------------------------
</div>
<div align='center'>

<img src="https://fiverr-res.cloudinary.com/images/q_auto,f_auto/gigs/134606170/original/fb4be771c30d6cb17fa9caee0322a7f6aeb843d0/do-data-mining-web-scraping-from-website-or-webpage-to-excel.png" alt="logo web scraping" >

</div>
Source : Google Images

___

## Web Scraping
**Description:** Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

**Scraping package:** beautifulsoup4

___

# Live Coding

## Install and Import needed packages

In [1]:
!pip install --quiet beautifulsoup4 # useful to manipulate web pages source code
!pip install --quiet pandas # useful to structure collected data

In [2]:
from bs4 import BeautifulSoup as bs
from dateutil.parser import parse
import pandas as pd
import requests # useful to request web pages and get their source coded

## Set url and request it

In [3]:
# BBC Africa Sport page link
page_url = 'https://www.bbc.com/afrique/topics/c404v54yrqyt' 

# We request the page to get the whole page code source
response = requests.get(page_url)

In [4]:
## Check Status Code of request to see if the request was successful
response.status_code # <></>

200

<div align='center'>
<p style='text-align:center'><b>Summary of HTTP request status codes</b></p>
<img src="https://www.researchgate.net/profile/Vaibhav_Hemant_Dixit/publication/328327565/figure/tbl1/AS:682534392303619@1539740288388/HTTP-response-codes-and-inference.png" alt="Table of HTTP request status codes" >

</div>

Source : AIM-SDN: Attacking Information Mismanagement in SDN-datastores _found on Research Gate_





## Show response text content and Parse the content to BeautifulSoup

In [5]:
# print(response.text [:788])

In [6]:
# Parsing response to BeautifulSoup to handle it
soup = bs(response.text)

In [7]:
type(soup)

bs4.BeautifulSoup

In [8]:
#To get help in order to use the created soup variable from bs4.BeautifulSoup type

# help(soup)

## Use the BeautifulSoup object to gather informations that we want

> Here we want to get all informations about publications of the target web page.
We use Browser inspector to do that.

> We found that each publication is in `<li>` tag the whole in `<ol>` tag with attribute `class="gs-u-m0 gs-u-p0 lx-stream__feed qa-stream"`. We'll use `BeautifulSoup` methods `find` and `find_all` to retrieve tags which contain useful data.


<div align='center'>

<img src="https://drive.google.com/uc?export=view&id=1EDKWf9Nt7nQ53T1JwupAOYqJ2CjrJFAe" alt="BBC Africa Sport inspecting in Google Chrome" >

</div>
Source : Emmanuel KOUPOH

> After inpection, We found that there have two main types of publications :  `articles (containing image and text)` and `video reportages`. Then we'll follow the next steps to do our Scraping : 

1.   Isolate the tag `<ol>` which contains all publications ;
1.   Gather each `<li>` which represents a publication ; 
1.   Make a prototype for each of two main types of publications ; 
1.   Run a loop over all `<li>` of isolated the `<ol>`.

### Isolate the tag `<ol>` which contains all publications

In [9]:
publications_container = soup.find("ol", {"class" : "gs-u-m0 gs-u-p0 lx-stream__feed qa-stream"} )

Just an info : find method return a tag that we have to handle like a BoutifulSoup object.

In [10]:
print(f"{publications_container.prettify()[:788]}...")

<ol class="gs-u-m0 gs-u-p0 lx-stream__feed qa-stream" data-reactid=".wu12hvz5v6.1.0.1.1">
 <noscript data-reactid=".wu12hvz5v6.1.0.1.1.0">
 </noscript>
 <li class="lx-stream__post-container placeholder-animation-finished" data-reactid=".wu12hvz5v6.1.0.1.1.1:$post-53279666">
  <article aria-labelledby="title_53279666" class="qa-post gs-u-pb-alt+ lx-stream-post gs-u-pt-alt+ gs-u-align-left" data-reactid=".wu12hvz5v6.1.0.1.1.1:$post-53279666.0" id="post_53279666">
   <div class="gs-u-mb-" data-reactid=".wu12hvz5v6.1.0.1.1.1:$post-53279666.0.0">
    <time class="lx-stream-post__meta-time gs-u-align-middle gs-u-display-inline-block gs-u-mr0@m gs-u-mr gel-long-primer" data-reactid=".wu12hvz5v6.1.0.1.1.1:$post-53279666.0.0.0">
     <span class="gs-u-vh qa-visually-hidden-meta" data-re...


### Gather each `<li>` which represents a publication

In [11]:
publications_list = publications_container.find_all("li", {"class" : "lx-stream__post-container placeholder-animation-finished"} )

Just an info : find_all method return a list of tags and each tag have to be handle like a BoutifulSoup object.

In [12]:
print(f"How many publications do we have in our container of publications : {len(publications_list)}")

How many publications do we have in our container of publications : 10



> We see that first and second publications are respectivement a `video reportage` and a `readable article (containing image and text)`. Then we'll use them to build our prototypes.

<div align='center'>

<img src="https://drive.google.com/uc?export=view&id=10TxXXWYc8okDPpcprVPcvZQQpaneu0fw" alt="First and Second publications on BBC Africa Sport inspecting in Google Chrome" >

</div>
Source : Emmanuel KOUPOH

____

### Make a prototype for each of two main types of publications

**Note:** We have two similar informations: Date and Title block

#### Video reportage 

In [13]:
video = publications_list[0]

##### Get Date

In [14]:
datetime_cell = video.find("span", {"class": "qa-post-auto-meta"})
datetime_cell

<span aria-hidden="true" class="qa-post-auto-meta" data-reactid=".wu12hvz5v6.1.0.1.1.1:$post-53279666.0.0.0.1">18:59 3 juillet 2020</span>

We get date as french string ( '18:59 3 juillet 2020' ), so we conveter it in english and parse it to datetime


In [15]:
date = datetime_cell.text.strip().lower().replace('janvier', 'january').replace('février', 'february').replace('mars', 'march').replace('avril', 'april').replace('mai', 'may').replace('juin', 'june').replace('juillet', 'july').replace('août', 'august').replace('septembre', 'september').replace('octobre', 'october').replace('novembre', 'november').replace('décembre', 'december')
date = parse(date, fuzzy_with_tokens=True)[0]
date

datetime.datetime(2020, 7, 3, 18, 59)

##### Get article's Title and Link

In [16]:
header = video.find("a", {"class": "qa-heading-link lx-stream-post__header-link"})

title = header.text.strip()
link = "https://www.bbc.com" + header['href']

print(f"Title : {title}\nLink : {link}")

Title : Rencontre avec l'enfant boxeur le plus rapide du monde
Link : https://www.bbc.com/afrique/media-53279666


##### Get Video link


In [17]:
video_response = requests.get(link)
video_soup = bs(video_response.text)

video_cell = video_soup.find("figure", {"class": "Figure-sc-6a3dhy-0 gJUCFc"})


In [18]:
video_cell

<figure class="Figure-sc-6a3dhy-0 gJUCFc"><div class="StyledVideoContainer-sc-13p1a4d-0 bpuDKK"><iframe allow="autoplay; fullscreen" allowfullscreen="" class="StyledIframe-fuo2ed-0 dIyDYU" scrolling="no" src="https://polling.bbc.co.uk/ws/av-embeds/cps/afrique/media-53279666/p08jsrj3/fr" title="Lecteur média"></iframe><noscript><div class="StyledWrapper-sc-1pnftlt-0 jywlaE"><img alt="" aria-hidden="true" class="StyledImg-sc-7vx2mr-0 hIxkbt" src="https://ichef.bbci.co.uk/images/ic/1024x576/p08jssj5.jpg" srcset="https://ichef.bbci.co.uk/images/ic/1024x576/p08jssj5.jpg 240w, https://ichef.bbci.co.uk/images/ic/1024x576/p08jssj5.jpg 320w, https://ichef.bbci.co.uk/images/ic/1024x576/p08jssj5.jpg 480w, https://ichef.bbci.co.uk/images/ic/1024x576/p08jssj5.jpg 624w, https://ichef.bbci.co.uk/images/ic/1024x576/p08jssj5.jpg 800w"/><div class="MessageWrapper-sc-1pnftlt-1 bHbRwS"><strong class="StyledMessage-sc-1pnftlt-2 iRlUPR">Pour regarder ce contenu, veuillez activer JavaScript ou essayer un aut

In [19]:
video_cover_img = video_cell.img['src']

In [20]:
video_link = video_cell.iframe['src']

#### Article

In [21]:
image = publications_list[1]

##### Get Image of article

In [22]:
div_image_cell = image.find("div", {"class": "lx-stream-related-story--index-image-wrapper qa-story-image-wrapper"})

image_cell = div_image_cell.img
img_srcs_list = image_cell['srcset'].split()
img_srcs_dict ={ img_srcs_list[i+1][:-1] : img_srcs_list[i]  for i in range( 0, len(img_srcs_list) // 2, 2) }

img_srcs_dict

{'240w': 'https://ichef.bbci.co.uk/live-experience/cps/240/cpsprodpb/CE90/production/_113108825_whatsubject.jpg',
 '320w': 'https://ichef.bbci.co.uk/live-experience/cps/320/cpsprodpb/CE90/production/_113108825_whatsubject.jpg',
 '480w': 'https://ichef.bbci.co.uk/live-experience/cps/480/cpsprodpb/CE90/production/_113108825_whatsubject.jpg'}

##### Get article Primer

In [23]:
primer_cell = image.find("p", {"class": "lx-stream-related-story--summary qa-story-summary"})

primer = primer_cell.text 
primer

'Comment le premier titre de première division de Liverpool en Angleterre depuis 30 ans a été alimenté par des joueurs africains.'

#### Prototypes

***Note***: We are looking for a condition here to differentiate **video reportages** from **articles**.

In [24]:
len(video.find_all("img"))

0

In [25]:
len(image.find_all("img"))

1

***Note***: Building method prototypes for data gathering : `video` for **video reportages** and `article` for **articles**. .

In [26]:
def video(publication_tag):
    video = publication_tag
    
    #Get date
    datetime_cell = video.find("span", {"class": "qa-post-auto-meta"})
    date = datetime_cell.text.strip().lower().replace('janvier', 'january').replace('février', 'february').replace('mars', 'march').replace('avril', 'april').replace('mai', 'may').replace('juin', 'june').replace('juillet', 'july').replace('août', 'august').replace('septembre', 'september').replace('octobre', 'october').replace('novembre', 'november').replace('décembre', 'december')
    date = parse(date, fuzzy_with_tokens=True)[0]
    
    #Get publication's title and link
    header = video.find("a", {"class": "qa-heading-link lx-stream-post__header-link"})
    title = header.text.strip()
    link = "https://www.bbc.com" + header['href']

    #Get video link
    video_response = requests.get(link)
    video_soup = bs(video_response.text)
    video_cell = video_soup.find("figure", {"class": "Figure-sc-6a3dhy-0 gJUCFc"})
    video_cover_img = video_cell.img['src']
    video_link = video_cell.iframe['src']
    
    #Save and structure gathered informations
    data =  {
        "type" : 'video',
        "publication_link" : link,
        "date" : date, 
        "title" : title, 
        "video_cover" : video_cover_img, 
        "video_link" : video_link, 
    }
    return data


def article(publication_tag):
    image = publication_tag
    
    #Get date
    datetime_cell = image.find("span", {"class": "qa-post-auto-meta"})
    date = datetime_cell.text.strip().lower().replace('janvier', 'january').replace('février', 'february').replace('mars', 'march').replace('avril', 'april').replace('mai', 'may').replace('juin', 'june').replace('juillet', 'july').replace('août', 'august').replace('septembre', 'september').replace('octobre', 'october').replace('novembre', 'november').replace('décembre', 'december')
    date = parse(date, fuzzy_with_tokens=True)[0]
    
    #Get title and link
    header = image.find("a", {"class": "qa-heading-link lx-stream-post__header-link"})
    title = header.text.strip()
    link = "https://www.bbc.com" + header['href']
    
    #Get publication cover source set : { '320W' : link, '720W' : link} : many size for reponsivity
    div_image_cell = image.find("div", {"class": "lx-stream-related-story--index-image-wrapper qa-story-image-wrapper"})
    image_cell = div_image_cell.img
    img_srcs_list = image_cell['srcset'].split()
    img_srcs_dict = { img_srcs_list[i+1][:-1] : img_srcs_list[i]  for i in range( 0, len(img_srcs_list) // 2, 2) }
    
    #Get article primer 
    primer_cell = image.find("p", {"class": "lx-stream-related-story--summary qa-story-summary"})
    primer = primer_cell.text 

    #Save and structure gathered informations
    data =  {
        "type" : 'article',
        "publication_link" : link,
        "date" : date, 
        "title" : title, 
        "img_srcs_dict" : img_srcs_dict, 
        "primer" : primer,
    }
    return data

### Run a loop over all `<li>` of isolated the `<ol>` 

In [27]:
store = []
for publication in publications_list: 
    if len(publication.find_all("img")) == 1 :
        store += [article(publication)]
    else:
        store += [video(publication)]

        
print(f"We got '{len(store)}' publication(s) on this page")

We got '10' publication(s) on this page


In [28]:
df_store = pd.DataFrame(store)
df_store.head()

Unnamed: 0,type,publication_link,date,title,video_cover,video_link,img_srcs_dict,primer
0,video,https://www.bbc.com/afrique/media-53279666,2020-07-03 18:59:00,Rencontre avec l'enfant boxeur le plus rapide ...,https://ichef.bbci.co.uk/images/ic/1024x576/p0...,https://polling.bbc.co.uk/ws/av-embeds/cps/afr...,,
1,article,https://www.bbc.com/afrique/sports-53191371,2020-06-26 14:19:00,Liverpool : les Africains qui ont aidé à rempo...,,,{'240w': 'https://ichef.bbci.co.uk/live-experi...,Comment le premier titre de première division ...
2,article,https://www.bbc.com/afrique/sports-53183843,2020-06-25 18:00:00,Liverpool à deux points du titre de Premier Le...,,,{'240w': 'https://ichef.bbci.co.uk/live-experi...,Mohamed Salah estime que le moment est venu po...
3,article,https://www.bbc.com/afrique/sports-53092788,2020-06-18 15:47:00,Marcus Rashford: la victoire d’un footballeur ...,,,{'240w': 'https://ichef.bbci.co.uk/live-experi...,Le gouvernement britannique a annoncé qu'il of...
4,article,https://www.bbc.com/afrique/sports-53066460,2020-06-17 10:36:00,8ème titre consécutif du Bayern de Munich,,,{'240w': 'https://ichef.bbci.co.uk/live-experi...,Une passe décisive de Robert Lewandowski à Jer...


# 💻<span style='color:green'> Authors & Contributors </span>

<div align='center'>
    <table>
        <thead>
            <tr>
                <th>Name</th>
                <th>Zindi ID</th>
                <th>Github ID</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>Muhamed TUO</td>
                <td><a href="https://zindi.africa/users/Muhamed_Tuo" target="_blank" rel="nofollow">@Nazario😁</a></td>
                <td><a href="https://github.com/NazarioR9" target="_blank" rel="nofollow">@NazarioR9</a></td>
            </tr>
            <tr>
                <td>Cédric MANOUAN</td>
                <td><a href="https://zindi.africa/users/I_am_Zeus_AI" target="_blank" rel="nofollow">@I_am_Zeus_AI😆</a></td>
                <td><a href="https://github.com/dric2018" target="_blank" rel="nofollow">@dric2018</a></td>
            </tr>
            <tr>
                <td>Emmanuel KOUPOH</td>
                <td><a href="https://zindi.africa/users/eaedk" target="_blank" rel="nofollow">@eaedk😂</a></td>
                <td><a href="https://github.com/eaedk" target="_blank" rel="nofollow">@eaedk</a></td>
            </tr>
            <tr>
                <td></td>
                <td></td>
                <td></td>
            </tr>
        </tbody>
    </table>
</div>
