In this notebook I am going to scrape the content of the website www.ndombolo.co, the site have almost 30 songs in lingala and their french traduction!

I will scraping each page and save the content to a pandas dataframe with 2 colums one for lingala and another one for french.

We will keep the translation paragraph per paragraph.



Let us load importnant modules

In [2]:
!pip install beautifulsoup4 pandas

Collecting beautifulsoup4
  Using cached beautifulsoup4-4.9.0-py3-none-any.whl (109 kB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.0-py2.py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.0 soupsieve-2.0
You should consider upgrading via the '/Users/es.py/Projects/Personal/speed-rw/.venv/bin/python -m pip install --upgrade pip' command.[0m


In [1]:
from pathlib import Path
import pandas as pd
from bs4 import BeautifulSoup
from requests import get
from requests.exceptions import RequestException
from contextlib import closing



We are not scraping the website directly, we have already download the content 

In [6]:
TRADUCTION_URL = "http://www.ndombolo.co/chansons/traductions/"

Let add our utilities functions 

In [9]:
def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

In [10]:
def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)

In [11]:
def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

In [13]:
path = Path('*')

In [14]:
path.glob('r')

<generator object Path.glob at 0x1169398e0>

In [80]:
def list_files(path):
    """
    given a path list all the files in the list that follow a specific format
    """
    content = simple_get(path)
    content_html = BeautifulSoup(content, 'html.parser')
    tables = content_html.findChildren('table')
    table = tables[0]
    rows = table.findChildren(['tr'])
    for row in rows[4:-1]:
        cells = row.findChildren(['td', 'th'])
        name_cell = cells[1]
        link = name_cell.a
        text = link.get('href')
        if text.endswith('.php'):
            yield text

In [84]:
names = list_files(TRADUCTION_URL)

In [85]:
names = list(names)

In [87]:
print(len(names))

64


So we have 64 songs , let see now how we can scrape them

In [111]:
def parse_song_lyrics(path):
    """
    parse song web page and return the coresponding lyrics
    """
    content = simple_get(path)
    content_html = BeautifulSoup(content, 'html.parser')
    french_section = content_html.find_all("section", class_="francais")
    lingala_section = content_html.find_all("section", class_="lingala")
    if french_section and lingala_section:
        french_section = french_section[0]
        lingala_section = lingala_section[0]
        song_title = french_section.h3.get_text()
        song_artist = french_section.h5.em.get_text()
        song_album = french_section.h5.get_text()
        lyrics = list()
        for french_text, lingala_text in zip(french_section.find_all("p"), lingala_section.find_all("p")):
            lyric = {"french":french_text.get_text(), "lingala": lingala_text.get_text()}
            lyrics.append(lyric)
        return pd.DataFrame.from_records(lyrics), song_title, song_artist, song_album
    else:
        raise ValueError("This pages does not have lyrics")

In [112]:
f'{TRADUCTION_URL}{names[0]}'

'http://www.ndombolo.co/chansons/traductions/amen.php'

In [113]:
def parse_all_songs(names):
    """
    from the names parse all the lyrics and return them in a list
    """
    song_data = list()
    for name in names:
        full_url = f'{TRADUCTION_URL}{name}'
        print(full_url)
        try:
            lyrics, song_title, song_artist, song_album = parse_song_lyrics(full_url)
            song_data.append({"lyrics": lyrics, 'title':song_title,  'artist': song_artist, 'album': song_album})
        except ValueError as exc:
            log_error(exc)
    return song_data

In [114]:
song_data = parse_all_songs(names)

http://www.ndombolo.co/chansons/traductions/amen.php
http://www.ndombolo.co/chansons/traductions/aminatasylla.php
http://www.ndombolo.co/chansons/traductions/ausecour.php
http://www.ndombolo.co/chansons/traductions/azalakiawa.php
http://www.ndombolo.co/chansons/traductions/barrev.php
This pages does not have lyrics
http://www.ndombolo.co/chansons/traductions/biberon.php
http://www.ndombolo.co/chansons/traductions/blandine.php
http://www.ndombolo.co/chansons/traductions/blessuredamour.php
http://www.ndombolo.co/chansons/traductions/calvaire.php
http://www.ndombolo.co/chansons/traductions/choc.php
http://www.ndombolo.co/chansons/traductions/consolation.php
http://www.ndombolo.co/chansons/traductions/coucou.php
http://www.ndombolo.co/chansons/traductions/coup2foudre.php
http://www.ndombolo.co/chansons/traductions/dalhia.php
http://www.ndombolo.co/chansons/traductions/dieuleternel.php
http://www.ndombolo.co/chansons/traductions/djino.php
http://www.ndombolo.co/chansons/traductions/eaubenit

In [117]:
songs_data = song_data

In [118]:
# Saving everything to csv

In [125]:
for song_data in songs_data:
    song_data.get('lyrics').to_csv('../data/{}.csv'.format(song_data.get('title')), sep='|')

## Reading all files

In [66]:
all_data = pd.concat(pd.read_csv(file, sep='|') for file in Path.cwd().parent.joinpath('data').glob('*.csv'))

In [67]:
all_data.french = all_data.french.str.replace('\n', ' ')
all_data.lingala = all_data.lingala.str.replace('\n', ' ')

In [68]:
all_data.head()

Unnamed: 0.1,Unnamed: 0,french,lingala
0,0.0,"Comme le soleil, tu m’as séduit ...",Lekola moyi yo seduire nga haa ...
1,1.0,Je viens te faire voir mon cœur frappé par le ...,Nayé ko talisa yo motema ezui coup de foudre n...
2,2.0,"L’amour ma pénétré dans les os, et m’oblige à ...",Bolingo ekoteli nga aah na mikuwa ekotinda na ...
3,3.0,"Comme le soleil fait mal aux yeux, tu me donne...",Lekola moyi esuaka na miso yo pesa nga fievre ...
4,4.0,Comme le soleil la source du feu ...,Lekola moyi source ya moto ...


In [71]:
all_data = all_data.drop('Unnamed: 0', axis=1)

In [73]:
all_data = all_data.reset_index()

In [74]:
all_data.tail()

Unnamed: 0,index,french,lingala
1921,4,Maman attend moi je vais venir ...,Mama zela nga na koya he ...
1922,5,Maman attend moi je vais venir ...,Ho ho ho Mama zela nga na koya he ...
1923,6,Qu’est ce que t’as pas fait pour que je sois d...,Eloko nini oyo osali te pote ngai na mona nzel...
1924,7,Maman qu’est ce que tu ne fais pas ...,Mama eloko nini yo salaka te ...
1925,8,Maman qu’est ce que tu ne fais pas ...,Oh oh oh mama eloko nini yo salaka te eh ...


In [75]:
all_data = all_data.drop('index', axis=1)

In [76]:
all_data.tail()

Unnamed: 0,french,lingala
1921,Maman attend moi je vais venir ...,Mama zela nga na koya he ...
1922,Maman attend moi je vais venir ...,Ho ho ho Mama zela nga na koya he ...
1923,Qu’est ce que t’as pas fait pour que je sois d...,Eloko nini oyo osali te pote ngai na mona nzel...
1924,Maman qu’est ce que tu ne fais pas ...,Mama eloko nini yo salaka te ...
1925,Maman qu’est ce que tu ne fais pas ...,Oh oh oh mama eloko nini yo salaka te eh ...


In [77]:
all_data.to_csv('../data/{}.csv'.format('all_data'), index=False, sep='|')

In [78]:
all_data.loc[102, 'french']

"Ma vie d'amour s’arrête avec toi Viens me chercher"

In [79]:
all_data.loc[102, 'lingala']

"Vie d'amour na nga esuki epayi na yo   yaka k'ozua n'o nga"

In [80]:
all_data.loc[59, 'lingala']

'FALLY\xa0: Oh bazua ba bolingo bateka na zando ya ngabela EKATSHAKA'

In [81]:
all_data.loc[59, 'french']

"FALLY\xa0: Ils ont pris l'amour et l'ont vendu au grand marché"