### Datascraping French Stocks

**Objective:**    
Retrieve data to store in a CSV file to speed up the execution of other codes.

**Extracted Data:**    
- Company name    
- Company's yfinance ticker     
- Company sector     
- Trading volume     

**How?**      
Using BeautifulSoup to extract data from Boursorama.

**Why is trading volume important?**      
Although the extracted volume may not fully represent the liquidity of the stock, it is a quick indicator to differentiate between liquid and illiquid stocks.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import yfinance as yf
from tqdm import tqdm

In [2]:
# For each sector, there is a number on Boursorama. And this number is useful to build the URLs of the sites we will scrape.
dico = {'Pétrole et Gaz' : 0, 'Matériaux de base' : 1, 'Santé' : 4, 'Biens de consommation' : 3, 'Industries' : 2, 'Services aux Collectivités' : 7, 'Services aux consommateurs' : 5, 'Sociétés financières' : 8, 'Technologies' : 9, 'Télécommunications' : 6}

list_sector = list(dico.keys())
list_name = []
list_ticker = []
list_secteur = []
list_volume = []

# We will go through each page listing companies from each sector
for sector in tqdm(list_sector) :
    # We create a list containing all the URLs we will scrape
    liste_url = ['https://www.boursorama.com/bourse/actions/cotations/secteur/?filter%5Bindustry%5D=' + str(dico[sector]) + '&filter%5BsubmitButton%5D=']
    for i in range(2, 10) :
        url = 'https://www.boursorama.com/bourse/actions/cotations/secteur/page-' + str(i) +'?filter%5Bindustry%5D=' + str(dico[sector]) + '&filter%5BsubmitButton%5D='
        liste_url.append(url)
        
    # Then we scrape our data by going through each URL
    for url in liste_url :
        response = requests.get(url)
        if response.url != url :
            break
        page = response.content
        soup = BeautifulSoup(page, "html.parser")
        
        # We retrieve the transaction volume of the company
        volumes = soup.find_all(class_="c-instrument c-instrument--totalvolume")
        list_volume_transition = []
        for volume in volumes :
            list_volume_transition.append(int(volume.text.replace(" ","")))
        
        # We retrieve the name, ticker, and volume of the stock
        entreprises = soup.find_all(class_="o-pack__item u-color-cerulean u-ellipsis")
        for counter, entreprise in enumerate(entreprises) :
            name = entreprise.find("a").text
            ref = entreprise.find("a").get('href').replace("-OTC", "")
            pos = ref.find("P")
            ticker = ref[pos+1:-1] + ".PA"
            try :
                # We verify that we have the correct ticker by trying to retrieve the stock price using yfinance
                yf.Ticker(ticker).history(period="1d").iloc[-1]
                list_name.append(name)
                list_ticker.append(ticker)
                list_secteur.append(sector)
                list_volume.append(list_volume_transition[counter])
            except :
                None


  0%|          | 0/10 [00:00<?, ?it/s]

 10%|█         | 1/10 [00:06<00:58,  6.51s/it]

SOLB.PA: No data found, symbol may be delisted
SOLB.PA: No data found, symbol may be delisted


 20%|██        | 2/10 [00:12<00:49,  6.24s/it]

CYAD.PA: No data found, symbol may be delisted
ONWD.PA: No data found, symbol may be delisted


 30%|███       | 3/10 [00:27<01:12, 10.37s/it]

ALALAGR.PA: No data found, symbol may be delisted


 40%|████      | 4/10 [00:46<01:20, 13.50s/it]

BIDS.PA: No data found, symbol may be delisted
MLPLC.PA: Period '1d' is invalid, must be one of ['1mo', '3mo', '6mo', 'ytd', '1y', '2y', '5y', '10y', 'max']
TITC.PA: No data found, symbol may be delisted


 50%|█████     | 5/10 [01:12<01:30, 18.18s/it]

ABO.PA: Period '1d' is invalid, must be one of ['1mo', '3mo', '6mo', 'ytd', '1y', '2y', '5y', '10y', 'max']


 60%|██████    | 6/10 [01:16<00:53, 13.25s/it]

DAR.PA: Period '1d' is invalid, must be one of ['1mo', '3mo', '6mo', 'ytd', '1y', '2y', '5y', '10y', 'max']


 70%|███████   | 7/10 [01:34<00:44, 14.81s/it]

MONT.PA: Period '1d' is invalid, must be one of ['1mo', '3mo', '6mo', 'ytd', '1y', '2y', '5y', '10y', 'max']


100%|██████████| 10/10 [02:18<00:00, 13.82s/it]


In [None]:
# We create our DataFrame
dic_data = {"Name": list_name, "Ticker": list_ticker, "Sector": list_secteur, "Volume": list_volume}
df = pd.DataFrame(dic_data)

# We save the DataFrame to a csv file
df.to_csv("Dataset French Companies.csv", index=False)

df

Unnamed: 0,Name,Ticker,Sector,Volume
0,AGRIPOWER,ALAGP.PA,Pétrole et Gaz,12715
1,CHARWOOD ENERGY,ALCWE.PA,Pétrole et Gaz,160
2,ECOSLOPS,ALESA.PA,Pétrole et Gaz,4191
3,ENOGIA,ALENO.PA,Pétrole et Gaz,3184
4,ENTECH,ALESE.PA,Pétrole et Gaz,9191
...,...,...,...,...
581,EUTELSAT COMMUNIC.,ETL.PA,Télécommunications,146677
582,HF COMPANY,ALHF.PA,Télécommunications,50
583,MAROC TELECOM,IAM.PA,Télécommunications,2020
584,ORANGE,ORA.PA,Télécommunications,4779105
