### Datascraping French Stocks

**Objective:**    
Retrieve data to store in a CSV file to speed up the execution of other codes.

**Extracted Data:**    
- Company name    
- Company's yfinance ticker     
- Company sector     
- Trading volume     

**How?**      
Using BeautifulSoup to extract data from Boursorama.

**Why is trading volume important?**      
Although the extracted volume may not fully represent the liquidity of the stock, it is a quick indicator to differentiate between liquid and illiquid stocks.

In [30]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import yfinance as yf

In [65]:
# For each sector, there is a number on Boursorama. And this number is useful to build the URLs of the sites we will scrape.
dico = {'Pétrole et Gaz' : 0, 'Matériaux de base' : 1, 'Santé' : 4, 'Biens de consommation' : 3, 'Industries' : 2, 'Services aux Collectivités' : 7, 'Services aux consommateurs' : 5, 'Sociétés financières' : 8, 'Technologies' : 9, 'Télécommunications' : 6}

list_sector = list(dico.keys())
list_name = []
list_ticker = []
list_secteur = []
list_volume = []

# We will go through each page listing companies from each sector
for sector in list_sector :
    # We create a list containing all the URLs we will scrape
    liste_url = ['https://www.boursorama.com/bourse/actions/cotations/secteur/?filter%5Bindustry%5D=' + str(dico[sector]) + '&filter%5BsubmitButton%5D=']
    for i in range(2, 10) :
        url = 'https://www.boursorama.com/bourse/actions/cotations/secteur/page-' + str(i) +'?filter%5Bindustry%5D=' + str(dico[sector]) + '&filter%5BsubmitButton%5D='
        liste_url.append(url)
        
    # Then we scrape our data by going through each URL
    for url in liste_url :
        response = requests.get(url)
        if response.url != url :
            break
        page = response.content
        soup = BeautifulSoup(page, "html.parser")
        
        # We retrieve the transaction volume of the company
        volumes = soup.find_all(class_="c-instrument c-instrument--totalvolume")
        list_volume_transition = []
        for volume in volumes :
            list_volume_transition.append(int(volume.text.replace(" ","")))
        
        # We retrieve the name, ticker, and volume of the stock
        entreprises = soup.find_all(class_="o-pack__item u-color-cerulean u-ellipsis")
        for counter, entreprise in enumerate(entreprises) :
            name = entreprise.find("a").text
            print(name) # Used to check that the code runs well and at what speed
            ref = entreprise.find("a").get('href').replace("-OTC", "")
            pos = ref.find("P")
            ticker = ref[pos+1:-1] + ".PA"
            try :
                # We verify that we have the correct ticker by trying to retrieve the stock price using yfinance
                yf.Ticker(ticker).history(period="1d").iloc[-1]
                list_name.append(name)
                list_ticker.append(ticker)
                list_secteur.append(sector)
                list_volume.append(list_volume_transition[counter])
            except :
                None


AGRIPOWER
BOOSTHEAT
CGG
CHARWOOD ENERGY
DOLFINES
ECOSLOPS
ENERTIME
ENOGIA
ENTECH
EO2
ESSO
GLOBAL BIOENERGIES
GTT (GAZTRANSPORT ET TEC.)
HAFFNER ENERGY
HRS (HYDROGEN REFUELING SOL.)
LHYFE
MAUREL & PROM
MCPHY ENERGY
NHOA
SEQUA PETROLEUM
SLB
TECHNIP ENERGIES
TOTALENERG GAB
TOTALENERGIES
VALLOUREC
VERGNET
WAGA ENERGY
AFYREN
AIR LIQUIDE
AMOEBA
ARKEMA
AUPLATA MNG GRP
BAIKOWSKI
CARBIOS
COGRA
COIL
ENCRES DUBUIT
ERAMET
EXACOMPTA
EXPL & PROD CHIM
FERMENTALG
FLORENTAISE
GOLD BY GOLD
GROUPE BERKEM
IMERYS
JACQUET METALS
METABOLIC EXPLORER
MOULINVEST
ORAPI
ROBERTET
ROBERTET CI E87
ROUGIER S.A.
SOLVAY
SOLB.PA: No data found, symbol may be delisted
ZAMBIA CONS.CAT.B
AB SCIENCE
ABIONYX PHARMA
ABIVAX
ACTICOR BIOTECH
ADOCIA
ADVICENNE
AELIS FARMA
AFFLUENT MEDICAL
AMPLITUDE SURG.
BASTIDE LE CONFORT MED.
BIOMERIEUX
BIOPHYTIS
BIOSENIC
BIOS.PA: No data found, symbol may be delisted
BIOSYNEX
BLUELINEA
BOIRON
BONYF
CARMAT
CELLECTIS
CELYAD ONCO
CYAD.PA: No data found, symbol may be delisted
CLARIANE
CLARIANE
CRO

CARMILA
CBO TERRITORIA
COFACE
COVIVIO
COVIVIO HTLS
CRCAM ALPES PROVENCE.CCI
CRCAM BRIE PIC2CCI
CRCAM ILLE CCI
CRCAM LANGUEDOC
CRCAM LOIRE HAUTE LOIRE
CRCAM MORBIHAN CCI
CRCAM NOR.SE.CCI
CRCAM NORD FRANCE
CRCAM PARIS ET IDF
CRCAM SRA CI
CRCAM TOURAINE CCI
CREDIT AGRICOLE SA
EDUFORM'ACTION
EEM
EURAZEO
EURONEXT
FIDUCIAL REAL ESTATE
FINANCIERE MONCEY
FONCIERE INEA
FONCIERE LYONNAIS
FOREST EQUATORIAL
FREY
GALIMMO
GECINA
Gold Bullion Securities
HAMILTON GLOBAL OPP
ICADE
IDI
IDSUD
IMMOB DASSAULT
Invesco EuroMTS Cash 3 Months UCITS ETF
EU.PA: No data found, symbol may be delisted
KLEPIERRE
LEBON
LOIRE ATL.VEND.CCI
Lyxor Dow Jones Industrial Average UCITS ETF - Dist
/COURS/1RTDJE.PA: No data found for this date range, symbol may be delisted
MAISON ANTOINE BAUD
MERCIALYS
MONTEA
MONT.PA: No data found, symbol may be delisted
NEXITY
PAREF
PATRIMOINE COM.
PEUGEOT INVEST
PHOTONIKE
REALITES
SCBSM
SCOR
SELECTIRENTE
SMALTO
SOCIETE GENERALE
STRADIM ESPAC.FIN.
TIKEHAU CAPITAL
TOUR EIFFEL
TRAMWAYS DE ROUE

In [72]:
# We create our DataFrame
dic_data = {"Name": list_name, "Ticker": list_ticker, "Sector": list_secteur, "Volume": list_volume}
df = pd.DataFrame(dic_data)

# We save the DataFrame to a csv file
df.to_csv("Dataset French Companies.csv", index=False)

df

Unnamed: 0,Name,Ticker,Sector,Volume
0,AGRIPOWER,ALAGP.PA,Pétrole et Gaz,12112
1,BOOSTHEAT,ALBOO.PA,Pétrole et Gaz,18724615
2,CGG,CGG.PA,Pétrole et Gaz,8418679
3,CHARWOOD ENERGY,ALCWE.PA,Pétrole et Gaz,97
4,DOLFINES,ALDOL.PA,Pétrole et Gaz,10877659
...,...,...,...,...
618,HF COMPANY,ALHF.PA,Télécommunications,2828
619,MAROC TELECOM,IAM.PA,Télécommunications,21
620,ORANGE,ORA.PA,Télécommunications,5058001
621,OSMOZIS,ALOSM.PA,Télécommunications,236
