# Scraping the information 

How?

- init session (gives you the cookies, the link to call later)
- init search with the right parameters
- get post response

After watching this video, I realized he uses a [Session](http://docs.python-requests.org/en/v1.0.4/user/advanced/). Let's try this.

In [1]:
import requests

In [2]:
s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

{
  "cookies": {
    "sessioncookie": "123456789"
  }
}



What if we do this with ameli?

In [12]:
s = requests.Session()

r = s.get('http://ameli-direct.ameli.fr')

Getting the interesting link:

In [14]:
test_string = """<div id="centresite">
        <form action="/recherche-ceca2094e7dec344ca69beca17f092d2.html" method="post">
            <div class="choix-ps-es">
	<h2>Je recherche :</h2>"""

In [13]:
import re

In [15]:
p = re.compile('<form action="([\w\d/.-]+)" method="post">')

In [16]:
p.findall(test_string)

['/recherche-ceca2094e7dec344ca69beca17f092d2.html']

In [17]:
suburl = p.findall(r.text)[0]
suburl

'/recherche-3a4f89dc9bd653a34e287e7c17e7b2ff.html'

Good, we have an entry point into the system: the url we need to find.

It also looks like we have cookies!

In [18]:
r.cookies

<RequestsCookieJar[Cookie(version=0, name='AmeliDirectPersist', value='376496439.20480.0000', port=None, port_specified=False, domain='ameli-direct.ameli.fr', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False), Cookie(version=0, name='TS01b76c1f', value='0139dce0d2bec8d54d79aaa58ea7790b16c9c15691efa8e160cf039b22e711887f55e473b7f92c0140a8380921774568848cd87c83db97e20a37f9ab0e3b8d6f91527e9707', port=None, port_specified=False, domain='ameli-direct.ameli.fr', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False), Cookie(version=0, name='infosoins', value='vst70vcj195p9jic21l133jfk6', port=None, port_specified=False, domain='ameli-direct.ameli.fr', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False,

Now, which request do we have to complete to finish this? We primarily need the sort of payload we want to ask the site for:

In [19]:
payload = {"type":"ps",
    "ps_profession":"ophtalmologiste",
    "ps_localisation":"92120"}

r = s.post("http://ameli-direct.ameli.fr" + suburl, params=payload)

In [20]:
r

<Response [200]>

In [21]:
r.text

'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr">\n<head>\n\t<title>Annuaire santé d\'ameli.fr : trouver un médecin, un hôpital...</title>\n\t<meta name="description" content="L’annuaire santé de l’Assurance Maladie pour trouver un médecin, un kiné, un hôpital… Tarifs – Horaires – Spécialités - Localisation" />\n\t<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />\n<meta name="robots" content="noindex, nofollow" /><link href="/resources_ver/20150826115145/css/print_new.css" media="print" rel="stylesheet" type="text/css" />\n<link href="/resources_ver/20150826115145/css/jquery.qtip.css" media="screen" rel="stylesheet" type="text/css" />\n<link href="/resources_ver/20150826115145/css/styles_new.css" media="screen" rel="stylesheet" type="text/css" />\n<link href="/resources_ver/20150826115

Good, the first part of our work is done!

# The parsing 

Let's use beautiful soup to parse the document structure. There's a great documentation here: [http://www.crummy.com/software/BeautifulSoup/bs4/doc/](http://www.crummy.com/software/BeautifulSoup/bs4/doc/).

In [54]:
import bs4

In [55]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

BS gives us an easy access to most things from the structure.

In [56]:
soup.title

<title>Annuaire santé d'ameli.fr : trouver un médecin, un hôpital...</title>

We can find tags easily.

In [58]:
soup.findAll('strong')

[<strong>DEBBASCH</strong>, <strong>KHAYAT</strong>, <strong>KOHANE</strong>]

All doctors can be found like this:

In [59]:
doctors = soup.findAll('div', attrs={"class":"item-professionnel"})

A single doctor looks like this.

In [51]:
doctors[0]

<div class="item-professionnel"><div class="item-professionnel-inner"><span class="num">1</span><div class="nom_pictos"><h2><a href="/professionnels-de-sante/recherche-1/fiche-detaillee-CbA1kzI4MzG0-3a4f89dc9bd653a34e287e7c17e7b2ff.html"><strong>DEBBASCH</strong> JEAN MARC</a></h2><div class="pictos"><img alt="Accepte la carte Vitale" class="infobulle" src="/resources_ver/20150826115145/images/picto_cartevitale.png"/></div><div class="clear"></div></div><div class="clear"></div><div class="elements"><div class="item left"></div><div class="item right type_honoraires">Honoraires libres</div><div class="clear"></div><div class="item left tel">01 46 56 90 90</div><div class="item right convention"><a alt="Les médecins fixent librement leurs tarifs et peuvent donc pratiquer des dépassements d’honoraires avec tact et mesure. L’Assurance Maladie rembourse les consultations et actes réalisés par ces médecins sur la base des tarifs fixés dans la convention (tarifs applicables au médecin de sec

In [52]:
len(doctors)

4

Let's write a function that allows us to extract just the things we want.

In [61]:
block = doctors[0]

In [63]:
print(block.prettify())

<div class="item-professionnel">
 <div class="item-professionnel-inner">
  <span class="num">
   1
  </span>
  <div class="nom_pictos">
   <h2>
    <a href="/professionnels-de-sante/recherche-1/fiche-detaillee-CbA1kzI4MzG0-3a4f89dc9bd653a34e287e7c17e7b2ff.html">
     <strong>
      DEBBASCH
     </strong>
     JEAN MARC
    </a>
   </h2>
   <div class="pictos">
    <img alt="Accepte la carte Vitale" class="infobulle" src="/resources_ver/20150826115145/images/picto_cartevitale.png"/>
   </div>
   <div class="clear">
   </div>
  </div>
  <div class="clear">
  </div>
  <div class="elements">
   <div class="item left">
   </div>
   <div class="item right type_honoraires">
    Honoraires libres
   </div>
   <div class="clear">
   </div>
   <div class="item left tel">
    01 46 56 90 90
   </div>
   <div class="item right convention">
    <a alt="Les médecins fixent librement leurs tarifs et peuvent donc pratiquer des dépassements d’honoraires avec tact et mesure. L’Assurance Maladie rembour

In [68]:
block.find('h2').text

'DEBBASCH JEAN MARC'

In [73]:
block.find("div", attrs={'class':"item left adresse"}).text

'31 AVENUE VERDIER92120 MONTROUGE'

In [76]:
block.find("div", attrs={'class':"item left tel"}).text

'01\xa046\xa056\xa090\xa090'

In [78]:
block.find("div", attrs={'class':"item right type_honoraires"}).text

'Honoraires libres'

In [79]:
block.find("div", attrs={'class':"item right convention"}).text

'Conventionné secteur 2'

In [85]:
def extract_information(block):
    name = block.find('h2')
    address = block.find("div", attrs={'class':"item left adresse"})
    phone = block.find("div", attrs={'class':"item left tel"})
    prices = block.find("div", attrs={'class':"item right type_honoraires"})
    convention = block.find("div", attrs={'class':"item right convention"})
    

    return [item.text for item in [name, address, phone, prices, convention] if item is not None]

In [87]:
for doc in doctors:
    print(extract_information(doc))

['DEBBASCH JEAN MARC', '31 AVENUE VERDIER92120 MONTROUGE', '01\xa046\xa056\xa090\xa090', 'Honoraires libres', 'Conventionné secteur 2']
['KHAYAT NADINE', '10 RUE V. HUGO92120 MONTROUGE', '01\xa082\xa000\xa015\xa016', 'Honoraires sans dépassement', 'Conventionné secteur 1']
['CENTRE DE SANTE MUNICIPAL', 'CENTRE DE SANTE MUNICIPAL5 RUE AMAURY DUVAL92120 MONTROUGE']
['KOHANE BERNARD', '31 AVENUE VERDIER92120 MONTROUGE', '01\xa046\xa056\xa090\xa090', 'Honoraires libres', 'Conventionné secteur 2']


In [88]:
import pandas as pd

In [89]:
pd.DataFrame([extract_information(doc) for doc in doctors])

Unnamed: 0,0,1,2,3,4
0,DEBBASCH JEAN MARC,31 AVENUE VERDIER92120 MONTROUGE,01 46 56 90 90,Honoraires libres,Conventionné secteur 2
1,KHAYAT NADINE,10 RUE V. HUGO92120 MONTROUGE,01 82 00 15 16,Honoraires sans dépassement,Conventionné secteur 1
2,CENTRE DE SANTE MUNICIPAL,CENTRE DE SANTE MUNICIPAL5 RUE AMAURY DUVAL921...,,,
3,KOHANE BERNARD,31 AVENUE VERDIER92120 MONTROUGE,01 46 56 90 90,Honoraires libres,Conventionné secteur 2
