# Troubleshooting

Test accuracy of article urls extraction process

---

In [1]:
import os
import re
import email
import quopri
import pprint
import imaplib
import getpass

pp = pprint.PrettyPrinter(indent=4)

M = imaplib.IMAP4_SSL('imap.gmail.com')

In [2]:
# Retrieve credentials
user = 'amoya@bluecap.com' #input/raw_input('User: ')
passwd = getpass.getpass()

# Login 
M.login(user, passwd)
M.select()

········


('OK', ['4577'])

---

# Case 1

#### Email from text file:

In [3]:
file = open("prueba306.txt", "r") 

msg = quopri.decodestring(file.read())

In [4]:
# Instead of extracting the urls from the raw stringified body of the email, 
# we will take a look at the raw html text that is located after the stringified body.
# We will extract all the text within the @href attribute of each <a> tag.

urls_raw = re.findall('<a href="(\S+)"', msg)[:-1] # last one refers to email footnote (not interested)
pp.pprint(urls_raw)
print len(urls_raw)

[   'http://www.expansion.com/empresas/banca/2017/05/18/591d3b1ce2704e2f648b456b.html',
    'http://cincodias.elpais.com/cincodias/2017/05/17/companias/1495050961_042814.html',
    'http://www.expansion.com/empresas/banca/2017/05/17/591c5021268e3eb7308b45f4.html',
    'http://www.elconfidencial.com/empresas/2017-05-18/banco-popular-bonos-deuda-subodinada-coco-saracho-rescate_1384172/',
    'http://www.expansion.com/empresas/banca/2017/05/17/591be5af268e3e672f8b456b.html',
    'http://www.expansion.com/empresas/banca/2017/05/16/591ac44ae2704e355c8b4649.html',
    'http://www.expansion.com/empresas/banca/2017/05/16/591afa0ce2704e9b3e8b4582.html',
    'http://www.elconfidencial.com/mercados/2017-05-18/bankia-bbva-santander-compra-banco-popular-favorito-mercado_1384370/',
    'http://cincodias.elpais.com/cincodias/2017/05/17/companias/1495051532_770914.html',
    'http://cincodias.elpais.com/cincodias/2017/05/17/companias/1495050192_256490.html',
    'http://www.expansion.com/empresas/banc

---

# Case 2

#### Santander - Popular:

* Santander ficha al jefe de real estate de Deustche Bank para hacerse cargo de parte del frente inmobiliario. El fichaje responde a la petición expresa de Javier García de Carranza, Director General adjunto y encargado del área de Reestructuraciones, Recuperaciones, Inmobiliaria, Participadas y Capital Riesgo. Éste ha querido que Carlos Manzano, hasta la fecha máximo responsable de real estate del banco alemán y gestor de Trajano, la socimi participada por clientes de grandes patrimonios de DB, se incorpore al equipo que, entre otros retos, tendrá que limpiar la cartera de la entidad vinculada al ladrillo (€37.000m). Carlos Manzano estará al frente de las participadas Merlin y Testa, relanzará Metrovacesa y llevará Altamira y Aliseda. El artículo destaca otro movimiento: Jaime Rodríguez Andrades, ex Morgan Stanley y hombre de confianza de Carranza, que se pone al frente de la división de Non Performing Loans.

http://www.elconfidencial.com/empresas/2017-0
7-18/santander-ficha-jefe-ladrillo-deutsche-bank-venta-inmobiliarias_1416629/

* Bruselas pone bajo la lupa la ofensiva de Popular para captar negocio, que premia a los clientes que traigan dinero de otras entidades, siempre y cuando no sea del Santander

## Problem

The url is broken in two different lines ...

In [5]:
typ, data = M.search(None, '(FROM "lizquierdo@bluecap.com")')

n_last = data[0].split()[-113]

typ, data = M.fetch(n_last, '(RFC822)') # RFC822: Standard for ARPA Internet Text Messages

So far the links have been extracted from the stringified body of the emails (see below).

In [6]:
email_obj = {}

for response_part in data:

    if isinstance(response_part, tuple):
        
        msg = email.message_from_string(response_part[1].decode('utf-8'))

        email_obj['from'] = msg['from']
        email_obj['to'] = msg['to']
        email_obj['subject'] = msg['subject']
        email_obj['date'] = msg['date']

        if msg.is_multipart():
            raw_body = msg.get_payload()[0].get_payload()
        else:
            raw_body = msg.get_payload()
                
        # email_obj['body'] = raw_body

        # extracting links to articles
        try:
            body = raw_body.replace("=\r\n", "")
            body = body.replace("\r", "")
        except:
            raw_body = raw_body[0].get_payload()
            body = raw_body.replace("=\r\n", "")
            body = body.replace("\r", "")

        urls_raw = re.findall("(?P<url>https?://[^\s]+)", body)[:-1]
        
        # sanity
        urls = [url.split(">")[0] for url in urls_raw]
        
        if urls:
            email_obj['urls'] = urls

print "\nThere are", len(email_obj['urls']), "urls extracted.\n"
pp.pprint(email_obj['urls'])


There are 18 urls extracted.

[   'http://www.elconfidencial.com/empresas/2017-0',
    'http://www.expansion.com/empresas/banca/2017/07/17/596ca4f8468aebea418b4582.html',
    'http://www.expansion.com/empresas/banca/2017/07/17/596ca409ca4741a2118b4646.html',
    'http://www.expansion.com/empresas/banca/2017/07/17/596cf10946163fc1518b45ef.html',
    'https://cincodias.elpais.com/cincodias/2017/07/17/companias/1500318223_607267.html',
    'http://www.expansion.com/mercados/2017/07/18/596d1c95e2704e6b1c8b4617.html',
    'http://www.expansion.com/mercados/2017/07/17/596c9369468aeb11158b456c.html',
    'http://www.expansion.com/economia/2017/07/17/596cbbf346163fca058b45ab.html',
    'https://cincodias.elpais.com/cincodias/2017/07/17/midinero/1500294143_065299.html',
    'http://www.elconfidencial.com/vivienda/2017-07-18/bankia-hipotecas-gastos-hipotecarios-actos-juridicos-documentados-ajd_1416586/',
    'http://www.expansion.com/empresas/banca/2017/07/17/596cd52e46163fd0618b45b5.html',
   

We will extract those same urls by looking at the raw html text of the email.

In [7]:
# So far looking at the stringified part of the response (=body).
# Let's extract now the urls from the html raw text of the email.

email_obj = {}

for response_part in data:

    if isinstance(response_part, tuple):
        
        msg = email.message_from_string(response_part[1].decode('utf-8'))

        email_obj['from'] = msg['from']
        email_obj['to'] = msg['to']
        email_obj['subject'] = msg['subject']
        email_obj['date'] = msg['date']

        raw_body = quopri.decodestring(response_part[1])
        urls_raw = re.findall('<a href="(\S+)"', raw_body)[:-1]
        if urls_raw:
            email_obj['urls'] = urls_raw

print "\nThere are", len(email_obj['urls']), "urls extracted.\n"
pp.pprint(email_obj['urls'])


There are 18 urls extracted.

[   'http://www.elconfidencial.com/empresas/2017-0',
    'http://www.expansion.com/empresas/banca/2017/07/17/596ca4f8468aebea418b4582.html',
    'http://www.expansion.com/empresas/banca/2017/07/17/596ca409ca4741a2118b4646.html',
    'http://www.expansion.com/empresas/banca/2017/07/17/596cf10946163fc1518b45ef.html',
    'https://cincodias.elpais.com/cincodias/2017/07/17/companias/1500318223_607267.html',
    'http://www.expansion.com/mercados/2017/07/18/596d1c95e2704e6b1c8b4617.html',
    'http://www.expansion.com/mercados/2017/07/17/596c9369468aeb11158b456c.html',
    'http://www.expansion.com/economia/2017/07/17/596cbbf346163fca058b45ab.html',
    'https://cincodias.elpais.com/cincodias/2017/07/17/midinero/1500294143_065299.html',
    'http://www.elconfidencial.com/vivienda/2017-07-18/bankia-hipotecas-gastos-hipotecarios-actos-juridicos-documentados-ajd_1416586/',
    'http://www.expansion.com/empresas/banca/2017/07/17/596cd52e46163fd0618b45b5.html',
   

In [8]:
# Let's understand why it is not possible to get the first url right ...

re.findall('<a href="(.+)>2\.', raw_body)[0]

'http://www.elconfidencial.com/empresas/2017-0">http://www.elconfidencial.com/empresas/2017-0</a></font><font color="#000000" face="arial, helvetica, sans-serif">7-18/santander-ficha-jefe-ladrillo-deutsche-bank-venta-inmobiliarias_1416629/<br></font></span></div><div><font color="#000000" face="arial, helvetica, sans-serif" style="background-color:rgb(255,255,255)"><br></font></div><div><font color="#000000" face="arial, helvetica, sans-serif" style="background-color:rgb(255,255,255)"'

* ...relanzará Metrovacesa y llevará Altamira y Aliseda. El artículo destaca otro movimiento: Jaime Rodríguez Andrades, ex Morgan Stanley y hombre de confianza de Carranza, que se pone al frente de la división de Non Performing Loans.

http://www.elconfidencial.com/empresas/2017-0
7-18/santander-ficha-jefe-ladrillo-deutsche-bank-venta-inmobiliarias_1416629/

* Bruselas pone bajo ...

It is complicated to reconstruct the url from the html above since the @href attribute is uncomplete. We should regenerate the url by concatenating the text of all the child/following nodes `(<font>...</font>)`. Anyway, the link to the article is also broken in the newsletter email... 

In [9]:
M.close()

M.logout()

('BYE', ['LOGOUT Requested'])