# Articles for training  - Emails (target)

Checking the newspaper output for all urls downloaded from all emails sent.

* There are __3.884__ articles downloaded from all emails urls.
* There are __56__ articles without text.
* There are __38__ articles with useless text.

Therefore there are roughly __3.790__ target articles that can be currently considered.

In [66]:
import os
import re
import json
import pprint
import datetime
import pandas as pd

pp = pprint.PrettyPrinter(indent=4)

In [67]:
data = []
with open('./articles_email.json') as input_file:
    for line in input_file:
        data.append(json.loads(line))

In [69]:
df = pd.DataFrame(data)
df.shape

(3884, 8)

In [70]:
df.head(5)

Unnamed: 0,authors,keywords,publish_date,summary,text,title,top_image,url
0,[],"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...
1,[],"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...
2,[],"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...
3,[],"[situación, en, bancos, que, por, los, la, se,...",2016-03-01 21:27:34+00:00,"Primero la situación económica de China, despu...",Desde que empezó el año hemos tenido ya 3 fact...,Nota sobre la situación de los bancos,http://www.robust.fondos.gvcgaesco.es/wp-conte...,http://www.robustglobal.com/nota-sobre-la-situ...
4,[],"[search, scheduled, meeting, regularly, fomc, ...",,"Meeting calendars, statements, and minutes (20...","Meeting calendars, statements, and minutes (20...",Meeting calendars and information,,http://www.federalreserve.gov/monetarypolicy/f...


In [71]:
# Data Wrangling

keywords = []

df['source'] = df['url'].apply(lambda x: re.findall('(?:\/\/www\.|\/\/)(\w+)\.\w',x))

for idx, row in df.iterrows():
    authors = ' / '.join(row['authors'])
    df.loc[idx, 'authors'] = authors
    
    for keyw in row['keywords']:
        keywords.append(keyw)
        
    source = row['source']
    try:
        df.loc[idx, 'source'] = source[0]
    except:
        df.loc[idx, 'source'] = ""

In [74]:
df.head(5)

Unnamed: 0,authors,keywords,publish_date,summary,text,title,top_image,url,source
0,,"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...,bbvagmr
1,,"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...,bbvagmr
2,,"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...,bbvagmr
3,,"[situación, en, bancos, que, por, los, la, se,...",2016-03-01 21:27:34+00:00,"Primero la situación económica de China, despu...",Desde que empezó el año hemos tenido ya 3 fact...,Nota sobre la situación de los bancos,http://www.robust.fondos.gvcgaesco.es/wp-conte...,http://www.robustglobal.com/nota-sobre-la-situ...,robustglobal
4,,"[search, scheduled, meeting, regularly, fomc, ...",,"Meeting calendars, statements, and minutes (20...","Meeting calendars, statements, and minutes (20...",Meeting calendars and information,,http://www.federalreserve.gov/monetarypolicy/f...,federalreserve


### Source

In [75]:
df_source = df.groupby('source').size().sort_values(0, ascending=False).reset_index().rename(columns={0:"count"})
df_source.head(10)

Unnamed: 0,source,count
0,expansion,2031
1,cincodias,1015
2,elconfidencial,655
3,bernsteinresearch,72
4,blogs,30
5,federalreserve,13
6,retina,12
7,economia,6
8,bbvagmr,4
9,euribor,3


### Authors

In [76]:
df.groupby(['authors']).size().sort_values(0, ascending=False).reset_index().rename(columns={0: "count"}).head(15)

Unnamed: 0,authors,count
0,,2184
1,Ángeles Gonzalo Alconada,148
2,Cinco Días,99
3,Juande Portillo,59
4,Ángeles Gonzalo Alconada / Pablo Monge,33
5,Eduardo Segovia / Contacta Al Autor,30
6,Nuria Salobral,23
7,Pablo Martín Simón,20
8,Juande Portillo / Pablo Monge,19
9,Bernardo De Miguel,17


### Text

In [77]:
# articles without title (#56)

df[df['text']=='']['text'].count()

56

In [78]:
df[df['text']==''].groupby('source').size().sort_values(0, ascending=False).reset_index().rename(columns={0:"count"})

Unnamed: 0,source,count
0,bernsteinresearch,38
1,elconfidencial,6
2,imf,2
3,federalreserve,2
4,bbvaresearch,2
5,prensa,1
6,nobelprize,1
7,hugin,1
8,grupobancopopular,1
9,cnmv,1


In [79]:
# those without title do not have text either (#10, already included in query above)

df[df['title']==""]

Unnamed: 0,authors,keywords,publish_date,summary,text,title,top_image,url,source
250,,[],,,,,,https://www.bbvaresearch.com/wp-content/upload...,bbvaresearch
251,,[],,,,,,https://www.bbvaresearch.com/wp-content/upload...,bbvaresearch
330,,[],,,,,,http://cnmv.es/portal/HR/verDoc.axd?t=3D%7bee5...,cnmv
421,,[],,,,,,https://www.federalreserve.gov/monetarypolicy/...,federalreserve
732,,[],,,,,,http://hugin.info/134323/R/2043980/763188.pdf,hugin
817,,[],,,,,,http://www.grupobancopopular.com/ES/Accionista...,grupobancopopular
829,,[],,,,,,http://www.imf.org/external/pubs/ft/weo/2016/0...,imf
842,,[],,,,,,https://www.nobelprize.org/nobel_prizes/econom...,nobelprize
845,,[],,,,,,http://www.imf.org/external/pubs/ft/weo/2016/0...,imf
901,,[],,,,,,http://www.federalreserve.gov/monetarypolicy/b...,federalreserve


In [80]:
df.groupby('text').size().sort_values(0,ascending=False).reset_index().rename(columns={0:"count"}).head(10)

Unnamed: 0,text,count
0,,56
1,How Can We Help?\n\nIf you'd like to learn mor...,34
2,Consulte las citas más relevantes de la jornad...,4
3,Don't have an account yet?\n\nThe content of t...,4
4,El ministro de Economía aseguró que la resoluc...,3
5,La salida a Bolsa de Unicaja prevista para est...,2
6,"Emilio Saracho, presidente de Banco Popular\n\...",2
7,Tras el sobresalto del miércoles por la consid...,2
8,La EBA aprueba unas guías para calcular las pé...,2
9,Los consistorios de las dos principales capita...,2


In [81]:
idx = df[df['text'].str.contains('How Can We Help')].index[0]
print "\nBernsteinresearch text:\n\n", df.loc[idx]['text'], "\n"


Bernsteinresearch text:

How Can We Help?

If you'd like to learn more about Bernstein's insights and execution or how they can help advance your business, please contact us. 



In [82]:
# a few bernsteinresearch articles cannot be scraped (#34)

df[df['text'].str.contains('How Can We Help')].groupby('source').size().sort_values(0,ascending=False).reset_index().rename(columns={0:"count"})

Unnamed: 0,source,count
0,bernsteinresearch,34


In [83]:
idx = df[df['title'].str.contains('custom login')].index[0]
print "\nbbvagmr text:\n\n", df.loc[idx]['text'], "\n"


bbvagmr text:

Don't have an account yet?

The content of this website is for the exclusive access of BBVA Corporate & Investment Banking authorized clients.

If you need a username and password or require further information about our services, please contact your Corporate & Investment Banking representative. Alternatively, please use the link below:

Contact us now 



In [84]:
# bbvagmr cannot be scraped since it requests login credentials (#4)

df[df['title'].str.contains('custom login')]

Unnamed: 0,authors,keywords,publish_date,summary,text,title,top_image,url,source
0,,"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...,bbvagmr
1,,"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...,bbvagmr
2,,"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...,bbvagmr
107,,"[yetthe, website, services, login, custom, rep...",,Don't have an account yet?\nThe content of thi...,Don't have an account yet?\n\nThe content of t...,» custom login,https://www.bbvagmr.com/wp-content/themes/bbva...,http://www.bbvagmr.com/wp-content/plugins/misc...,bbvagmr


---

# Articles for training - Historical data

Checking the newspaper output for all urls scraped from the different source to build the training dataset. 

__Considerations__

* Articles with __no text__.
* Note that some of these __articles may be included in the emails__ sent over these years.
* Currenty working with articles __published on mondays__ regardless of the source.
* Handle errors/exceptions in the download method of the newspaper.

---

## Expansion

* Run every Monday between 2017/04/10 and 2018/04/10 (5.871 urls)
* There are #32 urls without text (5.839 useful articles)
* It might be necessary to focus on determinate sections or type of articles and clean the database (and perhaps increase the frequency of the scrapy process to download more articles)

In [50]:
# raw urls extracted with Scrapy
urls = []
with open('../scrapy_projects/expansion_hemeroteca/urls_expansion.json') as input_file:
    for line in input_file:
        urls.append(json.loads(line))

# article content downloaded and parsed with newspaper3k
data = []
with open('./articles_expansion_hemeroteca.json') as input_file:
    for line in input_file:
        data.append(json.loads(line))
        

In [32]:
print "# of raw urls scraped from Expansion:", len(urls)

# of raw urls scraped from Expansion: 5871


In [33]:
df = pd.DataFrame(data)

In [34]:
df.shape

(5871, 8)

In [35]:
df.head(2)

Unnamed: 0,authors,keywords,publish_date,summary,text,title,top_image,url
0,[],"[dato, el, resumen, completar, declaraciones, ...",2017-05-15 00:00:00,Renta 2016: Cómo completar un dato ya incluido...,Renta 2016: Cómo completar un dato ya incluido...,Renta 2016: Cómo completar un dato ya incluido...,http://v.uecdn.es/p/111/thumbnail/entry_id/0_y...,http://www.expansion.com/economia/declaracion-...
1,[],"[decisiones, el, recursos, personas, talento, ...",2017-05-15 00:00:00,El cambio tecnológico provoca una gran transfo...,El cambio tecnológico provoca una gran transfo...,Valores y personas se colocan en el foco de la...,http://estaticos.expansion.com/assets/multimed...,http://www.expansion.com/pais-vasco/2017/05/15...


In [36]:
# create seccion field from url
df['seccion'] = df['url'].apply(lambda x: x.split('/')[3])

In [37]:
df.groupby('seccion').size().sort_values(0, ascending=False).reset_index().rename(columns={0:'count'})

Unnamed: 0,seccion,count
0,empresas,1249
1,economia,1039
2,aragon,546
3,mercados,516
4,extremadura,439
5,fueradeserie,407
6,juridico,233
7,sociedad,220
8,directivos,188
9,latinoamerica,173


In [40]:
df.groupby('publish_date').size().reset_index().rename(columns={0:'count'}).head(10)

Unnamed: 0,publish_date,count
0,2017-04-10 00:00:00,141
1,2017-04-17 00:00:00,92
2,2017-04-24 00:00:00,144
3,2017-05-01 00:00:00,39
4,2017-05-08 00:00:00,120
5,2017-05-15 00:00:00,55
6,2017-05-22 00:00:00,147
7,2017-05-29 00:00:00,115
8,2017-06-05 00:00:00,132
9,2017-06-12 00:00:00,109


In [38]:
# articles without text (#32)

df[df['text']=='']['text'].count()

32

In [39]:
# articles without title (#2, one of the does not have text)

df[df['title']=='']

Unnamed: 0,authors,keywords,publish_date,summary,text,title,top_image,url,seccion
714,[],"[está, esta, ella, y, se, puede, cerrada, vota...",2017-05-22 00:00:00,,Esta encuesta está cerrada y no se puede votar...,,http://estaticos.expansion.com/assets/desktop/...,http://www.expansion.com/economia/politica/deb...,economia
5788,[],[],2018-02-19 00:00:00,,,,,http://www.expansion.com/juridico/premios/2018...,juridico


---

## Cincodias

* Run every two weeks on mondays between 2017/04/10 and 2018/04/10 (1.390 urls)
* There are #0 urls without text (1.390 useful articles)
* It might be necessary to focus on determinate sections or type of articles and clean the database (and perhaps increase the frequency of the scrapy process to download more articles)

In [55]:
# raw urls extracted with Scrapy
urls = []
with open('../scrapy_projects/cincodias/urls_cincodias.json') as input_file:
    for line in input_file:
        urls.append(json.loads(line))
        
# article content downloaded and parsed with newspaper3k
data = []        
with open('./articles_cincodias.json') as input_file:
    for line in input_file:
        data.append(json.loads(line))

In [56]:
print "# of raw urls scraped from Cincodias:", len(urls)

# of raw urls scraped from Cincodias: 1390


In [45]:
df = pd.DataFrame(data)
df.shape

(1390, 8)

In [46]:
df.head(5)

Unnamed: 0,authors,keywords,publish_date,summary,text,title,top_image,url
0,[Marimar Jiménez],"[en, tercio, millones, por, ventas, el, las, i...",2017-04-10 00:00:00,La industria del videojuego continúa creciendo...,La industria del videojuego continúa creciendo...,Un tercio de las ventas de videojuegos en Espa...,https://d500.epimg.net/cincodias/imagenes/2017...,https://cincodias.elpais.com/cincodias/2017/04...
1,[Ediciones Cinco Días],"[partners, en, ritzcarlton, del, abama, el, hi...",2017-04-10 00:00:00,"Tropical Hoteles, propiedad del Grupo Timón, a...","Tropical Hoteles, propiedad del Grupo Timón, a...","Tropical Hoteles vende a HI Partners el 49,9% ...",https://d500.epimg.net/cincodias/imagenes/2017...,https://cincodias.elpais.com/cincodias/2017/04...
2,"[Pablo Martín Simón, Alfonso Simón Ruiz]","[aprueba, en, del, el, ofrece, la, como, inmob...",2017-05-08 00:00:00,"Es la mayor plataforma del sector, con más de ...","Housers se presenta como ""la plataforma líder ...",La CNMV aprueba a Housers como plataforma inmo...,https://d500.epimg.net/cincodias/imagenes/2017...,https://cincodias.elpais.com/cincodias/2017/05...
3,"[Javier García Ropero, Mike Segar]","[niño, sergio, al, en, del, el, dar, garcía, l...",2017-04-10 00:00:00,"En abril de 1999, un joven de 19 años de Borri...","En abril de 1999, un joven de 19 años de Borri...","Sergio García, 'El Niño' que vuelve a dar aire...",https://d500.epimg.net/cincodias/imagenes/2017...,https://cincodias.elpais.com/cincodias/2017/04...
4,[Carlos Molina],"[frente, barata, en, al, ciudad, ir, el, barce...",2017-04-10 00:00:00,Desplazarse desde el centro de la ciudad hasta...,Desplazarse desde el centro de la ciudad hasta...,"Barcelona, la ciudad más barata para ir en tax...",https://d500.epimg.net/cincodias/imagenes/2017...,https://cincodias.elpais.com/cincodias/2017/04...


In [48]:
# articles without text (#0)

df[df['text']=='']['text'].count()

0

---

## El Confidencial

* Run every two weeks on mondays between 2017/04/10 and 2018/04/10 (1.800 urls)
* There are #2 urls without text and #4 articles that could not be downloaded (1.794 useful articles)
* It might be necessary to focus on determinate sections or type of articles and clean the database (and perhaps increase the frequency of the scrapy process to download more articles)

In [59]:
# raw urls extracted with Scrapy
urls = []
with open('../scrapy_projects/elconfidencial/urls_elconfidencial.json') as input_file:
    for line in input_file:
        urls.append(json.loads(line))

# article content downloaded and parsed with newspaper3k
data = []
with open('./articles_elconfidencial.json') as input_file:
    for line in input_file:
        data.append(json.loads(line))

In [60]:
print "# of raw urls scraped from El Confidencial:", len(urls)

# of raw urls scraped from El Confidencial: 1800


In [61]:
df = pd.DataFrame(data)
df.shape

(1796, 8)

In [62]:
df.head(5)

Unnamed: 0,authors,keywords,publish_date,summary,text,title,top_image,url
0,"[Rafael, Fotografía, Enrique Villarino, Eduard...","[la, en, noticias, supremo, el, que, una, paga...",2017-06-19 00:00:00,Sostienen que el mismo dinero que les sirvió p...,"Jaime Botín, patrón de Bankinter y procesado p...",Lista Falciani: Jaime Botín e hijos reclaman e...,https://www.ecestaticos.com/imagestatic/clippi...,https://www.elconfidencial.com/empresas/2017-0...
1,"[Eduardo Segovia, Agustín Marco, E. Segovia, C...","[del, vender, el, reprocha, que, una, se, plan...",2017-06-05 00:00:00,"Uno de sus componentes, molesto con la actual ...","""Enfado monumental"". Así califican varios miem...",Noticias del Banco Popular: El consejo del Ban...,https://www.ecestaticos.com/imagestatic/clippi...,https://www.elconfidencial.com/empresas/2017-0...
2,"[Isabel Morillo, Juanma Romero, Contacta Al Au...","[del, el, que, segunda, va, sánchez, su, por, ...",2017-06-19 00:00:00,El día después de la batalla no habrá tregua e...,El día después de la batalla no habrá tregua e...,39° Congreso Federal del PSOE: El PSOE afronta...,https://www.ecestaticos.com/imagestatic/clippi...,https://www.elconfidencial.com/espana/2017-06-...
3,"[Isabel Morillo, Juanma Romero, Contacta Al Au...","[las, del, federal, reabre, el, que, se, más, ...",2017-06-18 00:00:00,Fue la última en entrevistarse con Pedro Sánch...,La jornada acabó mal. Pasadas las once de la n...,39° Congreso Federal del PSOE: El pulso de Ped...,https://www.ecestaticos.com/imagestatic/clippi...,https://www.elconfidencial.com/espana/2017-06-...
4,"[Juanma Romero, Isabel Morillo, Contacta Al Au...","[pronto, del, las, mandato, el, que, una, misi...",2017-06-19 00:00:00,"Que hiciera el equipo que quisiera, que sacara...",El reinado de Pedro Sánchez ya ha comenzado. E...,39° Congreso Federal del PSOE: Sánchez arranca...,https://www.ecestaticos.com/imagestatic/clippi...,https://www.elconfidencial.com/espana/2017-06-...


In [64]:
# articles without text (#2)

df[df['text']=='']['text'].count()

2