#1.- Create a database with all cities and towns from Catalonia

As the project focuses on a database of news from diferent newspapers from Catalonia, the database will contain every town and city of Catalonia. This information will be extracted from a database given by the "Institut Cartogràfic de Catalunya".

In [15]:
# Initialize a MongoDB instance
from pymongo import MongoClient

def get_db_ciutats():
    client = MongoClient('localhost:27017')
    db = client.ciutats
    return db

def add_city_db(db, name, tipus, lat, lon):
    db.ciutats.insert({"name":name, "tipus":tipus,"lat":lat, "lon":lon})

In [18]:
import xlrd
from collections import OrderedDict
import simplejson as json
import utm

def ini_Catalan_cities():
    print "Loading file..."
    # Open the workbook and select the first worksheet
    wb = xlrd.open_workbook('CartoCat.xlsx')
    sh = wb.sheet_by_index(0)
    print "File loaded correctly."
    db = get_db_ciutats()
    
    print "Conversio iniciada"
    # Iterate through each row in worksheet and fetch values into dict
    for rownum in range(1, sh.nrows):
        item = OrderedDict()
        row_values = sh.row_values(rownum)
        item['nom'] = row_values[0]
        item['tipus'] = row_values[1]
        #item['municipi'] = row_values[2]
        #item['comarca'] = row_values[6]
        item['utmX'] = row_values[15]
        item['utmY'] = row_values[16]
        #if item['tipus']!= 'cap':
        #    continue
        if item['utmX']==0.0 or item['utmY']==0.0:
            continue
        #if db.ciutats.find({'name':item['nom']}).count()>0:
            #continue
        u = utm.to_latlon(item['utmX'],item['utmY'], 31, 'T')
        item['lat'] = u[0]
        item['lon'] = u[1]
        add_city_db(db,item['nom'],item['tipus'],item['lat'],item['lon'])
    print "Conversio finalitzada"
    print "Elements in db.ciutats: " , db.ciutats.count()

In [19]:
# Create the first database, the one containing all Catalan cities
# The following code should be executed once, so to prevent accidental execution and because of the long time
# the code lasts it is commented:

#ini_Catalan_cities()

Loading file...
File loaded correctly.
Conversio iniciada
Conversio finalitzada
Elements in db.ciutats:  52698


## Query example
Example of how a MongoDB can be queried to find an item by a key.

In [22]:
# Get the information related to my town: Calaf
db = get_db_ciutats()
poble = db.ciutats.find({'name':'Calaf'})
for a in poble:
    print a

{u'lat': 41.734805925564594, u'_id': ObjectId('554f9478366044344065476a'), u'lon': 1.5137381450247307, u'name': u'Calaf', u'tipus': u'cap'}


In [20]:
# Get the full list of towns and cities in Catalonia. There are currently 947 towns in Catalonia, so we expect
# a list of towns as long as that.
db = get_db_ciutats()
db.ciutats.find().count()
ciut = db.ciutats.find({'tipus':'cap'})
print ciut.count()
for a in ciut:
    print a['name']

947
Abella, l'
Abella de la Conca
Abrera
Àger
Agramunt
Aguilar de Segarra
Agullana
Aiguafreda
Aiguaviva
Aitona
Alamús, els
Alàs
Albagés, l'
Albanyà
Albatàrrec
Albesa
Albi, l'
Albinyana
Albiol, l'
Albons
Alcanar
Alcanó
Alcarràs
Alcoletge
Alcover
Aldea, l'
Aldover
Aleixar, l'
Alella
Alfara de Carles
Alfarràs
Alfés
Alforja
Algerri
Alguaire
Alins
Alió
All
Almacelles
Almatret
Almenar
Almoster
Alòs de Balaguer
Alp
Alpens
Alpicat
Altafulla
Amer
Ametlla del Vallès, l'
Ametlla de Mar, l'
Ampolla, l'
Amposta
Anglès
Anglesola
Anserall
Ansovell
Arbeca
Arboç, l'
Arbolí
Arbúcies
Arenys de Mar
Arenys de Munt
Argelaguer
Argençola
Argentera, l'
Argentona
Armentera, l'
Arnes
Arres de Jos
Arsèguel
Artés
Artesa de Lleida
Artesa de Segre
Ascó
Aspa
Avellanes, les
Avià
Avinyó
Avinyonet de Puigventós
Avinyó Nou
Badalona
Badia del Vallès
Bagà
Balaguer
Balsareny
Banyeres del Penedès
Banyoles
Barbens
Barberà de la Conca
Barberà del Vallès
Barcelona
Barruera
Bàscara
Bassella
Batea
Bausen
Begues
Begur
Belianes
Bel

## Types of elements in the database

The main elements that can be found among other less important are the following:

Abbreviature | Meaning
---    |   ---
'cap'  | Cap de municipi
'barri'| Barri, sector urbà (+50.000 hab.)
'nucli'| Nucli de població (poble, llogaret...)
'diss.'| Veïnat disseminat
'e.m.d.'| Entitat municipal descentralitzada
mun. | Nom del municipi quan aquest no coincideix amb la capital
edif. | Edificació aïllada
edif. hist. | Edifici històric (ermita, església, castell...)


In [23]:
db.ciutats.count()

52698

# 2. Add to the database some important international cities

To achieve this objective, I will be using SPARQL language, that let obtain information from for example DBPEDIA, a database containing information extracted from the WIKIPEDIA.

In [33]:
import json
from SPARQLWrapper import SPARQLWrapper, JSON
 
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setReturnFormat(JSON)
 
sparql.setQuery("""
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX yago: <http://dbpedia.org/class/yago/>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
 
SELECT ?title ?geolat ?geolong
    WHERE {
        #?place rdf:type <http://dbpedia.org/ontology/Place> .
        #?place dbpedia-owl:country <http://dbpedia.org/resource/Spain> .
        ?place foaf:name ?title .
        ?place geo:lat ?geolat .
        ?place geo:long ?geolong .
        #FILTER ((?geolong > 0.5 && ?geolong < 2.7) && (?geolat < 42.5 && ?geolat > 40.5))
        #FILTER (LANG(?title)='ca')
    }
""")
results = sparql.query().convert()
print results

{u'head': {u'link': [], u'vars': [u'title', u'geolat', u'geolong']}, u'results': {u'distinct': False, u'bindings': [], u'ordered': True}}


# 2.- Create a database of news

In order to create a database that will contain news from a few digital news sites from Catalonia, I will be fetching the information from their RSS publications. I will get the <strong>title</strong>, the <strong>description</strong>, the <strong>date of publication</strong> and the <strong>name of the news site</strong>.

In [34]:
import feedparser
from bs4 import BeautifulSoup
from time import mktime
import time
from datetime import datetime

def get_db_news():
    from pymongo import MongoClient
    client = MongoClient('localhost:27017')
    db = client.noticies
    return db

def add_news(db, diari, data, titol, desc):
    db.noticies.insert({"diari":diari, "data":data, "titol":titol, "descripcio":desc})  

def get_news(diari,web, today=False):
    db = get_db_news()
    rss = web
    feed = feedparser.parse(rss)
    print len(feed["entries"])
    vals = []
    for key in feed["entries"]: 
        title = BeautifulSoup(key["title"]).get_text()
        date = datetime.fromtimestamp(mktime(key["published_parsed"]))
        #print date
        date_formated = date.strftime("%d/%m/%Y")
        if today and date_formated != time.strftime("%d/%m/%Y"):
            print date_formated
            continue
        if db.noticies.find({"titol":title}).count() > 0:
            continue
        #desc_formated = BeautifulSoup(unidecode.unidecode(key["description"])).get_text()
        desc_formated = BeautifulSoup(key["description"]).get_text()
        #print desc_formated.get_text()
        ret_val = [date_formated,title,desc_formated]
        #print ret_val
        add_news(db,diari,date,title,desc_formated)
        vals.append(ret_val)
    return vals

import os

def get_news_job():
    print "*******Running process:", os.getpid()
    get_news('ara','http://www.ara.cat/rss/')
    get_news('regio7','http://www.regio7.cat/elementosInt/rss/1')
    get_news('vilaweb','http://www.vilaweb.cat/rss/')
    db = get_db_news()
    print db.noticies.find().count()
    print "*******Process ended!"



In [35]:
get_news_job()

*******Running process: 13376
110
13
39
1137
*******Process ended!


#3.- Identify cities in a text

First, I import the corpora and "nltk" which will help us to classify words from a text

In [None]:
text = 'Tots els gats son de Sant Cugat del Valles.'
text = text.replace('.','').split(' ')

cities =  {"Sant Cugat del Valles":["Sant","Cugat","del","Valles"]}

found_match = False
for word in text:
    if found_match:        
        cityTest = cityTest
    else:
        cityTest = ''
    found_match = False
    for city in cities.keys():

        if word in cities[city]:
            cityTest += word + ' '
            found_match = True        
        if cityTest.split(' ')[0:-1] == city.split(' '):
            print city    #Print if it found a city.

503853

[('Tots', 'NNS'),
 ('els', 'NNS'),
 ('gats', 'NNS'),
 ('son', 'VBP'),
 ('de', 'IN'),
 ('sant', 'JJ'),
 ('cugat', 'NN'),
 ('del', 'NN'),
 ('valles', 'NNS')]

h
