# Introduction

The largest city in Turkey that I am familiar with is İstanbul. Therefore, I thought I could look at the map and check if I can catch anything of significance. 

In [1]:
import xml.etree.cElementTree as ET
import pandas as pd
import pprint
import bz2file
import operator
import numpy as np
import os

In [2]:
DATA_FILE = "istanbul_turkey.osm"

In [3]:
bz2_data = bz2file.BZ2File(DATA_FILE+".bz2").read()
open(DATA_FILE, 'wb').write(bz2_data)

I want to first check my data to see what hierarchy it has. Below functions will extract tags and hierarchy in my dataset. It will also extract attributes and child tags for each tag residing within our XML file. 

In [4]:
# This function will traverse attributes and childs for each tag type.
def define_file(filename):
    tags = {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tags:
            tags[elem.tag] = {"count":0,"attribs":set(),"childs":set()}
        tags[elem.tag]["count"] += 1
        for attr in elem.attrib:
            tags[elem.tag]["attribs"].add(attr)
        for elem_ in elem.getiterator():
            if elem_ != elem:
                tags[elem.tag]["childs"].add(elem_.tag)
    return tags

In [5]:
file_tag_stats = define_file(DATA_FILE)
df = pd.DataFrame(file_tag_stats)
df

Unnamed: 0,bounds,member,nd,node,osm,relation,tag,way
attribs,"{minlat, maxlon, minlon, maxlat}","{role, ref, type}",{ref},"{changeset, uid, timestamp, lon, version, user...","{timestamp, version, generator}","{changeset, uid, timestamp, version, user, id}","{k, v}","{changeset, uid, timestamp, version, user, id}"
childs,{},{},{},{tag},"{node, nd, bounds, member, tag, relation, way}","{member, tag}",{},"{tag, nd}"
count,1,8034,1498568,1164528,1,694,370703,192562


From above table we can see that the main tag types are **tags**, **ways**, **nodes**, and **nds**. I am going to ignore nd types since they only provide some references. 

Historically, Istanbul had been a very diverse city. Until recent decades, there were a large number of followers of different religions, particularly Ortodox Christians. Therefore, I would like to explore into religious places in Istanbul, and see what is current distribution of religious places in Istanbul. This data might provide insight into proportion of Istanbulites into different religious groups. 

I do not know what types of places are there which are tagged as religious, so I think I should first look into nodes/ways tagged with religion key.

Functions below will explore XML to find stats and sample entries for **element tag** - **tag key** relationships. 

In [6]:
def isTypeOf(tag_,key):
    return (key == tag_.attrib['k'])

In [7]:
def add_type(result,value):
    if value in result:
        result[value] += 1
    else:
        result[value] = 1

In [8]:
def get_elem_tags(elem):
    result = {}
    for tag_ in elem.iter("tag"):
        result[tag_.attrib["k"]] = tag_.attrib["v"]
    return result

In [9]:
def get_stats_with_tag_key(filename,tag,key):
    result = {}
    for event, elem in ET.iterparse(filename, events=("start",)):
        if elem.tag == tag:
            for tag_ in elem.iter("tag"):
                if(isTypeOf(tag_,key)):
                    add_type(result,tag_.attrib['v'])
    return result

I am expecting religious places being in *node* tag type. Therefore, let's first look into nodes tagged with *religion*:

In [10]:
get_stats_with_tag_key(DATA_FILE,"node","religion")

{'christian': 15, 'jewish': 4, 'muslim': 148, 'pastafarian': 1}

I am seeing that there are Christian, Jewish, Muslim and Pastafarian places listed. Where most of nodes belong to Muslims in an unsurprising way.

Let's also look into ways so that we don't miss anything of value. 

In [11]:
get_stats_with_tag_key(DATA_FILE,"way","religion")

{'christian': 95, 'jewish': 9, 'muslim': 1923}

Quite surprising to me, there are actually not only entries in **way-tag**, but also a lot more entries than **node-tag**. 

I think it will be plausible to merge results from both tag classes. 

# Problems Encountered

* Not all nodes have amenity tag. Therefore we will stick to last word in their names which tells about their types. 

* Most places have a varying naming issues at the end of their names. For example, Mosques have **Camii, Camisi etc**, Sinagogs have **Sinagogu, Sinangog etc.** Therefore, I will be checking last word in place names for each religion, and try to audit their types into known few types.
* There are Turkish characters in almost all names. I will be replacing this characters with their English counterpart to eliminate visual problems with character codes.

In [12]:
turkishCharMap = {
    "ç":"c",
    "Ç":"C",
    "ğ":"g",
    "Ğ":"G",
    "ı":"i",
    "İ":"I",
    "ö":"o",
    "Ö":"O",
    "ş":"s",
    "Ş":"S",
    "ü":"u",
    "Ü":"U"
}
def serializeTurkishText(text):
    text = text.encode("utf-8")
    for k,v in turkishCharMap.iteritems():
        text = text.replace(k,v)
    return text;

In [13]:
def get_last_word(name):
    return serializeTurkishText(name.split(" ")[-1]).lower()

In [14]:
def get_religious_place_types(filename,religion):
    types = {}
    place_list = {}
    for event, elem in ET.iterparse(filename, events=("start",)):
        elem_tags = get_elem_tags(elem)
        if("religion" in elem_tags and elem_tags["religion"]==religion):
            name = ""
            if "name" in elem_tags:
                name = elem_tags["name"]
            type_ = get_last_word(name)
            if type_ not in types:
                types[type_] = 1
                place_list[type_] = [elem_tags]
            else:
                types[type_] += 1
                place_list[type_].append(elem_tags)
    return (types, place_list)

In [15]:
def tabulate_dict(dict_):
    sorted_list = sorted(list(dict_.items()), key=operator.itemgetter(1), reverse=True)
    return pd.DataFrame([i[1] for i in sorted_list],index=[i[0] for i in sorted_list]).transpose()

In [16]:
muslim_types, muslim_place_list = get_religious_place_types(DATA_FILE,"muslim")
tabulate_dict(muslim_types)

Unnamed: 0,Unnamed: 1,cami,camii,mescidi,mezarligi,mezarlik,camisi,cemevi,ada),serifi,...,aksemsettin,muftulugu,camii),(insaat),camii3,kulliyesi,kursu,cmi,asamasinda),hatun
0,3124,740,267,19,18,9,8,8,6,3,...,1,1,1,1,1,1,1,1,1,1


Many missing places(3124) are in the list. These places seem to be having no names at all. Therefore, I feel an urge to look into this entries: 

In [17]:
np.random.choice(muslim_place_list[""],10)

array([{'religion': 'muslim'}, {'religion': 'muslim'},
       {'religion': 'muslim'},
       {'religion': 'muslim', 'amenity': 'place_of_worship'},
       {'building': 'yes', 'religion': 'muslim', 'amenity': 'place_of_worship'},
       {'building': 'yes', 'religion': 'muslim', 'amenity': 'place_of_worship'},
       {'religion': 'muslim'},
       {'building': 'yes', 'religion': 'muslim', 'amenity': 'place_of_worship'},
       {'religion': 'muslim'}, {'religion': 'muslim'}], dtype=object)

In [18]:
tabulate_dict(muslim_types)

Unnamed: 0,Unnamed: 1,cami,camii,mescidi,mezarligi,mezarlik,camisi,cemevi,ada),serifi,...,aksemsettin,muftulugu,camii),(insaat),camii3,kulliyesi,kursu,cmi,asamasinda),hatun
0,3124,740,267,19,18,9,8,8,6,3,...,1,1,1,1,1,1,1,1,1,1


## New Problems
* It seems some have names only in English. I will add these entries with their English names. 
* Apart from this, most of places seem to be holding no information but the religion tag itself. Since, we have no chance of making guesses about this I will simply ignore these entries. 
* Lastly, there are some entries with some variables that could help to extract their types.

In [19]:
attributes_to_check = ["amenity","name:en","building","source"]
def getProminentAttributes(list_):
    result = {}
    for elem in list_:
        for k,v in elem.iteritems():
            if k in attributes_to_check:
                if k not in result:
                    result[k] = set(v)
                else:
                    result[k].add(v)
    return result

In [20]:
tabulate_dict(getProminentAttributes(muslim_place_list[""]))

Unnamed: 0,building,source,amenity,name:en
0,y,a,a,a
1,mosque,bing,place_of_worship,
2,s,h,c,C
3,e,local_knowledge; Bing,e,e
4,yes,Yahoo,f,Murat Reia Camii
5,,o,i,i
6,,Y,h,m
7,,Bing,l,M
8,,,o,s
9,,,p,r


It seems apart from "name:en" attribute none of above elements in sets provide enough info to extract type of the place. Therefore we will ignore the rest.

Now let's also look into other types that dont make much sense:

In [21]:
muslim_place_list["(insaat)"]

[{'amenity': 'place_of_worship',
  'building': 'yes',
  'name': u'Yeni Zeynebiye Camii (\u0130n\u015faat)',
  'religion': 'muslim'}]

This is a mosque with a note stating it's in reconstruction. Therefore, let's remove descriptions in paranthesis, and try again to see if things improve.

In [22]:
import re
def remove_paranthesis(text):
    return re.sub(r'\(.*\)', '', text).strip()

In [23]:
#Override the function to handle paranthesis:
def get_last_word(name):
    return serializeTurkishText(remove_paranthesis(name).split(" ")[-1]).lower()

In [24]:
muslim_types, muslim_place_list = get_religious_place_types(DATA_FILE,"muslim")
tabulate_dict(muslim_types)

Unnamed: 0,Unnamed: 1,cami,camii,mezarligi,mescidi,mezarlik,camisi,cemevi,serifi,mescit,...,germe,tekkesi,aksemsettin,muftulugu,hamami,camii3,kulliyesi,kursu,cmi,hatun
0,3124,741,269,24,19,9,8,8,3,3,...,1,1,1,1,1,1,1,1,1,1


Types listed above provide enough insight into types of places therefore we will now define mappings for types and typo fixes.

I decided to set below types:

* Mosque
* Islamic School
* Graveyard
* Other

In [25]:
mapping = {
    'Graveyard': ['mezarligi', 'mezarlik'],
    'Islamic School': ['kursu', 'tekkesi', 'medresesi', 'kulliyesi'],
    'Mosque': ['camil', 'camii','camisi','namazgah','serifi','mescit','cemevi','camii3','mescid','mescidi','cmi','cami'],
    'Other': ['pasa','germe','aksemsettin','muftulugu','hamami','turbesi','hatun']
}
typo_fixes = {
    "muslim":{
        'camii': "cami",
        'camii3': "cami",
        'camil': "cami",
        'camisi': "cami",
        'cmi': "cami"
    }
}

In [26]:
christian_types, christian_place_list = get_religious_place_types(DATA_FILE,"christian")
tabulate_dict(christian_types)

Unnamed: 0,Unnamed: 1,kilisesi,mezarligi,manastiri,church,ayazmasi,kilise,nikola,phokas,katedrali,vakfi,patrikhanesi,mongols,metropolitligi,kabristani,Стефан“,kilesi,kiliesi
0,127,62,8,6,4,3,2,1,1,1,1,1,1,1,1,1,1,1


Place types seem to be alright. But lets dive into places with no names: 

In [27]:
tabulate_dict(getProminentAttributes(christian_place_list[""])).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
building,y,s,e,yes,,,,,,,,,,
amenity,a,place_of_worship,c,e,f,i,h,l,o,p,s,r,w,_


It's seems places with no names are no more that entries with almost no info. Therefore, I will be ignoring these ones also. 

Let's also add type mapping and typo fixes mapping. Our types will be as following:

* Church
* Graveyard
* Monastery
* Other

In [28]:
mapping['Church'] = ['kilesi', 'katedrali', 'kilise', 'kilisesi', 'church', 'kiliesi']
mapping['Monastery'] = ['manastiri']
mapping['Graveyard'] += ['kabristani', 'mezarligi']
mapping['Other'] += ['nikola','phokas','ayazmasi','metropolitligi','patrikhanesi','mongols','Стефан“','vakfi']

typo_fixes["christian"] = {
    'kilesi': "Kilisesi",
    'kiliesi': "Kilisesi"
}

In [29]:
jewish_types, jewish_place_list = get_religious_place_types(DATA_FILE,"jewish")
tabulate_dict(jewish_types)

Unnamed: 0,Unnamed: 1,sinagogu,mezarligi,sinagonu,sinagog,neve-shalom
0,15,6,3,1,1,1


In [30]:
tabulate_dict(getProminentAttributes(jewish_place_list[""])).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
building,y,s,e,,,,,,,,,,
amenity,a,c,e,f,i,h,l,o,p,s,r,w,_


There seems to be nothing to extract from places with no name, so we will ignore them.

Let's add our type and typo fixes mappings. List of types is provided below:

* Synagogue
* Graveyard


In [31]:
mapping['Synagogue'] = ['neve-shalom', 'sinagogu', 'sinagonu', 'sinagog']
typo_fixes["jewish"] = {
    'sinagonu': "Sinagogu"
}

In [32]:
pastafarian_types, pastafarian_place_list = get_religious_place_types(DATA_FILE,"pastafarian")
pastafarian_types

{'': 1, 'tapinagi': 1}

In [33]:
pastafarian_place_list[""]

[{'religion': 'pastafarian'}]

An empty place. We will ignore this.
* I will tag the only **Tapınak** as temple. 

In [34]:
mapping["Temple"] = ["tapinagi"]
typo_fixes["pastafarian"] = {}

# Data Extraction

Since we don't have any multiple valued attribute in religious places list we will have a single SQL table.

In [35]:
import sqlite3
sqlite_file = 'istanbul_osm.sqlite'
table = "religious_places"
columns = ["id","name","lat","lon","religion","type"]

conn = sqlite3.connect(sqlite_file)
c = conn.cursor()

**Create table, and drop if table already exists:**

In [36]:
c.execute('DROP TABLE {tn}'\
        .format(tn=table))
c.execute('CREATE TABLE {tn} ({columns})'\
        .format(tn=table, columns=",".join(columns)))

<sqlite3.Cursor at 0x5d314ab0>

In [37]:
religions = ['christian', 'jewish', 'muslim', 'pastafarian']

In [38]:
def get_name(elem_tags):
    if "name" in elem_tags:
        return elem_tags["name"]
    elif "name:en" in elem_tags:
        return elem_tags["name:en"]
    else:
        return ""

In [39]:
def get_fixed_name(elem_tags):
    name = get_name(elem_tags)
    name_words = name.split(" ")[0:-1]
    type_identifier = get_last_word(name)
    if type_identifier in typo_fixes[elem_tags["religion"]]:
        name_words.append(typo_fixes[elem_tags["religion"]][type_identifier])
        return " ".join(name_words)
    return name

In [40]:
def get_type(elem_tags,name):
    type_identifier = get_last_word(name)
    religion = elem_tags["religion"]
    for k,v in mapping.iteritems():
        if type_identifier in mapping[k]:
            return k
    else:
        pprint.pprint(["unknown type",elem_tags,type_identifier])
        return "Unknown" 

In [41]:
def get_religious_places(filename):
    places = []
    for event, elem in ET.iterparse(filename, events=("start",)):
        elem_tags = get_elem_tags(elem)
        if "religion" in elem_tags and elem_tags["religion"] in religions:
            name = get_fixed_name(elem_tags)
            if(len(name)>0):
                type_ = get_type(elem_tags,name)
                elem_tags["name"] = name
                elem_tags["type"] = type_
                for column in columns:
                    if column in elem.attrib:
                        elem_tags[column] = elem.attrib[column]
                places.append(elem_tags)
    return places

**Below code will extract all places for all religions:**

In [42]:
religious_places = get_religious_places(DATA_FILE)

In [43]:
pd.DataFrame(religious_places,columns=columns).head()

Unnamed: 0,id,name,lat,lon,religion,type
0,269497288,Barbaros Hayrettin Türbesi,41.0419227,29.0068359,muslim,Other
1,269706604,Ertuğrul Tekke cami,41.0456489,29.0085216,muslim,Mosque
2,269707397,Beşiktaş Panayia Rum Ortodoks Kilisesi Vakfı,41.0436679,29.0050595,christian,Other
3,278092559,Murat Reis cami,41.0184464,29.0275916,muslim,Mosque
4,278102132,Murat Reia cami,41.0231166,29.0237092,muslim,Mosque


**Let's check if any of places comes with missing fields:**

In [44]:
def check_data_integrity(places):
    missing = []
    for place in places:
        for column in columns:
            if column not in place:
                missing.append(place)
                break
    return missing

In [45]:
missing = check_data_integrity(religious_places)

In [46]:
print str(len(missing))+" out of "+str(len(religious_places))+" places will have NULL values in database."

1122 out of 1217 places will have NA values in database.


**Function below will process all results and insert them into SQL table.**

In [47]:
def insert_values(cursor,table,columns,values):
    sql_values = ""
    for value in values:
        values_str = ""
        for column in columns:
            if len(values_str)>0:
                values_str += ","
            if column in value:
                #replace quotes with double quotes in texts: special chars in SQL
                values_str += "'"+value[column].replace("'","''")+"'"
            else:
                values_str += "NULL"
        if len(sql_values)>0:
            sql_values += ","
        sql_values += "("+values_str+")"
    c.execute('INSERT INTO {tn} ({columns}) values {values}'
              .format(tn=table, columns=",".join(columns), values=sql_values.encode("utf-8")))

In [48]:
insert_values(c,table,columns,religious_places)
conn.commit()

# Data Overview and Additional Ideas

## Data File Size

In [49]:
print DATA_FILE+"........"+str(os.stat(DATA_FILE).st_size/(1024*1024))+"MB"
print sqlite_file+"........"+str(os.stat(sqlite_file).st_size/(1024))+"kB"

istanbul_turkey.osm........242MB
istanbul_osm.sqlite........130kB


## Sample SQL Queries

### Top Religios Place Types

In [50]:
c.execute("select type, count(type) as count from religious_places group by type order by count desc")
pd.DataFrame(c.fetchall(),columns=["Type","Count"]).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7
Type,Mosque,Church,Graveyard,Other,Synagogue,Monastery,Islamic School,Temple
Count,1060,71,45,20,9,6,5,1


### Top Religious Place Names

In [51]:
c.execute("select name, count(type) as count from religious_places group by name order by count desc")
pd.DataFrame(c.fetchall(),columns=["Type","Count"]).head(10).transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Type,Cami,Mezarlık,Mevlana Cami,Fatih cami,Yunus Emre Cami,Akşemsettin Cami,Berat Cami,Huzur Cami,Hz. Ali Cami,Hz. Ebubekir cami
Count,219,8,6,4,4,3,3,3,3,3


### Religion Rank by # of Places

In [52]:
c.execute("select religion, count(type) as count from religious_places group by religion order by count desc")
pd.DataFrame(c.fetchall(),columns=["Type","Count"]).transpose()

Unnamed: 0,0,1,2,3
Type,muslim,christian,jewish,pastafarian
Count,1108,96,12,1


# Additional Ideas

### Places with few Attributes

There are lots of places with a single attribute which provides no insight into the place itself. Most of these places are ignored in our data exploration stages. These entries either belong to some incomplete data or they are garbage as a whole. 

I think some incentives could be taken to clean or complete these entries. Gamification and AutoBots could be two of possible solutions.

||Pros|Cons|
|-|-|-|
|Gamification|Users will be motivated to contribute more|The abundance of similar types of missing entries could annoy users|
|AutoBots|Very fast resolution|Contributions being little of value to the completeness of the data|

A hybrid solution could potentially improve overall results better.


# Conclusion

Data set seems to be filled with incomplete and garbage entries. This limits our ability to extract all information in addition to our ability to distinguish incomplete data from abundant data.

A quick check on Wikipedia gives us the [List of Churches](https://tr.wikipedia.org/wiki/%C4%B0stanbul%27daki_kiliseler_listesi) and the [List of Mosques](https://tr.wikipedia.org/wiki/%C4%B0stanbul%27daki_camiler_listesi) in Istanbul. This shows that there is around 110-120 churches in Istanbul where as our data set has 71 with comlete data. In addition the second list provides names of around 3k mosques in Istanbul. We were able to extract data only for 1060 of them. This shows that our dataset is incomplete to a great extent.