# Open Street Map Project- DC Metro Area

### Map Area

###### Washington DC Metro Area, United States

![alt tags](https://c1.staticflickr.com/5/4381/36224012332_9822f411aa.jpg)


This map is where I have spent over 3 years and met amazing people, so I’m more interested to see what database querying reveals.



## Problems Encountered in the Map

After initially downloading a OSM.XML file and then got a sample of that and running it , I noticed 2 main problems with the data, which I will discuss in the following order:

- Inconsistency  state names  *("Virginia, dc")*
- Over­abbreviated street names (“'H St NE', 'Minnesota Avenue NE'”)



### Inconsistent  State Names

After pulling out the sample, I took a look on all the possible value for each K and spot that for 'addr:state', the value can be :"Virginia" "dc" not VA,DC,MD
So I update the value by the following function:

```python 
mapping_states={'Virginia':'VA', 'va':'VA', 'dc':'DC'}
correct_ones=['VA','DC','MD']
def update_states_name(name):
    if name not in correct_ones:
        name=mapping_states[name]
    return name
```

### Over­abbreviated street names

The situation here is the same with the tutorial videos, some ending of streets are abbreviated like'Ln','NE.So I updated the street names with a mapping dictionary on those that are not correct (based on the sample):

```python 
mapping_street={
             'Hwy': 'Highway',
             'Ln': 'Lane',
             'NE': 'Northeast',
             'NW': 'Northwest',
             'SE': 'Southeast',
             'SW': 'Southwest',
             'St': 'Street',
             'Ave': 'Avenue'}
def update_street_name(name):
    m = street_type_re.search(name)
    if m:
        street_type = m.group()
        if street_type not in expected_street_end:
            if street_type in mapping_street.keys():
                name = re.sub(street_type_re, mapping_street[street_type], name)
    return name
```

This updated all substrings in problematic address strings, such that: “Jefferson Davis Hwy” becomes: “Jefferson Davis Highway”

### Import csv files into a sqlite database

e.g:table nodes

```python
with open('nodes.csv','rb') as fin:
    dr = csv.DictReader(fin) # comma is default delimiter
    to_db = [(i['id'].decode('utf-8'), i['lat'].decode('utf-8'),i['lon'].decode('utf-8'),i['user'].decode('utf-8'),i['uid'].decode('utf-8'),i['version'].decode('utf-8'),i['changeset'].decode('utf-8'),i['timestamp'].decode('utf-8')) for i in dr]


cursor.executemany("INSERT INTO nodes(id, lat, lon, user, uid, version, changeset, timestamp) VALUES (?, ?, ?, ?, ?, ?, ?, ?);", to_db)
created_db.commit()
```

importion for other tables follow the same practice

### Dataset Overview

###### Number of nodes

```sql
select count(*) from nodes;
```

2369337

##### Number of Ways

```sql
select count(*) from ways;
```

297267

#### Number of unique users

```sql
SELECT COUNT(DISTINCT(e.uid))          
FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e;
```

2142

#### Top 10 contributers

```sql
SELECT e.user, COUNT(*) as num
FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e
GROUP BY e.user
ORDER BY num DESC
LIMIT 10;
```

```sql
aude|741680
DavidYJackson_import|449254
wonderchook|171058
kriscarle|125532
woodpeck_fixbot|123280
emacsen|83850
RoadGeek_MD99|72254
Sawan Shariar|59815
ingalls|52627
sejohnson|52540
```

#### Top 10 appearing amenities

```sql
restaurant|1364
place_of_worship|811
school|772
fast_food|543
bench|527
cafe|469
bank|338
bicycle_rental|271
drinking_water|267
bicycle_parking|252
```

#### Number of restraunant

```sql
SELECT count(*)
FROM (SELECT * FROM nodes_tags  UNION ALL 
      SELECT * FROM ways_tags) tags
WHERE tags.key = 'cuisine';
```

1441

**There seems to be no specific line between restaurant and cafe or fastfood and there might be intersection when recorded
as 
*1441< 1364+543+469

#### Top 10 restruants in DC Metro Area

```sql
SELECT tags.value, COUNT(id) as count ,
                
                round((count(*)*100.0/(SELECT COUNT(*) FROM (
                    SELECT * FROM nodes_tags  UNION ALL 
                      SELECT * FROM ways_tags) t
                      WHERE t.key = 'cuisine'
                      ) ),2)
                      
                AS percetange
                      
                FROM (SELECT * FROM nodes_tags  UNION ALL 
                      SELECT * FROM ways_tags) tags
                WHERE tags.key = 'cuisine'
                GROUP BY tags.value
                ORDER BY count DESC
                limit 10;
```

```sql
pizza|122|8.47
american|118|8.19
sandwich|111|7.7
burger|110|7.63
coffee_shop|96|6.66
mexican|95|6.59
italian|73|5.07
chinese|70|4.86
thai|57|3.96
indian|35|2.43
```

##### it seems that the most popular ones are pizza shops,following American,Sandwich

#### Most hours when the restruants are open

although the time structure is really messy


```sql
SELECT tags.value, hours.hour, COUNT(distinct tags.id) as num 
               FROM (SELECT * FROM nodes_tags  UNION ALL 
                     SELECT * FROM ways_tags) tags join
                    (SELECT * FROM node_opening_hours  UNION ALL 
                      SELECT * FROM way_opening_hours) hours
                      on tags.id=hours.id
                      
               
               WHERE tags.key='cuisine'
               GROUP BY tags.value
               ORDER BY num DESC
               LIMIT 10;
```

```sql
sandwich|Sa 8:00 - 2:00, Su 8:00 - 21:00, Mo 8:30 - 21:00, Tu-Fr: 8:30 - 20:00|25
american|Mo-Su 10:30-24:00|21
mexican|11:00-22:00|17
burger| Fr-Sa 10:00-03:30|13
pizza| Fr,Sa 00:00-01:00,11:00-24:00|13
chinese| su 12:00 - 20:00|10
italian|10:30 AM - 10:00 PM|9
thai| Su 12:00-15:00,16:30-21:00|9
coffee_shop| Su 07:00-22:00|6
mediterranean|su - th 11:00 - 21:00, fr-sa 11:00 - 22:00|5
```

#####  suppose I would like to know on saturday where Can I find a Chinese restruant open and where they are

###### names for the restraurant

```sql
select tag.value
from 
   (SELECT * FROM nodes_tags  UNION ALL 
    SELECT * FROM ways_tags) tag
                     
where tag.id in (select tags.id 
    FROM (SELECT * FROM nodes_tags  UNION ALL 
          SELECT * FROM ways_tags) tags join
         (SELECT * FROM node_opening_hours  UNION ALL 
          SELECT * FROM way_opening_hours) hours
          on tags.id=hours.id
    WHERE tags.key='cuisine' and tags.value='chinese'and hours.hour like"%su%"
    LIMIT 10)
    and tag.key like '%name%' 
;
```

```sql
Mayflower
Great Wall Szechuan House
Nagomi Izakaya
Eastern Carryout
Sammy Carry-Out
George's Carry Out
China House
Ho's Chinese Carry Out
```

###### get the address for these places

```sql
select tag.id,tag.value
from 
   (SELECT * FROM nodes_tags  UNION ALL 
    SELECT * FROM ways_tags) tag
                     
where tag.id in (select tags.id 
    FROM (SELECT * FROM nodes_tags  UNION ALL 
          SELECT * FROM ways_tags) tags join
         (SELECT * FROM node_opening_hours  UNION ALL 
          SELECT * FROM way_opening_hours) hours
          on tags.id=hours.id
    WHERE tags.key='cuisine' and tags.value='chinese'and hours.hour like"%su%"
    LIMIT 10)
    and tag.key like '%addr%' 
;

```

```sql
490254362|Washington
490254362|DC
490254362|Mount Pleasant Street Northwest
490254362|20009
490254362|3066
807592195|Washington
807592195|1527
807592195|20005
807592195|DC
807592195|14th Street Northwest
837122883|M Street Northwest
296507739|Washington
296507739|DC
296507739|12th Street Northeast
296507739|US
296507739|20017
296507739|2801
297371630|Washington
297371630|DC
297371630|Georgia Avenue Northwest
297371630|US
297371630|20011
297371630|4910
371641428|Arlington
371641428|VA
371641428|B
371641428|South Shirlington Road
371641428|22206
371641428|2249
417073480|Alexandria
417073480|VA
417073480|Centre Plaza
417073480|US
417073480|22302
417073480|1707
```

## Conclusion
The open street map for DC metro area is not complete since there are many restruants I knew was not on the list.
The dataset for now will needs more data cleaning on more keys

## Additional Suggestions

### about the contributer

who are the major contributer?

```sql
SELECT e.user, COUNT(*) as num,
                  round((count(*)*100.0/(select count(e.user) from(SELECT user FROM nodes 
                                         UNION ALL 
                                        SELECT user FROM ways ) e )
                                           
                       ),2)
                      
                  AS percetange
                   
                  
                  FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e
                  GROUP BY e.user
                  ORDER BY num DESC
                  limit 10;
```

```sql
aude|741680|27.81
DavidYJackson_import|449254|16.85
wonderchook|171058|6.41
kriscarle|125532|4.71
woodpeck_fixbot|123280|4.62
emacsen|83850|3.14
RoadGeek_MD99|72254|2.71
Sawan Shariar|59815|2.24
ingalls|52627|1.97
sejohnson|52540|1.97
```

we can see that the auther aude contributed over 1/4

save this query as view for future reference

```sql
CREATE VIEW v_contributer
                                  as
                                  SELECT e.user, COUNT(*) as num,
                  round((count(*)*100.0/(select count(e.user) from(SELECT user FROM nodes 
                                         UNION ALL 
                                        SELECT user FROM ways ) e )
                                           
                       ),2)
                      
                  AS percetange
                   
                  
                  FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e
                  GROUP BY e.user
                  ORDER BY num DESC
                  limit 10;
```

then let's see the possible inconsistency * here I used uncorrected street types

```sql
select key,value,type,e.user
                        from  (select * from nodes_tags 
                               union all
                               select * from ways_tags) tag 
                               join 
                               (select user,id from nodes
                                union all
                                select user,id from ways) e
                                on tag.id=e.id
                        where e.user in
                               (select user from v_contributer limit 2
                                 ) and tag.key like '%addr%' and tag.type like '%street%';
```

we can have records as such:

    addr|16th Street Northeast|street|DavidYJackson_import
    addr|Girard Street NW|street|aude
in which case the abbreviation is not consistent for multiple users.

##### Thus, the suggestions will be once some contributer will need to contribute any information.They will need to follow the format of the first contributer for this key.


###  Benefits and Concerns

##### Benefits

1. All the data will follow a consistent format which will be easier for data cleaning
2. It will be easier for people to combine the dataset with other datasets that are following the same routine

##### Concerns

1. The first one who created the key might not use a universal way to represent the information

2. It will be difficult for people to follow this rule for there is no alert can be made when not following the pattern