## Capstone Project-The Battle of Neighborhoods (Week1)

### Introduction/Business Problem

Greece attracts a lot of tourists for hiking cause of the number and beauty of her nature.

The aim of this project is to recommend city destinations for mountain lovers in Greece.

The first task would be to segment the cities based on their geographic distance (on km) from every mountain and cluster them based on that and as second step I will make another segmentation based on the similarity of the cities.

After that I will merge them to filter the cities that are both similar and close to same montains.

Finaly I will find the closest 5 mountains , based on their distance in km for every city in the final merged cluster and categorize them based on their height in 3 categories.

I will make use of our data science tools to analyse data and focus on the relationship of cities and mountains in Greece.

### Data

We will need two types of data.

* Wikipedia data for mountain informations like height , regional unit and coordinates.
* Wikipedia data for cities informations like name , coordinates

* Forsquare data about venues on every city in Greece.

I scrape the wikipedia data from https://en.wikipedia.org/wiki/List_of_mountains_in_Greece .

I used **Scrapy** an open source and collaborative framework for extracting the data from websites. Here is my code.

```python
import scrapy


class MountainsSpider(scrapy.Spider):
    name = 'mountains'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['http://en.wikipedia.org/wiki/List_of_mountains_in_Greece']

    def parse(self, response):
        table = response.xpath('//table[contains(@class,"wikitable sortable")]')
        trs = table.xpath('.//tr')[2:]

        for tr in trs:
        	peak = tr.xpath('.//td[1]//text()').extract_first().strip()
        	height = tr.xpath('.//td[2]//text()').extract_first().strip()
        	mountain_range = tr.xpath('.//td[4]//text()').extract_first().strip()
        	regional_unit = tr.xpath('.//td[5]//text()').extract_first().strip()

        	next_page_mountain_url = tr.xpath('.//td/a/@href').extract_first()
        	next_page_region_url = tr.xpath('.//td[5]/a/@href').extract_first().strip()
        	
        	if next_page_mountain_url:
        		yield scrapy.Request(response.urljoin(next_page_mountain_url),
    							 callback =self.parse_info,
    							 meta={'peak': peak,
    								   'height': height,
    								   'mountain_range': mountain_range,
    								   'regional_unit':regional_unit})

    def parse_info(self,response):
        	info = response.xpath('//table[contains(@class,"infobox")]')
        	latitude = info.xpath('.//span[@class="latitude"]/text()').extract_first()
        	longitude = info.xpath('.//span[@class="longitude"]/text()').extract_first()

        	# convert coordinates from DMS to dd
        	def convert_coor(old_value):
        		degrees = old_value.split('°')[0]
        		minutes = old_value.split('°')[1].split('′')[0]
        		seconds = old_value.split('°')[1].split('′')[1].split('″')[0]
        		#conversion formula
        		if(seconds.isnumeric()):
        			new_value = float(degrees) + (float(minutes)/60) + (float(seconds)/3600)
        		else:
        			new_value = float(degrees) + (float(minutes)/60)

        		return(new_value)

        	if((latitude != "") & (longitude != "")):
        		mountain_latitude = convert_coor(latitude)
        		mountain_longitude = convert_coor(longitude)

        	peak = response.meta['peak']
        	height = response.meta['height']
        	mountain_range = response.meta['mountain_range']
        	regional_unit = response.meta['regional_unit']

        	yield {
        	'peak':peak,
        	'height':height,
        	'mountain_range':mountain_range,
        	'regional_unit':regional_unit,
        	'mountain_latitude':mountain_latitude,
        	'mountain_longitude':mountain_longitude
        	}
```

I end up with a dataset that contains the **peak** , the **height** , the **mountain range** , the **coordinates** and the **regional unit** the mountain belongs.

I follow a similar approach for the **regions** / **cities** . I got the **name** and the **coordinates** of the cities which is usefull to use this dataset in collaboration with the previous.

```python
import scrapy


class RegionsSpider(scrapy.Spider):
    name = 'regions'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['http://en.wikipedia.org/wiki/List_of_mountains_in_Greece']

    def parse(self, response):
        table = response.xpath('//table[contains(@class,"wikitable sortable")]')
        trs = table.xpath('.//tr')[2:]

        for tr in trs:
        	regional_unit = tr.xpath('.//td[5]//text()').extract_first().strip()

        	next_page_region_url = tr.xpath('.//td[5]/a/@href').extract_first()
        	
        	if next_page_region_url:
        		yield scrapy.Request(response.urljoin(next_page_region_url),
    							 callback =self.parse_info,
    							 meta={'regional_unit':regional_unit})

    def parse_info(self,response):
        	info = response.xpath('//table[contains(@class,"infobox")]')
        	latitude = info.xpath('.//span[@class="latitude"]/text()').extract_first()
        	longitude = info.xpath('.//span[@class="longitude"]/text()').extract_first()

        	# convert coordinates from DMS to dd
        	def convert_coor(old_value):
        		degrees = old_value.split('°')[0]
        		minutes = old_value.split('°')[1].split('′')[0]
        		seconds = old_value.split('°')[1].split('′')[1].split('″')[0]
        		#conversion formula
        		if(seconds.isnumeric()):
        			new_value = float(degrees) + (float(minutes)/60) + (float(seconds)/3600)
        		else:
        			new_value = float(degrees) + (float(minutes)/60)

        		return(new_value)

        	if((latitude != "") & (longitude != "")):
        		mountain_latitude = convert_coor(latitude)
        		mountain_longitude = convert_coor(longitude)

        	regional_unit = response.meta['regional_unit']

        	yield {
        	'regional_unit':regional_unit,
        	'regional_latitude':mountain_latitude,
        	'regional_longitude':mountain_longitude
        	}
```

Finaly I will use foursquare data about the near cities from each mountain to find venues in a given radius,  to separate them from each other. I will take the top 10 venues of every city. We will see at the end that cities having some specific similarities like islands cluster together.