# Some ground rules for the assignments:

For all assignments (this one and any future assignment including the final project): 

* Do not download and save the data locally unless your data is very big (~TBs). I **do not** want to see you opening a local file for the data I gave you as an URL. Anything local is suspect: local files can't be trusted (they might be manipulated, changed, modified, tempered.) Refer to my lecture notes on how to pull data from an URL using `urlopen`.

* All computations must be done locally within python. Nothing external: no manual input, no excel, no SQL, no java, etc.

* All code has to be explained. Explain your reasoning and your choices. If you installed a third party library (including `numpy`, `scipy`, `pandas` etc), explain which part you import and what that function does etc. 

* Explain your code using a markdown cell. **Do not** use code comments starting with `#` to do your explanations.

* Do not use `if __name__ == "__main__"`. EVER! If you are using that within jupyter, I am going to assume you found the solution on the internet and you cut/pasted without understanding what that peice of code did.

# Question 1

Istanbul municipality has a open data service, and it provides detailed information about their services. For this question, use the data given at [this link](https://data.ibb.gov.tr/dataset/istanbul-sehir-hatlari-iskeleleri).

1. Understand what the data is for. Explain what it is, what it records, what pieces it has in broad strokes. 

2. Data (among other things) contains geographical locations of Sea Stations ('Iskele') of Istanbul Deniz Isletmeleri boats operating in Istanbul.  Extract the locations of these stations as a pandas dataframe with 2 columns: latitudes and longitudes. The index of the dataframe has to be the station names.

The data is in XML format. Do not use any external libraries other than numpy and pandas. Use `xmltodict` to convert it into a dictionary then extract the necessary parts.

<hr>

1. This dataset contains the locations of the Wharfs located in Istanbul and managed by Şehir Hatları. There are 5 groups of wharfs and each of them contains stations. The stations is showed in the map by their own coordinates; latitudes and longtitudes. We want to get the list of coordinates of the stations.

2. First, Library Urlopen was imported in order to reach the page which includes the data. Moreover, Library parse from xmltodict was imported since the data is stored in KML format which is based on the XML. Finally, Library Pandas was imported because we are dealing with data itself.

In [136]:
from urllib.request import urlopen
from xmltodict import parse
import pandas as pd

Raw data from url is being retrieved. The type of the raw is actually KML.

In [137]:
with urlopen("https://data.ibb.gov.tr/dataset/b47b5391-bcca-4bb3-a575-8ece68901d5d/resource/bcbeff5d-14d7-4ec0-a211-4c4d3e96cfba/download/istanbul-ehir-hatlar-iskeleleri.kml") as url:
    raw = url.read().decode('utf8')

By using imported function parse, the raw data is parsed from XML to Dictionary type.

In [138]:
data = parse(raw)

We reach to the essential part where it contains all infos of the whole stations in the KML Format.

In [139]:
wharfdata = data['kml']['Document']['Folder']['Folder']
wharfdata

[OrderedDict([('name', 'MERKEZ İSKELELER'),
              ('open', '1'),
              ('Placemark',
               [OrderedDict([('name', 'MALTEPE'),
                             ('LookAt',
                              OrderedDict([('gx:TimeStamp',
                                            OrderedDict([('when',
                                                          '2020-06-16')])),
                                           ('gx:ViewerOptions',
                                            OrderedDict([('gx:option',
                                                          [OrderedDict([('@name',
                                                                         'historicalimagery')]),
                                                           OrderedDict([('@enabled',
                                                                         '0'),
                                                                        ('@name',
                                               

We figure it out that there are 5 groups of wharfs located in İstanbul.

In [140]:
wharfgroups = [wharfgroup['name'] for wharfgroup in wharfdata]
wharfgroups

['MERKEZ İSKELELER',
 'HALİÇ İSKELELERİ',
 'ADA İSKELELERİ',
 'BOĞAZ İSKELELERİ',
 'ARABALI İSKELELER']

Here is shown that how we can pull a 'longitude' info from a instance station.

In [141]:
wharfdata[0]['Placemark'][0]['LookAt']['longitude']

'29.13060758098593'

By going through to per wharfgroup, we pull the latitude and longitude values of stations and append them to an array called res. At the end, res is converted to a DataFrame by using a pandas function DataFrame().

Note : The Coordinate info lies on keys 'LookAt' or 'Camera'. If it throws an exception when it is tried for key 'LookAt', then it will be tried for key 'Camera'.  

In [142]:
res = []
i = 0
for wharfgroup in wharfdata :
    groupname = wharfgroup['name']
    for station in wharfgroup['Placemark'] :
        try:
            name = station["name"]
            lat = float(station['LookAt']['latitude'])
            long = float(station['LookAt']['longitude'])
            res.append({"name": name, "groupname": groupname, "lat": lat, "long": long})
        except Exception as e:
            try:
                name = station["name"]
                lat = float(station['Camera']['latitude'])
                long = float(station['Camera']['longitude'])
                res.append({"name": name, "groupname": groupname, "lat": lat, "long": long})
            except :
                i = i+1
        
stations = pd.DataFrame(res)
stations

Unnamed: 0,name,groupname,lat,long
0,MALTEPE,MERKEZ İSKELELER,40.91681,29.130608
1,AHIRKAPI,MERKEZ İSKELELER,41.003145,28.982897
2,BEŞİKTAŞ-1,MERKEZ İSKELELER,41.041162,29.007788
3,BEŞİKTAŞ-2,MERKEZ İSKELELER,41.040654,29.005505
4,BOSTANCI,MERKEZ İSKELELER,40.951734,29.094257
5,EMİNÖNÜ-1,MERKEZ İSKELELER,41.01496,28.976219
6,EMİNÖNÜ-2,MERKEZ İSKELELER,41.01496,28.976219
7,EMİNÖNÜ-3,MERKEZ İSKELELER,41.014886,28.97496
8,EMİNÖNÜ-4,MERKEZ İSKELELER,41.014886,28.97496
9,HAYDARPAŞA,MERKEZ İSKELELER,40.995774,29.018102


In the following, the index of the dataframe is converted to the station names.

In [143]:
stations.index = stations['name']
del stations['name']
stations

Unnamed: 0_level_0,groupname,lat,long
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MALTEPE,MERKEZ İSKELELER,40.91681,29.130608
AHIRKAPI,MERKEZ İSKELELER,41.003145,28.982897
BEŞİKTAŞ-1,MERKEZ İSKELELER,41.041162,29.007788
BEŞİKTAŞ-2,MERKEZ İSKELELER,41.040654,29.005505
BOSTANCI,MERKEZ İSKELELER,40.951734,29.094257
EMİNÖNÜ-1,MERKEZ İSKELELER,41.01496,28.976219
EMİNÖNÜ-2,MERKEZ İSKELELER,41.01496,28.976219
EMİNÖNÜ-3,MERKEZ İSKELELER,41.014886,28.97496
EMİNÖNÜ-4,MERKEZ İSKELELER,41.014886,28.97496
HAYDARPAŞA,MERKEZ İSKELELER,40.995774,29.018102


# Question 2

For this question we are going to use Istanbul Municipality data at [this link](https://data.ibb.gov.tr/dataset/sehir-hatlari-sefer-sayilari). Data contains Istanbul Deniz Isletmeleri route information.

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Extract the data about the number of trips between stations.
2. Calculate the total number of trips in 2020,
3. Calculate the total number of trips in 2021, 
4. Which is the busiest station in 2020 and 2021?

For these questions you must extract the specific data you need from the raw data. Let me be very clear: I am not interested in the numerical answer, I'd like to see your data extraction and calculation explicitly. I need to see your code with which you extract the data, see the data frame where you record the extracted data, and the code where you group and calculate the required results.

<hr>

1. It records the number of trips between stations in 2020 and also in 2021. It contains year, route and total number of trips.

2. I imported numpy because of the Matrix operations such as sum() and argmax(). The others, pandas and urlopen, is imported with the same purpose in Question 1.

In [144]:
from urllib.request import urlopen
import pandas as pd
import numpy as np

In the following two steps, we get documents by URL. Since the documents are in CSV format, we extract the data with the function read_csv() and also parameter sep is used in the function to seperate the data combined with ';'. 

In [145]:
url20="https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/f1f95d5d-fa2f-479d-9d50-85ca1d604c1e/download/2020-yl-ehir-hatlar-sefer-saylar.csv"

with urlopen(url20) as url:
    data20 = pd.read_csv(url, encoding= 'unicode_escape', sep=";")
data20

Unnamed: 0,YIL,GÜZERGAH,TOPLAM SEFER ADETÝ
0,2020,BEÞÝKTAÞ - KADIKÖY,26.879
1,2020,KADIKÖY - KARAKÖY - BEÞÝKTAÞ,13.0
2,2020,EMÝNÖNÜ - ÜSKÜDAR,28.441
3,2020,ÜSKÜDAR - KARAKÖY - EMÝNÖNÜ,8.737
4,2020,KADIKÖY - EMÝNÖNÜ,18.408
5,2020,KADIKÖY - KARAKÖY,25.658
6,2020,KABATAÞ - KADIKÖY - ADALAR - BOSTANCI,5.879
7,2020,ÝSTANBUL - ADALAR,4.542
8,2020,KADIKÖY - KARAKÖY - EMÝNÖNÜ,11.156
9,2020,BOÐAZ GÝDÝÞ GELÝÞ (EMÝNÖNÜ - BEÞÝKTAÞ -KUZGUN...,523.0


In [146]:
url21="https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/d2c7e4c3-fd09-4952-8a8e-776e3accf91d/download/2021-yl-ehir-hatlar-sefer-saylar.csv"

with urlopen(url21) as url:
    data21 = pd.read_csv(url, encoding= 'unicode_escape', sep=";")
data21

Unnamed: 0,Yil,Guzergah,Toplam Sefer Adeti
0,2021.0,BEÞÝKTAÞ-KADIKÖY,23.658
1,2021.0,EMÝNÖNÜ-ÜSKÜDAR,23.854
2,2021.0,EMÝNÖNÜ-KADIKÖY,18.298
3,2021.0,EMÝNÖNÜ-BEÞÝKTAÞ-KUZGUNCUK-BEYLERBEYÝ-ÇENGELKÖ...,497.000
4,2021.0,EMÝNÖNÜ-BEÞÝKTAÞ-ORTAKÖY-EMÝRGAN-PAÞABAHÇE-BEY...,545.000
...,...,...,...
70,,,
71,,,
72,,,
73,,,


3. Sum is used on the column 'TOPLAM SEFER ADETÝ' of data20 to find the total number of trips in 2020.
4. Sum is used on the column 'Toplam Sefer Adeti' of data21 to find the total number of trips in 2021.


In [147]:
sumdata20 = data20['TOPLAM SEFER ADETÝ'].sum()
sumdata21 = data21['Toplam Sefer Adeti'].sum()
print('Total trips in 2020 is '+str(sumdata20)+ '\nTotal trips in 2021 is '+ str(sumdata21) )

Total trips in 2020 is 5851.006
Total trips in 2021 is 8956.095


5. With argmax, I find indice of max value and locate with iloc, which takes an indice as parameter. Furthermore, I get the busiest stations for each year 2020 and 2021.

In [148]:
maxdata20 = data20.iloc[data20['TOPLAM SEFER ADETÝ'].argmax()]
maxdata21 = data21.iloc[data21['Toplam Sefer Adeti'].argmax()]
print('The Busiest Station in 2020:\n'+str(maxdata20['GÜZERGAH'])+ '\n The Busiest Station in 2021:\n'+ str(maxdata21['Guzergah']) )

The Busiest Station in 2020:
SARIYER - MUHTELÝF BOÐAZ - EMÝNÖNÜ (SARIYER - ÝSTÝNYE - EMÝRGAN - BEYLERBEYÝ - KUZGUNCUK - BEÞÝKTAÞ  - EMÝNÖNÜ)
 The Busiest Station in 2021:
BEBEK-ANADOLU HÝSARI-KANLICA-EMÝRGAN


# Question 3

Using a different [dataset](https://data.ibb.gov.tr/dataset/istanbul-deniz-iskeleleri-yolcu-sayilari) again from Istanbul Municipality on Istanbul Deniz Isletmeleri: 

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Find out the busiest station in the years 2020 and 2021,
3. Repeat the same calculation monthly: find the busiest stations for each month.
4. Does your calculation of busiest stations agree with the calculation you made in Question #2? Explain.



<hr>

1. It records the number of passengers passing through the wharfs located in Istanbul. It contains Year, Month, Authority Name, Station Name and Passenger Count.

In [149]:
with urlopen("https://data.ibb.gov.tr/dataset/20f33ff0-1ab3-4378-9998-486e28242f48/resource/6fbdd928-8c37-43a4-8e6a-ba0fa7f767fb/download/istanbul-deniz-iskeleleri-yolcu-saylar.csv") as url:
    data = pd.read_csv(url, encoding= 'unicode_escape', sep=";")
data

Unnamed: 0,Yil,Ay,Otorite Adi,Istasyon Adi,Yolcu Sayisi
0,2021,3,Beyden Deniz Ulaþým Hizmetleri Turizm ve Tic. ...,BEYKOZ,5076
1,2021,3,Beyden Deniz Ulaþým Hizmetleri Turizm ve Tic. ...,YENIKOY,5347
2,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,106334
3,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,KABATAS,24
4,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,USKUDAR,94200
...,...,...,...,...,...
656,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Eminönü,55387
657,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Kadýköy Balon,40680
658,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Kadýköy Çayýrbaþý,69443
659,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Karaköy,55098


2. When the data is filtered, there is no data about 2020.

In [150]:
data[(data["Yil"]==2020)]

Unnamed: 0,Yil,Ay,Otorite Adi,Istasyon Adi,Yolcu Sayisi


2. Again with argmax(), I find the index of max value of the column 'Yolcu Sayisi' when 'Yil' is 2021. Then I use iloc on the data.

In [151]:
data.iloc[data[(data["Yil"]==2021)]['Yolcu Sayisi'].argmax()]

Yil                                                          2021
Ay                                                             10
Otorite Adi     DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...
Istasyon Adi                                             BESIKTAS
Yolcu Sayisi                                               757374
Name: 516, dtype: object

3. Data is groupped by 'Ay' and 'Yil'. Then, a Pandas function called idxmax() is used to get positions of max indexes according to 'Yolcu Sayisi' for per Group. At the end, we find busiest stations for each month.

In [152]:
groupdata = data.loc[data.groupby(["Ay", "Yil"])["Yolcu Sayisi"].idxmax()]

groupdata.index = groupdata['Ay']
del groupdata['Ay']
groupdata

Unnamed: 0_level_0,Yil,Otorite Adi,Istasyon Adi,Yolcu Sayisi
Ay,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3,2021,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,106334
4,2021,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,274984
5,2021,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,205662
6,2021,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,488048
7,2021,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,590792
8,2021,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,601089
9,2021,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,633355
10,2021,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,757374
11,2021,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,223286


4. I can not explain for 2020 because of no data.
<br> On the other hand, for 2021, It is clear to see that station 'Beşiktaş' was the busiest one for per month in Question 3. Also, again for 2021, Beşiktaş is in the the busiest route in Question 2. So it is agree. (Normally it fits but unfortunately, we can not see it because of wrong decimal of values in Question 2)