# Some ground rules for the assignments:

For all assignments (this one and any future assignment including the final project): 

* Do not download and save the data locally unless your data is very big (~TBs). I **do not** want to see you opening a local file for the data I gave you as an URL. Anything local is suspect: local files can't be trusted (they might be manipulated, changed, modified, tempered.) Refer to my lecture notes on how to pull data from an URL using `urlopen`.

* All computations must be done locally within python. Nothing external: no manual input, no excel, no SQL, no java, etc.

* All code has to be explained. Explain your reasoning and your choices. If you installed a third party library (including `numpy`, `scipy`, `pandas` etc), explain which part you import and what that function does etc. 

* Explain your code using a markdown cell. **Do not** use code comments starting with `#` to do your explanations.

* Do not use `if __name__ == "__main__"`. EVER! If you are using that within jupyter, I am going to assume you found the solution on the internet and you cut/pasted without understanding what that peice of code did.

# Question 1

Istanbul municipality has a open data service, and it provides detailed information about their services. For this question, use the data given at [this link](https://data.ibb.gov.tr/dataset/istanbul-sehir-hatlari-iskeleleri).

1. Understand what the data is for. Explain what it is, what it records, what pieces it has in broad strokes. 

2. Data (among other things) contains geographical locations of Sea Stations ('Iskele') of Istanbul Deniz Isletmeleri boats operating in Istanbul.  Extract the locations of these stations as a pandas dataframe with 2 columns: latitudes and longitudes. The index of the dataframe has to be the station names.

The data is in XML format. Do not use any external libraries other than numpy and pandas. Use `xmltodict` to convert it into a dictionary then extract the necessary parts.

In [1]:
from xmltodict import parse
from urllib.request import urlopen
import pandas as pd

### What is the data
The data in the xml format stores the name and coordinate informations of the stations. These informations can be accessed by key values such as 'kml', 'Document', 'Folder', 'Folder'

With urlopen link of data received then parse funtion converts data xml format to python dictionary format. Hence, data can be manipulated.

In [2]:
with urlopen("https://data.ibb.gov.tr/dataset/b47b5391-bcca-4bb3-a575-8ece68901d5d/resource/bcbeff5d-14d7-4ec0-a211-4c4d3e96cfba/download/istanbul-ehir-hatlar-iskeleleri.kml") as data:
    raw = parse(data,encoding='utf-8')
raw

OrderedDict([('kml',
              OrderedDict([('@xmlns', 'http://www.opengis.net/kml/2.2'),
                           ('@xmlns:gx', 'http://www.google.com/kml/ext/2.2'),
                           ('@xmlns:kml', 'http://www.opengis.net/kml/2.2'),
                           ('@xmlns:atom', 'http://www.w3.org/2005/Atom'),
                           ('Document',
                            OrderedDict([('name', 'SHI İSKELELER.kml'),
                                         ('StyleMap',
                                          [OrderedDict([('@id',
                                                         'msn_marina23'),
                                                        ('Pair',
                                                         [OrderedDict([('key',
                                                                        'normal'),
                                                                       ('styleUrl',
                                                            

found the list of the piers informations by the dictionary keys

In [3]:
st_groups=raw['kml']['Document']['Folder']['Folder']
st_groups

[OrderedDict([('name', 'MERKEZ İSKELELER'),
              ('open', '1'),
              ('Placemark',
               [OrderedDict([('name', 'MALTEPE'),
                             ('LookAt',
                              OrderedDict([('gx:TimeStamp',
                                            OrderedDict([('when',
                                                          '2020-06-16')])),
                                           ('gx:ViewerOptions',
                                            OrderedDict([('gx:option',
                                                          [OrderedDict([('@name',
                                                                         'historicalimagery')]),
                                                           OrderedDict([('@enabled',
                                                                         '0'),
                                                                        ('@name',
                                               

### Explanation of loop
With the for loop, the Placemark key is accessed by using the index of the list called st_groups at each step. st_groups is a list that contains groups of stations such as MERKEZ İSKELELER.

Station_list contaion each station and its informations. List called name take name of station by 'name' key 

Since the key of the coordinate information is different for some stations, I got the keys of each station with a list named key.

With this key information, we can access the coordinates of the station.

In [4]:
name=[]

longi=[]
lati=[]


for i in range(len(st_groups)):
    station_list=st_groups[i]['Placemark']
    for j in range(len(station_list)):
        name.append(station_list[j]['name'])
        key=list(st_groups[i]['Placemark'][j].keys())
        longi.append(st_groups[i]['Placemark'][j][key[1]]['longitude'])
        lati.append(st_groups[i]['Placemark'][j][key[1]]['latitude'])

Created dataframe by lists of name, lati and longi and set index of dataframe STATION_NAME

In [6]:
df=pd.DataFrame(name,columns=['STATION_NAME'])

df['latitude']=lati
df['longitude']=longi
df.index = df['STATION_NAME']
del df['STATION_NAME']

df

Unnamed: 0_level_0,latitude,longitude
STATION_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1
MALTEPE,40.91681013544846,29.13060758098593
AHIRKAPI,41.00314456999032,28.98289668101853
BEŞİKTAŞ-1,41.04116198628195,29.00778819900819
BEŞİKTAŞ-2,41.04065414312002,29.0055048939288
BOSTANCI,40.95173395654253,29.09425745312653
EMİNÖNÜ-1,41.01495987953694,28.97621869809887
EMİNÖNÜ-2,41.01495987953694,28.97621869809887
EMİNÖNÜ-3,41.01488637107048,28.97495985342729
EMİNÖNÜ-4,41.01488637107048,28.97495985342729
HAYDARPAŞA,40.99577360085738,29.01810215560077


# Question 2

For this question we are going to use Istanbul Municipality data at [this link](https://data.ibb.gov.tr/dataset/sehir-hatlari-sefer-sayilari). Data contains Istanbul Deniz Isletmeleri route information.

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Extract the data about the number of trips between stations.
2. Calculate the total number of trips in 2020,
3. Calculate the total number of trips in 2021, 
4. Which is the busiest station in 2020 and 2021?

For these questions you must extract the specific data you need from the raw data. Let me be very clear: I am not interested in the numerical answer, I'd like to see your data extraction and calculation explicitly. I need to see your code with which you extract the data, see the data frame where you record the extracted data, and the code where you group and calculate the required results.

### What is the data
The data stores number of trips between stations by years.

### Convert to dataframe
urlopen recaives csv file.

pd.read_csv converts csv data format to dataframe.

Non-null data is retrieved by notna() function.

In [7]:
with urlopen("https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/d2c7e4c3-fd09-4952-8a8e-776e3accf91d/download/2021-yl-ehir-hatlar-sefer-saylar.csv") as url2_1:
    q2_21=pd.read_csv(url2_1,delimiter = ";",encoding='iso-8859-1')

q2_21=q2_21[q2_21['Yil'].notna()]
q2_21

Unnamed: 0,Yil,Guzergah,Toplam Sefer Adeti
0,2021.0,BEÞÝKTAÞ-KADIKÖY,23.658
1,2021.0,EMÝNÖNÜ-ÜSKÜDAR,23.854
2,2021.0,EMÝNÖNÜ-KADIKÖY,18.298
3,2021.0,EMÝNÖNÜ-BEÞÝKTAÞ-KUZGUNCUK-BEYLERBEYÝ-ÇENGELKÖ...,497.0
4,2021.0,EMÝNÖNÜ-BEÞÝKTAÞ-ORTAKÖY-EMÝRGAN-PAÞABAHÇE-BEY...,545.0
5,2021.0,ÇENGELKÖY-BEÞÝKTAÞ-EMÝNÖNÜ,433.0
6,2021.0,KADIKÖY-KARAKÖY,6.168
7,2021.0,KADIKÖY-KARAKÖY-EMÝNÖNÜ,18.304
8,2021.0,KABATAÞ-KADIKÖY-ADALAR,7.046
9,2021.0,BOSTANCI- BÜYÜKADA-HEYBELÝADA,940.0


colums of q2_20 are made same by rename() function.

In [8]:
with urlopen("https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/f1f95d5d-fa2f-479d-9d50-85ca1d604c1e/download/2020-yl-ehir-hatlar-sefer-saylar.csv") as url2_2:
    q2_20 = pd.read_csv(url2_2,delimiter = ";",encoding='iso-8859-1')
q2_20=q2_20.rename(columns={'YIL':'Yil', 'GÜZERGAH':'Guzergah', 'TOPLAM SEFER ADETÝ':'Toplam Sefer Adeti'})
q2_20

Unnamed: 0,Yil,Guzergah,Toplam Sefer Adeti
0,2020,BEÞÝKTAÞ - KADIKÖY,26.879
1,2020,KADIKÖY - KARAKÖY - BEÞÝKTAÞ,13.0
2,2020,EMÝNÖNÜ - ÜSKÜDAR,28.441
3,2020,ÜSKÜDAR - KARAKÖY - EMÝNÖNÜ,8.737
4,2020,KADIKÖY - EMÝNÖNÜ,18.408
5,2020,KADIKÖY - KARAKÖY,25.658
6,2020,KABATAÞ - KADIKÖY - ADALAR - BOSTANCI,5.879
7,2020,ÝSTANBUL - ADALAR,4.542
8,2020,KADIKÖY - KARAKÖY - EMÝNÖNÜ,11.156
9,2020,BOÐAZ GÝDÝÞ GELÝÞ (EMÝNÖNÜ - BEÞÝKTAÞ -KUZGUN...,523.0


Concat() function merges 2 separate dataframe.

In [9]:
son=pd.concat([q2_20,q2_21],ignore_index=True)
son

Unnamed: 0,Yil,Guzergah,Toplam Sefer Adeti
0,2020.0,BEÞÝKTAÞ - KADIKÖY,26.879
1,2020.0,KADIKÖY - KARAKÖY - BEÞÝKTAÞ,13.000
2,2020.0,EMÝNÖNÜ - ÜSKÜDAR,28.441
3,2020.0,ÜSKÜDAR - KARAKÖY - EMÝNÖNÜ,8.737
4,2020.0,KADIKÖY - EMÝNÖNÜ,18.408
...,...,...,...
68,2021.0,ADALAR-BEÞÝKTAÞ,3.262
69,2021.0,BÜYÜKADA-SEDEFADASI,389.000
70,2021.0,EMÝNÖNÜ- BEYLERBEYÝ,249.000
71,2021.0,BEYLERBEYÝ- KUZGUNCUK- BEÞÝKTAÞ- EMÝNÖNÜ,248.000


Function that takes input value year information and dataframe.

Function adds the number of trips where the entered year variable is equal in the dataframe.

In [10]:
def toplam(son,yil):
    print(son.loc[son['Yil']==yil]['Toplam Sefer Adeti'].sum())

In [11]:
toplam(son,2020)
toplam(son,2021)

5851.006
8956.094999999998


Function that takes input value year information and dataframe.

The function splits the stations names where stand in 'Guzergah' column of entered year dataframe with split() function according to 2 delimiters.

To access separate stations names where stand in list, dataframe index uses.

characters must be the same. Hence, spaces are removed and assigned as dictionary key. The 'Total Number of Trips' value in the current index of the dataframe is added to this key.

Max() function finds kay of maximum value.

In [12]:
yil_20=2020
yil_21=2021

def dene(yil,son):
    stations=son.loc[son['Yil']==yil]['Guzergah'].str.split('[-(]')

    dic={}
    key=list(dic.keys())
    for i in son.loc[son['Yil']==yil]['Guzergah'].index:
        for j in range(len(stations[i])):
            if stations[i][j].replace(" ","") in dic.keys():
                dic[stations[i][j].replace(" ","")]=dic[stations[i][j].replace(" ","")]+son.loc[i]['Toplam Sefer Adeti']
            else:
                dic[stations[i][j].replace(" ","")]=son.loc[i]['Toplam Sefer Adeti']
    key=max(dic, key=dic.get)
    value=dic[key]
    print("{} {} ".format(key,value))

In [13]:
dene(yil_20,son)

EMÝNÖNÜ 3790.7679999999996 


In [14]:
dene(yil_21,son)

BEÞÝKTAÞ 5263.192999999999 


# Question 3

Using a different [dataset](https://data.ibb.gov.tr/dataset/istanbul-deniz-iskeleleri-yolcu-sayilari) again from Istanbul Municipality on Istanbul Deniz Isletmeleri: 

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Find out the busiest station in the years 2020 and 2021,
3. Repeat the same calculation monthly: find the busiest stations for each month.
4. Does your calculation of busiest stations agree with the calculation you made in Question #2? Explain.


### What is the data
The data stores the number of annual passengers by month of each station.

In [15]:
with urlopen("https://data.ibb.gov.tr/dataset/20f33ff0-1ab3-4378-9998-486e28242f48/resource/6fbdd928-8c37-43a4-8e6a-ba0fa7f767fb/download/istanbul-deniz-iskeleleri-yolcu-saylar.csv") as url3:
    q3=pd.read_csv(url3,delimiter = ";",encoding='iso-8859-1')
q3

Unnamed: 0,Yil,Ay,Otorite Adi,Istasyon Adi,Yolcu Sayisi
0,2021,3,Beyden Deniz Ulaþým Hizmetleri Turizm ve Tic. ...,BEYKOZ,5076
1,2021,3,Beyden Deniz Ulaþým Hizmetleri Turizm ve Tic. ...,YENIKOY,5347
2,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,106334
3,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,KABATAS,24
4,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,USKUDAR,94200
...,...,...,...,...,...
656,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Eminönü,55387
657,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Kadýköy Balon,40680
658,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Kadýköy Çayýrbaþý,69443
659,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Karaköy,55098


Function that takes input value year information and dataframe.

The function adds number of passengers in the 'Yolcu Sayisi' column that equal in the data frame of the entered year variable and grouped by 'Istasyon Adi'.

The function adds the number of passengers data where stand in 'Yolcu Sayisi' column of entered year dataframe which is grouped by 'Istasyon Adi'. idxmax() function returns station name of maximum passenger number.


In [17]:
def find_max(q3,yil):
    print(q3.loc[q3['Yil']==yil].groupby('Istasyon Adi')['Yolcu Sayisi'].sum().idxmax())

No data for 2020

In [18]:
q3.loc[q3['Yil']==2020].groupby('Istasyon Adi')['Yolcu Sayisi'].sum()

Series([], Name: Yolcu Sayisi, dtype: int64)

In [19]:
find_max(q3,2021)

USKUDAR


For loop finds busiest station for each month 3 to 11.

In [20]:
maks=[]
for i in range(3,12):
    maks.append(q3[q3['Ay']==i].groupby('Istasyon Adi')["Yolcu Sayisi"].sum().idxmax())

maks 

['USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR',
 'USKUDAR']

### Is it different
The results of the busiest stations in questions 2 and 3 are different. While the data in the second question shows the number of trips between the stations, the third question shows the number of passengers. The number of passengers and the number of trips are different measurements. The reason for that the size of the transport used on the expeditions may be different or the times of the trips may not be the busiest hours.