# Some ground rules for the assignments:

For all assignments (this one and any future assignment including the final project): 

* Do not download and save the data locally unless your data is very big (~TBs). I **do not** want to see you opening a local file for the data I gave you as an URL. Anything local is suspect: local files can't be trusted (they might be manipulated, changed, modified, tempered.) Refer to my lecture notes on how to pull data from an URL using `urlopen`.

* All computations must be done locally within python. Nothing external: no manual input, no excel, no SQL, no java, etc.

* All code has to be explained. Explain your reasoning and your choices. If you installed a third party library (including `numpy`, `scipy`, `pandas` etc), explain which part you import and what that function does etc. 

* Explain your code using a markdown cell. **Do not** use code comments starting with `#` to do your explanations.

* Do not use `if __name__ == "__main__"`. EVER! If you are using that within jupyter, I am going to assume you found the solution on the internet and you cut/pasted without understanding what that peice of code did.

# Question 1

Istanbul municipality has a open data service, and it provides detailed information about their services. For this question, use the data given at [this link](https://data.ibb.gov.tr/dataset/istanbul-sehir-hatlari-iskeleleri).

1. Understand what the data is for. Explain what it is, what it records, what pieces it has in broad strokes. 

2. Data (among other things) contains geographical locations of Sea Stations ('Iskele') of Istanbul Deniz Isletmeleri boats operating in Istanbul.  Extract the locations of these stations as a pandas dataframe with 2 columns: latitudes and longitudes. The index of the dataframe has to be the station names.

The data is in XML format. Do not use any external libraries other than numpy and pandas. Use `xmltodict` to convert it into a dictionary then extract the necessary parts.

# Homework-1
### Solution.1

### 1.1)
#### Data

- This data set includes the coordinates of the piers in Istanbul, operated by Şehir Hatları, on the world map, and the names of the piers.

#### Code

* For our first question, we first import the libraries we will use.

In [1]:
import pandas as pd
from urllib.request import urlopen
from xmltodict import parse

- By using the urlopen and parse functions, we analyze our data in xml structure by converting it into a dictionary structure.

In [2]:
link="https://data.ibb.gov.tr/dataset/b47b5391-bcca-4bb3-a575-8ece68901d5d/resource/bcbeff5d-14d7-4ec0-a211-4c4d3e96cfba/download/istanbul-ehir-hatlar-iskeleleri.kml"
A=urlopen(link) 
data=parse(A.read().decode('utf8'))

- In order to reach the subtitles (station name, coordinates) requested in the question from the complex dictionary structure, we select with square brackets in our data.
-  We create 3 different empty lists for 3 different subheadings.
- In order not to get the "list indices must be integers or slices, not str" error, we navigate through our list elements one by one with the help of for loops.
- We add the data that is useful to us while navigating through the elements, with the if command and append() to the empty lists we opened at the beginning.

In [3]:
station_names=[]
longitude=[]
latitude=[]
A=data['kml']['Document']['Folder']['Folder']
for x in A:
    for k in x['Placemark']:
        if k.get('LookAt'): 
            station_names.append(k['name'])
            longitude.append(k['LookAt']['longitude'])
            latitude.append(k['LookAt']['latitude'])
        elif k.get('Camera'):
            station_names.append(k['name'])
            longitude.append(k['Camera']['longitude'])
            latitude.append(k['Camera']['latitude'])

### 1.2)
- In the dataframe we created, place the names in the indexes and the coordinates in the other columns.
- We use the pd.DataFrame() function of our Pandas library to create a dataframe.
- The reason we created the dictionary named res beforehand is to define the column names from the beginning.

In [4]:
result={'longitude':longitude,'latitude':latitude}
pd.DataFrame(result,index=station_names)

Unnamed: 0,longitude,latitude
MALTEPE,29.13060758098593,40.91681013544846
AHIRKAPI,28.98289668101853,41.00314456999032
BEŞİKTAŞ-1,29.00778819900819,41.04116198628195
BEŞİKTAŞ-2,29.0055048939288,41.04065414312002
BOSTANCI,29.09425745312653,40.95173395654253
EMİNÖNÜ-1,28.97621869809887,41.01495987953694
EMİNÖNÜ-2,28.97621869809887,41.01495987953694
EMİNÖNÜ-3,28.97495985342729,41.01488637107048
EMİNÖNÜ-4,28.97495985342729,41.01488637107048
HAYDARPAŞA,29.01810215560077,40.99577360085738


# Question 2

For this question we are going to use Istanbul Municipality data at [this link](https://data.ibb.gov.tr/dataset/sehir-hatlari-sefer-sayilari). Data contains Istanbul Deniz Isletmeleri route information.

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Extract the data about the number of trips between stations.
2. Calculate the total number of trips in 2020,
3. Calculate the total number of trips in 2021, 
4. Which is the busiest station in 2020 and 2021?

For these questions you must extract the specific data you need from the raw data. Let me be very clear: I am not interested in the numerical answer, I'd like to see your data extraction and calculation explicitly. I need to see your code with which you extract the data, see the data frame where you record the extracted data, and the code where you group and calculate the required results.

### Solution.2

### 2.1)
#### Data 

- In this data set, the number of city lines and the routes of the vehicles in 2020 and 2021 are given.

#### Code

- Importing our libraries:

In [5]:
import pandas as pd
from urllib.request import urlopen

### 2.2)
- We open the url of our data from the internet with the Urlopen() function.
- We do web scraping of our csv type data with the pd.read_csv() function in the pandas library.
- Our data becomes a dataframe thanks to the pandas function. We are already editing the column names with .columns as we will combine the two dataframes in the future.

In [6]:
link="https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/f1f95d5d-fa2f-479d-9d50-85ca1d604c1e/download/2020-yl-ehir-hatlar-sefer-saylar.csv"
url=urlopen(link) 
df_2020=pd.read_csv(url,encoding= 'unicode_escape',sep=";")
df_2020.columns=['YIL','GÜZERGAH','TOPLAM SEFER ADETİ']
df_2020

Unnamed: 0,YIL,GÜZERGAH,TOPLAM SEFER ADETİ
0,2020,BEÞÝKTAÞ - KADIKÖY,26.879
1,2020,KADIKÖY - KARAKÖY - BEÞÝKTAÞ,13.0
2,2020,EMÝNÖNÜ - ÜSKÜDAR,28.441
3,2020,ÜSKÜDAR - KARAKÖY - EMÝNÖNÜ,8.737
4,2020,KADIKÖY - EMÝNÖNÜ,18.408
5,2020,KADIKÖY - KARAKÖY,25.658
6,2020,KABATAÞ - KADIKÖY - ADALAR - BOSTANCI,5.879
7,2020,ÝSTANBUL - ADALAR,4.542
8,2020,KADIKÖY - KARAKÖY - EMÝNÖNÜ,11.156
9,2020,BOÐAZ GÝDÝÞ GELÝÞ (EMÝNÖNÜ - BEÞÝKTAÞ -KUZGUN...,523.0


### 2.3)
- Since we can perform certain operations on the column we want with the groupby() function, we obtain the total number of trips in 2020 with this function.

In [7]:
Sum_2020=df_2020.groupby("YIL").sum()
Sum_2020

Unnamed: 0_level_0,TOPLAM SEFER ADETİ
YIL,Unnamed: 1_level_1
2020,5851.006


### 2.4)
- We took the data for 2020 above, now we are pulling the data for 2021.

In [8]:
link="https://data.ibb.gov.tr/dataset/4be0d5f6-62de-4a24-a6a5-038009cde39f/resource/d2c7e4c3-fd09-4952-8a8e-776e3accf91d/download/2021-yl-ehir-hatlar-sefer-saylar.csv"
url_2=urlopen(link) 
df_2021=pd.read_csv(url_2,encoding= 'unicode_escape',sep=";")
df_2021.columns=['YIL','GÜZERGAH','TOPLAM SEFER ADETİ']
df_2021

Unnamed: 0,YIL,GÜZERGAH,TOPLAM SEFER ADETİ
0,2021.0,BEÞÝKTAÞ-KADIKÖY,23.658
1,2021.0,EMÝNÖNÜ-ÜSKÜDAR,23.854
2,2021.0,EMÝNÖNÜ-KADIKÖY,18.298
3,2021.0,EMÝNÖNÜ-BEÞÝKTAÞ-KUZGUNCUK-BEYLERBEYÝ-ÇENGELKÖ...,497.000
4,2021.0,EMÝNÖNÜ-BEÞÝKTAÞ-ORTAKÖY-EMÝRGAN-PAÞABAHÇE-BEY...,545.000
...,...,...,...
70,,,
71,,,
72,,,
73,,,


- When we see that there are missing data in our data set, we reach our total number of missing data with the isnull().sum() functions.

In [9]:
df_2021.isnull().sum()

YIL                   37
GÜZERGAH              37
TOPLAM SEFER ADETİ    37
dtype: int64

- Since we cannot operate with missing data, it would be the most logical way to delete it. We use the dropna() function for this.

In [10]:
df_new_2021=df_2021.dropna(how="all")
df_new_2021

Unnamed: 0,YIL,GÜZERGAH,TOPLAM SEFER ADETİ
0,2021.0,BEÞÝKTAÞ-KADIKÖY,23.658
1,2021.0,EMÝNÖNÜ-ÜSKÜDAR,23.854
2,2021.0,EMÝNÖNÜ-KADIKÖY,18.298
3,2021.0,EMÝNÖNÜ-BEÞÝKTAÞ-KUZGUNCUK-BEYLERBEYÝ-ÇENGELKÖ...,497.0
4,2021.0,EMÝNÖNÜ-BEÞÝKTAÞ-ORTAKÖY-EMÝRGAN-PAÞABAHÇE-BEY...,545.0
5,2021.0,ÇENGELKÖY-BEÞÝKTAÞ-EMÝNÖNÜ,433.0
6,2021.0,KADIKÖY-KARAKÖY,6.168
7,2021.0,KADIKÖY-KARAKÖY-EMÝNÖNÜ,18.304
8,2021.0,KABATAÞ-KADIKÖY-ADALAR,7.046
9,2021.0,BOSTANCI- BÜYÜKADA-HEYBELÝADA,940.0


- After eliminating the missing data, we reach the total number of trips in 2021 as above.

In [11]:
Sum_2021=df_new_2021.groupby("YIL").sum()
Sum_2021

Unnamed: 0_level_0,TOPLAM SEFER ADETİ
YIL,Unnamed: 1_level_1
2021.0,8956.095


### 2.5)
- In order to process the years 2020 and 2021 together, we combine the two data sets with the pd.concat() function from the pandas library. The ignore_index=True statement allows the indexes to be rearranged and written sequentially.

In [12]:
Final_Data=pd.concat([df_2020,df_new_2021],ignore_index=True)
Final_Data

Unnamed: 0,YIL,GÜZERGAH,TOPLAM SEFER ADETİ
0,2020.0,BEÞÝKTAÞ - KADIKÖY,26.879
1,2020.0,KADIKÖY - KARAKÖY - BEÞÝKTAÞ,13.000
2,2020.0,EMÝNÖNÜ - ÜSKÜDAR,28.441
3,2020.0,ÜSKÜDAR - KARAKÖY - EMÝNÖNÜ,8.737
4,2020.0,KADIKÖY - EMÝNÖNÜ,18.408
...,...,...,...
68,2021.0,ADALAR-BEÞÝKTAÞ,3.262
69,2021.0,BÜYÜKADA-SEDEFADASI,389.000
70,2021.0,EMÝNÖNÜ- BEYLERBEYÝ,249.000
71,2021.0,BEYLERBEYÝ- KUZGUNCUK- BEÞÝKTAÞ- EMÝNÖNÜ,248.000


- We cannot use the groupby() function to find the busiest station in our final table called Final_Data. Because we were aiming to see the whole row. For this, we do column selecting the with square brackets, but we limit it to the case where it is equal to the maximum.
- Since we put ==, the equality we wrote will return us only if it is True.
- In this way, we can see the route and reach the busiest station in 2020 and 2021.

In [13]:
Final_Data[Final_Data['TOPLAM SEFER ADETİ']==Final_Data['TOPLAM SEFER ADETİ'].max()]

Unnamed: 0,YIL,GÜZERGAH,TOPLAM SEFER ADETİ
57,2021.0,BEBEK-ANADOLU HÝSARI-KANLICA-EMÝRGAN,969.0


# Question 3

Using a different [dataset](https://data.ibb.gov.tr/dataset/istanbul-deniz-iskeleleri-yolcu-sayilari) again from Istanbul Municipality on Istanbul Deniz Isletmeleri: 

1. Understand what the data is for, what it records, what it contains. Explain it in broad strokes.
2. Find out the busiest station in the years 2020 and 2021,
3. Repeat the same calculation monthly: find the busiest stations for each month.
4. Does your calculation of busiest stations agree with the calculation you made in Question #2? Explain.


### Solution.3

### 3.1)
#### Data

- This dataset includes the number of passengers passing through the sea piers in Istanbul. 
- Since our file is of csv type, we obtain our data using urlopen() and pd.read_csv() as in the 2nd question.

#### Code

- Importing our libraries

In [14]:
import pandas as pd
from urllib.request import urlopen

- In our data, the year, month, operators of the vehicles, station name and number of passengers are available in order.

In [15]:
link="https://data.ibb.gov.tr/dataset/20f33ff0-1ab3-4378-9998-486e28242f48/resource/6fbdd928-8c37-43a4-8e6a-ba0fa7f767fb/download/istanbul-deniz-iskeleleri-yolcu-saylar.csv"
url=urlopen(link) 
iskele_bilgisi=pd.read_csv(url,encoding= 'unicode_escape',sep=";")
iskele_bilgisi

Unnamed: 0,Yil,Ay,Otorite Adi,Istasyon Adi,Yolcu Sayisi
0,2021,3,Beyden Deniz Ulaþým Hizmetleri Turizm ve Tic. ...,BEYKOZ,5076
1,2021,3,Beyden Deniz Ulaþým Hizmetleri Turizm ve Tic. ...,YENIKOY,5347
2,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,106334
3,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,KABATAS,24
4,2021,3,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,USKUDAR,94200
...,...,...,...,...,...
656,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Eminönü,55387
657,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Kadýköy Balon,40680
658,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Kadýköy Çayýrbaþý,69443
659,2021,11,TURYOL - S.S. TURÝZM VE YOLCU DENÝZ TAÞIYICILA...,Karaköy,55098


### 3.2)
- We cannot use the groupby() function to find the busiest station in our final table called iskele_bilgisi. Because we were aiming to see the whole row. For this, we do column selecting the with square brackets, but we limit it to the case where it is equal to the maximum.
- Since we put ==, the equality we wrote will return us only if it is True.
- In this way, we can see the route and reach the busiest station in 2021. (We cannot reach 2020 because it is not available in our dataset.)
- Besiktas, the busiest station in 2021.

In [16]:
iskele_bilgisi[iskele_bilgisi['Yolcu Sayisi']==iskele_bilgisi['Yolcu Sayisi'].max()]

Unnamed: 0,Yil,Ay,Otorite Adi,Istasyon Adi,Yolcu Sayisi
516,2021,10,DENTUR - AVRASYA DENÝZ TAÞ.TUR.HÝZ.ÝNÞ.SAN.TÝC...,BESIKTAS,757374


### 3.3)
- We use the groupby() function to find the maximum number of passengers carried at the same time each month. Because this function gives one output instead of repeating elements like in sets.

In [17]:
res=iskele_bilgisi.groupby(['Ay'])['Yolcu Sayisi'].max()
pd.DataFrame(res)

Unnamed: 0_level_0,Yolcu Sayisi
Ay,Unnamed: 1_level_1
3,106334
4,274984
5,205662
6,488048
7,590792
8,601089
9,633355
10,757374
11,223286


### 3.4)
- The busiest station in 2021 was different in the 2nd and 3rd questions. In the 2nd question: "BEBEK-ANATOLIAN HISARI-KANLICA-EMIRGAN" routeIn the third question, "Besiktas" was the busiest station.
- The conclusion I have drawn from this is as follows: I have experienced that the fact that the total number of trips is high does not make that station the busiest.