# NYC-Flights

## 01 - Problem (case study)

### Abstract

NYC-Flights

**Objectives:**
+ Connect Python with SQL Server
+ Use SQL queries to answer the following questions



**Questions:**:  

1. How many flights were there from NYC airports to Seattle in 1993?
2. How many airlines fly from NYC to Seattle?
3. How many unique air planes have flown from NYC to Seattle?
4. What is the average arrival delay for flights from NC to Seattle?
5. What proportion of flights to Seattle come from each NYC airport?
6. Which date has the largest average departure delay? 
7. Which date has the largest average arrival delay?
8. What was the worst day to fly out of NYC in 1997 if you dislike delayed flights?
9. Are there any seasonal patterns in departure delays for flights from NYC?
10. On average, how do departure delays vary over the course of a year?
11. Which flight departing NYC in the last decade flew the fastest?
12. Which flights (i.e. carrier + flight + dest) happen every day? Where do they fly to?
13. Which carriers have been the top and the bottom performers in 1999?

## 02 - Getting Data

In [None]:
import pandas as pd

In [None]:
data=pd.read_csv('data-nyc/1990.csv')

data.head()

In [None]:
data.info(memory_usage='deep')

In [None]:
data=data.fillna('null')

In [None]:
data.head()

**from pandas to SQL**

### Connection Python-SQL

In [None]:
import mysql.connector  # connection with sql

In [None]:
db_name='Flights_NYC'           # database name

In [None]:
# create connection
create_db=mysql.connector.connect(host='localhost', user='root', passwd='password')

cursor=create_db.cursor()   

In [None]:
# drop database (if exists) and create empty database

cursor.execute('drop database if exists {}'.format(db_name))
cursor.execute('create database {}'.format(db_name))

In [None]:
# check, show databases 

cursor.execute('show databases')
for x in cursor:
  print(x)

### Load Data into SQL

In [None]:
# connection to database

db=mysql.connector.connect(host='localhost', user='root', passwd='password', database=db_name)

cursor=db.cursor()

In [None]:
# create table in the database

import re     # regex

table_name='_1990'     

cursor.execute('use {};'.format(db_name))

cursor.execute('drop table if exists {};'.format(table_name)) 

# name and dtypes in table from dataframe
names_dtypes=[' '.join(f) for f in zip(data.columns,
                                       [re.findall('[a-t]+',str(e))[0] if e!='object' else 'text' for e in data.dtypes.tolist()])]

table='create table {}({});'.format(table_name, ', '.join(names_dtypes))

table

In [None]:
cursor.execute(table) # make query

In [None]:
# check tables

cursor.execute('show tables')
for x in cursor:
  print(x)

In [None]:
%%time
results=list(data.T.to_dict().values())  # format change

In [None]:
%%time
for i in range(len(results)):  # insert query
    
    insert_query='insert into {} ({}) values {};'\
                  .format(table_name, ','.join(results[i].keys()), tuple(results[i].values()))
    cursor.execute(insert_query)
    
    
db.commit()  # save database

In [None]:
insert_query  # last insertion

In [None]:
# checking

cursor.execute('select * from {} limit 3'.format(table_name))
for x in cursor:
  print(x)

In [None]:
# all together

def to_sql(year):
    
    table_name='_199{}'.format(year)
    data=pd.read_csv('data-nyc/199{}.csv'.format(year)).fillna('null')
         
    cursor.execute('drop table if exists {};'.format(table_name)) 
    names_dtypes=[' '.join(f) for f in zip(data.columns,
                                           [re.findall('[a-t]+',str(e))[0] if e!='object' else 'text' for e in data.dtypes.tolist()])]

    table='create table {}({});'.format(table_name, ', '.join(names_dtypes))
    cursor.execute(table)

    
    results=list(data.T.to_dict().values())
    
    for i in range(len(results)):

        insert_query='insert into {} ({}) values {};'\
                      .format(table_name, ','.join(results[i].keys()), tuple(results[i].values()))
        cursor.execute(insert_query)


    db.commit()

In [None]:
%%time
from tqdm import tqdm

for i in tqdm(range(10), desc='Data to SQL-->'):
    to_sql(i)

In [None]:
# check tables

cursor.execute('show tables')
for x in cursor:
  print(x)

### Data from SQL

In [None]:
def from_sql(cursor, query):
    print ('Query:\n{}\n'.format(query))
    
    cursor.execute(query)
    data=cursor.fetchall()

    df=pd.DataFrame(data, columns=cursor.column_names)

    print ('Data readed from MySQL.')

    return df

In [None]:
# reboot connection to database

db_name='Flights_NYC'

db=mysql.connector.connect(host='localhost', user='root', passwd='password', database=db_name)

cursor=db.cursor()

# Queries

### One Table

In [None]:
%%time

query='select * from _1990'

from_sql(cursor, query)

In [None]:
%%time

query='''select * 
          from 
          information_schema.columns 
          where table_name='_1999';'''


from_sql(cursor, query)

### All Data

In [None]:
%%time

query='''select * from
         information_schema.tables
            where table_schema='Flights_NYC';'''


from_sql(cursor, query)

In [None]:
%%time

query='''select * from _1990
         union all
         select * from _1992;'''


from_sql(cursor, query)

In [None]:
%%time

query='select * from _1990 '+\
      ' '.join(['union all select * from _199{}'.format(i+1) for i in range(3)])+';'


from_sql(cursor, query)

### Cleaning

In [None]:
%%time

query='''
        select * from
        _1990;
'''


from_sql(cursor, query).shape

In [None]:
%%time

query='''
        select * from
        _1990
        where DepTime is not null and
              ArrTime is not null;

'''


from_sql(cursor, query)

In [None]:
%%time

query='''
        select * from
        _1990
        where (DepTime!='null') and 
              (ArrTime!='null') ;

'''


from_sql(cursor, query)

In [None]:
%%time

all_time_query='select * from _1990 '+\
               ' '.join(['union all select * from _199{}'.format(i+1) for i in range(9)])


clean_query='''
            select * from
            ({}) t
            where (t.DepTime!='null') and 
                  (t.ArrTime!='null') and
                  (t.TailNum!='null')

    '''.format(all_time_query)



from_sql(cursor, clean_query)

## 03 - Questions

### 1. How many flights were there from NYC airports to Seattle in 1993?

In [None]:
%%time

query='''
        select count(*) as Flights from
        _1993
        where (Dest='SEA');

'''


from_sql(cursor, query)

### 2. How many airlines fly from NYC to Seattle?

In [None]:
%%time

query='''
        select UniqueCarrier as Airline, count(UniqueCarrier) as Flights
        from _1999
        group by UniqueCarrier;

'''

from_sql(cursor, query)

### 3. How many unique air planes have flown from NYC to Seattle?

In [None]:
%%time

query='''
        select c.TailNum , count(c.TailNum) as Flights
        from
        ({}) c 
        group by c.TailNum;
        
'''.format(clean_query)


from_sql(cursor, query)

### 4. What is the average arrival delay for flights from NYC to Seattle?

In [None]:
%%time

query='''
        select avg(a.ArrDelay) as AvgDelay
        from
        ({}) a
        where(a.Dest='SEA');

'''.format(all_time_query)


from_sql(cursor, query)

### 5. What proportion of flights to Seattle come from each NYC airport?

In [None]:
%%time

query='''
        select Origin , 
        (count(Origin)*100/(select count(*) from _1999 where (Dest='SEA'))) as PropFlights
        from 
        _1999
        where (Dest='SEA')
        group by Origin;

'''


from_sql(cursor, query)

In [None]:
%%time

query='''
        select a.Origin , 
        (count(a.Origin)*100/(select count(*) from ({}) as p where (Dest='SEA'))) as PropFlights
        from 
        ({}) a
        where (a.Dest='SEA')
        group by a.Origin;

'''.format(all_time_query, all_time_query)


from_sql(cursor, query)

### 6. Which date has the largest average departure delay? 

In [None]:
%%time

query='''
        select a.DayofMonth as Day, a.Month, a.Year, avg(a.DepDelay) as DepDelay
        from ({}) a
        group by a.DayofMonth, a.Month, a.Year
        order by DepDelay desc 
        limit 10;

'''.format(clean_query)


from_sql(cursor, query)

### 7. Which date has the largest average arrival delay?

In [None]:
%%time

query='''
        select a.DayofMonth as Day, a.Month, a.Year, avg(a.ArrDelay) as ArrDelay
        from ({}) a
        group by a.DayofMonth, a.Month, a.Year
        order by ArrDelay desc 
        limit 10;

'''.format(clean_query)


from_sql(cursor, query)

### 8. What was the worst day to fly out of NYC in 1997 if you dislike delayed flights?

In [None]:
%%time

query='''
        select DayofMonth as Day, Month, Year, avg(DepDelay) as DepDelay
        from _1997
        group by DayofMonth, Month, Year
        order by DepDelay desc 
        limit 10;

'''


from_sql(cursor, query)

### 9. Are there any seasonal patterns in departure delays for flights from NYC?

In [None]:
%%time

query='''
        select Month, Year, avg(DepDelay) as DepDelay
        from _1999
        group by Month, Year;

'''


from_sql(cursor, query)

In [None]:
%%time
%matplotlib inline

query='''
        select Month, Year, avg(DepDelay) as DepDelay
        from _1999
        group by Month, Year;

'''


from_sql(cursor, query).plot(x='Month', y='DepDelay');

In [None]:
%%time

query='''
        select a.Month, a.Year, avg(a.DepDelay) as DepDelay
        from
        ({}) a
        group by a.Month, a.Year;

'''.format(all_time_query)


from_sql(cursor, query)

### 10. On average, how do departure delays vary over the course of a year?

In [None]:
%%time

query='''
        select Month, avg(DepTime) as AvgDepDelay
        from _1999
        group by Month;

'''

from_sql(cursor, query)

### 11. Which flight departing NYC in the last decade flew the fastest?

In [None]:
%%time

query='''
        select a.DayofMonth, a.Month, a.Year, a.Distance/a.AirTime as Speed, a.UniqueCarrier as Airline, a.TailNum
        from
        ({}) a
        where (a.Distance!='null') and (a.AirTime!='null')
        order by Speed desc
        limit 5;

'''.format(all_time_query)


from_sql(cursor, query)

### 12. Which flights (i.e. carrier + flight + dest) happen every day? Where do they fly to?

In [None]:
%%time

query='''
        select UniqueCarrier as Airline, FlightNum, Dest, count(Dest) as Flights
        from
        _1999
        group by UniqueCarrier, FlightNum, Dest
        order by Flights desc
        limit 10;

'''


from_sql(cursor, query)

### 13. Which carriers have been the top and the bottom performers in 1999?

In [None]:
%%time

query='''
        select UniqueCarrier as Airline, avg(DepTime) as AvgDepDelay
        from
        _1999
        group by Airline
        order by AvgDepDelay asc;

'''


from_sql(cursor, query)

In [None]:
%%time

query='''
        select UniqueCarrier as Airline, avg(ArrTime) as AvgArrDelay
        from
        _1999
        group by Airline
        order by AvgArrDelay asc;

'''


from_sql(cursor, query)

In [None]:
%%time

query='''
        select a.UniqueCarrier as Airline, avg(a.DepTime) as AvgDepDelay
        from
        ({}) a
        group by Airline
        order by AvgDepDelay asc;

'''.format(all_time_query)


from_sql(cursor, query)