# Flights to SQlite database

Using https://jhub.rc.ufl.edu we can only access /home, so make a symbolic link to the flights file on command line:
```bash
ln -s /ufrc/zoo6927/share/Class_Files/data/flights.May2017-Apr2018.csv .
```

In [1]:
# quick check that this worked

flights=open("/ufrc/zoo6927/share/Class_Files/data/flights.May2017-Apr2018.csv")

count=0
for Line in flights:
    count+=1
print(count)

flights.close()

6156045


Here's the metadata from the flights file: https://github.com/CompTools/Class_Files/blob/master/data/flights_metadata.md


## Picking up from Wednesday...
I've modified the code a bit from where we left off.

I changed the code to load the tables if they exist and the create them if they don't.

I also added all the columns to the Flights table.

In [47]:
from sqlalchemy import create_engine
from sqlalchemy import Table, Column, Integer, String, MetaData, ForeignKey
from sqlalchemy import DateTime, Boolean
from sqlalchemy import exists
from sqlalchemy import sql, select, join, desc

# Create a sqlite database 
engine = create_engine('sqlite:///zoo6927/flights.sqlite')

metadata=MetaData(engine)

# Try to load Airports info from database, if not there, create it.
try:
    Airports=Table('Airports', metadata, autoload=True)
except:
    Airports = Table ('Airports', metadata,
                Column('ID', Integer, autoincrement=True),
                Column('Code', String, primary_key=True),
                Column('City', String),
                Column('State', String),
                Column('Name', String),
               )

# Same for Flights table.
try:
    Flights=Table('Flights', metadata, autoload=True)
except:
    Flights = Table ('Flights', metadata,
                 Column('Fl_date', DateTime),
                 Column('Airline_ID', String),
                 Column('Origin', String, ForeignKey("Airports.Code")),
                 Column('Destination', String, ForeignKey("Airports.Code")),
                 Column('Dep_Time', String),
                 Column('Dep_Delay_New', Integer),
                 Column('Arr_Time', String),
                 Column('Arr_Delay_New', Integer),
                 Column('Cancelled', Boolean),
                 Column('Cancellation_Code', String),
                 Column('Diverted', Boolean),
                 Column('Air_Time', String),
                 Column('Flights', Integer),
                 Column('Distance', Integer),
                 Column('Carrier_Delay', Integer),
                 Column('Weather_Delay', Integer),
                 Column('NAS_Delay', Integer),
                 Column('Security_Delay', Integer),
                 Column('Late_Aircraft_Delay', Integer)
                )
                 
metadata.create_all(engine)

## Cheating??

In playing and thinking about this I realized that to populate the Airports table, we only want unique airports added to the table. I tried to find ways to do this in SQLAlchemy, but it was slow and not entirely clear, so I am cheating a bit...

Read through the file once and make a dictionary with the information we want. We can use the `if Line['ORIGIN'] not in Airport_dict:` construct to only add unique airports

In [49]:
import csv
flights=open("/ufrc/zoo6927/share/Class_Files/data/flights.1K.csv")


reader = csv.DictReader(flights)

Airport_dict={}

# Read through the file and make a dictionary for airport codes.
# This gets a unique list of airport codes.
for Line in reader:
    if Line['ORIGIN'] not in Airport_dict:
        Airport_dict[Line['ORIGIN']]=[Line['ORIGIN_CITY_NAME'], Line['ORIGIN_STATE_ABR']]

    if Line['DEST'] not in Airport_dict:
        Airport_dict[Line['DEST']]=[Line['DEST_CITY_NAME'], Line['DEST_STATE_ABR']]

print(Airport_dict)


{'LAX': ['Los Angeles, CA', 'CA'], 'IAD': ['Washington, DC', 'VA'], 'SAN': ['San Diego, CA', 'CA'], 'SFO': ['San Francisco, CA', 'CA'], 'EWR': ['Newark, NJ', 'NJ'], 'JFK': ['New York, NY', 'NY'], 'OGG': ['Kahului, HI', 'HI'], 'SEA': ['Seattle, WA', 'WA'], 'DCA': ['Washington, DC', 'VA'], 'ORD': ['Chicago, IL', 'IL'], 'AUS': ['Austin, TX', 'TX'], 'PDX': ['Portland, OR', 'OR'], 'LAS': ['Las Vegas, NV', 'NV'], 'MCO': ['Orlando, FL', 'FL'], 'FLL': ['Fort Lauderdale, FL', 'FL'], 'BOS': ['Boston, MA', 'MA'], 'HNL': ['Honolulu, HI', 'HI'], 'PSP': ['Palm Springs, CA', 'CA'], 'DAL': ['Dallas, TX', 'TX'], 'LGA': ['New York, NY', 'NY'], 'DEN': ['Denver, CO', 'CO'], 'SNA': ['Santa Ana, CA', 'CA'], 'IAH': ['Houston, TX', 'TX'], 'MEM': ['Memphis, TN', 'TN'], 'CLE': ['Cleveland, OH', 'OH'], 'AVL': ['Asheville, NC', 'NC'], 'ONT': ['Ontario, CA', 'CA'], 'MSP': ['Minneapolis, MN', 'MN'], 'MIA': ['Miami, FL', 'FL'], 'SAV': ['Savannah, GA', 'GA'], 'ATL': ['Atlanta, GA', 'GA'], 'TPA': ['Tampa, FL', 'FL'], 

In [50]:
# Add the Airport_dict codes to the Airports table

conn = engine.connect()

def insert_airport(code,city,state):
    ins=Airports.insert().values(Code=code,
                                 City=city,
                                 State=state)
    result = conn.execute(ins)

for key, value in Airport_dict.items(): 
    insert_airport(key, value[0], value[1])

In [51]:
# Close the file
flights.close()

# Re-open to get flight data

flights=open("/ufrc/zoo6927/share/Class_Files/data/flights.1K.csv")


reader = csv.DictReader(flights)
for Line in reader:

    ins=Flights.insert().values(Fl_date=Line['FL_DATE'],
                                Airline_ID = Line['AIRLINE_ID'],
                                Origin = Line['ORIGIN'],
                                Destination = Line['DEST'],
                                Dep_Time = Line['DEP_TIME'],
                                Dep_Delay_New = Line['DEP_DELAY_NEW'],
                                Arr_Time = Line['ARR_TIME'],
                                Arr_Delay_New = Line['ARR_DELAY_NEW'],
                                Cancelled = Line['CANCELLED'],
                                Cancellation_Code = Line['CANCELLATION_CODE'],
                                Diverted = Line['DIVERTED'],
                                Air_Time = Line['AIR_TIME'],
                                Flights = Line['FLIGHTS'],
                                Distance = Line['DISTANCE'],
                                Carrier_Delay = Line['CARRIER_DELAY'],
                                Weather_Delay = Line['WEATHER_DELAY'],
                                NAS_Delay = Line['NAS_DELAY'],
                                Security_Delay = Line['SECURITY_DELAY'],
                                Late_Aircraft_Delay = Line['LATE_AIRCRAFT_DELAY']
                                          )
                                          

    result = conn.execute(ins)


StatementError: (builtins.TypeError) SQLite DateTime type only accepts Python datetime and date objects as input. [SQL: 'INSERT INTO "Flights" ("Fl_date", "Airline_ID", "Origin", "Destination", "Dep_Time", "Dep_Delay_New", "Arr_Time", "Arr_Delay_New", "Cancelled", "Cancellation_Code", "Diverted", "Air_Time", "Flights", "Distance", "Carrier_Delay", "Weather_Delay", "NAS_Delay", "Security_Delay", "Late_Aircraft_Delay") VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)']

# Fixing the date time error above

The error above states: `SQLite DateTime type only accepts Python datetime and date objects as input.`

That's because we defined the column as having a DateTime data type. We could take the easy way out and go with a string, but that's not the best...

So, to convert the string '2017-05-01' to a date, let's use this StackOverflow post: https://stackoverflow.com/questions/31326834/faster-csv-loading-with-datetime-index-pandas/36800960#36800960

## Also fixed the Booleans:

We defined Canceled and Diverted as Booleans. Typically these are 0s or 1s (sometimes True and False). But either in the original data source, or in some processing I did of the file, these were changed to 0.00 and 1.00.

Not a huge deal, but if you try to do `int((Line['CANCELLED']))` you get the error:

```python
ValueError: invalid literal for int() with base 10: '0.00'
```

That's because the 0.00 is a string (notice the quotes around it) and you can't get an int of a string representations of a float...

But you *can* turn a string into a float... `float((Line['CANCELLED']))` and you *can* get and int of a float. So, it's a two step process, which can be combined into one with: `Cancelled = int(float((Line['CANCELLED'])))`

In [None]:
import datetime
import pandas as pd

def to_date(dates, lookup=False, **args):
    if lookup:
        return dates.map({v: pd.to_datetime(v, **args) for v in dates.unique()})
    return pd.to_datetime(dates, **args)


# Close the file
flights.close()

# Re-open to get flight data

flights=open("/ufrc/zoo6927/share/Class_Files/data/flights.1K.csv")
reader = csv.DictReader(flights)
for Line in reader:

    ins=Flights.insert().values(Fl_date=to_date(Line['FL_DATE']),
                                Airline_ID = Line['AIRLINE_ID'],
                                Origin = Line['ORIGIN'],
                                Destination = Line['DEST'],
                                Dep_Time = Line['DEP_TIME'],
                                Dep_Delay_New = Line['DEP_DELAY_NEW'],
                                Arr_Time = Line['ARR_TIME'],
                                Arr_Delay_New = Line['ARR_DELAY_NEW'],
                                Cancelled = int(float((Line['CANCELLED']))),
                                Cancellation_Code = Line['CANCELLATION_CODE'],
                                Diverted = int(float((Line['DIVERTED']))),
                                Air_Time = Line['AIR_TIME'],
                                Flights = Line['FLIGHTS'],
                                Distance = Line['DISTANCE'],
                                Carrier_Delay = Line['CARRIER_DELAY'],
                                Weather_Delay = Line['WEATHER_DELAY'],
                                NAS_Delay = Line['NAS_DELAY'],
                                Security_Delay = Line['SECURITY_DELAY'],
                                Late_Aircraft_Delay = Line['LATE_AIRCRAFT_DELAY']
                                          )
    result = conn.execute(ins)