# Project Title
### Data Engineering Capstone Project

#### Project Summary

The project follows the follow steps:
* [Step 1: Scope the Project and Gather Data](#Step-1:-Scope-the-Project-and-Gather-Data)
* [Step 2: Explore and Assess the Data](#Step-2:-Explore-and-Assess-the-Data)
    * [I94 Description Labels](#I94-Descrition-Labels)
    * [Immigration data](#Immigration-data)
    * [Global Land Temperature Data](#Global-Land-Temperature-Data)
    * [Global Airports Data](#Global-Airports-Data)
    * [Airports Data](#Airports-Data)
    
* [Step 3: Define the Data Model](#Step-3:-Define-the-Data-Model)
* [Step 4: Run ETL to Model the Data](#Step-4:-Run-ETL-to-Model-the-Data)
* [Step 5: Complete Project Write Up](#Step-5:-Complete-Project-Write-Up)



Les bases de données multidimensionnelles considèrent chaque attribut d’une donnée comme une dimension « séparée ». Le logiciel peut ensuite localiser l’intersection des dimensions et les afficher. Il est ainsi possible d’analyser et de comparer les données de différentes façons. Les attributs peuvent aussi être séparés en plusieurs sous-attributs. Les bases de données multi-dimensionnelles s’opposent aux bases de données relationnelles à deux dimensions.

[Finish here yesterday](#workflow-1)

In [1]:
# Do all imports and installs here

import os
import sys
import boto3

import datetime
import numpy as np
import pandas as pd

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()

pd.set_option("display.max.columns", None)
pd.set_option("display.precision", 2)


In [3]:
# Read in the data here
!ls -1FSash ./dataset
#!ls -tRFh ./dataset/
path = './dataset/'

total 1,3G
510M airports_us.csv
509M GlobalLandTemperaturesByCity.csv
205M WDIData.csv
5,8M airport-codes_csv.csv
1,5M airports-extended.csv
248K us-cities-demographics.csv
144K immigration_data_sample.csv
 36K I94_SAS_Labels_Descriptions.SAS
 12K i94port.csv
8,0K i94cit_i94res.csv
4,0K ./
4,0K ../
4,0K 20-years-us-university-dataset/
4,0K airline-delay-and-cancellation-data-2009-2018/
4,0K education-statistics/
4,0K sas_data/
4,0K i94addr.csv
4,0K i94mode.csv
4,0K i94visa.csv


# Step 1: Scope the Project and Gather Data

#### Scope TODO
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>
> The main dataset includes data on immigration to the United State, and other datasets. In this work book, the data is transforming and cleasning.  
> How many students arrived in US in April?  
> Which Airline bring the most student in April?  
> What are the top city for alien studies?    
> what are the student profils (age, country born, country indicators)?  


### Describe and Gather Data

[Datactionnary](2_data_dictionnary.ipynb) is provided a dictionnary abou dataset and tables used.


**change name _immigration_data_sample.csv_ for  _data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat_**

#### I94 Immigration data  Description: 
Each line of immigration_data_sample.csv correspond to a record of I-94 Form from the U.S. immigration officers. It's provide information about Arrival/Departure to foreign visitors. Some explanation about the [Visitor Arrivals Program (I-94 Form)](https://travel.trade.gov/research/programs/i94/description.asp).

Dataset information: There is a file per month for 2016, storage format is sas7bdat. These records are described according to 28 variables.   
A small description is provided [here](2_data_dictionnary.ipynb)  
I keep this variables for this project:
    
Column Name | Description | Example | Type
-|-|-|-|
**cicid**|     ID uniq per record in the dataset | 4.08e+06 | float64
**i94yr**|     4 digit year  | 2016.0 | float64
**i94mon**|    Numeric month |  4.0 | float64      
**i94cit**|     3 digit code of source city for immigration (Born country) | 209.0 | float64
**i94res**|    3 digit code of source country for immigration
**i94port**|   Port addmitted through | HHW | object
**arrdate**|   Arrival date in the USA | 20566.0 | float64
**i94mode**|   Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported) | 1.0 | float
**i94addr**|   State of arrival | HI | object
**i94bir**|    Age in years | 61.0 | float
**i94visa**|   Visa Code - 1 = Business / 2 = Pleasure / 3 = Student |2.0 | float
**dtadfile**|  Date Field in I94 files |20160422| int 64
**admnum**|    Admission number, should be unique and not nullable |5.66e+10| float
**gender**|    Gender|M| object
**visatype**|  Class of admission legally admitting the non-immigrant to temporarily stay in U.S.|WT|object


Additional files of this dataset are provide to give more desciption about this dataset


#### I94 Description Labels  Description
The I94_SAS_Labels_Description.SAS file is provide to add explanations  about code used in _data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat._ 
I parse this file, save the result in 5 .csv files. 
    * i94visa Data
    * i94country and i94residence Data
    * i94port Data
    * i94mode Data
    * i94addr
A small description is provided [here](2_data_dictionnary.ipynb)

####  Global Land Temperature Data  Description
The Berkeley Earth Surface Temperature Study provide climate information. Each line correspond to a record of temperature per day from city around the world. 
Dataset information: the GlobalLandTemperaturesByCity.csv has 7 variables.    
A small description is provided [here](2_data_dictionnary.ipynb)
I keep this variables for this project:

Column Name | Description | Example | Type
-|-|-|-|
**dt**|Date format YYYY-MM-DD| 1743-11-01| object
**AverageTemperature**|Average Temperature for the city to th date dt|6.07|float64
**City**| City name| Århus| object
**Country**| Country name | Denmark | object

#### Global Airports Data
This is a database of airports, train stations, and ferry terminals around the world. Some of the data come from public sources and some of it comes from OpenFlights.org user contributions
Dataset information: 
A small description is provided [here](2_data_dictionnary.ipynb)
Column Name | Description | Example | Type
-|-|-|-|


#### Airports Data Description
The airport code refers to the IATA airport code, 3 letters code unique for all airports in the world. It's a code used in passenger reservation, ticket and baggage-handling too. 
Dataset information: The airport-codes_csv.csv provides informations about aiports and have 12 variables.    
A small description is provided [here](2_data_dictionnary.ipynb)
I keep this variables for this project:

Column Name | Description | Example | Type
-|-|-|-|
**ident**| Unique identifier Airport code| 00AK| object 
**type**| Type of airport | small_airport |object
**name**| Name of the airport | Lowell Field | object
**continent**| Continent | | object
**iso_country**| ISO code of airport country |US| object
**iso_region**| ISO code of the region airport | US-KS|object
**municipality**| City name where the airport is located | Anchor Point|object
**iata_code**| IATA code of the airport| | object
 

# Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [4]:
def check_df(df, description):
    nRow, nCol = df.shape
    print("There are {} rows and {} columns in **** {}. ****".format(nRow, nCol, description))
    #print(df.head(3))
    #print(n_df.info())
    #print(df.nunique)
    # check null value
    print(       )
    print("---------   Check null values")
    tab_info=pd.DataFrame(df.dtypes).T.rename(index={0:'column type'})
    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()).T.rename(index={0:'null values (nb)'}))
    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()/df.shape[0]*100).T.rename(index={0:'null values (%)'}))
    display(tab_info)
    print(       )
    print("---------   Check unique values")
    cols = [col for col in df.columns if df[col].isnull().any()]
    df_miss = df[cols]    
    display(pd.DataFrame(df.nunique()).T.rename(index={0:'Unique values in columns'}))
    #print(df_miss.head())
    print(       )
    print("---------   Check duplicated value")
    n_df = df.iloc[:, 2:]
    df_dup = df[df.duplicated(keep=False)]
    display(pd.DataFrame(df_dup.count()).T.rename(index={0:'Duplicate values in columns'}))
    return(df)



    

In [5]:
#I94_SAS_Labels_Description.SAS
#def SAS_parser(file_parse, item, columns):
import re
import io

def parse_file(path_file, key):
    """
    fonction to parse file and create csv file
    return dataframe
    """
    
    file_parse = path+'I94_SAS_Labels_Descriptions.SAS'
    with open(file_parse, 'r') as f:
        file = f.read()
    sas_dict={}
    key_name = ''

    for line in file.split("\n"):
        line = re.sub(r"\s+", " ", line)
        if '/* I94' in line :         
            line = line.strip('/* ')
            key_name = line.split('-')[0].replace("&", "_").replace(" ", "").strip(" ").lower() 
            sas_dict[key_name] = []
        elif '=' in line and key_name != '' :
            #line_trans = re.sub("([A-Z]*?),(\s*?[A-Z]{2}\s)","\\1=\\2", line)
            #print(line_trans)
            sas_dict[key_name].append([item.strip(' ').strip(" ';") for item in line.split('=')])
        

    if key is "i94port":
        columns = ["Port_id", "Port_city", "State_id"]
        swap = sas_dict[key]
        sas_dict[key] = []
        for x in swap:
            if "," in x[1]:
                mylist=[]
                a = x[1].rsplit(",", 1)
                b = a[0]
                c = a[1].strip()
                mylist.extend([x[0], b, c])
                sas_dict[key].append(item for item in mylist)
                
                
    if key is "i94cit_i94res":
        columns = ["Country_id", "Country"]       
    if key is "i94mode":
        columns = ["Mode_id", "Mode"]
    if key is "i94addr":
        columns = ["State_id", "State"]
    if key is "i94visa":
            columns = ["Code_visa", "Visa"]
    df = ""           
            

    if key in sas_dict.keys():
        if len(sas_dict[key]) > 0:
            df = pd.DataFrame(sas_dict[key], columns = columns)
        with io.open(f"./dataset/{key}.csv", "w") as f:
            df.to_csv(f, index=False) 
           
    return(df)



### I94 Description Labels

In [None]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94port"
i94_port = parse_file(path_file, key) 
i94_port.head()

In [None]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94cit_i94res"
df_i94 = parse_file(path_file, key) 
df_i94.head()

In [None]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94addr"
df_i94 = parse_file(path_file, key) 
df_i94.head()

In [None]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94mode"
df_i94 = parse_file(path_file, key) 
df_i94.head()

In [None]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94visa"
df_i94 = parse_file(path_file, key) 
df_i94.head()

### Immigration data

TODO
* immigration_data_sample
    * revoir l'orignine du fichier. chercher data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat dans udacity workspace
    (pd.read_sas(immigration_fname, 'sas7bdat', encoding="ISO-8859-1")
    ==> 1000 rows fourni par udac, beaucoup plus dans 1 seul mois
    * DONE: faire un dictionnaire, recuperer les colonnes. 
* DONE chercher les valeurs manquantes, les valeurs dupliquees, data cleanning
    * chamger les formats

In [None]:
nRowsRead = None # change and set to None for the whole data
description = "dataset immigration provide by Udacity"
name = "df_immigration"
file = "immigration_data_sample.csv"

df_raw = pd.read_csv(path+file, nrows = nRowsRead)

df_drop = df_raw.drop(["count", "visapost", "occup", "entdepa", "depdate", "entdepd", "entdepu", "biryear", \
                       "dtaddto", "matflag", "insnum", "airline", "fltno"], axis=1)
df_immigration = check_df(df_drop, description).sort_values(by = ['cicid', 'admnum'])
df_immigration.head()



###### Missing and Duplicate
* somme value missing in 1000 rows
    * i94addr # US State of arrival,  59 null values and 51 unique values
        * map with i94port.csv
    * gender  # Gender, 141 null values and 3 unique values 
        * 'M', nan, 'F', 'X'


In [None]:
df_immigration['i94addr'] = df_immigration["i94port"].map(dict(zip(i94_port["Port_id"], i94_port["State_id"]))).fillna(df_immigration.i94addr)
df = df_immigration[df_immigration["gender"].isnull()]
df_immigration.dropna(inplace = True)
df_immigration = check_df(df_drop, description).sort_values(by = ['cicid', 'admnum'])
df_immigration.head()

**c/c: we have an ID for each record, `admnum` should be none null as `i94addr` for the analytic questions**

### Global Land Temperature Data

In [None]:
# download from kaggle the GlobalLandTemperaturesByCity.csv KAGGLE/UDACITY
nRowsRead = None # change and set to None for the whole data
description = "download from kaggle the GlobalLandTemperaturesByCity.csv KAGGLE/UDACITY"
name = "df_temperature"
file = "GlobalLandTemperaturesByCity.csv"

df_raw = pd.read_csv(path+file, sep=",", nrows = nRowsRead)
df_raw = df_raw.drop(["AverageTemperatureUncertainty"], axis=1)


df_temp = check_df(df_raw, description).sort_values(by = ["dt", "Country","City"], ascending=True)


print("The date of the first record is {}.".format(df_temp["dt"].min()))
print("The date of the first record is {}.".format(df_temp["dt"].max()))

df_temp.head()

##### Missing and Duplicate
* no duplicate for the whole Temperature dataset
* 364130 none value for AverageTemperature, so remove this rows.
* begin in 1743 and finis in 2013 so agregation by City.

In [None]:
global_temp = df_temp.groupby(['Country', 'City']) \
                            .agg({"AverageTemperature": "mean", 
                                  "Latitude": "first", 
                                  "Longitude": "first"}).reset_index()
                            
global_temp.sort_values(["AverageTemperature"], ascending=True, inplace=True)
df_drop = global_temp.drop(["Latitude", "Longitude"], axis=1)

df_temperature = check_df(df_drop, description).sort_values(by = ["Country","City"], ascending=True)
df_temperature.head()

**c/c: Temperature from 1743 to 2013, useful if we want look for raison of immigration. People have needed bad weather conditions for a long time to leave their country**.

### Global Airports Data

In [10]:
# download from kaggle the airports-extended.csv KAGGLE
nRowsRead = None # change and set to None for the whole data
description = "download from kaggle the airports-extended.csv"
name = "df_global_airports"
file = "airports-extended.csv"

df_raw = (pd.read_csv(path+file, 
                     nrows = nRowsRead,
                     names=['id', 'name', 'city', 'country', 'iata', 'icao', 'latitude', 'longitude', 'altitude', 
                            'timezone', 'dst', 'tz_timezone', 'type', 'data_source'],
                     na_values=['\\N', '-', 'NAN', 'unknown'])
           .set_index("id")[lambda df: df.type == 'airport']
           .reset_index(drop=True)
           .drop(columns=['type', 'timezone', 'tz_timezone', 'data_source', 'dst', 'latitude', 'longitude', 'altitude'])
           .rename(columns=lambda col:'airport_'+ col)
        )
#df_raw = df_raw.drop(["AverageTemperatureUncertainty"], axis=1)

df_global_airports = check_df(df_raw, description)
#.sort_values(by = ["dt", "Country","City"], ascending=True)


df_global_airports.head()

There are 7750 rows and 5 columns in **** download from kaggle the airports-extended.csv. ****

---------   Check null values


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata,airport_icao
column type,object,object,object,object,object
null values (nb),0,44,0,1665,478
null values (%),0,0.57,0,21,6.2



---------   Check unique values


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata,airport_icao
Unique values in columns,7664,6953,237,6085,7272



---------   Check duplicated value


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata,airport_icao
Duplicate values in columns,13,13,13,0,0


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata,airport_icao
0,Goroka Airport,Goroka,Papua New Guinea,GKA,AYGA
1,Madang Airport,Madang,Papua New Guinea,MAG,AYMD
2,Mount Hagen Kagamuga Airport,Mount Hagen,Papua New Guinea,HGU,AYMH
3,Nadzab Airport,Nadzab,Papua New Guinea,LAE,AYNZ
4,Port Moresby Jacksons International Airport,Port Moresby,Papua New Guinea,POM,AYPY


### Airports Data

In [None]:

%who_ls DataFrame


In [None]:
%xdel global_temp

##### workflow-1

In [None]:
# airport-codes_csv UDACITY
nRowsRead = None # change and set to None for the whole data
description = "airport-codes_csv provide by UDACITY"
name = "df_airport_world"
file = "airport-codes_csv.csv"

df_raw = pd.read_csv(path+file, sep=",", nrows = nRowsRead)
df_raw = df_raw.drop(["elevation_ft", "continent","gps_code", "local_code", "coordinates"], axis=1)

df_airports = check_df(df_raw, description).sort_values(by = ["iata_code"], ascending=True)
df_airports.head()

##### Missing and Duplicate
* no duplicate for the whole Airports dataset
* column `ident` has no missing value and unique.
* none value 45886 in iata_code
* Type port are 'small_airport' 'medium_airport' 'large_airport' 'closed' 'seaplane_base' 'balloonport' do decide too drop ballonport, heliport and closed

In [None]:
unique = df_airports["type"].unique()
#print(unique)
indexNames = df_airports[df_airports['type'].str.contains(r'\bheliport\b' or 'closed' or 'ballonport')].index
df_airports.drop(indexNames , inplace=True)


In [None]:
df = df_airports.loc[df_airports["iata_code"].notnull(), ["type", "iata_code"]]
df_airports['i94addr'] = df_immigration["i94port"].map(dict(zip(i94_port["Port_id"], i94_port["State_id"]))).fillna(df_immigration.i94addr)

**c/c : It seems to have no data in common apart from the regions of the united states with the 1st dataset.The columns `ident` contains unique value for airport, digit letter with zero before, sometimes 1 or 2**

### US Cities Demographic Data

* from the US census bureau's
* demographics of all US cities > 65 000


In [None]:
# us-cities-demographics USACITY
nRowsRead = None # change and set to None for the whole data
description   = "us-cities-demographics provide by UDACITY"
name          = "df_demograph"
file          = "us-cities-demographics.csv"

n_df          = pd.read_csv(path+file, sep=";", nrows = nRowsRead)
nRow, nCol    = n_df.shape
print("There are {} rows and {} columns in DataFrame {}.".format(nRow, nCol, name))
df_demograph  = n_df

In [None]:
print(n_df.info())
n_df.head(10)
# dic_5 = {'name': name, 'path': (path+file), 'nLines': nRow, 'nColomn': nCol, 'Description': description}


##### missing and duplicate

In [None]:
df_dup = n_df[n_df.duplicated(keep=False)].sort_values("City")
df_dup.count()

In [None]:
cols = [col for col in n_df.columns if n_df[col].isnull().any()]
df_miss = n_df[cols]
df_miss.head()
df_miss[pd.isnull(df_miss).any(axis=1)]

In [None]:
duplicateRowsDF = n_df[n_df.duplicated()]
duplicateRowsDF.count()

----

----

In [None]:
# airports_us.csv KAGGLE
nRowsRead = 1000 # change and set to None for the whole data
description = " Dataset from KAGGLE, usefule to follow where go aliens"
name = "df_airport_us"
file = "airports_us.csv"
n_df = pd.read_csv(path+file, nrows = nRowsRead)
nRow, nCol = n_df.shape
print("There are {} rows and {} columns in DataFrame {}.".format(nRow, nCol, file))
df_airport_us = n_df
df_airport_us.head()
#print(n_df.head(1))
#print(n_df.info())


In [None]:
n_df.sort_values(["Fly_date"], ascending=True, inplace=True)
print(n_df["Fly_date"].min())
print(n_df["Fly_date"].max())

In [None]:
# WDIData.csv Indicators developpement KAGGLE
description      = "WDIData.csv country Indicators developpment KAGGLE"
name             = "df_indicator_dev"
file             = "WDIData.csv"

n_df             = pd.read_csv(path+file, sep=",", nrows = nRowsRead)
nRow, nCol       = n_df.shape
print("There are {} rows and {} columns in DataFrame {}.".format(nRow, nCol, name))
df_indicator_dev = n_df
n_df.head()
#dic_3 = {'name': name, 'path': (path+file), 'nLines': nRow, 'nColomn': nCol, 'Description': description}
#dic_3

In [None]:
# ./dataset/airline-delay-and-cancellation-data-2009-2018/2016.csv
nRowsRead = 1000 # change and set to None for the whole data
description   = ".Data about us flight in 2016"
name          = "df_airline_delay"
file          = "airline-delay-and-cancellation-data-2009-2018/2016.csv"

n_df          = pd.read_csv(path+file, nrows = nRowsRead)
nRow, nCol    = n_df.shape
print("There are {} rows and {} columns in DataFrame {}.".format(nRow, nCol, name))
df_airline_delay       = n_df
n_df.head()

#dic_10 = {'name': name, 'path': (path+file), 'nLines': nRow, 'nColomn': nCol, 'Description': description}
#dic_10
#type(dic_10)

In [None]:
# /dataset/education-statistics 

description           = "Data from education-statistics"
name                  = "df_Educ_country_series"
file                  = "education-statistics/EdStatsCountry-Series.csv"
n_df                  = pd.read_csv(path+file, sep=",", nrows = nRowsRead)
nRow, nCol            = n_df.shape
print("There are {} rows and {} columns in DataFrame {}.".format(nRow, nCol, name))
df_Educ_country_series = n_df
dic_11 = {'name': name, 'path': (path+file), 'nLines': nRow, 'nColomn': nCol, 'Description': description}
#dic_11

description   = "Data from education-statistics"
name          = "df_Educ_country"
file          = "education-statistics/EdStatsCountry.csv"
n_df          = pd.read_csv(path+file, sep=",", nrows = nRowsRead)
nRow, nCol    = n_df.shape
print("There are {} rows and {} columns in DataFrame {}.".format(nRow, nCol, name))
df_Educ_country = n_df
dic_12 = {'name': name, 'path': (path+file), 'nLines': nRow, 'nColomn': nCol, 'Description': description}
#dic_12

description   = "Data from education-statistics"
name          = "df_Educ_data"
file          = "education-statistics/EdStatsData.csv"
n_df          = pd.read_csv(path+file, nrows = nRowsRead)
nRow, nCol    = n_df.shape
print("There are {} rows and {} columns in DataFrame {}.".format(nRow, nCol, name))
df_Educ_data  = n_df
dic_13 = {'name': name, 'path': (path+file), 'nLines': nRow, 'nColomn': nCol, 'Description': description}
#dic_13

description       = "Data from education-statistics"
name              = "df_Educ_foot_note"
file              = "education-statistics/EdStatsFootNote.csv"
n_df              = pd.read_csv(path+file, sep=",", nrows = nRowsRead)
nRow, nCol        = n_df.shape
print("There are {} rows and {} columns in DataFrame {}.".format(nRow, nCol, name))
df_Educ_foot_note = n_df
dic_14 = {'name': name, 'path': (path+file), 'nLines': nRow, 'nColomn': nCol, 'Description': description}
#dic_14

description    = "Data from education-statistics"
name           = "df_Educ_series"
file           = "education-statistics/EdStatsSeries.csv"
n_df           = pd.read_csv(path+file, sep=",", nrows = nRowsRead)
nRow, nCol     = n_df.shape
print("There are {} rows and {} columns in DataFrame {}.".format(nRow, nCol, name))
df_Educ_series = n_df
dic_15 = {'name': name, 'path': (path+file), 'nLines': nRow, 'nColomn': nCol, 'Description': description}
#dic_15


In [None]:
#%xdel n_df
%who_ls dict
%who_ls DataFrame


----

In [None]:
#%whos DataFrame
all_df = %who_ls DataFrame
all_df



### df_sas

### df_immigration

### df_indicator_dev

### df_temperature

### df_country

### df_country_series

### df_series

### df_data

### df_foot_note

### df_df_demograph

### df_airport_world

### df_airport_us

### df_airline_delay

In [None]:
df_airline_delay.head(2)


In [None]:
df_airline_delay.info()

In [None]:
#len(df_airline_delay)

In [None]:
df_airline_delay.columns

In [None]:
#df_airline_delay.dtypes

In [None]:
df_airline_delay.describe()

In [None]:
#df_airline_delay.values

In [None]:
df_airline_delay.describe(include=np.object)

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.