###### Data Engineering Capstone Project

# US Student Immigration
> The purpose of this project is to study the foreign students. The goal is to offer Data teams Analysts a selection of data concerning immigration to the United States.

#### Project Summary

The project follows the follow steps:
* [Step 1: Scope the Project and Gather Data](#Step-1:-Scope-the-Project-and-Gather-Data)
* [Step 2: Explore and Assess the Data](#Step-2:-Explore-and-Assess-the-Data)
    * [I94 Description Labels](#I94-Descrition-Labels)
    * [Immigration data](#Immigration-data)
    * [Global Land Temperature Data](#Global-Land-Temperature-Data)
    * [Global Airports Data](#Global-Airports-Data)
    * [Airports Data](#Airports-Data)
    
* [Step 3: Define the Data Model](#Step-3:-Define-the-Data-Model)
* [Step 4: Run ETL to Model the Data](#Step-4:-Run-ETL-to-Model-the-Data)
* [Step 5: Complete Project Write Up](#Step-5:-Complete-Project-Write-Up)



[Finish here yesterday](#workflow-1)

In [1]:
import os
import sys
import datetime
import numpy as np
import pandas as pd

In [2]:
# Read in the data here
!ls -1FSash ../../data
#!ls -tRFh ./dataset/
path = '../../data/'

total 1.3G
510M airports_us.csv*
509M GlobalLandTemperaturesByCity.csv*
205M WDIData.csv*
5.8M airport-codes_csv.csv*
1.5M airports-extended.csv*
248K us-cities-demographics.csv*
144K immigration_data_sample.csv*
 36K I94_SAS_Labels_Descriptions.SAS*
 12K i94port.csv*
 12K wikipedia-iso-country-codes.csv*
8.0K i94cit_i94res.csv*
4.0K ./
4.0K ../
4.0K 18-83510-I94-Data-2016/
4.0K education-statistics/
4.0K postgres/
4.0K i94addr.csv*
4.0K i94mode.csv*
4.0K i94visa.csv*


In [3]:
list_filename= ['I94_SAS_Labels_Descriptions.SAS', 'immigration_data_sample.csv', 
                "GlobalLandTemperaturesByCity.csv", 'wikipedia-iso-country-codes.csv',
                "airports-extended.csv","airport-codes_csv.csv", 'us-cities-demographics.csv',
               'WDIData.csv', "education-statistics/EdStatsData.csv"]

def sum1forline(list_filename):
    my_sum = 0
    path = '../../data/'
    for filename in list_filename:
        print(filename)
        with open(path+filename) as f:
            a = sum(1 for line in f)
            print(a)
            my_sum+=a
            
    return("Sum and Names of the files : {:,}".format(my_sum)) 
    
sum1forline(list_filename)

I94_SAS_Labels_Descriptions.SAS
1100
immigration_data_sample.csv
1001
GlobalLandTemperaturesByCity.csv
8599213
wikipedia-iso-country-codes.csv
247
airports-extended.csv
10668
airport-codes_csv.csv
55076
us-cities-demographics.csv
2892
WDIData.csv
422137
education-statistics/EdStatsData.csv
886931


'Sum and Names of the files : 9,979,265'

# Step 1: Scope the Project and Gather Data

#### Scope TODO
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>
> The main dataset includes data on immigration to the United State, and other datasets. In this work book, the data is transforming and cleasning.  
> How many students arrived in US in April?  
> Which Airline bring the most student in April?  
> What are the top city for alien studies?    
> what are the student profils (age, country born, country indicators)?  
> Where are from? 
Try to find some explanations about their country. 
> In which cities do they arrive in the USA?

### Describe and Gather Data

[Data dictionnary](2_data_dictionnary.ipynb) provide informations abou dataset and tables used.

#### Data Source

Data |File |Data Source
-|-|-|
I94 Immigration | immigration_data_sample.csv| [US National Tourism and Trade Office](https://travel.trade.gov/research/programs/i94/description.asp)
I94 Description Labels  Description|I94_SAS_Labels_Descriptions.SAS |US National Tourism and Trade Office
Global Land Temperature|GlobalLandTemperaturesByCity.csv| [Berkeley Earth](http://berkeleyearth.org/)
Global Airports|airports-extended.csv| [OpenFlights.org and user contributions](https://www.kaggle.com/open-flights/airports-train-stations-and-ferry-terminals)
Airports codes |airport-codes_csv.csv| provide by Udacity
Iso country | wikipedia-iso-country-codes.csv|[Kaggle](https://www.kaggle.com/juanumusic/countries-iso-codes)
US Cities Demographic| us-cities-demographics.csv|provide by Udacity
Indicators developpment| WDIData.csv| [Kaggle](https://www.kaggle.com/xavier14/wdidata)
Education-statistics| EdStatsData.csv|provide by Kaggle [World Bank](https://www.kaggle.com/kostya23/worldbankedstatsunarchived)


**change name _immigration_data_sample.csv_ for  _data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat_**

#### I94 Immigration data  Description: 
Each line of immigration_data_sample.csv correspond to a record of I-94 Form from the U.S. immigration officers. It's provide information about Arrival/Departure to foreign visitors. Some explanation about the [Visitor Arrivals Program (I-94 Form)](https://travel.trade.gov/research/programs/i94/description.asp).  

Dataset information: There is a file per month for 2016, storage format is sas7bdat. These records are described according to 28 variables.   
A small description is provided [here](2_data_dictionnary.ipynb)  
I keep this variables for this project( _df_immigration_ ):
    
Column Name | Description | Example | Type
-|-|-|-|
**cicid**|     ID uniq per record in the dataset | 4.08e+06 | float64
**i94yr**|     4 digit year  | 2016.0 | float64
**i94mon**|    Numeric month |  4.0 | float64      
**i94cit**|     3 digit code of source city for immigration (Born country) | 209.0 | float64
**i94res**|    3 digit code of source country for immigration |209.0 | float64
**i94port**|   Port addmitted through | HHW | object
**arrdate**|   Arrival date in the USA | 20566.0 | float64
**i94mode**|   Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported) | 1.0 | float
**i94addr**|   State of arrival | HI | object
**i94bir**|    Age in years | 61.0 | float
**i94visa**|   Visa Code - 1 = Business / 2 = Pleasure / 3 = Student |2.0 | float
**dtadfile**|  Date Field in I94 files |20160422| int 64
**gender**|    Gender|M| object
**visatype**|  Class of admission legally admitting the non-immigrant to temporarily stay in U.S.|WT|object
**airline**|Airline used to arrive in U.S.|MU|Object



df_immigration   
Additional files of this dataset are provide to give more desciption about this dataset


#### I94 Description Labels  Description
The I94_SAS_Labels_Description.SAS file is provide to add explanations  about code used in _data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat._ 
I parse this file, save the result in 5 .csv files. 
    * i94visa Data
    * i94country and i94residence Data
    * i94port Data
    * i94mode Data
    * i94addr
A small description is provided [here](2_data_dictionnary.ipynb)

####  Global Land Temperature Data  Description
The Berkeley Earth Surface Temperature Study provide climate information. Each line correspond to a record of temperature per day from city around the world.     
Dataset information: the GlobalLandTemperaturesByCity.csv has 7 variables. A small description is provided [here](2_data_dictionnary.ipynb). I keep this variables for this project ( _df_temperature_ ):

Column Name | Description | Example | Type
-|-|-|-|
**dt**|Date format YYYY-MM-DD| 1743-11-01| object
**AverageTemperature**|Average Temperature for the city to th date dt|6.07|float64
**City**| City name| Århus| object
**Country**| Country name | Denmark | object

#### Global Airports Data
This is a database of airports, train stations, and ferry terminals around the world. Some of the data come from public sources and some of it comes from OpenFlights.org user contributions.      
Dataset information: A small description is provided [here](2_data_dictionnary.ipynb). I give name and keep this variables ( _df_global_airports_ ):

Column Name | Description | Example | Type
-|-|-|-|
**airport_ID**|Id in the table|1| Int
**airport_name**|Name of airport|Nadzab Airport|Object
**airport_city**|Main city served by airport|Nadzab|Object
**airport_country**|Country or territory where airport is located|Papua New Guinea|Object
**airport_iata**|3-letter IATA code|LAE|Object


#### Airports Data Description
The airport code refers to the IATA airport code, 3 letters code unique for all airports in the world. It's a code used in passenger reservation, ticket and baggage-handling too.     
Dataset information: The airport-codes_csv.csv provides informations about aiports and have 12 variables. A small description is provided [here](2_data_dictionnary.ipynb). I keep this variables for this project ( _df_airport_code_ ):

Column Name | Description | Example | Type
-|-|-|-|
**ident**| Unique identifier Airport code| 00AK| object 
**type**| Type of airport | small_airport |object
**name**| Name of the airport | Lowell Field | object
**iso_country**| ISO code of airport country |US| object
**iso_region**| ISO code of the region airport | US-KS|object
**municipality**| City name where the airport is located | Anchor Point|object
**iata_code**| IATA code of the airport| | object

#### Iso country
This is a database about the different code useful to identify country.        
Datasset information: A small description is provided [here](2_data_dictionnary.ipynb). This table gives us informations about Country codes used to identify each country and contains 4 variables. I keep this variables for this project ( _df_iso_country_ ):

Column Name | Description | Example | Type
-|-|-|-|
**Country_name**|Country Name in English|Wallis and Futuna|Object
**Alpha2_code**|code 2 letter code for the country|WF|Object
**Alpha3_code**|code 3 letter code for the country|WLF|Object
**Numeric_code**|ISO 3166-2 code|876|Int

#### US cities Demographics
This dataset contains information about the demographics of all US cities and come from the US Census Bureau.     
Dataset information: A small description is provided [here](2_data_dictionnary.ipynb). 
This dataset contains 12 variables and provides simple informations about us state population. 
I keep this variables for this project ( _df_demograph_ ):

Column Name | Description | Example | Type
-|-|-|-|
**City**|Name of the city|Silver Spring|Object
**State**|US state of the city|Maryland|Object
**Median Age**|The median of the age of the population|33.8|Float64
**Male Population**|Number of the male population|40601.0|Float64
**Female Population**|Number of the female population|41862.0|Float64
**Total Population**|Number of the total population|82463 	|Float64
**Foreign-born**|Number of residents of the city that were not born in the city|30908.0|Float64
**State Code**|Code of the state of the city|MD|Object|
**Race**|Race class|Hispanic or Latino|Object
**Count**|Number of individual of each race|25924|Int64

#### World Development Indicators
The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.   
Dataset information: This dataset contains 64 variables with economics context , most of which are variables per year(1960 to 2018).
A small description is provided [here](2_data_dictionnary.ipynb).
I keep this variables for this project ( _df_indicator_dev_ ):

Column Name | Description | Example | Type
-|-|-|-|
**Country Name**|Name of the country|Arab World|Object|
**Country Code**|3 letters code of country|ARB|Object
**Indicator Name**|indicators of economic development|2005 PPP conversion factor, GDP (LCU per inter...|Object
**Indicator Code**|letters indicator code|PA.NUS.PPP.05|Object
**1960 ...2018**|one column per year since 1960|2018|Float64

#### Education statistics Data
The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.    
Dataset information: This dataset contains 64 variables witheducation context , most of which are variables per year(1970 to 2100).
A small description is provided [here](2_data_dictionnary.ipynb).
I keep this variables for this project ( _df_Educ_data_ ):

Column Name | Description | Example | Type
-|-|-|-|
**Country Name**|Name of the country|Arab World|Object|
**Country Code**|3 letters code of country|ARB|Object
**Indicator Name**|indicators of education development|Adjusted net enrolment rate, lower secondary, ...|Object
**Indicator Code**|letters indicator code|UIS.NERA.2|Object
**1970 ...2100**|one column per year since 1970|2018|Float64

# Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [4]:
def check_df(df, description):
    nRow, nCol = df.shape
    print("There are {} rows and {} columns in **** {}. ****".format(nRow, nCol, description))
    #print(df.head(3))
    #print(n_df.info())
    #print(df.nunique)
    # check null value
    print(       )
    print("---------   Check null values")
    tab_info=pd.DataFrame(df.dtypes).T.rename(index={0:'column type'})
    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()).T.rename(index={0:'null values (nb)'}))
    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()/df.shape[0]*100).T.rename(index={0:'null values (%)'}))
    display(tab_info)
    print(       )
    print("---------   Check unique values")
    cols = [col for col in df.columns if df[col].isnull().any()]
    df_miss = df[cols]    
    display(pd.DataFrame(df.nunique()).T.rename(index={0:'Unique values in columns'}))
    #print(df_miss.head())
    print(       )
    print("---------   Check duplicated value")
    n_df = df.iloc[:, 2:]
    df_dup = df[df.duplicated(keep=False)]
    display(pd.DataFrame(df_dup.count()).T.rename(index={0:'Duplicate values in columns'}))
    return(df)



    

In [5]:
def check_null(df):
    print("---------   Check null values")
    tab_info=pd.DataFrame(df.dtypes).T.rename(index={0:'column type'})
    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()).T.rename(index={0:'null values (nb)'}))
    tab_info=tab_info.append(pd.DataFrame(df.isnull().sum()/df.shape[0]*100).T.rename(index={0:'null values (%)'}))
    display(tab_info)
    print(       )

In [6]:
#I94_SAS_Labels_Description.SAS
#def SAS_parser(file_parse, item, columns):
import re
import io

def parse_file(path_file, key):
    """
    fonction to parse file and create csv file
    return dataframe
    """
    
    file_parse = path+'I94_SAS_Labels_Descriptions.SAS'
    with open(file_parse, 'r') as f:
        file = f.read()
    sas_dict={}
    key_name = ''

    for line in file.split("\n"):
        line = re.sub(r"\s+", " ", line)
        if '/* I94' in line :         
            line = line.strip('/* ')
            key_name = line.split('-')[0].replace("&", "_").replace(" ", "").strip(" ").lower() 
            sas_dict[key_name] = []
        elif '=' in line and key_name != '' :
            #line_trans = re.sub("([A-Z]*?),(\s*?[A-Z]{2}\s)","\\1=\\2", line)
            #print(line_trans)
            sas_dict[key_name].append([item.strip(' ').strip(" ';").title() for item in line.split('=')])
        

    if key is "i94port":
        columns = ["Port_id", "Port_city", "State_id"]
        swap = sas_dict[key]
        sas_dict[key] = []
        for x in swap:
            if "," in x[1]:
                mylist=[]
                a = x[1].rsplit(",", 1)
                b = a[0]
                c = a[1].strip()
                mylist.extend([x[0], b, c])
                sas_dict[key].append(item for item in mylist)      
                
                
    if key is "i94cit_i94res":
        columns = ["Country_id", "Country"]
        swap = sas_dict[key]
        for x in swap:
            if "Mexico" in x[1]:
                x[1] = "Mexico"
        
        
        
    if key is "i94mode":
        columns = ["Mode_id", "Mode"]
    if key is "i94addr":
        columns = ["State_id", "State"]
    if key is "i94visa":
            columns = ["Code_visa", "Visa"]
    df = ""           
            

    if key in sas_dict.keys():
        if len(sas_dict[key]) > 0:
            df = pd.DataFrame(sas_dict[key], columns = columns)
        with io.open(f"../../data/{key}.csv", "w") as f:
            df.to_csv(f, index=False) 
           
    return(df)



### I94 Description Labels

In [7]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94port"
i94_port = parse_file(path_file, key) 
i94_port.head()

Unnamed: 0,Port_id,Port_city,State_id
0,Alc,Alcan,Ak
1,Anc,Anchorage,Ak
2,Bar,Baker Aaf - Baker Island,Ak
3,Dac,Daltons Cache,Ak
4,Piz,Dew Station Pt Lay Dew,Ak


In [8]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94cit_i94res"
i94_city = parse_file(path_file, key) 
i94_city.head()

Unnamed: 0,Country_id,Country
0,582,Mexico
1,236,Afghanistan
2,101,Albania
3,316,Algeria
4,102,Andorra


In [9]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94addr"
i94_state = parse_file(path_file, key) 
i94_state.head()

Unnamed: 0,State_id,State
0,Al,Alabama
1,Ak,Alaska
2,Az,Arizona
3,Ar,Arkansas
4,Ca,California


In [10]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94mode"
i94_mode = parse_file(path_file, key) 
i94_mode.head()

Unnamed: 0,Mode_id,Mode
0,1,Air
1,2,Sea
2,3,Land
3,9,Not Reported


In [11]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94visa"
i94_visa = parse_file(path_file, key) 
i94_visa.head()

Unnamed: 0,Code_visa,Visa
0,1,Business
1,2,Pleasure
2,3,Student


In [12]:
path_file = "path+'I94_SAS_Labels_Descriptions.SAS'"
key = "i94addr"
i94_addr = parse_file(path_file, key) 
i94_addr.head()

Unnamed: 0,State_id,State
0,Al,Alabama
1,Ak,Alaska
2,Az,Arizona
3,Ar,Arkansas
4,Ca,California


### Immigration data
* immigration_data_sample.csv
* There are 1000 rows and 29 columns in *immigration_data_sample.csv*.
* df_immigration

TODO
* immigration_data_sample
    * revoir l'orignine du fichier. chercher data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat dans udacity workspace
    (pd.read_sas(immigration_fname, 'sas7bdat', encoding="ISO-8859-1")
    ==> 1000 rows fourni par udac, beaucoup plus dans 1 seul mois
    * DONE: faire un dictionnaire, recuperer les colonnes. 
* DONE chercher les valeurs manquantes, les valeurs dupliquees, data cleanning
    * chamger les formats

In [13]:
nRowsRead = None # change and set to None for the whole data
description = "dataset immigration provide by Udacity"
name = "df_immigration"
file = "immigration_data_sample.csv"

df_raw = (pd.read_csv(path+file, 
                      nrows = nRowsRead,
                      na_values=['\\N', '-', 'NAN', 'unknown'])
            .drop(columns=["Unnamed: 0","count", "visapost", "occup", "entdepa", "depdate", "entdepd", "entdepu", "biryear", \
                       "dtaddto", "matflag", "insnum", "airline", "fltno"])
            .reset_index(drop=True)
                     )

df_immigration = check_df(df_raw, description).sort_values(by = ['cicid', 'admnum'])
df_immigration.head()



There are 1000 rows and 15 columns in **** dataset immigration provide by Udacity. ****

---------   Check null values


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,i94bir,i94visa,dtadfile,gender,admnum,visatype
column type,float64,float64,float64,float64,float64,object,float64,float64,object,float64,float64,int64,object,float64,object
null values (nb),0,0,0,0,0,0,0,0,59,0,0,0,141,0,0
null values (%),0,0,0,0,0,0,0,0,5.9,0,0,0,14.1,0,0



---------   Check unique values


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,i94bir,i94visa,dtadfile,gender,admnum,visatype
Unique values in columns,1000,1,1,88,91,70,30,4,51,85,3,39,3,1000,10



---------   Check duplicated value


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,i94bir,i94visa,dtadfile,gender,admnum,visatype
Duplicate values in columns,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,i94bir,i94visa,dtadfile,gender,admnum,visatype
112,13208.0,2016.0,4.0,116.0,116.0,LOS,20545.0,1.0,CA,29.0,2.0,20160401,M,55442240000.0,WT
9,13213.0,2016.0,4.0,116.0,116.0,LOS,20545.0,1.0,CA,35.0,2.0,20160401,,55449790000.0,WT
148,13826.0,2016.0,4.0,117.0,117.0,ATL,20545.0,1.0,SC,44.0,1.0,20160401,M,55459080000.0,WB
163,17786.0,2016.0,4.0,123.0,123.0,NYC,20545.0,1.0,NE,31.0,1.0,20160401,,55455180000.0,WB
867,18310.0,2016.0,4.0,123.0,123.0,SEA,20545.0,1.0,CA,45.0,2.0,20160401,M,55421540000.0,WT


###### Missing and Duplicate
* somme value missing in 1000 rows
    * i94addr # US State of arrival,  59 null values and 51 unique values
        * map with i94port.csv
    * gender  # Gender, 141 null values and 3 unique values 
        * 'M', nan, 'F', 'X'


In [14]:
%xdel df_raw
df_immigration['i94addr'] = df_immigration["i94port"].map(dict(zip(i94_port["Port_id"], i94_port["State_id"]))).fillna(df_immigration.i94addr)
df_immigration.gender.fillna("X", inplace=True)
#df_immigration.dropna(inplace = True)
df_immigration[['cicid','i94yr','i94mon','i94cit','i94res','arrdate','i94mode','i94bir','i94visa','admnum', 'dtadfile']] = df_immigration[['cicid','i94yr','i94mon','i94cit','i94res','arrdate','i94mode','i94bir','i94visa','admnum','dtadfile']].astype(int)
df_immigration = check_df(df_immigration, description).sort_values(by = ['cicid', 'admnum'])
df_immigration.head()


There are 1000 rows and 15 columns in **** dataset immigration provide by Udacity. ****

---------   Check null values


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,i94bir,i94visa,dtadfile,gender,admnum,visatype
column type,int64,int64,int64,int64,int64,object,int64,int64,object,int64,int64,int64,object,int64,object
null values (nb),0,0,0,0,0,0,0,0,59,0,0,0,0,0,0
null values (%),0,0,0,0,0,0,0,0,5.9,0,0,0,0,0,0



---------   Check unique values


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,i94bir,i94visa,dtadfile,gender,admnum,visatype
Unique values in columns,1000,1,1,88,91,70,30,4,51,85,3,39,3,1000,10



---------   Check duplicated value


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,i94bir,i94visa,dtadfile,gender,admnum,visatype
Duplicate values in columns,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,i94bir,i94visa,dtadfile,gender,admnum,visatype
112,13208,2016,4,116,116,LOS,20545,1,CA,29,2,20160401,M,55442244033,WT
9,13213,2016,4,116,116,LOS,20545,1,CA,35,2,20160401,X,55449792933,WT
148,13826,2016,4,117,117,ATL,20545,1,SC,44,1,20160401,M,55459078733,WB
163,17786,2016,4,123,123,NYC,20545,1,NE,31,1,20160401,X,55455177333,WB
867,18310,2016,4,123,123,SEA,20545,1,CA,45,2,20160401,M,55421541133,WT


In [15]:
%who_ls DataFrame

['df_immigration',
 'i94_addr',
 'i94_city',
 'i94_mode',
 'i94_port',
 'i94_state',
 'i94_visa']

**c/c: we have an ID for each record, `admnum` should be none null as `i94addr` for the analytic questions**

### Global Land Temperature Data
* dataset/GlobalLandTemperaturesByCity.csv
* There are 8599212 rows and 7 columns in  *GlobalLandTemperaturesByCity.csv*. 
* df_temperature


In [16]:
# download from kaggle the GlobalLandTemperaturesByCity.csv KAGGLE/UDACITY
nRowsRead = None # change and set to None for the whole data
description = "the GlobalLandTemperaturesByCity.csv KAGGLE/UDACITY"
name = "df_temperature"
file = "GlobalLandTemperaturesByCity.csv"

df_raw = (pd.read_csv(path+file, 
                     sep=",", 
                     nrows = nRowsRead)
           .reset_index(drop=True)
           .drop(columns=["AverageTemperatureUncertainty"])
           .groupby(['Country', 'City']) \
           .agg({"AverageTemperature": "mean", 
                  "Latitude": "first", 
                  "Longitude": "first"}).reset_index())          

df_temp = check_df(df_raw, description).sort_values(by = ["Country","City"], ascending=True)


#print("The date of the first record is {}.".format(df_temp["dt"].min()))
#print("The date of the first record is {}.".format(df_temp["dt"].max()))

df_temp.head()

There are 3490 rows and 5 columns in **** the GlobalLandTemperaturesByCity.csv KAGGLE/UDACITY. ****

---------   Check null values


Unnamed: 0,Country,City,AverageTemperature,Latitude,Longitude
column type,object,object,float64,object,object
null values (nb),0,0,0,0,0
null values (%),0,0,0,0,0



---------   Check unique values


Unnamed: 0,Country,City,AverageTemperature,Latitude,Longitude
Unique values in columns,159,3448,1381,73,1226



---------   Check duplicated value


Unnamed: 0,Country,City,AverageTemperature,Latitude,Longitude
Duplicate values in columns,0,0,0,0,0


Unnamed: 0,Country,City,AverageTemperature,Latitude,Longitude
0,Afghanistan,Baglan,10.790278,36.17N,69.61E
1,Afghanistan,Gardez,17.27424,32.95N,69.89E
2,Afghanistan,Gazni,10.311996,32.95N,67.98E
3,Afghanistan,Herat,14.213004,34.56N,62.27E
4,Afghanistan,Jalalabad,14.342919,34.56N,70.05E


##### Missing and Duplicate
* no duplicate for the whole Temperature dataset
* 364130 none value for AverageTemperature, so remove this rows.
* begin in 1743 and finis in 2013 so agregation by City.

In [17]:
                            
#df_temp.sort_values(["AverageTemperature"], ascending=True, inplace=True)
df_temp.drop(["Latitude", "Longitude"], axis=1, inplace=True)
df_temperature = check_df(df_temp, description).sort_values(by = ["Country","City"], ascending=True)
%xdel df_temp
df_temperature.head()


There are 3490 rows and 3 columns in **** the GlobalLandTemperaturesByCity.csv KAGGLE/UDACITY. ****

---------   Check null values


Unnamed: 0,Country,City,AverageTemperature
column type,object,object,float64
null values (nb),0,0,0
null values (%),0,0,0



---------   Check unique values


Unnamed: 0,Country,City,AverageTemperature
Unique values in columns,159,3448,1381



---------   Check duplicated value


Unnamed: 0,Country,City,AverageTemperature
Duplicate values in columns,0,0,0


Unnamed: 0,Country,City,AverageTemperature
0,Afghanistan,Baglan,10.790278
1,Afghanistan,Gardez,17.27424
2,Afghanistan,Gazni,10.311996
3,Afghanistan,Herat,14.213004
4,Afghanistan,Jalalabad,14.342919


In [18]:
%who_ls DataFrame

['df_immigration',
 'df_raw',
 'df_temperature',
 'i94_addr',
 'i94_city',
 'i94_mode',
 'i94_port',
 'i94_state',
 'i94_visa']

**c/c: Temperature from 1743 to 2013, useful if we want look for raison of immigration. People have needed bad weather conditions for a long time to leave their country**.

### Iso Country Data
* dataset/wikipedia-iso-country-codes.csv
* There are 246 rows and 4 columns in  *wikipedia-iso-country-codes.csv*.
* df_iso_country

In [19]:
# dowload from Kaggle wikipedia-iso-country-codes.csv provide by wikipedia
nRowsRead = None # change and set to None for the whole data
description = "download from kaggle the wikipedia-iso-country-codes.csv"
name = "df_iso_country"
file = "wikipedia-iso-country-codes.csv"

df_raw = (pd.read_csv(path+file,
                      nrows = nRowsRead,
                      names=['Country_name', 'Alpha2_code', 'Alpha3_code', 'Numeric_code', 'ISO 3166-2'])
            .drop(columns=['ISO 3166-2'])
            .iloc[1:])
#df_iso_country[df_iso_country.isna().any(axis=1)]
df_iso_country = check_df(df_raw, description)
df_iso_country["Alpha2_code"].fillna("NA", inplace = True)
df_iso_country["Numeric_code"] = df_iso_country["Numeric_code"].astype(int)
check_df(df_iso_country, description)
df_iso_country.head()

There are 246 rows and 4 columns in **** download from kaggle the wikipedia-iso-country-codes.csv. ****

---------   Check null values


Unnamed: 0,Country_name,Alpha2_code,Alpha3_code,Numeric_code
column type,object,object,object,object
null values (nb),0,1,0,0
null values (%),0,0.406504,0,0



---------   Check unique values


Unnamed: 0,Country_name,Alpha2_code,Alpha3_code,Numeric_code
Unique values in columns,246,245,246,246



---------   Check duplicated value


Unnamed: 0,Country_name,Alpha2_code,Alpha3_code,Numeric_code
Duplicate values in columns,0,0,0,0


There are 246 rows and 4 columns in **** download from kaggle the wikipedia-iso-country-codes.csv. ****

---------   Check null values


Unnamed: 0,Country_name,Alpha2_code,Alpha3_code,Numeric_code
column type,object,object,object,int64
null values (nb),0,0,0,0
null values (%),0,0,0,0



---------   Check unique values


Unnamed: 0,Country_name,Alpha2_code,Alpha3_code,Numeric_code
Unique values in columns,246,246,246,246



---------   Check duplicated value


Unnamed: 0,Country_name,Alpha2_code,Alpha3_code,Numeric_code
Duplicate values in columns,0,0,0,0


Unnamed: 0,Country_name,Alpha2_code,Alpha3_code,Numeric_code
1,Zimbabwe,ZW,ZWE,716
2,Zambia,ZM,ZMB,894
3,Yemen,YE,YEM,887
4,Western Sahara,EH,ESH,732
5,Wallis and Futuna,WF,WLF,876


In [20]:
#df_iso_country[df_iso_country["Alpha2_code"] == "ZW"]

### Global Airports Data
* dataset/airports-extended.csv
* There are 10668 rows and 13 columns in  *airports-extended.csv*. 
* df_global_airports

In [21]:
# download from kaggle the airports-extended.csv KAGGLE
nRowsRead = None # change and set to None for the whole data
description = "download from kaggle the airports-extended.csv"
name = "df_global_airports"
file = "airports-extended.csv"

df_raw = (pd.read_csv(path+file, 
                     nrows = nRowsRead,
                     names=['id', 'name', 'city', 'country', 'iata', 'icao', 'latitude', 'longitude', 'altitude', 
                            'timezone', 'dst', 'tz_timezone', 'type', 'data_source'],
                     na_values=['\\N', '-', 'NAN', 'unknown'])
           .set_index("id")[lambda df: df.type == 'airport']
           .reset_index(drop=True)
           .drop(columns=['icao','type', 'timezone', 'tz_timezone', 'data_source', 'dst', 'latitude', 'longitude', 'altitude'])
           .rename(columns=lambda col:'airport_'+ col)
        )
#df_raw = df_raw.drop(["AverageTemperatureUncertainty"], axis=1)

df_global_airports = check_df(df_raw, description)
#.sort_values(by = ["dt", "Country","City"], ascending=True)


df_global_airports.head()

There are 7750 rows and 4 columns in **** download from kaggle the airports-extended.csv. ****

---------   Check null values


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata
column type,object,object,object,object
null values (nb),0,44,0,1665
null values (%),0,0.567742,0,21.4839



---------   Check unique values


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata
Unique values in columns,7664,6953,237,6085



---------   Check duplicated value


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata
Duplicate values in columns,14,14,14,0


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata
0,Goroka Airport,Goroka,Papua New Guinea,GKA
1,Madang Airport,Madang,Papua New Guinea,MAG
2,Mount Hagen Kagamuga Airport,Mount Hagen,Papua New Guinea,HGU
3,Nadzab Airport,Nadzab,Papua New Guinea,LAE
4,Port Moresby Jacksons International Airport,Port Moresby,Papua New Guinea,POM


In [22]:
#n_df = df_global_airports
#df_dup = n_df[n_df.duplicated(keep=False)]
#df_dup
df_global_airports.drop_duplicates(subset=['airport_name','airport_city','airport_country'], inplace=True)

In [23]:
%who_ls DataFrame

['df_global_airports',
 'df_immigration',
 'df_iso_country',
 'df_raw',
 'df_temperature',
 'i94_addr',
 'i94_city',
 'i94_mode',
 'i94_port',
 'i94_state',
 'i94_visa']

In [24]:
check_df(df_global_airports, description)

There are 7721 rows and 4 columns in **** download from kaggle the airports-extended.csv. ****

---------   Check null values


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata
column type,object,object,object,object
null values (nb),0,44,0,1649
null values (%),0,0.569874,0,21.3573



---------   Check unique values


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata
Unique values in columns,7664,6953,237,6072



---------   Check duplicated value


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata
Duplicate values in columns,0,0,0,0


Unnamed: 0,airport_name,airport_city,airport_country,airport_iata
0,Goroka Airport,Goroka,Papua New Guinea,GKA
1,Madang Airport,Madang,Papua New Guinea,MAG
2,Mount Hagen Kagamuga Airport,Mount Hagen,Papua New Guinea,HGU
3,Nadzab Airport,Nadzab,Papua New Guinea,LAE
4,Port Moresby Jacksons International Airport,Port Moresby,Papua New Guinea,POM
...,...,...,...,...
7745,Black Rock City Airport,Gerlach,United States,
7746,Rajiv Gandhi International Airport,Hyderabad,India,HYD
7747,Vancouver International Water Airport,Vancouver,Canada,
7748,Port Washington Water Aerodrome,Port Washington,Canada,


### Airports Data
* dataset/airport-codes_csv.csv
* There are 55075 rows and 12 columns in  *airport-codes_csv.csv*.
* df_airport_id

In [29]:
# airport-codes_csv UDACITY
nRowsRead = None # change and set to None for the whole data
description = "airport-codes_csv provide by UDACITY"
name = "df_airport_id"
file = "airport-codes_csv.csv"

df_raw = (pd.read_csv(path+file, 
                     sep=",", 
                     nrows = nRowsRead)
            .reset_index(drop=True)
            .drop(["elevation_ft", "continent","gps_code", "local_code", "coordinates"], axis=1)
            
         )
df_airports_id = check_df(df_raw, description)
df_airports_id.head()

There are 55075 rows and 7 columns in **** airport-codes_csv provide by UDACITY. ****

---------   Check null values


Unnamed: 0,ident,type,name,iso_country,iso_region,municipality,iata_code
column type,object,object,object,object,object,object,object
null values (nb),0,0,0,247,0,5676,45886
null values (%),0,0,0,0.448479,0,10.3059,83.3155



---------   Check unique values


Unnamed: 0,ident,type,name,iso_country,iso_region,municipality,iata_code
Unique values in columns,55075,7,52144,243,2810,27133,9042



---------   Check duplicated value


Unnamed: 0,ident,type,name,iso_country,iso_region,municipality,iata_code
Duplicate values in columns,0,0,0,0,0,0,0


Unnamed: 0,ident,type,name,iso_country,iso_region,municipality,iata_code
0,00A,heliport,Total Rf Heliport,US,US-PA,Bensalem,
1,00AA,small_airport,Aero B Ranch Airport,US,US-KS,Leoti,
2,00AK,small_airport,Lowell Field,US,US-AK,Anchor Point,
3,00AL,small_airport,Epps Airpark,US,US-AL,Harvest,
4,00AR,closed,Newport Hospital & Clinic Heliport,US,US-AR,Newport,


##### Missing and Duplicate
* no duplicate for the whole Airports dataset
* column `ident` has no missing value and unique.
* none value 45886 in iata_code
* Type port are 'small_airport' 'medium_airport' 'large_airport' 'closed' 'seaplane_base' 'balloonport' do decide too drop ballonport, heliport and closed

In [30]:
unique = df_airport_id["type"].unique()
#print(unique)
index_types = df_airports_id[df_airports_id['type'].str.contains(r'\bheliport\b' or 'closed' or 'ballonport')].index
df_airports_id.drop(index_types , inplace=True)
index_names = df_airports_id[df_airports_id['name'].str.contains(r'\s*(?i)delete\s*')].index
df_airports_id.drop(index_names , inplace=True)
check_df(df_airports_id, description)
df_airports_id

There are 43779 rows and 7 columns in **** airport-codes_csv provide by UDACITY. ****

---------   Check null values


Unnamed: 0,ident,type,name,iso_country,iso_region,municipality,iata_code
column type,object,object,object,object,object,object,object
null values (nb),0,0,0,247,0,5309,34658
null values (%),0,0,0,0.564197,0,12.1268,79.1658



---------   Check unique values


Unnamed: 0,ident,type,name,iso_country,iso_region,municipality,iata_code
Unique values in columns,43779,6,41424,239,2780,24402,8977



---------   Check duplicated value


Unnamed: 0,ident,type,name,iso_country,iso_region,municipality,iata_code
Duplicate values in columns,0,0,0,0,0,0,0


Unnamed: 0,ident,type,name,iso_country,iso_region,municipality,iata_code
1,00AA,small_airport,Aero B Ranch Airport,US,US-KS,Leoti,
2,00AK,small_airport,Lowell Field,US,US-AK,Anchor Point,
3,00AL,small_airport,Epps Airpark,US,US-AL,Harvest,
4,00AR,closed,Newport Hospital & Clinic Heliport,US,US-AR,Newport,
5,00AS,small_airport,Fulton Airport,US,US-OK,Alex,
...,...,...,...,...,...,...,...
55069,ZYYJ,medium_airport,Yanji Chaoyangchuan Airport,CN,CN-22,Yanji,YNJ
55070,ZYYK,medium_airport,Yingkou Lanqi Airport,CN,CN-21,Yingkou,YKH
55071,ZYYY,medium_airport,Shenyang Dongta Airport,CN,CN-21,Shenyang,
55073,ZZ-0002,small_airport,Glorioso Islands Airstrip,TF,TF-U-A,Grande Glorieuse,


In [31]:
df_airports_id['iso_region'] = df_airports_id['iso_region'].astype(str).str.split('-', 1).str.get(0)
df_airports_id.loc[df_airports_id['iso_country'].isnull(),'iso_country'] = df_airports_id['iso_region']
#df_airports['country'] = df_airports['iso_country'].map(dict(zip(df_iso_country['Alpha2_code'], df_iso_country['Country_name'])))
df_airports_id.drop(["iso_region", "ident", "type"], axis=1, inplace=True)

In [32]:
#df_airports[df_airports['country'].isnull()]

In [34]:
check_df(df_airports_id, description)

There are 43779 rows and 4 columns in **** airport-codes_csv provide by UDACITY. ****

---------   Check null values


Unnamed: 0,name,iso_country,municipality,iata_code
column type,object,object,object,object
null values (nb),0,0,5309,34658
null values (%),0,0,12.1268,79.1658



---------   Check unique values


Unnamed: 0,name,iso_country,municipality,iata_code
Unique values in columns,41424,240,24402,8977



---------   Check duplicated value


Unnamed: 0,name,iso_country,municipality,iata_code
Duplicate values in columns,294,294,242,16


Unnamed: 0,name,iso_country,municipality,iata_code
1,Aero B Ranch Airport,US,Leoti,
2,Lowell Field,US,Anchor Point,
3,Epps Airpark,US,Harvest,
4,Newport Hospital & Clinic Heliport,US,Newport,
5,Fulton Airport,US,Alex,
...,...,...,...,...
55069,Yanji Chaoyangchuan Airport,CN,Yanji,YNJ
55070,Yingkou Lanqi Airport,CN,Yingkou,YKH
55071,Shenyang Dongta Airport,CN,Shenyang,
55073,Glorioso Islands Airstrip,TF,Grande Glorieuse,


In [None]:
# df2['B']=df2['A'].map(dict(zip(df1['A'],df1['B']))).fillna(df2.B)
#df['country'] = df['municipality'].map(dict(zip(df_temperature['City'], df_temperature['Country'])))


In [None]:
#df['country'] = df['municipality'].map(dict(zip(df['municipality'], df['country'])))


**c/c : It seems to have no data in common apart from the regions of the united states with the 1st dataset.The columns `ident` contains unique value for airport, digit letter with zero before, sometimes 1 or 2**

### US Cities Demographic Data
* dataset/us-cities-demographics.csv
* There are 2891 rows and 12 columns in  *us-cities-demographics.csv*. 
* df_demograph

* from the US census bureau's
* demographics of all US cities > 65 000


In [35]:
# us-cities-demographics USACITY
nRowsRead = None # change and set to None for the whole data
description = "us-cities-demographics provide by UDACITY"
name = "df_demograph"
file = "us-cities-demographics.csv"

df_raw = (pd.read_csv(path+file,
                     sep=';',
                     nrows = nRowsRead)
          .reset_index(drop=True)
          .drop(["Number of Veterans", "Average Household Size"], axis=1)
          )

df_demograph = check_df(df_raw, description)
df_demograph.head()

There are 2891 rows and 10 columns in **** us-cities-demographics provide by UDACITY. ****

---------   Check null values


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Foreign-born,State Code,Race,Count
column type,object,object,float64,float64,float64,int64,float64,object,object,int64
null values (nb),0,0,0,3,3,0,13,0,0,0
null values (%),0,0,0,0.10377,0.10377,0,0.449671,0,0,0



---------   Check unique values


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Foreign-born,State Code,Race,Count
Unique values in columns,567,49,180,593,594,594,587,49,5,2785



---------   Check duplicated value


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Foreign-born,State Code,Race,Count
Duplicate values in columns,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Foreign-born,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,30908.0,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,32935.0,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,8229.0,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,33878.0,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,86253.0,NJ,White,76402


----

----

### World Development Indicators
* dataset/WDIData.csv
* There are 422136 rows and 64 columns in  *WDIData.csv*.
* df_indicator_dev

In [36]:
# WDIData.csv Indicators developpement KAGGLE
nRowsRead = None # change and set to None for the whole data
description = "WDIData.csv country Indicators developpment KAGGLE"
name = "df_indicator_dev"
file = "WDIData.csv"
cols_to_use = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code','2013', '2014', '2015', '2016']
df_raw = (pd.read_csv(path+file,
                     sep=',',
                     nrows = nRowsRead,
                     usecols=cols_to_use)
          .reset_index(drop=True)
          .fillna(0)
          )

df_indicator_dev = check_df(df_raw, description)
df_indicator_dev.head()



There are 422136 rows and 8 columns in **** WDIData.csv country Indicators developpment KAGGLE. ****

---------   Check null values


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,2013,2014,2015,2016
column type,object,object,object,object,float64,float64,float64,float64
null values (nb),0,0,0,0,0,0,0,0
null values (%),0,0,0,0,0,0,0,0



---------   Check unique values


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,2013,2014,2015,2016
Unique values in columns,264,264,1599,1599,176291,180850,172864,166439



---------   Check duplicated value


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,2013,2014,2015,2016
Duplicate values in columns,0,0,0,0,0,0,0,0


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,2013,2014,2015,2016
0,Arab World,ARB,"2005 PPP conversion factor, GDP (LCU per inter...",PA.NUS.PPP.05,0.0,0.0,0.0,0.0
1,Arab World,ARB,"2005 PPP conversion factor, private consumptio...",PA.NUS.PRVT.PP.05,0.0,0.0,0.0,0.0
2,Arab World,ARB,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,83.533457,83.897596,84.171599,84.510171
3,Arab World,ARB,Access to electricity (% of population),EG.ELC.ACCS.ZS,88.176836,87.342739,89.130121,89.678685
4,Arab World,ARB,"Access to electricity, rural (% of rural popul...",EG.ELC.ACCS.RU.ZS,77.162305,75.538976,78.741152,79.665635


### Education statistics
* dataset/education-statistics/EdStatsData.csv
* There are 886930 rows and 70 columns in  *education-statistics/EdStatsData.csv*. 
* df_Educ_data

In [44]:
description = "Data from education-statistics"
name = "df_Educ_data"
file = "education-statistics/EdStatsData.csv"

cols_to_use = ['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code','2013', '2014', '2015', '2016']
df_raw = (pd.read_csv(path+file,
                     #sep=',',
                     nrows = nRowsRead,
                     #usecols = cols_to_use
                     )
          .reset_index(drop=True)
          .fillna(0)
          )
df_Educ_data = check_df(df_raw, description)
df_Educ_data.head()


There are 886930 rows and 70 columns in **** Data from education-statistics. ****

---------   Check null values


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
column type,object,object,object,object,float64,float64,float64,float64,float64,float64,...,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
null values (nb),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
null values (%),0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0



---------   Check unique values


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
Unique values in columns,242,242,3665,3665,24595,30892,30982,30988,31139,37838,...,7914,7800,7700,7562,7466,7335,7150,7044,6914,1



---------   Check duplicated value


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
Duplicate values in columns,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1970,1971,1972,1973,1974,1975,...,2060,2065,2070,2075,2080,2085,2090,2095,2100,Unnamed: 69
0,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.F,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.GPI,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Arab World,ARB,"Adjusted net enrolment rate, lower secondary, ...",UIS.NERA.2.M,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Arab World,ARB,"Adjusted net enrolment rate, primary, both sex...",SE.PRM.TENR,54.822121,54.894138,56.209438,57.267109,57.991138,59.36554,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
#%xdel n_df
%who_ls dict
%who_ls DataFrame


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.