In [1]:
import pandas as pd
import numpy as np

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

# Scope

We will be constructing an immigration tracking database. This database has the potential for different uses such as to determine staffing at USCIS offices and the immigration office at airports, predicting future demographics, and tracking overall immigration trends. 

#### Our database consists of the following datasets:
- i94 immigration data from the [US National Tourism and Trade Office](https://travel.trade.gov/research/reports/i94/historical/2016.html). (millions of rows x 29 columns)
- [global airports](https://datahub.io/core/airport-codes#data). (>55000 rows x 12 columns)
- US cities and demographics provided by [OpenSoft](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/). (>2800 rows x 12 columns)
- Temperature data by cities for the past 150+ years from [Kaggle](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data?select=GlobalLandTemperaturesByCity.csv). (>8 million rows x 7 columns)

End solution:
- TBD

Tools and technologies used:
- TBD

# Let's gather and describe each dataset
- Airports
- Cities and Demographics
- Immigration

### Functions to use later

In [2]:
#function to describe dataframe
def stats_on_df(df, name):
    print("\nThere are {} rows and {} columns of data in the {} file".format(len(df), len(df.columns), name))
    print("columns are {}".format(df.columns))

In [3]:
#function to compare series to identify similarity
def matcher(series1, series2):
    matches = [i for i in series1 if i in series2]
    print(matches)

## Airports
Let's start with the **airport** data set found in airport-codes_csv.csv 

In [4]:
airport_df = pd.read_csv("data_raw/airport-codes_csv.csv")
stats_on_df(airport_df, "airports")


There are 55075 rows and 12 columns of data in the airports file
columns are Index(['ident', 'type', 'name', 'elevation_ft', 'continent', 'iso_country',
       'iso_region', 'municipality', 'gps_code', 'iata_code', 'local_code',
       'coordinates'],
      dtype='object')


We have 12 columns of data in the airport file and 55000 rows. Let's explore this further. 

In [5]:
#let's give the columns more informative names for later combining with other data sets
airport_df = airport_df.rename(columns = {'ident':'airport_identifier', 'type':'airport_size', 'name':'airport_name', 'iso_country':'country'})
airport_df.columns

Index(['airport_identifier', 'airport_size', 'airport_name', 'elevation_ft',
       'continent', 'country', 'iso_region', 'municipality', 'gps_code',
       'iata_code', 'local_code', 'coordinates'],
      dtype='object')

The cleaned airport dataset has the following 12 fields:
- airport_identifier: identifying airport code
- airport_size: small, medim, and large
- airport_name: name of airport (string)
- elevation_ft: height from sea level of airport (int)
- continent: 2 character continent abbreviation
- country: 2 character country abbreviation
- iso_region: iso regional identifier of airport location
- municipality: locale of airport
- gps_code: 4 char identification gps_code
- iata_code: 4 char iata identifying code
- local_code: 4 char local identifying code
- coordinates: gps coordinates of airport separate by comma e.g -101.473911, 38.704022

In [6]:
airport_df.head()

Unnamed: 0,airport_identifier,airport_size,airport_name,elevation_ft,continent,country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [7]:
airport_df.to_csv('data_first_cleaning/airports.csv', index=False) #save for to use in step 2

## Cities and Demographics Data
Now let's check the **US cities data** using us-cities-demographics.csv

In [8]:
city_df = pd.read_csv("data_raw/us-cities-demographics.csv")
stats_on_df(city_df, "cities")


There are 2891 rows and 1 columns of data in the cities file
columns are Index(['City;State;Median Age;Male Population;Female Population;Total Population;Number of Veterans;Foreign-born;Average Household Size;State Code;Race;Count'], dtype='object')


Immediately we see that the data has been loaded in with only 1 column with a ; separator and needs to be reloaded accordingly.

In [9]:
city_df = pd.read_csv("data_raw/us-cities-demographics.csv", sep=';')
stats_on_df(city_df, "cities")


There are 2891 rows and 12 columns of data in the cities file
columns are Index(['City', 'State', 'Median Age', 'Male Population', 'Female Population',
       'Total Population', 'Number of Veterans', 'Foreign-born',
       'Average Household Size', 'State Code', 'Race', 'Count'],
      dtype='object')


In [10]:
city_df[city_df.City=="Detroit"]

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
1026,Detroit,Michigan,34.8,319265.0,357859.0,677124,29511.0,39861.0,2.6,MI,Black or African-American,545988
1126,Detroit,Michigan,34.8,319265.0,357859.0,677124,29511.0,39861.0,2.6,MI,Asian,10804
1127,Detroit,Michigan,34.8,319265.0,357859.0,677124,29511.0,39861.0,2.6,MI,American Indian and Alaska Native,6007
1713,Detroit,Michigan,34.8,319265.0,357859.0,677124,29511.0,39861.0,2.6,MI,White,104260
2471,Detroit,Michigan,34.8,319265.0,357859.0,677124,29511.0,39861.0,2.6,MI,Hispanic or Latino,53980


The data is very redundant as we can see above. The city of Detroit has 5 rows of data where 10 columns are identical with only the Race and Count columns as different. We will need to modify the structure later for our purposes. 

In [11]:
city_df = city_df.rename(columns = {'Count':'Race Population'})
city_df.columns

Index(['City', 'State', 'Median Age', 'Male Population', 'Female Population',
       'Total Population', 'Number of Veterans', 'Foreign-born',
       'Average Household Size', 'State Code', 'Race', 'Race Population'],
      dtype='object')

The cleaned cities dataset has the following 12 fields:
- City: city (Full name)
- State: state (Full name)
- Median Age: Median age of city population
- Male Population: Number of males in city
- Female Population: Number of females in city
- Total Population: Total population of city, equal to male population + female population
- Number of Veterans: Number of veterans in total population of city
- Foreign-born: Number of foreign-born population in city
- AVerage Household Sioze: total population / number of households
- State Code: 2 character abbreviation of State
- Race: one specific race within the city's population (different rows for different races in same city)
- Race Population: Population of the given race within the city

In [12]:
city_df.to_csv('data_first_cleaning/cities.csv', index=False) #save for to use in step 2

## Immigration Data
Now let's check the **immigration data** using immigration_data_sample.csv which is a sample of the larger immigration sas data. 

Since we are looking only at a sample data, there is no reason to change or clean the sample dataset until a later step.

In [13]:
imm_df = pd.read_csv("data_raw/immigration_data_sample.csv")
stats_on_df(imm_df, "immigration sample")
print(imm_df.iloc[:4,:8])
print(imm_df.iloc[:4,8:16])
print(imm_df.iloc[:4,16:24])
print(imm_df.iloc[:4,24:])


There are 1000 rows and 29 columns of data in the immigration sample file
columns are Index(['Unnamed: 0', 'cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port',
       'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa',
       'count', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd',
       'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum',
       'airline', 'admnum', 'fltno', 'visatype'],
      dtype='object')
   Unnamed: 0      cicid   i94yr  i94mon  i94cit  i94res i94port  arrdate
0     2027561  4084316.0  2016.0     4.0   209.0   209.0     HHW  20566.0
1     2171295  4422636.0  2016.0     4.0   582.0   582.0     MCA  20567.0
2      589494  1195600.0  2016.0     4.0   148.0   112.0     OGG  20551.0
3     2631158  5291768.0  2016.0     4.0   297.0   297.0     LOS  20572.0
   i94mode i94addr  depdate  i94bir  i94visa  count  dtadfile visapost
0      1.0      HI  20573.0    61.0      2.0    1.0  20160422      NaN
1      1.0      TX  20568.0    26.0

We have a a significant amount of fields in this data - 29 columns in total. We can already start to see some ways in which this data can be joined with the cities and airports dataset. For example, the i94port column appears to reference an airport and the i94addr matches the state code in the cities dataset. The immigration dataset also provides date information which can be used to partition the data to view trends over time.  
  
Additional opportunity in the data exists, such as using i94bir in the immigration dataset to compare it to the median age in the cities.  
  
A deeper dive into the individual columns will be necessary in step 2 to see what information can be gained by including it in our overall model.

The immigration dataset has the following fields:
- cicid: unique identifier
- i94yr: 4 digit year
- i94mon: numeric month
- i94cit: origin code for processing (3 numbers)
- i94res: origin code for processing (3 numbers)
- i94port: airport code of arrival to US (3 letters)
- arrdate: Arrival date to US
- i94mode: Mode of transportation on arrival (air, sea, land as 1, 2, and 3 respectively)
- i94addr: 2 letter state code of destination
- depdate: Departure date from US
- i94bir: Age of individual in years
- i94visa: visa reason codes (1, 2, 3 for business, pleasure or student respectively)
- count: contains value of 1 for summarizing
- dtadfile: date field yyyymmdd format
- visapost: dept of state branch issuing visa
- occup: occupation to perform in US
- entdepa: arrival flag  (admitted or paroled)
- entdepd: departure flag (departed, lost or deceased)
- entdepu: update flag (apprehended, overstayed, adjusted to permanent residence)
- matflag: flag if arrival/departure records matching
- biryear: 4 digit birth year
- dtaddto: date allowed to stay in US until
- gender: gender code (1 letter abbreviation)
- insnum: INS number
- airline: airline flown to arrive in US
- admnum: admission number
- fltno: flight number of airline flown to arrive in US
- visatype: admission class of visa for non-immigrant family

In [14]:
imm_df.to_csv('data_first_cleaning/immigration_sample.csv', index=False) #save for to use in step 2

## Temperature Data
Now let's check the **temperature data** using GlobalLandTemperaturesByMajorCity.csv which is a sample of the larger temperature data. 

Since we are looking only at a sample data, there is no reason to change or clean the sample dataset until a later step.

In [15]:
temp_df = pd.read_csv("data_raw/GlobalLandTemperaturesByMajorCity.csv")
stats_on_df(temp_df, "temperpature sample")
temp_df.head(5)


There are 239177 rows and 7 columns of data in the temperpature sample file
columns are Index(['dt', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City',
       'Country', 'Latitude', 'Longitude'],
      dtype='object')


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1849-01-01,26.704,1.435,Abidjan,Côte D'Ivoire,5.63N,3.23W
1,1849-02-01,27.434,1.362,Abidjan,Côte D'Ivoire,5.63N,3.23W
2,1849-03-01,28.101,1.612,Abidjan,Côte D'Ivoire,5.63N,3.23W
3,1849-04-01,26.14,1.387,Abidjan,Côte D'Ivoire,5.63N,3.23W
4,1849-05-01,25.427,1.2,Abidjan,Côte D'Ivoire,5.63N,3.23W


In [16]:
temp_df.tail(5)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
239172,2013-05-01,18.979,0.807,Xian,China,34.56N,108.97E
239173,2013-06-01,23.522,0.647,Xian,China,34.56N,108.97E
239174,2013-07-01,25.251,1.042,Xian,China,34.56N,108.97E
239175,2013-08-01,24.528,0.84,Xian,China,34.56N,108.97E
239176,2013-09-01,,,Xian,China,34.56N,108.97E


In [17]:
temp_df_full = pd.read_csv('../../data2/GlobalLandTemperaturesByCity.csv') 

In [18]:
stats_on_df(temp_df_full, 'temperatures all cities')
temp_df_full.head()


There are 8599212 rows and 7 columns of data in the temperatures all cities file
columns are Index(['dt', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City',
       'Country', 'Latitude', 'Longitude'],
      dtype='object')


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [19]:
temp_df_full.tail()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
8599207,2013-05-01,11.464,0.236,Zwolle,Netherlands,52.24N,5.26E
8599208,2013-06-01,15.043,0.261,Zwolle,Netherlands,52.24N,5.26E
8599209,2013-07-01,18.775,0.193,Zwolle,Netherlands,52.24N,5.26E
8599210,2013-08-01,18.025,0.298,Zwolle,Netherlands,52.24N,5.26E
8599211,2013-09-01,,,Zwolle,Netherlands,52.24N,5.26E


The temperature dataset has the following fields:
- dt: date yyyy-mm-dd (although day does not appear to be relevant as it is just a monthly entry)
- AverageTemperature: celcius reading in float format with 3 decimals and some NaN
- AverageTemperatureUncertainty: 95% confidence interval for Average Temperature
- City: full name
- Country: full name
- Latitude: to 2 decimals places ex: 34.56N
- Longitude: to 2 decimals places ex: 108.97E

In [20]:
temp_df.to_csv('data_first_cleaning/temperature_sample.csv', index=False) #save for to use in step 2