
## Belal Elseraty

This notebook would discover insights for immigration data to Canada, from different continents and countries.
It would include advanced plots and various graphs to visualize the data, after cleaning and preparation.

## Table of Contents

1. Downloading and Prepping Data
2. Pie Charts
3. Box Plots
4. Bubble Plot
5. Area Plot
6. Histograms
7. Bar Chart
8. Scatter Plot

# Downloading and Prepping Data <a id="2"></a>

In [1]:
import numpy as np  
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
!wget https://www.un.org/en/development/desa/population/migration/data/empirical2/data/UN_MigFlow_A_to_E.zip 

--2019-07-27 00:41:47--  https://www.un.org/en/development/desa/population/migration/data/empirical2/data/UN_MigFlow_A_to_E.zip
Resolving www.un.org (www.un.org)... 157.150.185.49
Connecting to www.un.org (www.un.org)|157.150.185.49|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2207182 (2.1M) [application/zip]
Saving to: ‘UN_MigFlow_A_to_E.zip’

UN_MigFlow_A_to_E.z  30%[=====>              ] 649.34K  44.9KB/s    eta 40s    

In [None]:
!unzip UN_MigFlow_A_to_E.zip

Read the dataset into a *pandas* dataframe.

In [None]:
df_can = pd.read_excel('Canada.xlsx',
                       sheet_name='Canada by Citizenship',
                       skiprows=range(20),
                       skipfooter=2
                      )

print('Data downloaded and read into a dataframe!')

In [None]:
df_can.head(5)

Let's take a look at the first five items in our dataset.

In [None]:
df_can.head()

In [None]:
print (df_can.shape)

In [None]:
df_can.drop(['AREA', 'REG', 'DEV', 'Type', 'Coverage'], axis=1, inplace=True)
df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent','RegName':'Region'}, inplace=True)

# for sake of consistency, let's also make all column labels of type string
df_can.columns = list(map(str, df_can.columns))

# set the country name as index - useful for quickly looking up countries using .loc method
df_can.set_index('Country', inplace=True)

# add total column
df_can['Total'] = df_can.sum(axis=1)

# years that we will be using in this lesson - useful for plotting later on
years = list(map(str, range(1980, 2014)))
print('data dimensions:', df_can.shape)

# Pie Charts <a id="6"></a>

We can use a pie chart to explore the proportion (percentage) of new immigrants grouped by continents for the entire time period from 1980 to 2013. 

In [None]:
# group countries by continents and apply sum() function 
df_continents = df_can.groupby('Continent', axis=0).sum()
print(type(df_can.groupby('Continent', axis=0)))

df_continents.head()

In [None]:
df_can

In [None]:
df_continents.sort_values(by= 'Total', ascending=False)

In [None]:
explode_list = [0.15, 0, 0, 0, 0.15, 0.15] # ratio for each continent with which to offset each wedge.

df_continents['Total'].plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,         
                            pctdistance=1.12,     
                            explode=explode_list
                            )

# scale the title up by 12% to match pctdistance
plt.title('Immigration to Canada by Continent [1980 - 2013]', y=1.12) 

plt.axis('equal') 

# add legend
plt.legend(labels=df_continents.index, loc='upper left') 

plt.show()

In [None]:
explode_list = [0.15, 0, 0, 0, 0.1, 0.2] # ratio for each continent with which to offset each wedge.

df_continents['2013'].plot(kind='pie',
                            figsize=(15, 6),
                            autopct='%1.1f%%', 
                            startangle=90,    
                            shadow=True,       
                            labels=None,         
                            pctdistance=1.12,     
                            explode=explode_list
                            )

# scale the title up by 12% to match pctdistance
plt.title('Immigration to Canada by Continent [2013]', y=1.1) 

plt.axis('equal') 

# add legend
plt.legend(labels=df_continents.index, loc='upper left') 

plt.show()

# Box Plots <a id="8"></a>

- **Minimun:** Smallest number in the dataset.
- **First quartile:** Middle number between the `minimum` and the `median`.
- **Second quartile (Median):** Middle number of the (sorted) dataset.
- **Third quartile:** Middle number between `median` and `maximum`.
- **Maximum:** Highest number in the dataset.

In [None]:
# to get a dataframe, place extra square brackets around 'Japan'.
df_japan = df_can.loc[['Japan'], years].transpose()

In [None]:
df_japan.plot(kind='box', figsize=(8, 6))

plt.title('Box plot of Japanese Immigrants from 1980 - 2013')
plt.ylabel('Number of Immigrants')

plt.show()

In [None]:
df_japan.describe()


Comparing the distribution of the number of new immigrants from India and China for the period 1980 - 2013.

In [None]:
df_CI=df_can.loc[['China', 'India'], years].transpose()

In [None]:
df_japan

In [None]:
df_CI

In [None]:
df_CI.describe()

In [None]:
df_CI.plot(kind = 'box', figsize=(10,6))
plt.xlabel('Country')
plt.ylabel('No. Immigrants')
plt.title('Immigrants from China and India to Canada [1980-2013]')

Observed that, while both countries have around the same median immigrant population (~20,000),  China's immigrant population range is more spread out than India's. The maximum population from India for any year (36,210) is around 15% lower than the maximum population from China (42,584).


In [None]:
# horizontal box plots
df_CI.plot(kind='box', figsize=(10, 7), color='blue', vert=False)

plt.title('Box plots of Immigrants from China and India (1980 - 2013)')
plt.xlabel('Number of Immigrants')

In [None]:
fig = plt.figure()
ax0 = fig.add_subplot(1, 2, 1) 
ax1 = fig.add_subplot(1, 2, 2)
df_CI.plot(kind='box', color='blue', vert=False, figsize=(20, 6), ax=ax0)
ax0.set_title('Box Plots of Immigrants from China and India (1980 - 2013)')
ax0.set_xlabel('Number of Immigrants')
ax0.set_ylabel('Countries')
df_CI.plot(kind='line', figsize=(20, 6), ax=ax1)
ax1.set_title ('Line Plots of Immigrants from China and India (1980 - 2013)')
ax1.set_ylabel('Number of Immigrants')
ax1.set_xlabel('Years')


We can create a box plot to visualize the distribution of the top 15 countries (based on total immigration) grouped by the *decades* `1980s`, `1990s`, and `2000s`.

In [None]:
topCountries = df_can.sort_values(by='Total', ascending=False).head(15).index
df_top15 = df_can.loc[topCountries]
df_top15.drop(['Continent', 'Region', 'DevName'], axis = 1, inplace = True)
df_top15.transpose()

In [None]:
eighties = df_top15.loc[:,years[0:10]].sum(axis=1)
nineties = df_top15.loc[:,years[10:20]].sum(axis=1)
newMill = df_top15.loc[:,years[20:30]].sum(axis=1)

In [None]:
newDF = pd.DataFrame({'80s' : eighties, '90s':nineties, '00s':newMill})

In [None]:
newDF

In [None]:
newDF.describe()

In [None]:
newDF.plot(kind='box',figsize=(10,6))
plt.xlabel('Decades')
plt.ylabel('Immigrants')
plt.title('Immigration to Canada through the 3 last decades')

In [None]:
# let's check how many entries fall above the outlier threshold 
newDF[newDF['00s']> 209611.5]

# Bubble Plots <a id="12"></a>
Argentina suffered a great depression from 1998 - 2002, which caused widespread unemployment, riots, the fall of the government, and a default on the country's foreign debt. In terms of income, over 50% of Argentines were poor, and seven out of ten Argentine children were poor at the depth of the crisis in 2002. 

Let's analyze the effect of this crisis, and compare Argentina's immigration to that of it's neighbour Brazil. Let's do that using a `bubble plot` of immigration from Brazil and Argentina for the years 1980 - 2013. We will set the weights for the bubble as the *normalized* value of the population for each year.

In [None]:
df_can_t = df_can[years].transpose()

df_can_t.index = map(int, df_can_t.index)
df_can_t.index.name = 'Year'

# reset index to bring the Year in as a column
df_can_t.reset_index(inplace=True)

# view the changes
df_can_t.head()

In [None]:
# normalize Brazil data
norm_brazil = (df_can_t['Brazil'] - df_can_t['Brazil'].min()) / (df_can_t['Brazil'].max() - df_can_t['Brazil'].min())

# normalize Argentina data
norm_argentina = (df_can_t['Argentina'] - df_can_t['Argentina'].min()) / (df_can_t['Argentina'].max() - df_can_t['Argentina'].min())

In [None]:
# Brazil
ax0 = df_can_t.plot(kind='scatter',
                    x='Year',
                    y='Brazil',
                    figsize=(14, 8),
                    alpha=0.5,                  
                    color='green',
                    s=norm_brazil * 2000 + 10,  
                    xlim=(1975, 2015)
                   )

# Argentina
ax1 = df_can_t.plot(kind='scatter',
                    x='Year',
                    y='Argentina',
                    alpha=0.5,
                    color="blue",
                    s=norm_argentina * 2000 + 10,
                    ax = ax0
                   )

ax0.set_ylabel('Number of Immigrants')
ax0.set_title('Immigration from Brazil and Argentina from 1980 - 2013')
ax0.legend(['Brazil', 'Argentina'], loc='upper left', fontsize='x-large')

In [None]:
chinaNormalized = (df_can_t['China']-df_can_t['China'].min())/(df_can_t['China'].max()-df_can_t['China'].min())
indiaNormalized = (df_can_t['India']-df_can_t['India'].min())/(df_can_t['India'].max()-df_can_t['India'].min())

In [None]:
ax0 = df_can_t.plot(kind='scatter',
                    x='Year',
                    y='China',
                    figsize=(14, 8),
                    alpha=0.5,                  
                    color='green',
                    s=chinaNormalized * 2000 + 10, 
                    xlim=(1975, 2015)
                   )
ax1 = df_can_t.plot(kind='scatter',
                    x='Year',
                    y='India',
                    alpha=0.5,
                    color="blue",
                    s=indiaNormalized * 2000 + 10,
                    ax = ax0
                   )

ax0.set_ylabel('Number of Immigrants')
ax0.set_title('Immigration from China and India from 1980 - 2013')
ax0.legend(['China', 'India'], loc='upper left', fontsize='x-large')

In [None]:
df_can.sort_values(['Total'], ascending=False, axis=0, inplace=True)

# get the top 5 entries
df_top5 = df_can.head()

# transpose the dataframe
df_top5 = df_top5[years].transpose() 

df_top5.head()

Area plots are stacked by default. And to produce a stacked area plot, each column must be either all positive or all negative values (any NaN values will defaulted to 0). To produce an unstacked plot, pass `stacked=False`. 

In [None]:
df_top5.plot(kind='area', 
             alpha=0.75, # 0-1, default value a= 0.5
             stacked=False,
             figsize=(20, 10),
            )

plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

# Histograms<a id="8"></a>

A histogram is a way of representing the *frequency* distribution of numeric dataset. The way it works is it partitions the x-axis into *bins*, assigns each data point in our dataset to a bin, and then counts the number of data points that have been assigned to each bin.

Before we proceed with creating the histogram plot, let's first examine the data split into intervals. To do this, we will us **Numpy**'s `histrogram` method to get the bin ranges and frequency counts as follows:

In [None]:
df_can['2013'].head()

In [None]:
count, bin_edges = np.histogram(df_can['2013'])
print(count) # frequency count
print(bin_edges) # bin ranges, default = 10 bins

In [None]:
df_can['2013'].plot(kind='hist', figsize=(8, 5))

plt.title('Histogram of Immigration from 195 Countries in 2013') # add a title to the histogram
plt.ylabel('Number of Countries') # add y-label
plt.xlabel('Number of Immigrants') # add x-label

Here we can observe the Scandanivian immigration to Canada

In [None]:
df_can.loc[['Denmark', 'Norway', 'Sweden'], years]

In [None]:
# transpose dataframe
df_t = df_can.loc[['Denmark', 'Norway', 'Sweden'], years].transpose()
df_t.head()

In [None]:
count, bin_edges = np.histogram(df_t, 15)
xmin = bin_edges[0] - 10   #  first bin value is 31.0, adding buffer of 10 for aesthetic purposes 
xmax = bin_edges[-1] + 10  #  last bin value is 308.0, adding buffer of 10 for aesthetic purposes

# stacked Histogram
df_t.plot(kind='hist',
          figsize=(10, 6), 
          bins=15,
          xticks=bin_edges,
          color=['coral', 'darkslateblue', 'mediumseagreen'],
          stacked=True,
          xlim=(xmin, xmax)
         )

plt.title('Histogram of Immigration from Denmark, Norway, and Sweden from 1980 - 2013')
plt.ylabel('Number of Years')
plt.xlabel('Number of Immigrants') 

Display the immigration distribution for Greece, Albania, and Bulgaria for years 1980 - 2013

In [None]:
dfEuro = df_can.loc[['Greece', 'Albania', 'Bulgaria'], years].transpose()
dfEuro.plot(kind = 'hist',
            bins = 15,
            stacked=False,
            figsize = (10,6),
            title="Histogram of Immigration from Albania, Greece, and Bulgaria from 1980 - 2013 "
           )

# Bar Charts (Dataframe) <a id="10"></a>

A bar plot is a way of representing data where the *length* of the bars represents the magnitude/size of the feature/variable. Bar graphs usually represent numerical and categorical variables grouped in intervals. 

In [None]:
df_iceland = df_can.loc['Iceland', years]
df_iceland.head()

In [None]:
df_iceland.plot(kind='bar', figsize=(10, 6))

plt.xlabel('Year')
plt.ylabel('Number of immigrants') 
plt.title('Icelandic immigrants to Canada from 1980 to 2013')

The bar plot above shows the total number of immigrants broken down by each year. We can clearly see the impact of the financial crisis; the number of immigrants to Canada started increasing rapidly after 2008. 

# Scatter Plots <a id="10"></a>

Using a `scatter plot`, we can visualize the trend of total immigrantion to Canada (all countries combined) for the years 1980 - 2013.

In [None]:
df_tot = pd.DataFrame(df_can[years].sum(axis=0))
df_tot.index = map(int, df_tot.index)
df_tot.reset_index(inplace = True)
# rename columns
df_tot.columns = ['year', 'total']
# view the final dataframe
df_tot.head()

In [None]:
df_tot.plot(kind='scatter', x='year', y='total', figsize=(18, 6), color='darkblue')

plt.title('Total Immigration to Canada from 1980 - 2013')
plt.xlabel('Year')
plt.ylabel('Number of Immigrants')

plt.show()

Notice how the scatter plot does not connect the datapoints together. We can clearly observe an upward trend in the data: as the years go by, the total number of immigrants increases. We can mathematically analyze this upward trend using a regression line (line of best fit). 

In [None]:
x = df_tot['year']
y = df_tot['total']
fit = np.polyfit(x, y, deg=1)

fit

The output is an array with the polynomial coefficients, highest powers first. Since we are plotting a linear regression `y= a*x + b`, our output has 2 elements `[5.56709228e+03, -1.09261952e+07]` with the the slope in position 0 and intercept in position 1. 

In [None]:
df_tot.plot(kind='scatter', x='year', y='total', figsize=(10, 6), color='darkblue')

plt.title('Total Immigration to Canada from 1980 - 2013')
plt.xlabel('Year')
plt.ylabel('Number of Immigrants')
plt.plot(x, fit[0] * x + fit[1], color='red') # recall that x is the Years
plt.annotate('y={0:.2f} x + {1:.2f}'.format(fit[0], fit[1]), xy=(2000, 150000))
'No. Immigrants = {0:.0f} * Year + {1:.0f}'.format(fit[0], fit[1]) 

Using the equation of line of best fit, we can estimate the number of immigrants in 2015:
```python
No. Immigrants = 5567 * Year - 10926195
No. Immigrants = 5567 * 2015 - 10926195
No. Immigrants = 291,310
```
When compared to the actuals from Citizenship and Immigration Canada's (CIC) [2016 Annual Report](http://www.cic.gc.ca/english/resources/publications/annual-report-2016/index.asp), we see that Canada accepted 271,845 immigrants in 2015. Our estimated value of 291,310 is within 7% of the actual number, which is pretty good considering our original data came from United Nations (and might differ slightly from CIC data).

Then, create a scatter plot of the total immigration from Denmark, Norway, and Sweden to Canada from 1980 to 2013?

In [None]:
dfCountries = df_can.loc[['Denmark', 'Norway','Sweden'],years]
dfCountries.head()

In [None]:
dfTotal = pd.DataFrame(dfCountries.sum(axis=0))
dfTotal.index = map(int, dfTotal.index)
dfTotal.reset_index(inplace = True)
dfTotal.columns = ['year', 'total']
dfTotal.head()

In [None]:
dfTotal.plot(kind='scatter', x='year', y='total', figsize=(10,6))
plt.title('Total Immigrants Each Year from Denmark, Norway, and Sweden, to Canada')