<h1>Regression Modeling in Practice</h1>
<h3>WEEK 1 : Introduction to Regression</h3>

I choose the GapMinder dataset. Of this dataset I am interested in 3 variables, the incomeperperson variable, the armedforcesrate, and polityscore. Below are detailed descriptions of what the variables represent.

<img src="CoursebookSnippet.png">

<font color = '#009900'><b>DATA</b></font>

The data is provided by a non-profit organisation Gapminder. Their website is linked <a href = "https://www.gapminder.org/">here</a>.

<font color = '#009900'><h4>SAMPLE</h4></font>

The sample is a collection of different data regarding 213 countries. The data is collected by Gapminder through donation from external sources, online polls, and asking to live audiences. More information can be found at the links below:
<br><br><span style = "margin-left : 50px"><a href = "https://www.gapminder.org/about-gapminder/">About Gapminder</a></span>
<br><span style = "margin-left : 50px"><a href = "https://www.gapminder.org/data/">Data in the Gapminder World</a></span>
<br><br>
The main sources of the data are indicated in the codebook snippet above:
<br><span style = "margin-left : 50px"><i>incomeperperson</i> : World Bank</span>
<br><span style = "margin-left : 50px"><i>armedforcesrate</i> : World Development Indicators (also World Bank)</span>
<br><span style = "margin-left : 50px"><i>polityscore</i>     : Polity IV Project</span>

<font color = '#009900'><h4>PROCEDURE</h4></font>

The data is all observational.

By searching the <i>Data in the Gapminder World</i> link above for the individual variables we find that the infromation is <i>gathered/calculated/collected</i> by:
<br><span style = "margin-left : 50px"><u>incomeperperson</u></span>
<br><span style = "margin-left : 100px">The values are calculated as the weighted average of values from account data available to the World Bank and The Organisation for Economic Co-operation and Development (OECD).</span>
<br><span style = "margin-left : 50px"><u>armedforcesrate</u></span>
<br><span style = "margin-left : 100px">The data is recieved from the International Institute for Strategic Studies, The Military Balance. Not many countries are willing to give actual numbers of military personelle. As such the counts of active service men and women is sometimes extrapolated or gathered through intellegance. This means that the numbers may be incorrect in some cases but are still good estimates.</span>
<br><span style = "margin-left : 50px"><u>polityscore</u></span>
<br><span style = "margin-left : 100px">The values are gathered through monitoring the political behaviour of these countries. Decisions on indicators are then made in context to the country under observation.</span>

As the data is observational we can only ever indicate if our variables are associated. Causality <b>CANNOT</b> be deduced from our analysis. Because countries are such complex networks of influences there is potential for an incredibly high number of confounding variables. 

<font color = '#009900'><h4>MEASURES</h4></font>

By searching the <i>Data in the Gapminder World</i> link above for the individual variables we find:
<br><span style = "margin-left : 50px"><u>incomeperperson</u></span>
<br><span style = "margin-left : 100px"><i>Description</i> : GNI per capita is gross national income divided by midyear population. GNI (formerly GNP) is the sum of value added by all resident producers plus any product taxes (less subsidies) not included in the valuation of output plus net receipts of primary income (compensation of employees and property income) from abroad. Data are in constant 2000 U.S. dollars.</span>
<br><span style = "margin-left : 100px"><i>My Data</i> : Coerced into a number, no rounding or binning to be used, any countries with nulls are removed</span>
<br><span style = "margin-left : 100px"><i>Link to complete reference</i> : <a href = "http://data.worldbank.org/indicator/NY.GNP.PCAP.KD"> Link</a></span>
<br><span style = "margin-left : 50px"><u>armedforcesrate</u></span>
<br><span style = "margin-left : 100px"><i>Description</i> : Armed forces personnel are active duty military personnel, including paramilitary forces if the training, organization, equipment, and control suggest they may be used to support or replace regular military forces. Labor force comprises all people who meet the International Labour Organization's definition of the economically active population. Note: Data for some countries are based on partial or uncertain data or rough estimates.</span>
<br><span style = "margin-left : 100px"><i>My Data</i> : Coerced into a number, no rounding or binning to be used, any countries with nulls are removed</span>
<br><span style = "margin-left : 100px"><i>Link to complete reference</i> : <a href = "http://data.worldbank.org/indicator/MS.MIL.TOTL.TF.ZS"> Link</a></span>
<br><span style = "margin-left : 50px"><u>polityscore</u></span>
<br><span style = "margin-left : 100px"><i>Description</i> : Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. It is a summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.</span>
<br><span style = "margin-left : 100px"><i>My Data</i> : Categrical, no binning to be implemeneted unless an initial analysis to narrow down areas of interest is conducted, any countries with nulls are removed</span>
<br><span style = "margin-left : 100px"><i>Link to complete reference</i> : <a href = "http://www.systemicpeace.org/polity/polity4.htm"> Link</a></span>
<br><span style = "margin-left : 100px"><i>Sample size</i> : 167 countries are part of the study</span>
<br><br>Further information is available at the complete reference links.

<b>NB</b> : For the current analysis all countries which have a null value for any of our 3 variables is removed.

<font color = '#009900'><h3>A QUICK LOOK AT THE DATA SET</h3></font>

Set up

In [1]:
import pandas as pd
pd.options.display.max_rows = 6

Reading in then converting data to numeric

In [82]:
gap_data = pd.read_csv('gapminder.csv', usecols = ['incomeperperson', 'armedforcesrate','polityscore','country'], \
    index_col='country')
gap_data.loc[:,'incomeperperson'] = pd.to_numeric(gap_data['incomeperperson'].replace(' ',''))
gap_data.loc[:,'armedforcesrate'] = pd.to_numeric(gap_data['armedforcesrate'].replace(' ',''))
gap_data.loc[:,'polityscore'] = pd.to_numeric(gap_data['polityscore'].replace(' ',''))
gap_data

Unnamed: 0_level_0,incomeperperson,armedforcesrate,polityscore
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,,0.569653,0.0
Albania,1914.996551,1.024736,9.0
Algeria,2231.993335,2.306817,2.0
...,...,...,...
"Yemen, Rep.",610.357367,2.316235,-2.0
Zambia,432.226337,0.341335,7.0
Zimbabwe,320.771890,1.032785,1.0


Check for nulls in the data set

In [60]:
gap_data.isnull().sum()

incomeperperson    23
armedforcesrate    49
polityscore        52
dtype: int64

Check the size of the data set when any rows that contain null values are removed.

In [78]:
gap_data[gap_data.isnull().apply(lambda x : ~x.any(), axis = 1)].shape

(149, 3)