In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<h1>Data Management</h1>

I used the GapMinder data set to investigate the three variables incomeperperson, armedforcesrate, and polityscore.

<h4>SET UP</h4>

<i>Import the packages needed in this programme</i>

In [2]:
import pandas as pd
import numpy as np

<i>Set some options</i>

In [3]:
pd.set_option('display.max_rows', 220)
pd.set_option('expand_frame_repr', False)
pd.set_option('display.float_format',lambda x:'%f'%x)

<i>Read in the whole data set then set the index to be the Series country</i>

In [4]:
data = pd.read_csv('../gapminder.csv', low_memory=False, index_col='country')

In [5]:
for each in data.columns:
    data.loc[:,each] = pd.to_numeric(data.loc[:,each].str.replace(' ',''))

<i>Remove any nulls</i>

In [10]:
data = data[['incomeperperson','armedforcesrate','polityscore']][(data['incomeperperson'].isnull() == False) & (data['armedforcesrate'].isnull() == False)]

<i>Look at how many rows and columns the data set now has</i>

In [13]:
data.shape

(159, 3)

<h4>MANAGEMENT</h4>

<i>Look at the columns of interest as a whole</i>

In [14]:
data[['incomeperperson','armedforcesrate','polityscore']].head()

Unnamed: 0_level_0,incomeperperson,armedforcesrate,polityscore
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Albania,1914.996551,1.024736,9.0
Algeria,2231.993335,2.306817,2.0
Angola,1381.004268,1.461329,-2.0
Argentina,10749.419238,0.560987,8.0
Armenia,1326.741757,2.618438,5.0


In [15]:
data[['incomeperperson','armedforcesrate','polityscore']].tail()

Unnamed: 0_level_0,incomeperperson,armedforcesrate,polityscore
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Venezuela,5528.363114,0.904025,-3.0
Vietnam,722.807559,1.085367,-7.0
"Yemen, Rep.",610.357367,2.316235,-2.0
Zambia,432.226337,0.341335,7.0
Zimbabwe,320.77189,1.032785,1.0


<i>Convert columns to different significance</i>

In [21]:
data.loc[:,'incomeperperson'] = data.loc[:,'incomeperperson'].astype('int')

In [28]:
data.loc[:,'armedforcesrate'] = data.loc[:,'armedforcesrate'].round(4)

In [32]:
data.loc[:,'polityscore'] = data.loc[:,'polityscore'].fillna(50).astype('int')

<i>Check it out</i>

In [33]:
data.describe()

Unnamed: 0,incomeperperson,armedforcesrate,polityscore
count,159.0,159.0,159.0
mean,7353.666667,1.359445,6.779874
std,10555.773438,1.528641,12.733215
min,103.0,0.0,-10.0
25%,602.0,0.46825,-1.0
50%,2481.0,0.904,7.0
75%,8880.0,1.54405,10.0
max,52301.0,9.8201,50.0


In [34]:
data.shape

(159, 3)

<h4>ANALYSIS</h4>

While looking at the data I start by removing the variables that I’m not going to be using. 

I then see that there are some null values in both my main variable columns, so I filter my data set so it doesn’t include the rows with those nulls.

Looking at the whole data set I think there are too many decimal places floating around so I start with the main variable incomeperperson and try different methods of rounding to see if they behave in the way I am expecting, which one method doesn’t. I see if they are behaving the way I want by creating a new table with the original column along with the new columns so I can see them all side by side.

I move onto the armed forces variable and didn’t feel that rounding it would add clarity to the data so I left it as it is.

Finally my further interest variable, polityscore, is all integer values but the null values are forcing the data to appear with lots of decimal places. To combat this I set the nulls to be 50, which is a very different value to the legitimate scale of -10 to 10, and convert the whole column to integer. I didn’t want to trim down my data set any more, especially given this variable is my further interest and not part of my main set.

Looking at my new frequency tables I see that some of my incomeperperson variables are no longer unique, but don’t have a massive frequency. I also notice that some of my value counts have decreased for incomeperperson and armedforcesrate but I expected that as I removed rows where there were nulls in both those columns, regardless of if the other one had legitimate data in it or not.