# Lecture 1: Introduction and Data Science Basics


(Based on the Jupyter notebook by Marek Rei)

This session will cover how to load data in common formats into python, plot it and calculate basic statistics over it. 

## Python Syntax Refresher

Here is a short python program just to remind you about the python syntax. 

This short code snippet imports a library called *random*, creates a list with three elements, then goes through the list and prints each element along with a random number.

In [None]:
import random

my_list = ["camel", "elephant", "crocodile"]
for word in my_list:
    print(word + " " + str(random.random()))

If you need help with getting started with Python, there are a couple of useful tutorials online, e.g.:

- https://www.tutorialspoint.com/python/index.htm
- https://www.learnpython.org
- https://www.codecademy.com/learn/learn-python-3

## Dataset

We will use the data/country-stats.csv (the repository actually presents data in a number of different formats) file which contains demographic information for 161 countries, collected by The World Bank. Each line includes the following values:

* Country	Name
* GDP per Capita (PPP USD)
* Population Density (persons per sq km)
* Population Growth Rate (%)
* Urban Population (%)
* Life Expectancy at Birth (avg years)
* Fertility Rate (births per woman)
* Infant Mortality (deaths per 1000 births)
* Enrolment Rate, Tertiary (%)
* Unemployment, Total (%)
* Estimated Control of Corruption (scale -2.5 to 2.5)
* Estimated Government Effectiveness (scale -2.5 to 2.5)
* Internet Users (%)

A sample of the CSV (comma-separated values) format can be seen below. Values are separated by commas; the first row contains column headers.

~~~~
Country Name,GDP per Capita (PPP USD),Population Density (persons per sq km),Population Growth Rate (%),Urban Population (%),Life Expectancy at Birth (avg years),Fertility Rate (births per woman),Infant Mortality (deaths per 1000 births),"Enrolment Rate, Tertiary (%)","Unemployment, Total (%)",Estimated Control of Corruption (scale -2.5 to 2.5),Estimated Government Effectiveness (scale -2.5 to 2.5),Internet Users (%)
Afghanistan,1560.67,44.62,2.44,23.86,60.07,5.39,71,3.33,8.5,-1.41,-1.4,5.45
Albania,9403.43,115.11,0.26,54.45,77.16,1.75,15,54.85,14.2,-0.72,-0.28,54.66
Algeria,8515.35,15.86,1.89,73.71,70.75,2.83,25.6,31.46,10,-0.54,-0.55,15.23
Antigua and Barbuda,19640.35,200.35,1.03,29.87,75.5,2.12,9.2,14.37,8.4,1.29,0.48,83.79
Argentina,12016.2,14.88,0.88,92.64,75.84,2.2,12.7,74.83,7.2,-0.49,-0.25,55.8
~~~~

## Reading Data into Python

We can use the python library called `pandas` in order to easily load CSV files into our code. The *data* variable will be a `pandas`-specific object containing the whole dataset. *data.head()* shows the first few lines.

In [None]:
import pandas as pd

data = pd.read_csv('data/country-stats.csv')
data.head()

## Using *Pandas* to Analyze Data

Now that we have loaded the data, we can analyze it.  

To start, we'll focus on one variable in this dataset: GDP per Capita (PPP USD).

It is common to describe a variable by finding its average value (the mean), so let's do that first.

In [None]:
data["GDP per Capita (PPP USD)"].mean()

Now we know that the average GDP in those countries is $15616. But on its own that doesn't really tell us much. As data scientists, we want to find interesting connections and patterns. 

What if we look at how the average GDP differs between countries with low and high unemployment? We can use `pandas` to first select countries that have a specific unemployment percentage (e.g., 7\%) and then calculate the mean over that group.

In [None]:
low_unemployment_countries = data[data["Unemployment, Total (%)"] < 7]
low_unemployment_countries["GDP per Capita (PPP USD)"].mean()

In [None]:
high_unemployment_countries = data[data["Unemployment, Total (%)"] >= 7]
high_unemployment_countries["GDP per Capita (PPP USD)"].mean()

Now we're getting somewhere! There's a difference in average GDP between these two groups. Countries with higher unemployment rate have lower GDP.  

Let's plot this finding using another helpful library, `matplotlib`.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

bar_width = 0.6

x1 = [0.0]
x2 = [bar_width]

y1 = [data[data["Unemployment, Total (%)"] < 7]["GDP per Capita (PPP USD)"].mean()]
y2 = [data[data["Unemployment, Total (%)"] >= 7]["GDP per Capita (PPP USD)"].mean()]

fig, ax = plt.subplots()
bars1 = ax.bar(x1, y1, bar_width, alpha=0.4, color='b', label='Low unemployment')
bars2 = ax.bar(x2, y2, bar_width, alpha=0.4, color='r', label='High unemployment')

ax.set_ylabel('GDP')
ax.set_title('Average GDP by unemployment')
ax.set_xticks([])
ax.set_xlim([-2,2.5])
ax.set_ylim([14000,17000])
ax.legend()

plt.show()

This clearly looks like a big difference in GDP between the two countries, right?

Well, there are actually a couple of problems with this plot. First of all, a lot depends on how the information is presented. If we adjust the Y axis to show the whole range of values, the same difference will not look as substantial any more.

In [None]:
bar_width = 0.6

x1 = [0.0]
x2 = [bar_width]

y1 = [data[data["Unemployment, Total (%)"] < 7]["GDP per Capita (PPP USD)"].mean()]
y2 = [data[data["Unemployment, Total (%)"] >= 7]["GDP per Capita (PPP USD)"].mean()]

fig, ax = plt.subplots()
bars1 = ax.bar(x1, y1, bar_width, alpha=0.4, color='b', label='Low unempl.')
bars2 = ax.bar(x2, y2, bar_width, alpha=0.4, color='r', label='High unempl.')

ax.set_ylabel('GDP')
ax.set_title('Average GDP by unemployment')
ax.set_xticks([])
ax.set_xlim([-2,2.5])
ax.set_ylim([0, 17000])
ax.legend()

plt.show()

That doesn't look like a very big difference any more. The way we present data is a powerful tool and we have to be careful not to let it fool ourselves (or others).

The second problem is that taking the average of some data throws away a lot of important information. Let's calculate the standard deviation of these groups as well.

In [None]:
low_unemployment_countries = data[data["Unemployment, Total (%)"] < 7]
low_unemployment_countries["GDP per Capita (PPP USD)"].std()

In [None]:
high_unemployment_countries = data[data["Unemployment, Total (%)"] >= 7]
high_unemployment_countries["GDP per Capita (PPP USD)"].std()

For both subgroups, the standard deviation is almost just as high as the average GDP. This means there is so much variance in the data that the difference in averages hardly matters.

We can plot the data to take a better look at the relationships between these two variables.

In [None]:
plt.scatter(data["Unemployment, Total (%)"], data["GDP per Capita (PPP USD)"])
plt.xlabel("Unemployment, Total (%)")
plt.ylabel("GDP per Capita (PPP USD)")
for i in [37,84]:
    plt.annotate(data["Country Name"][i], (data["Unemployment, Total (%)"][i], data["GDP per Capita (PPP USD)"][i]))

#fig.set_size_inches(5,5)
plt.savefig('graph3.png', dpi=400)
    
plt.show()

We can see that there are some countries with very low unemployment and very high GDP, and some countries with very high unemployment and very low GDP (following our original intuition about the data). But there are also many countries with low unemloyment and low GDP, so our original assumption doesn't really hold and is not actually supported by the data.

## Doing More with Pandas

Here are some other useful things you can do with the `pandas` DataFrame object. If you call it on a dataset without any filters, it will calculate the mean value for all the columns.

In [None]:
data.mean()

Similarly, you can use it to calculate statistics such as the median, minimum and maximum:

In [None]:
data.median()
data.min()
data.max()

You can get index using row/column numbers as well. Here is row number 2:

In [None]:
data.iloc[2,:]

Column number 4:

In [None]:
data.iloc[:,4].head()

Element in row 2 column 4:

In [None]:
data.iloc[2,4]

The *.describe()* function prints a bunch of different statistics at once for all the fields:

In [None]:
data.describe()

Also, the *.corr()* function will automatically calculate correlations between all the columns:

In [None]:
data.corr()