<a href="https://colab.research.google.com/github/mnikolop/PythonTutorial/blob/master/KarolinskaTutorial-Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python 

Python is an interpreted, high-level, general-purpose programming language.    
Can be used for almost anything.   
We are using it for data analysis.    
Main library used in data analysis is pandas.   
There used to be 2 version python2 (or python) and python 3. Since January 1st Python 2 has been depricated so we are only using python 3.

## Main Resource
- [python](https://www.python.org/)
- [stackoverflow](https://stackoverflow.com/)
- [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [toward data science](https://towardsdatascience.com/)

## Editors

Editors can be IDEs, text editors or notbooks.
- [Anaconda]: deployment environment
  - [Jupyter]: notebook
- [VS Code]: text editor
- [PyCharm]: IDE
- [Google Colab]: notebook






# Libraries
Libraries can contain datasets, tools, graphics, processes, functions and so on.
They are basically what makes a language powerfull.

A lot of the libraries come pre-instaled in the environments so we can just load them. If they are not we have to install them like below.


Note: it is possible but not common that libraries have (slightly) different names for anaconda, only exist for anaconda or don't exist for it. 

---

It is possible to run command lines from a notebook.
Some notebooks also allow other languages to be mixed in as long as they are in seperate cells and you specify the language (and of course it is suported).

In [1]:
# Install libraries
!pip install gapminder
!pip install -U -q PyDrive

Collecting gapminder
  Downloading https://files.pythonhosted.org/packages/85/83/57293b277ac2990ea1d3d0439183da8a3466be58174f822c69b02e584863/gapminder-0.1-py3-none-any.whl
Installing collected packages: gapminder
Successfully installed gapminder-0.1


In [0]:
#Import libraries
import numpy as np #basic scientific computation 
import pandas as pd #for data science
from gapminder import gapminder #dataset
import matplotlib.pyplot as plt #ploting 
import seaborn as sns; sns.set_style("darkgrid") #ploting
my_dpi=96 #dots per inch. Used in the size of the plots


#Libraries for reading from the drive
from pydrive.auth import GoogleAuth #Google authwntication library
from pydrive.drive import GoogleDrive #Google drive library
from google.colab import auth #Google colab auth library
from google.colab import files #Google colab filesystem library
import gspread #google sheets library 
from oauth2client.client import GoogleCredentials #Google credentials library
import io #google io library



In [0]:
# Authenticate and create the PyDrive client.
auth.authenticate_user() #creat the authenticated user
gauth = GoogleAuth() #create the authentication
gauth.credentials = GoogleCredentials.get_application_default() #assign credentials
drive = GoogleDrive(gauth) #set authentication

# Datasets
Data frames are sets of data in a "table" format. 
They come with a multitude of operations and functions they can be performed on them.

## Loading datasets
- Library
- Sample dataset
- File
- Online


In [4]:
print(type(gapminder))

<class 'pandas.core.frame.DataFrame'>


In [5]:
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [6]:
print("the first 5 lines of the dataframe are \n" , gapminder.head(), "\n")
print("the basic statistics of the dataframe are \n" , gapminder.describe(), "\n")
print("the data types of the dataframe are \n" , gapminder.dtypes, "\n")
print("the data types of the dataframe are (plus sparsity) \n" , gapminder.ftypes, "\n")

the first 5 lines of the dataframe are 
        country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106 

the basic statistics of the dataframe are 
              year      lifeExp           pop      gdpPercap
count  1704.00000  1704.000000  1.704000e+03    1704.000000
mean   1979.50000    59.474439  2.960121e+07    7215.327081
std      17.26533    12.917107  1.061579e+08    9857.454543
min    1952.00000    23.599000  6.001100e+04     241.165877
25%    1965.75000    48.198000  2.793664e+06    1202.060309
50%    1979.50000    60.712500  7.023596e+06    3531.846989
75%    1993.25000    70.845500  1.958522e+07    9325.462346
max    2007.00000    82.603000  1.318683e+09  113523.132900 


  after removing the cwd from sys.path.


## Python and errors
Python is very expressive.   
This means you get loads of text when there is an error or a warning.   


In [0]:
gapminder['continent'] = pd.Categorical(gapminder['continent']) #change continent to categorical


In [0]:
# graph data
for i in gapminder.year.unique(): #for every year in the data make the following graph
  fig = plt.figure(figsize=(680/my_dpi, 480/my_dpi), dpi=my_dpi) #create a figure
  tmp=gapminder[ gapminder.year == i ] #seperate the data to be graphed
  plt.scatter(tmp['lifeExp'], tmp['gdpPercap'] , s=tmp['pop']/200000 , 
              c=tmp['continent'].cat.codes, cmap="Accent", alpha=0.6, edgecolors="white", linewidth=2) #plot and colors
  plt.yscale('log') #set the scale of the y axis to logarythmic
  plt.xlabel("Life Expectancy") #set the label of the x axis
  plt.ylabel("GDP per Capita") #set the label of the y axis
  plt.title("Year: "+str(i) ) #set the title
  plt.ylim(0,100000) #set the scale of the y axis
  plt.xlim(30, 90) #set the scale of the x axis


# Uploading data from file

Here we will try 
- From local system
- From google sheet


In [0]:
gc = gspread.authorize(GoogleCredentials.get_application_default()) #access the google sheets

worksheet = gc.open('life_expectancy_years').sheet1 #get access to the specific sheet

rows = worksheet.get_all_values()# get_all_values gives a list of rows.

df = pd.DataFrame.from_records(rows) #turn the data into a dataframe

# Data Wrangling

Data are rarely in the correct format or type when loaded. 
Common corrections that usually need to be done are:
- Asigning a row as headers
- Correcting the index
- Correcting data types
- Dealing with NAs or empty cells
- ...

Note: A good practice would be to pass the original dataset first to a new one (with more apropriate name) so if soemthing goes wrong the original data are unchanged. 

In [0]:
lifeExpectancy = df

In [0]:
lifeExpectancy = lifeExpectancy.reset_index(drop=True) #reset the index
lifeExpectancy.columns = lifeExpectancy.iloc[0] #use the first row as column names
lifeExpectancy = lifeExpectancy.drop(lifeExpectancy.index[0]) #delete the first row (the one used for names)
lifeExpectancy = lifeExpectancy.reset_index(drop=True) #reset the index

In [0]:
lifeExpectancy

In [0]:
lifeExpectancy['country'] = pd.Categorical(lifeExpectancy['country']) #change country to categorical
lifeExpectancy = lifeExpectancy.set_index('country') #set country as index
lifeExpectancy = lifeExpectancy.unstack() #unstack the categories created by the new index
lifeExpectancy = lifeExpectancy.to_frame().stack(level=0) #stack the dataframe using the first level as columns
lifeExpectancy = lifeExpectancy.to_frame().swaplevel() #turn the rows to columns
lifeExpectancy.index.names = ['year', 'drop', 'country']
lifeExpectancy = lifeExpectancy.droplevel('drop')
lifeExpectancy.rename(columns={ lifeExpectancy.columns[0]: "expectancy" }, inplace = True)

In [0]:
lifeExpectancy

In [0]:
# uploaded = files.upload() # create field that allows to apload data file and asign it to the uploaded variable (PLEASE only select the file indicated bellow!)
# immunization = pd.read_csv(io.BytesIO(uploaded['dtp3_immunized_percent_of_one_year_olds.csv']))

In [0]:
gc = gspread.authorize(GoogleCredentials.get_application_default()) #access the google sheets

worksheet = gc.open('dtp3_immunized_percent_of_one_year_olds').sheet1 #get access to the specific sheet

rows = worksheet.get_all_values()# get_all_values gives a list of rows.

immunization = pd.DataFrame.from_records(rows) #turn the data into a dataframe


In [0]:
immunization

In [0]:
immunization = immunization.reset_index(drop=True) #reset the index
immunization.columns = immunization.iloc[0] #use the first row as column names
immunization = immunization.drop(immunization.index[0]) #delete the first row (the one used for names)
immunization = immunization.reset_index(drop=True) #reset the index

In [0]:
immunization

In [0]:
immunization['country'] = pd.Categorical(immunization['country']) #change country to categorical
immunization = immunization.set_index('country') #set country as index
immunization = immunization.unstack() #unstack the categories created by the new index
immunization = immunization.to_frame().stack(level=0) #stack the dataframe using the first level as columns
immunization = immunization.to_frame().swaplevel() #turn the rows to columns
immunization.index.names = ['year', 'drop', 'country']
immunization = immunization.droplevel('drop')
immunization.rename(columns={ immunization.columns[0]: "dtp3" }, inplace = True)

In [0]:
immunization

In [0]:
immunization = immunization.reset_index(drop=False) #reset the index

In [0]:
immunization

In [0]:
demographics = gapminder[['continent', 'country']]

In [0]:
data = lifeExpectancy.merge(immunization, how='inner', on=['year', 'country'])
data = data.merge(demographics, how='outer', on='country').dropna()

In [0]:
data.dtp3 = pd.to_numeric(data.dtp3)
data.expectancy = pd.to_numeric(data.expectancy)

In [26]:
print(data.groupby(['continent']).max()) 
print(data.groupby(['continent']).expectancy.max() )
print(data.groupby(['continent']).dtp3.max() )

           year         country expectancy dtp3
continent                                      
Africa     2011        Zimbabwe         77   99
Americas   2011       Venezuela       81.5   99
Asia       2011         Vietnam       82.9   99
Europe     2011  United Kingdom       82.5   99
Oceania    2011     New Zealand       82.2   95
continent
Africa        77
Americas    81.5
Asia        82.9
Europe      82.5
Oceania     82.2
Name: expectancy, dtype: object
continent
Africa      99
Americas    99
Asia        99
Europe      99
Oceania     95
Name: dtp3, dtype: object


In [30]:
data.dtypes

year            object
country         object
expectancy      object
dtp3            object
continent     category
dtype: object

In [45]:
for continent, temp in data.groupby(['continent']):
  print(continent, ':')
  print(data.groupby(['country'])['dtp3'].max().rename('max').sort_values(ascending = False).head(2).reset_index(drop=False))
  print(data.groupby(['country'])['expectancy'].max().rename('max').sort_values(ascending = True).head(3).reset_index(drop=False))
  print("max values:\n", temp.max(), "\n")
  print("correlation of life expectancy and immunization to dtp3 is:\n", temp['expectancy'].corr(temp['dtp3']), "\n")

Africa :
    country   max
0  Zimbabwe  99.0
1    Kuwait  99.0
                    country   max
0  Central African Republic  50.2
1                    Malawi  54.5
2                   Somalia  55.0
max values:
 year              2011
country       Zimbabwe
expectancy          77
dtp3                99
continent       Africa
dtype: object 

correlation of life expectancy and immunization to dtp3 is:
 0.44534950306583765 

Americas :
    country   max
0  Zimbabwe  99.0
1    Kuwait  99.0
                    country   max
0  Central African Republic  50.2
1                    Malawi  54.5
2                   Somalia  55.0
max values:
 year               2011
country       Venezuela
expectancy         81.5
dtp3                 99
continent      Americas
dtype: object 

correlation of life expectancy and immunization to dtp3 is:
 0.7016344223538111 

Asia :
    country   max
0  Zimbabwe  99.0
1    Kuwait  99.0
                    country   max
0  Central African Republic  50.2
1            

In [0]:
#TODO Advanced: predict the next year

# References
- [Gapminder Data - Expectancy](http://gapm.io/ilex)
- [Gapminder Data - Immunization](https://data.unicef.org/child-health/immunization.html)
- [Gapminder data animation tutorial](https://python-graph-gallery.com/341-python-gapminder-animation/)
- 
