<a href="https://colab.research.google.com/github/mnikolop/PythonTutorial/blob/master/KarolinskaTutorial-Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python 

Python is an interpreted, high-level, general-purpose programming language.    
Can be used for almost anything.   
We are using it for data analysis.    
Main library used in data analysis is pandas.   
There used to be 2 version python2 (or python) and python 3. Since January 1st Python 2 has been depricated so we are only using python 3.

## Main Resource
- [python](https://www.python.org/)
- [stackoverflow](https://stackoverflow.com/)
- [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [toward data science](https://towardsdatascience.com/)

## Editors

Editors can be IDEs, text editors or notbooks.
- [Anaconda](https://www.anaconda.com/): deployment environment
  - [Jupyter](https://jupyter.org/): notebook
- [VS Code](https://code.visualstudio.com/): text editor
- [PyCharm](https://www.jetbrains.com/pycharm/): IDE
- [Google Colab](https://colab.research.google.com/): notebook






# Libraries
Libraries can contain datasets, tools, graphics, processes, functions and so on.
They are basically what makes a language powerfull.


**Note**: it is possible but not common that libraries have (slightly) different names for anaconda, only exist for anaconda or don't exist for it. 


**Other languages in python notebooks**   
It is possible to run command lines from a notebook.
Some notebooks also allow other languages to be mixed in as long as they are in seperate cells and you specify the language (and of course it is suported).   
For example Jupyter notebooks support Python, Java, R, Julia, Matlab, Octave, Scheme, Processing, Scala.   

---

*Good Practice*   
Different languages should be in their own cells.   
It is common for kernels/runtime environments to not allow the mixing of languages in a cell.

In [0]:
# Install libraries
!pip install gapminder
!pip install -U -q PyDrive

A lot of the libraries come pre-instaled in the environments so we can just load them. 

In [0]:
#Import libraries
import numpy as np #basic scientific computation 
import pandas as pd #for data science
from gapminder import gapminder #dataset
import matplotlib.pyplot as plt #ploting 
import seaborn as sns; sns.set_style("darkgrid") #ploting
my_dpi=96 #dots per inch. Used in the size of the plots


#Libraries for reading from the drive
from pydrive.auth import GoogleAuth #Google authwntication library
from pydrive.drive import GoogleDrive #Google drive library
from google.colab import auth #Google colab auth library
from google.colab import files #Google colab filesystem library
import gspread #google sheets library 
from oauth2client.client import GoogleCredentials #Google credentials library
import io #google io library



In [0]:
# Authenticate and create the PyDrive client.
auth.authenticate_user() #creat the authenticated user
gauth = GoogleAuth() #create the authentication
gauth.credentials = GoogleCredentials.get_application_default() #assign credentials
drive = GoogleDrive(gauth) #set authentication

# Datasets
Datasets are mostly used in the form of a dataframe in python.
Datasets can be also found in the form of 
- lists
- dictionaries
In this workshop we will look into dataframes and specifically Pandas dataframes.
Dataframes are sets of data in a "table" format. 
They come with a multitude of operations and functions they can be performed on them.

## Loading datasets
- Library
- Sample dataset
- File
- Online


### Loading from a library

#### Overview and quick statistics

It is important to overview dataframes to
- make sure we are looking at the correct data
- identify potential errors and point of correction
- know the datatypes and column names
- know the ranges and basic statistics 

In [0]:
print(type(gapminder))

<class 'pandas.core.frame.DataFrame'>


In [0]:
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [0]:
print("the first 5 lines of the dataframe are \n" , gapminder.head(), "\n")
print("the basic statistics of the dataframe are \n" , gapminder.describe(), "\n")
print("the data types of the dataframe are \n" , gapminder.dtypes, "\n")
print("the data types of the dataframe are (plus sparsity) \n" , gapminder.ftypes, "\n")

the first 5 lines of the dataframe are 
        country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106 

the basic statistics of the dataframe are 
              year      lifeExp           pop      gdpPercap
count  1704.00000  1704.000000  1.704000e+03    1704.000000
mean   1979.50000    59.474439  2.960121e+07    7215.327081
std      17.26533    12.917107  1.061579e+08    9857.454543
min    1952.00000    23.599000  6.001100e+04     241.165877
25%    1965.75000    48.198000  2.793664e+06    1202.060309
50%    1979.50000    60.712500  7.023596e+06    3531.846989
75%    1993.25000    70.845500  1.958522e+07    9325.462346
max    2007.00000    82.603000  1.318683e+09  113523.132900 


  after removing the cwd from sys.path.


# Python and errors
Python is very expressive.   
This means you get loads of text when there is an error or a warning.  
Only exeptions (that I have encountered so far) are out of bounds errors that are shown as keyError, That ususlly means that the position that is trying to be accesed doesn't exist or has a different index. 


# Data Wrangling

In [0]:
gapminder['continent'] = pd.Categorical(gapminder['continent']) #change continent to categorical

# Data Visualisation
Most of Python's good graphing libraries are libraries that have been migrated from other languages (R, Matlab). 

In [0]:
# graph data
for i in gapminder.year.unique(): #for every year in the data make the following graph
  fig = plt.figure(figsize=(680/my_dpi, 480/my_dpi), dpi=my_dpi) #create a figure
  tmp=gapminder[ gapminder.year == i ] #seperate the data to be graphed
  plt.scatter(tmp['lifeExp'], tmp['gdpPercap'] , s=tmp['pop']/200000 , 
              c=tmp['continent'].cat.codes, cmap="Accent", alpha=0.6, edgecolors="white", linewidth=2) #plot and colors
  plt.yscale('log') #set the scale of the y axis to logarythmic
  plt.xlabel("Life Expectancy") #set the label of the x axis
  plt.ylabel("GDP per Capita") #set the label of the y axis
  plt.title("Year: "+str(i) ) #set the title
  plt.ylim(0,100000) #set the scale of the y axis
  plt.xlim(30, 90) #set the scale of the x axis


# Uploading data from file

Here we will try 
- From local system
- From google sheet


In [0]:
gc = gspread.authorize(GoogleCredentials.get_application_default()) #access the google sheets

worksheet = gc.open('life_expectancy_years').sheet1 #get access to the specific sheet

rows = worksheet.get_all_values()# get_all_values gives a list of rows.

df = pd.DataFrame.from_records(rows) #turn the data into a dataframe

# Data Wrangling (2)

Data are rarely in the correct format or type when loaded. 
Common corrections that usually need to be done are:
- Asigning a row as headers
- Correcting the index
- Correcting data types
- Dealing with NAs or empty cells
- ...

**Note**: A good practice would be to pass the original dataset first to a new one (with more apropriate name) so if soemthing goes wrong the original data are unchanged. 

In [0]:
lifeExpectancy = df

In [0]:
lifeExpectancy = lifeExpectancy.reset_index(drop=True) #reset the index
lifeExpectancy.columns = lifeExpectancy.iloc[0] #use the first row as column names
lifeExpectancy = lifeExpectancy.drop(lifeExpectancy.index[0]) #delete the first row (the one used for names)
lifeExpectancy = lifeExpectancy.reset_index(drop=True) #reset the index

In [0]:
lifeExpectancy

In [0]:
lifeExpectancy['country'] = pd.Categorical(lifeExpectancy['country']) #change country to categorical
lifeExpectancy = lifeExpectancy.set_index('country') #set country as index
lifeExpectancy = lifeExpectancy.unstack() #unstack the categories created by the new index
lifeExpectancy = lifeExpectancy.to_frame().stack(level=0) #stack the dataframe using the first level as columns
lifeExpectancy = lifeExpectancy.to_frame().swaplevel() #turn the rows to columns
lifeExpectancy.index.names = ['year', 'drop', 'country']
lifeExpectancy = lifeExpectancy.droplevel('drop')
lifeExpectancy.rename(columns={ lifeExpectancy.columns[0]: "expectancy" }, inplace = True)

In [0]:
lifeExpectancy

In [56]:
uploaded = files.upload() # create field that allows to apload data file and asign it to the uploaded variable (PLEASE only select the file indicated bellow!)
immunization = pd.read_csv(io.BytesIO(uploaded['dtp3_immunized_percent_of_one_year_olds.csv'])) #read the uploaded file to a df

Saving dtp3_immunized_percent_of_one_year_olds.csv to dtp3_immunized_percent_of_one_year_olds (1).csv


In [57]:
immunization

Unnamed: 0,country,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011
0,Afghanistan,4.0,3.0,5.0,5.0,16.0,15.0,11.0,25.0,35.0,33.0,25.0,23.0,21.0,18.0,12.0,20.0,31.0,28.0,27.0,27.0,24.0,33.0,36.0,41.0,50.0,58.0,58.0,63.0,64.0,63.0,66.0,66
1,Albania,94.0,94.0,95.0,95.0,95.0,96.0,96.0,96.0,96.0,94.0,94.0,78.0,94.0,96.0,99.0,97.0,98.0,99.0,96.0,97.0,97.0,97.0,98.0,97.0,97.0,98.0,97.0,98.0,99.0,98.0,99.0,99
2,Algeria,,,,,,69.0,73.0,79.0,85.0,87.0,89.0,89.0,88.0,88.0,87.0,88.0,88.0,89.0,85.0,83.0,86.0,89.0,86.0,87.0,86.0,88.0,95.0,95.0,93.0,95.0,95.0,95
3,Andorra,,,,,,,,,,,,,,,,,,90.0,89.0,90.0,98.0,96.0,97.0,99.0,99.0,98.0,93.0,96.0,99.0,99.0,99.0,99
4,Angola,,,,6.0,8.0,8.0,10.0,10.0,12.0,18.0,24.0,26.0,21.0,30.0,27.0,24.0,28.0,41.0,47.0,21.0,31.0,42.0,47.0,46.0,47.0,47.0,44.0,83.0,81.0,73.0,91.0,86
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188,Venezuela,56.0,54.0,53.0,58.0,33.0,49.0,58.0,57.0,56.0,57.0,63.0,64.0,67.0,67.0,63.0,68.0,57.0,60.0,39.0,76.0,77.0,70.0,65.0,68.0,86.0,87.0,71.0,62.0,50.0,84.0,78.0,78
189,Vietnam,,,,4.0,5.0,42.0,43.0,50.0,70.0,88.0,88.0,88.0,88.0,91.0,94.0,93.0,94.0,95.0,94.0,93.0,96.0,96.0,75.0,99.0,96.0,95.0,94.0,92.0,93.0,96.0,93.0,95
190,Yemen,1.0,3.0,4.0,5.0,7.0,12.0,13.0,18.0,36.0,56.0,84.0,47.0,50.0,54.0,33.0,44.0,47.0,40.0,68.0,72.0,76.0,76.0,69.0,66.0,78.0,85.0,85.0,87.0,87.0,86.0,87.0,81
191,Zambia,,,,49.0,59.0,66.0,66.0,83.0,83.0,83.0,91.0,79.0,83.0,86.0,90.0,86.0,84.0,83.0,84.0,84.0,85.0,85.0,84.0,83.0,83.0,82.0,81.0,80.0,87.0,94.0,83.0,81


In [0]:
immunization['country'] = pd.Categorical(immunization['country']) #change country to categorical
immunization = immunization.set_index('country') #set country as index
immunization = immunization.unstack() #unstack the categories created by the new index
immunization = immunization.to_frame().stack(level=0) #stack the dataframe using the first level as columns
immunization = immunization.to_frame().swaplevel() #turn the rows to columns
immunization.index.names = ['year', 'drop', 'country'] #rename the index columns
immunization = immunization.droplevel('drop') #delete the 2nd level of teh index
immunization.rename(columns={ immunization.columns[0]: "dtp3" }, inplace = True) #rename the column without douplicating

In [59]:
immunization

Unnamed: 0_level_0,Unnamed: 1_level_0,dtp3
year,country,Unnamed: 2_level_1
1980,Afghanistan,4.0
1980,Albania,94.0
1980,Antigua and Barbuda,54.0
1980,Argentina,44.0
1980,Australia,33.0
...,...,...
2011,Venezuela,78.0
2011,Vietnam,95.0
2011,Yemen,81.0
2011,Zambia,81.0


In [0]:
immunization = immunization.reset_index(drop=False) #reset the index

In [61]:
immunization

Unnamed: 0,year,country,dtp3
0,1980,Afghanistan,4.0
1,1980,Albania,94.0
2,1980,Antigua and Barbuda,54.0
3,1980,Argentina,44.0
4,1980,Australia,33.0
...,...,...,...
5546,2011,Venezuela,78.0
5547,2011,Vietnam,95.0
5548,2011,Yemen,81.0
5549,2011,Zambia,81.0


In [0]:
demographics = gapminder[['continent', 'country']] #copy teh columns continent and country from the gapminder data to a new df

In [0]:
data = lifeExpectancy.merge(immunization, how='inner', on=['year', 'country']) #merge the life expectancy and immunization data based on the year and country columns
data = data.merge(demographics, how='outer', on='country').dropna() #merge the previous df with the demographics column 

In [65]:
data

Unnamed: 0,year,country,expectancy,dtp3,continent
0,1980,Afghanistan,43.3,4.0,Asia
1,1980,Afghanistan,43.3,4.0,Asia
2,1980,Afghanistan,43.3,4.0,Asia
3,1980,Afghanistan,43.3,4.0,Asia
4,1980,Afghanistan,43.3,4.0,Asia
...,...,...,...,...,...
49287,2011,Montenegro,76.4,95.0,Europe
49288,2011,Montenegro,76.4,95.0,Europe
49289,2011,Montenegro,76.4,95.0,Europe
49290,2011,Montenegro,76.4,95.0,Europe


In [68]:
data.dtypes

year            object
country         object
expectancy     float64
dtp3           float64
continent     category
dtype: object

In [0]:
data.dtp3 = pd.to_numeric(data.dtp3)
data.expectancy = pd.to_numeric(data.expectancy)

# Statistics

Grouping data is a very usefull function as it can provide an easier way of handling data and geting statistcs out of the groups.*italicized text*

In [69]:
print(data.groupby(['continent']).max()) #find the max of each column for each continent
print(data.groupby(['continent']).expectancy.max()) #find the max of the life expectancy for each continent
print(data.groupby(['continent']).dtp3.min()) #find the min of the immunization data for each continent 

           year         country  expectancy  dtp3
continent                                        
Africa     2011        Zimbabwe        77.0  99.0
Americas   2011       Venezuela        81.5  99.0
Asia       2011         Vietnam        82.9  99.0
Europe     2011  United Kingdom        82.5  99.0
Oceania    2011     New Zealand        82.2  95.0
continent
Africa      77.0
Americas    81.5
Asia        82.9
Europe      82.5
Oceania     82.2
Name: expectancy, dtype: float64
continent
Africa      99.0
Americas    99.0
Asia        99.0
Europe      99.0
Oceania     95.0
Name: dtp3, dtype: float64


In [0]:
for continent, temp in data.groupby(['continent']): #for each continent group the data and put it in a temp df
  print(continent, ':') 
  print(data.groupby(['country'])['dtp3'].max().rename('max').sort_values(ascending = False).head(2).reset_index(drop=False)) #find the max of dtp3 immunization and print the first 2 rows in descending order
  print(data.groupby(['country'])['expectancy'].max().rename('max').sort_values(ascending = True).head(3).reset_index(drop=False)) #find the max of life expectancy and print the first 3 rows in ascending order
  print("max values:\n", temp.max(), "\n")
  print("correlation of life expectancy and immunization to dtp3 is:\n", temp['expectancy'].corr(temp['dtp3']), "\n") # find the correlation of the life expectancy and immunization columns

# Predicting data

Like with other data science language we can performe predictive algorithms in python as well.
Here we will try and predict the next year in the data.

# References
- [Gapminder Data - Expectancy](http://gapm.io/ilex)
- [Gapminder Data - Immunization](https://data.unicef.org/child-health/immunization.html)
- [Gapminder data animation tutorial](https://python-graph-gallery.com/341-python-gapminder-animation/)
- 
