<a href="https://colab.research.google.com/github/mnikolop/PythonTutorial/blob/master/KarolinskaTutorial_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python 

Python is an interpreted, high-level, general-purpose programming language.    
Can be used for almost anything.   
We are using it for data analysis.    
Main library used in data analysis is pandas.   
There used to be 2 version python2 (or python) and python 3. Since January 1st Python 2 has been depricated so we are only using python 3.

## Main Resource
- [python](https://www.python.org/)
- [stackoverflow](https://stackoverflow.com/)
- [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)
- [toward data science](https://towardsdatascience.com/)

## Editors

Editors can be IDEs, text editors or notbooks.
- [Anaconda]: deployment environment
  - [Jupyter]: notebook
- [VS Code]: text editor
- [PyCharm]: IDE
- [Google Colab]: notebook






# Libraries
Libraries can contain datasets, tools, graphics, processes, functions and so on.
They are basically what makes a langiage powerfull.

A lot of the libraries come preinstaled in the environments so we can just load them. If they are not we have to install them like below.


/note Note: it is possible but not common that libraries have (slightly) different names for anaconda, only exist for anaconda or don't exist for it. note/

---

It is possible to run command lines from a notebook.
Some notebooks also allow other languages to be mixed in as long as they are in seperate cells and you specify the language (and of course it is suported).

In [0]:
# Install libraries
!pip install gapminder
!pip install -U -q PyDrive



In [0]:
#Import libraries
import numpy as np #basic scientific computation 
import pandas as pd #for data science
from gapminder import gapminder #dataset
import matplotlib.pyplot as plt #ploting 
import seaborn as sns; sns.set_style("darkgrid") #ploting
my_dpi=96 #dots per inch. Used in the size of the plots


#Libraries for reading from the drive
from pydrive.auth import GoogleAuth #Google authwntication library
from pydrive.drive import GoogleDrive #Google drive library
from google.colab import auth #Google colab auth library
from google.colab import files #Google colab filesystem library
import gspread #google sheets library 
from oauth2client.client import GoogleCredentials #Google credentials library
import io #google io library

# Authenticate and create the PyDrive client.
auth.authenticate_user() #creat the authenticated user
gauth = GoogleAuth() #create the authentication
gauth.credentials = GoogleCredentials.get_application_default() #assign credentials
drive = GoogleDrive(gauth) #set authentication

# Datasets
Data frames are sets of data in a "table" format. 
They come with a multitude of operations and functions they can be performed on them.

## Loading datasets
- Library
- Sample dataset
- File
- Online

---

One of the libraries above is a dataset, and the other comes with a few test datasets.

---

Since we have our dataset in pandas format we can get simple stats on the contents.


In [0]:
print(type(gapminder))

In [0]:
gapminder

In [0]:
print("the first 5 lines of the dataframe are \n" , gapminder.head(), "\n")
print("the basic statistics of the dataframe are \n" , gapminder.describe(), "\n")
print("the data types of the dataframe are \n" , gapminder.dtypes, "\n")
print("the data types of the dataframe are (plus sparsity) \n" , gapminder.ftypes, "\n")

In [0]:
gapminder['continent'] = pd.Categorical(gapminder['continent']) #change continent to categorical


In [0]:
# graph data
for i in gapminder.year.unique(): #for every year in the data make the following graph
  fig = plt.figure(figsize=(680/my_dpi, 480/my_dpi), dpi=my_dpi) #create a figure
  tmp=gapminder[ gapminder.year == i ] #seperate the data to be graphed
  plt.scatter(tmp['lifeExp'], tmp['gdpPercap'] , s=tmp['pop']/200000 , 
              c=tmp['continent'].cat.codes, cmap="Accent", alpha=0.6, edgecolors="white", linewidth=2) #plot and colors
  plt.yscale('log') #set the scale of the y axis to logarythmic
  plt.xlabel("Life Expectancy") #set the label of the x axis
  plt.ylabel("GDP per Capita") #set the label of the y axis
  plt.title("Year: "+str(i) ) #set the title
  plt.ylim(0,100000) #set the scale of the y axis
  plt.xlim(30, 90) #set the scale of the x axis


# Uploading data from file

Here we will try 
- From local system
- From google sheet


In [0]:
gc = gspread.authorize(GoogleCredentials.get_application_default()) #access the google sheets

worksheet = gc.open('life_expectancy_years').sheet1 #get access to the specific sheet

rows = worksheet.get_all_values()# get_all_values gives a list of rows.

df = pd.DataFrame.from_records(rows) #turn the data into a dataframe

# Data Wrangling
We can see that our dates have been asigned the wrong data type and that might cause some issues so we need to fix that.   
Note: First we pass the data to a clean dataframe so we can revert to the original 

In [0]:
df = df.reset_index(drop=True) #reset the index
df.columns = df.iloc[0] #use the first row as column names
df = df.drop(df.index[0]) #delete the first row (the one used for names)
df = df.reset_index(drop=True) #reset the index

In [0]:
df['country'] = pd.Categorical(df['country']) #change country to categorical
df = df.set_index('country') #set country as index
df = df.unstack() #unstack the categories created by the new index
df = df.to_frame().stack(level=0) #stack the dataframe using the first level as columns
df = df.to_frame().swaplevel() #turn the rows to columns
df.index.names = ['year', 'drop', 'country']
df = df.droplevel('drop')
df.rename(columns={ df.columns[0]: "expectancy" }, inplace = True)

In [0]:
df

In [0]:
# uploaded = files.upload()
# immunization = pd.read_csv(io.BytesIO(uploaded['dtp3_immunized_percent_of_one_year_olds.csv']))

In [0]:
gc = gspread.authorize(GoogleCredentials.get_application_default()) #access the google sheets

worksheet = gc.open('dtp3_immunized_percent_of_one_year_olds').sheet1 #get access to the specific sheet

rows = worksheet.get_all_values()# get_all_values gives a list of rows.

immunization = pd.DataFrame.from_records(rows) #turn the data into a dataframe

immunization = immunization.reset_index(drop=True) #reset the index
immunization.columns = immunization.iloc[0] #use the first row as column names
immunization = immunization.drop(immunization.index[0]) #delete the first row (the one used for names)
immunization = immunization.reset_index(drop=True) #reset the index

In [0]:
immunization['country'] = pd.Categorical(immunization['country']) #change country to categorical
immunization = immunization.set_index('country') #set country as index
immunization = immunization.unstack() #unstack the categories created by the new index
immunization = immunization.to_frame().stack(level=0) #stack the dataframe using the first level as columns
immunization = immunization.to_frame().swaplevel() #turn the rows to columns
immunization.index.names = ['year', 'drop', 'country']
immunization = immunization.droplevel('drop')
immunization.rename(columns={ immunization.columns[0]: "dtp3" }, inplace = True)

In [0]:
immunization = immunization.reset_index(drop=False) #reset the index

In [0]:
immunization

In [0]:
lifeExpectancy = df

In [0]:
data = lifeExpectancy.merge(immunization, how='inner', on=['year', 'country'])
demographics = gapminder[['continent', 'country']]
data = data.merge(demographics, how='outer', on='country').dropna()

In [0]:
data.groupby(['continent']).max() #Find the min/max immunization per continent


In [0]:
print(data.groupby(['continent']).expectancy.max() )
print(data.groupby(['continent']).dtp3.max() )

In [0]:
#TODO Advanced: predict the next year

# References
- [Gapminder Data - Expectancy](http://gapm.io/ilex)
- [Gapminder Data - Immunization](https://data.unicef.org/child-health/immunization.html)
- [Gapminder data animation tutorial](https://python-graph-gallery.com/341-python-gapminder-animation/)
- 
