<a href="https://colab.research.google.com/github/mnikolop/PythonTutorial/blob/master/KarolinskaTutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python 

Python is an interpreted, high-level, general-purpose programming language.    
Can be used for almost anything.   
We are using it for data analysis.    
Main library used in data analysis is pandas.   
There used to be 2 version python2 (or python) and python 3. Since January 1st Python 2 has been depricated so we are only using python 3.

## Main Resource
- [python](https://www.python.org/)
- [stackoverflow](https://stackoverflow.com/)
- [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)

## Editors

Editors can be IDEs, text editors or notbooks.
- [Anaconda]: deployment environment
  - [Jupyter]: notebook
- [VS Code]: text editor
- [PyCharm]: IDE
- [Google Colab]: notebook






# Libraries
Libraries can contain datasets, tools, graphics, processes, functions and so on.
They are basically what makes a langiage powerfull.

A lot of the libraries come preinstaled in the environments so we can just load them. If they are not we have to install them like below.


/note Note: it is possible but not common that libraries have (slightly) different names for anaconda, only exist for anaconda or don't exist for it. note/

---

It is possible to run command lines from a notebook.
Some notebooks also allow other languages to be mixed in as long as they are in seperate cells and you specify the language (and of course it is suported).

In [1]:
# Install libraries
!pip install gapminder
!pip install -U -q PyDrive

Collecting gapminder
  Downloading https://files.pythonhosted.org/packages/85/83/57293b277ac2990ea1d3d0439183da8a3466be58174f822c69b02e584863/gapminder-0.1-py3-none-any.whl
Installing collected packages: gapminder
Successfully installed gapminder-0.1


In [0]:
#Import libraries
import numpy as np #basic scientific computation 
import pandas as pd #for data science
from gapminder import gapminder #dataset
import matplotlib.pyplot as plt #ploting 
import seaborn as sns; sns.set_style("darkgrid") #ploting
my_dpi=96 #dots per inch. Used in the size of the plots

# Datasets
Data frames are sets of data in a "table" format. 
They come with a multitude of operations and functions they can be performed on them.

## Loading datasets
- Library
- Sample dataset
- File
- Online

---

One of the libraries above is a dataset, and the other comes with a few test datasets.

---

Since we have our dataset in pandas format we can get simple stats on the contents.


In [3]:
print(type(gapminder))

<class 'pandas.core.frame.DataFrame'>


In [173]:
print("the first 5 lines of the dataframe are \n" , gapminder.head(), "\n")
print("the basic statistics of the dataframe are \n" , gapminder.describe(), "\n")
print("the data types of the dataframe are \n" , gapminder.dtypes, "\n")
print("the data types of the dataframe are (plus sparsity) \n" , gapminder.ftypes, "\n")

the first 5 lines of the dataframe are 
        country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710
3  Afghanistan      Asia  1967   34.020  11537966  836.197138
4  Afghanistan      Asia  1972   36.088  13079460  739.981106 

the basic statistics of the dataframe are 
              year      lifeExp           pop      gdpPercap
count  1704.00000  1704.000000  1.704000e+03    1704.000000
mean   1979.50000    59.474439  2.960121e+07    7215.327081
std      17.26533    12.917107  1.061579e+08    9857.454543
min    1952.00000    23.599000  6.001100e+04     241.165877
25%    1965.75000    48.198000  2.793664e+06    1202.060309
50%    1979.50000    60.712500  7.023596e+06    3531.846989
75%    1993.25000    70.845500  1.958522e+07    9325.462346
max    2007.00000    82.603000  1.318683e+09  113523.132900 


  after removing the cwd from sys.path.


# Data Wrangling
We can see that our dates have been asigned the wrong data type and that might cause some issues so we need to fix that.   
- First we pass the data to a clean dataframe so we can revert to the original without having to load it formscratch again.   
- Then we fix the date type.
- And last we fix the continents to categorical.

In [205]:
df = gapminder 
df.year = pd.to_datetime(df.year, format='%Y').dt.year
df['continent']=pd.Categorical(df['continent'])
df.head() 

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [0]:
# graph data
for i in df.year.unique(): #for every year in the data make the following graph
  fig = plt.figure(figsize=(680/my_dpi, 480/my_dpi), dpi=my_dpi) #create a figure
  tmp=df[ df.year == i ] #seperate the data to be graphed
  plt.scatter(tmp['lifeExp'], tmp['gdpPercap'] , s=tmp['pop']/200000 , 
              c=tmp['continent'].cat.codes, cmap="Accent", alpha=0.6, edgecolors="white", linewidth=2) #plot and colors
  plt.yscale('log') #set the scale of the y axis to logarythmic
  plt.xlabel("Life Expectancy") #set the label of the x axis
  plt.ylabel("GDP per Capita") #set the label of the y axis
  plt.title("Year: "+str(i) ) #set the title
  plt.ylim(0,100000) #set the scale of the y axis
  plt.xlim(30, 90) #set the scale of the x axis


# Uploading data from file

Here we will try 
- From local system
- From google sheet


In [0]:
#Libraries for reading from the drive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from google.colab import files
import gspread
from oauth2client.client import GoogleCredentials
import io

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [9]:
uploaded = files.upload()
immunization = pd.read_csv(io.BytesIO(uploaded['dtp3_immunized_percent_of_one_year_olds.csv']))

Saving dtp3_immunized_percent_of_one_year_olds.csv to dtp3_immunized_percent_of_one_year_olds.csv


In [0]:
gc = gspread.authorize(GoogleCredentials.get_application_default())

worksheet = gc.open('dtp3_immunized_percent_of_one_year_olds').sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()

immunization = pd.DataFrame.from_records(rows)

immunization.columns = immunization.iloc[0]
immunization = immunization.drop(immunization.index[0])

In [0]:
immunization

In [0]:
immunization = immunization.set_index('country')
df = df.set_index(['year', 'country'])

In [0]:
immunization = immunization.unstack()
immunization = immunization.to_frame().stack(level=0)
immunization = immunization.to_frame().swaplevel()
immunization = immunization.droplevel(level=1)
immunization.index.names = ['year', 'country']
immunization.rename(columns={ immunization.columns[0]: "dtp3" }, inplace = True)
# immunization

In [0]:
joined = df.join(immunization, how='outer')

In [239]:
joined.loc[['1982']]

Unnamed: 0_level_0,Unnamed: 1_level_0,continent,lifeExp,pop,gdpPercap,dtp3
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1982,Afghanistan,,,,,5
1982,Albania,,,,,95
1982,Algeria,,,,,
1982,Andorra,,,,,
1982,Angola,,,,,
1982,...,...,...,...,...,...
1982,Venezuela,,,,,53
1982,Vietnam,,,,,
1982,Yemen,,,,,4
1982,Zambia,,,,,


In [235]:
joined.index.levels

FrozenList([[1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007, '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011'], ['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Central African Republic', 'Chad', 'Chile', 'China', 'Colombia', 'Comoros', 'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', 'Cote d'Ivoire', 'Croatia', 'Cuba', 'Cyprus', 'Czech Republic', 'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', '

In [0]:
# TODO max of immunization per capita

# References
- [Gapminder]
- https://python-graph-gallery.com/341-python-gapminder-animation/
- 
