# 01: Introduction and Visualisation

## Importing fundamental packages

### NumPy 
  * Package for fast "scientific" computing (especially linear algebra and random numbers capabilities).
  * Mostly an interface to fast C/C++/Fortran libraries.
  * http://www.numpy.org/

### pandas
  * Popular data analysis toolkit.
  * Helps to easily manipulate with **tabular** data.
  * http://pandas.pydata.org/

### scikit-learn (sklearn)
  * Data science tools and methods in Python.
  * Built on NumPy, [SciPy](https://www.scipy.org/), and matplotlib
  * http://scikit-learn.org/stable/
  
### matplotlib
  * Fundamental 2D plotting library.
  * https://matplotlib.org/

### seaborn
  * Data visualisation tool based on matplotlib.
  * https://seaborn.pydata.org/

In [None]:
import numpy as np
import pandas as pd
import sklearn as skit
import matplotlib.pyplot as plt
import seaborn as sns

## Basic data manipulations using pandas

  - Load data1.csv and data2.csv using pandas.
  - Find out types of data in all columns (do they contain strings, numbers, ...? And how variable are these?)

### Loading data

  - Load data from the csv files to the pandas DataFrame.

In [None]:
data1 = pd.read_csv('data1.csv')
data2 = pd.read_csv('data2.csv',sep=';')

### Using pandas functions to get a basic overview of a dataset

In [None]:
df = data2
#df.head()
#df.info()
#df.describe()
#df.isnull().sum()
#df.notnull().sum()
#display(df.head())
#df.head()

### Basics of data selection 

In [None]:
#data1['Age'] # the collumn with pandas.Series.name = 'Age'
#data1.Age # dtto
#data1['Age'][:10] # get first 10 entries of the column Age
#data1['Age'][:3][[True, False, True]]
#data1['Age'] > 30 # applying condition on on all entries of the series -> series containg the results (True/False)
#data1[data1['Age'] > 30] # series containing only people older that 30
#data1[['Age', 'Survived']].head() # selecting only given columns
#data1_tmp = data1.copy() # make a deep copy of the dataframe
#data1_tmp.columns = range(12) # renaming columns
#display(data1.head())
#data1_tmp.head()
#data1[1:2] # geting the first row
#data1.loc[1,['Age', 'Sex']] # indexer (see .loc? and .iloc?)

## Task 01: Concatenating data

  - Append data2.csv to data1.csv:
      - Data (columns) not present in data1.csv are omitted from data2.csv.
      - Calculate the Age using the BirthYear column in data2.csv and append the result to the Age column of data1.
      - PasangerId must be unique in the resulting data frame. 
      - Use pandas.concat method.

## Plotting with pandas and seaborn

In [None]:
import matplotlib.pyplot as plt # standard convention for importing the plotting tool
import matplotlib
%matplotlib inline
matplotlib.style.use('ggplot')

### Influence of Pclass, Age and Sex on passangers chances

In [None]:
#data.plot() # default behaviour of the plot() method
# look especially on what kinds of plots are available:
#data.plot?

# get data frames for survived and not-survived passangers
survived = data[data['Survived'] == 1]
not_survived = data[data['Survived'] == 0]

ax = survived.plot.scatter(x='Age', y='Pclass', color='Green', label='Survived')
not_survived.plot(x='Age', y='Pclass', kind='scatter', color='Black', label='Not Survived')
# plot the graphs into one figure:
# not_survived.plot.scatter(x='Age', y='Pclass', color='Black', label='Not Survived', ax = ax)

In [None]:
plt.figure(figsize=(9,12)) # figsize in inches
plt.subplot(321) # three rows and two columns, put the following plot into the first slot
survived['Age'].plot.hist(color='Green')
plt.subplot(322)
not_survived['Age'].plot.hist(color='Black')
plt.subplot(323)
survived['Pclass'].plot.hist(color='Green')
plt.subplot(324)
not_survived['Pclass'].plot.hist(color='Black')
plt.subplot(325)
survived['Sex'].apply(lambda x: 1 if x == 'female' else 0).plot.hist(color='Green')
plt.subplot(326)
not_survived['Sex'].apply(lambda x: 1 if x == 'female' else 0).plot.hist(color='Black')

## Seaborne: investigating relations between features

In [None]:
plt.figure(figsize=(14,12))
data['Sex'] = data['Sex'].apply(lambda x: 1 if x == 'female' else 0)
cor_matrix = data.drop('PassengerId', axis=1).corr()
print(cor_matrix)
sns.heatmap(cor_matrix, annot=True)

## Task 02: scatter plots for all pairs of features

  - Use sns.pairplot method to get the plots analogous to the one below for all (reasonable) pairs of features.

In [None]:
plt.figure(figsize=(12,4))
sns.stripplot(x="Pclass", y="Age", hue="Survived", data=data, palette= ['black','green']) # add jitter=True

## Task 03: Age and education level of 2017 Czech parliament election candidates

  - Choose at least three parties that are going to participate in the election (e.g. ODS, ČSSD, KSČM, etc.).
  - Scrap the web pages https://volby.cz/ to get list of all candidates for the chosen parties.
     - Good place to start: https://volby.cz/pls/ps2017/ps11?xjazyk=CZ&xv=1&xt=1
  - Use data visualisation to depict the age distribution of candidates:
      - How many candidates are young/old/middle aged?
      - Which party has older candidates compared to the others?
      - ...
  - Use the titles of candidates to get the idea on education levels of candidates.
      - E.g. *Barteček Ivo prof. PhDr. CSc.* has three titles "prof.", "PhDr." and "CSc."
      - Try to distinguish at least three education levels corresponding to:
          - No title,
          - Bc.
          - Ing., Mgr. and analogous and higher.
      - Your visualisation should answer at least these questions:
          - How frequent are the education levels of candidates for each of the chousen parties?
          - How "educated" is each party compared to the others?
          - How frequent is each education level within all candidates?
          - ...
          
### Hints:
   - Use `import requests` to get HTML sourse of a given `url`: 
      - `r = requests.get(url)`
      - `html = r.text`
   - Use `pandas.read_html(r.text)` to save content of all `<table>`'s as pandas DataFrames:
      - `list_of_data_frames = pd.read_html(html,flavor='html5lib')`
  
  