<img src="py-logo.png" width="100pt"/>

# PYTHON FOR DATA SCIENCE 
# II – WORKING WITH DATA
*Lasse Ruokolainen*

*Seasoned Data Master, BILOT Consulting Oy* 

***

## (1) Functions and methods
Python has two ways of operating on data; **function**s and **method**s. Functions take data as input (and potentially additional arguments), whereas methods do a predefined operation *within* an object. Syntax: `function(input,*arguments)`, `object.method()`. Some objects also have **attribute**s, such as `shape` of arrays and dataframes, which work just like **method**s, except that no brackets are used.  

### (a) *Functions*

In [None]:
from numpy.random import uniform as runif

# generate example data, using a function:
x = runif(size=10)
print(x) 

In [None]:
# Inbuilt function:
print(max(x))
print(min(x))
print(sum(x))

In [None]:
# define a custom function:
def my_mean(data):
    """
    This function returns the mean value of 
    input numeric data, using inbuilt functions
    sum() and len().
    """
    return sum(data)/len(data)

# use the function:
print(my_mean(x))
print('%.3f' %my_mean(x))

import numpy as np
print(np.mean(x))

### (b) *Methods*

In [None]:
import pandas as pd

# make x a pandas Series
x2 = pd.Series(x)
print(x2)

In [None]:
# use method for Series:
print(x2.mean())
print(x2.max())
print(x2.prod())

In [None]:
# make a dataframe:
df = pd.DataFrame({
    'x' : x,
    'x_squared' : x**2
})

# use methods on dataframe:
print(df.mean(),'\n')
print(df.apply(my_mean))

How does one then know whether one is dealing with a **function**, **method**, or **attribute**?
Here the function `type()` comes handy:

In [None]:
print(type(df.shape))
print(type(df.sum))
print(type(df.index))
print(type(sum))

***
## (2) Data operations and manipulations
It is typical that at least 80% of time in a project go to data handling and monipulation. Thus, it might be a good idea to know that is done 

### (a) *Read and inspect data*
In order to do something usefull in Python, one often needs to bring in data. If the data resides in a flat file, a convenient method is to use the `read_csv()` function in `Pandas`.

In [None]:
# read in a data set:
df = pd.read_csv('Datasets/tips.csv',index_col = 0)

df.head() # note: use of .method()

Now that we have the data, below we'll go through a couple of handy methods. 

In [None]:
# query variable types:
print(df.dtypes)

In [None]:
# change data type:
df.sex = df.sex.astype('category')

In [None]:
# descriptive statistics:
df.describe(include = 'all') # note: use of .method()

In [None]:
# tabulate the number of smokers: 
tab = df.pivot_table(
    index = 'day',
    columns = 'sex',
    values = 'smoker',
    aggfunc = 'count'
)
print(tab)

# convert to proportions:
print('\n',tab.apply(lambda x: x/sum(x),axis='rows').round(2))

### (b) *Data*
What if you need to calculate new variables to the data or perform aggregations? 

In [None]:
# calculate new variable:
df['relative_tip'] = df.tip/df.total_bill
print(df.relative_tip.mean().round(3))

In [None]:
import numpy as np
# perform operation within categories by using groupby:
df2 = df.groupby(['day','time','smoker']).apply(np.mean)

In [None]:
# index the groupped dataframe, notice the double index!
df2.loc['Fri','size']
df2

### (c) *Missing values*


In [None]:
mpg = pd.read_csv('Datasets/mpg.csv',index_col=0)
mpg.head()

Another good method for data inspection is `.info()`, which shows you not only the data dimensions, variable types, but also the number of missing values:

In [None]:
mpg.info()

OK, more precisely, you get the number of non-missing values. Still, we can easily calculate how many are missing:

In [None]:
mpg.isnull().sum()

In order to work with such data with missing values, we need to either remove those rows with incomplete records, or fill the blanks with something.

In [None]:
# drop missing values:
mpg.dropna().info()

Dropping the rows with missing data obviously leads to, in this case, 6 rows less. The above command needs to be stored to a new dataframe, or setting `inplace=True` in the `.dropna()` method.

In [None]:
# fill na's with a value:
mpg.fillna(0).info()

Here the missing valyes are replaced with the value `0`. Again, the result needs to be stored to a new dataframe, or setting `inplace=True` in the `.dropna()` method.

In [None]:
# fill na's with a value:
mpg.fillna(mpg.mean()).describe().round()

In this case the missing values are replaced by the averages by each variable. Isn't that clever?