<img src="py-logo.png" width="100pt"/>



# INTRODUCTION TO PYTHON 
# I – DATA STRUCTURES
*Lasse Ruokolainen*

*Seasoned Data Master, BILOT Consulting Oy* 

***

## (1) Data types
In Python there are several types of data entries that are treated differently. These **types** include: **int**, **float**, **string**, and **boolean**. Numeric types **int** and **float** can be subjected to any mathematical operation, whereas textual operations are used with **string** types. Boolean types are convenient in control structures as well in filtering/indexing variables data/tables. The type of an **object** can be queried with the **function** `type()`.

### (a) *Numeric*
Numeric objects are either integers or decimal numbers.

In [None]:
# assign a value to an object:
a = 6
# query the type of a and print it to the console:
print(type(a))

In [None]:
b = 2.3
print(type(b))

In [None]:
print(a + b)

In [None]:
# assign the result from an operation to an object:
c = a * b - b/a
# controlled printing:
print('%.2f' %c)
# or:
print('Result: {0:.2f}'.format(c))

### (b) *String*
String objects can be either single words or longer sentences.

In [None]:
animal = 'cat'
print(type(animal))

In [None]:
print(animal.capitalize())

In [None]:
# number of characters:
print(len(animal))

In [None]:
# concatenate strings:
fact = animal.capitalize() + ' is a feline'
print(fact)

In [None]:
opinion = 'I hate ' + animal.upper() + 'S!'
print(opinion)

In [None]:
# repeat a string:
print(animal * 3)

In [None]:
# find the position of a character/pattern:
print(opinion.index('e'))

In [None]:
# replace a part of a string:
print(opinion.replace('hate','love').replace('I','You'))

In [None]:
# join strings:
animals = ['cat', 'dog', 'horse', 'python']

', '.join(animals)

In [None]:
# list the content of the current workspace:
%who

### (c) *Boolean*
Boolean objects contain values `True` or `False`, which can be interpretted as numeric 1/0.

In [None]:
# logical test:
test = a > b
print(test)

In [None]:
print(type(test))

In [None]:
# logical matching:
print('cat' in fact)

In [None]:
print(animal is 'cat')

***
## (2) Data structures
Python contains a rich set of different data structures that are convenient for different purposes. 

### (a) *List*
Lists in Python are used to store collection of heterogeneous items. Lists are generated using square brackets (`[` and `]`) that hold elements, separated by a comma (`,`). That is, a **list** is a collection of entries that can be of any **type**. Lists can also be nested, meaning that a list can contain a list (or any other data structure). Lists can be either appended or reduced with convenient methods `.append()` and `.remove()` (note that these operations modify the list without the need to redefine the object). To make an empty list, type: [].

In [None]:
# generate a list with heterogeneous items:
x = [1,'a',1,2,'a','b',4,'b']
print(x)

In [None]:
print(type(x))

In [None]:
# count number of instances:
print(x.count('b'))

In [None]:
# concatenate two lists:
print(x + [3,'c'])

In [None]:
# query the length of a list:
print(len(x))

In [None]:
# make a nested list:
y = [[1,2,3,4], ['a','b','c','d']]
print(y)

In [None]:
print(len(y))

In [None]:
# append a list:
x.append(4.75)
print(x)

In [None]:
# remove elements of a list:
x.remove(4.75)
print(x)

In [None]:
# extend a list:
x.extend([0,5,1,'stop'])
print(x)

What is the difference between `.append` and `.extend`? The former will add what ever is the input as a new element in the list, while the latter will extent the list with the input:

`x = [1,2,3]` 

`x.append([4, 5, 6]) -> [1,2,3,[4,5,6]]`  

`x.extend([4, 5, 6]) -> [1,2,3,4,5,6]`

### (b) *Numpy array*
Lists cannot used in calculations, as they are allowed to contain any type data. NumPy arrays are like lists, but can be used in calculations, because they can only contain a single type of values. To use Numpy arrays one needs to import the `numpy` package. This is done using the `import` command: `import numpy`. This will load the library to the working memory, with name `numpy`. It is however a convention to import `numpy` as `np`, but you are free to do this anyway you like.

In [None]:
# import numpy:
import numpy as np

In [None]:
# generate a numeric array:
nar = np.array(range(0,5))
print(nar)
print(type(nar))

In [None]:
# calculate sum over the array:
print(sum(nar))

In [None]:
# mathematical operation on array:
print(nar/2)

In [None]:
# bind arrays together:
matrix = np.column_stack((nar,nar))
print(matrix)

In [None]:
# query the shape of an array (note: this returns a tuple):
print(matrix.shape)

Even if **arrays** can contain only a singly data type, it does not mean it needs to be numeric; string arrays and boolean arrays are also possible.  

In [None]:
# generate string array:
sar = np.array(['anaconda','worm,','viper','snake','python','boa'])
print(len(sar))

In [None]:
print('snake' in sar)

In [None]:
# logical matching generates a boolean array:
print(sar == 'python')

### (c) *Tuple*
Tuples are sequences, just like lists. The differences between tuples and lists are, the tuples cannot be changed unlike lists and tuples use parentheses, whereas lists use square brackets. In Python tuples are written with round brackets (`(` and `)`). An empty tuple is created by: ().

In [None]:
tu = matrix.shape
print(type(tu))

In [None]:
tu2 = (1,2,3,4)
print(tu2)

In [None]:
tu3 = 'anaconda','worm','viper','snake','python','boa'
print(tu3)

In [None]:
print(tu3[3])

In [None]:
tu3[3] = 'serpent'

In [None]:
# unpacking of a tuple:
a,b,c,d = tu2

but what if one want's only the first element from the tuple as a separate object?

In [None]:
a,*rest = tu2
print(a)
print(rest)

### (d) *Dictionary*
Dictionaries are handy data structures that combine key-value pairs. Each key is separated from its value by a colon (`:`), the items are separated by commas, and the whole thing is enclosed in curly braces (`{` and `}`). An empty dictionary without any items is written with just two curly braces, like this: {}. Dictionary keys can be accessed with `.keys()` method, while the values can be accessed with the `.values()` method. As lists, dictionaries can be nested.

In [None]:
# create a dictionary:
mydict = {'name':'Anne','pet':'cat','age':25}

In [None]:
# get keys:
print(mydict.keys())

In [None]:
# get values:
print(mydict.values())

In [None]:
# add new data:
mydict['job'] = 'manager'
print(mydict)

In [None]:
# remove data:
del(mydict['job'])
print(mydict)

### (e) *DataFrame*
A DataFrame is a two-dimensional array with heterogeneous data. For this a powerful library is available in Python, called `pandas`. By convention, pandas is usually imported as `import pandas as pd`. In practice, one usually encounters a dataframe when reading data into Python via the `pd.read_csv()` function.

In [None]:
# import pandas
import pandas as pd

# read .csv file to dataframe:
df = pd.read_csv('Datasets/iris.csv')

In [None]:
# inspect the dataframe shape:
print(df.shape)

In [None]:
# print the dataframe head:
df.head()

In [None]:
# quary variable types:
print(df.dtypes)

In [None]:
# descriptive statistics for numeric variables:
df.describe()

In [None]:
# access the dataframe row index:
print(df.index)

In [None]:
# access the dataframe column index (variable names):
print(df.columns)

In [None]:
# get usefull information about the data.frame:
df.info()

***
## (3) Indexing

### (a) *Indexing a list*
One can access the content of a **list** using square brackets and specifying the desired index location. Alternatively, one can also use logical expressions to filter the content of a list. Note that indexing take place on open range, such that the last index is not contained in the range.

In [None]:
# access the first entry (at index location 0!) of x:
print(x[0])
# access the last entry of x:
print(x[-1])
print(x)

In [None]:
# selecting a range:
print(x[2:]) # from 3rd to last
print(x[:2]) # from 1st to 2nd

In [None]:
# access the second list within y:
print(y[1])

In [None]:
# hierarchical indexing:
print(y[0][1:])

In [None]:
# index a string:
print(opinion[0:6])

In [None]:
# use list comprehension to get only the numeric values in list:
print([i for i in x if isinstance(i,(int,float))])
print([i for i in x if isinstance(i,(str))])

### (b) *Indexing an array*
Arrays are indexed in a similar way to lists, using square brackets.

In [None]:
# indexing with location:
print(nar[:3])

In [None]:
# indexing using a boolean:
print(nar[nar<3])
#print(nar<3)

### (c) *Indexing a dictionary*
Dictionaries are indexed first by **key**, then by index location.

In [None]:
# example:
mydict = {'name':['Anne','Mike'],'pet':['cat','python'],'age':[25,37]}
print(mydict['pet'][1])

### (d) *Indexing a DataFrame*
Dataframe indexing is somewhat more complicated than for other data structures. Indexing a dataframe using a numeric index returns a slice of rows, while using variable names as index returns a slice of columns. If one needs to index across several dimensions, indexers `.loc` or `.iloc` need to be used, depending on whether the index uses values in the dataframe index or not, respectively.

In [None]:
# row slice:
df[:3]

In [None]:
# column slice:
df[['species','sepal_length']].head()

In [None]:
# filtering:
df.loc[df.species=='setosa'].head()

In [None]:
# 2D slicing:
df.iloc[:3,[0,3]]

In [None]:
# more advanced filtering (note the use of function np.logical_and()):
df.loc[
    np.logical_and(
        df.species=='versicolor',
        df.sepal_length<5.5
    )
]