<img src="py-logo.png" width="100pt"/>



# INTRODUCTION TO PYTHON 
# I – DATA STRUCTURES
*Lasse Ruokolainen*

*Seasoned Data Master, BILOT Consulting Oy* 

***

## (1) Data types
In Python there are several types of data entries that are treated differently. These **types** include: **int**, **float**, **string**, and **boolean**. Numeric types **int** and **float** can be subjected to any mathematical operation, whereas textual operations are used with **string** types. Boolean types are convenient in control structures as well in filtering/indexing variables data/tables. The type of an **object** can be queried with the **function** `type()`.

### (a) *Numeric*
Numeric objects are either integers or decimal numbers.

In [4]:
# assign a value to an object:
a = 6
# query the type of a and print it to the console:
print(type(a))

<class 'int'>


In [5]:
b = 2.3
print(type(b))

<class 'float'>


In [6]:
print(a + b)

8.3


In [17]:
# assign the result from an operation to an object:
c = a * b - b/a
# controlled printing:
print('%.2f' %c)
# or:
print('Result: {0:.2f}'.format(c))

13.42
Result: 13.42


### (b) *String*
String objects can be either single words or longer sentences.

In [18]:
animal = 'cat'
print(type(animal))

<class 'str'>


In [19]:
print(animal.capitalize())

Cat


In [20]:
# number of characters:
print(len(animal))

3


In [21]:
# concatenate strings:
fact = animal.capitalize() + ' is a feline'
print(fact)

Cat is a feline


In [22]:
opinion = 'I hate ' + animal.upper() + 'S!'
print(opinion)

I hate CATS!


In [23]:
# repeat a string:
print(animal * 3)

catcatcat


In [24]:
# find the position of a character/pattern:
print(opinion.index('e'))

5


In [25]:
# replace a part of a string:
print(opinion.replace('hate','love').replace('I','You'))

You love CATS!


In [26]:
# join strings:
animals = ['cat', 'dog', 'horse', 'python']

', '.join(animals)

'cat, dog, horse, python'

In [28]:
# list the content of the current workspace:
%who

a	 animal	 animals	 b	 c	 fact	 opinion	 


### (c) *Boolean*
Boolean objects contain values `True` or `False`, which can be interpretted as numeric 1/0.

In [29]:
# logical test:
test = a > b
print(test)

True


In [30]:
print(type(test))

<class 'bool'>


In [31]:
# logical matching:
print('cat' in fact)

False


In [32]:
print(animal is 'cat')

True


***
## (2) Data structures
Python contains a rich set of different data structures that are convenient for different purposes. 

### (a) *List*
Lists in Python are used to store collection of heterogeneous items. Lists are generated using square brackets (`[` and `]`) that hold elements, separated by a comma (`,`). That is, a **list** is a collection of entries that can be of any **type**. Lists can also be nested, meaning that a list can contain a list (or any other data structure). Lists can be either appended or reduced with convenient methods `.append()` and `.remove()` (note that these operations modify the list without the need to redefine the object). To make an empty list, type: [].

In [33]:
# generate a list with heterogeneous items:
x = [1,'a',1,2,'a','b',4,'b']
print(x)

[1, 'a', 1, 2, 'a', 'b', 4, 'b']


In [34]:
print(type(x))

<class 'list'>


In [35]:
# count number of instances:
print(x.count('b'))

2


In [36]:
# concatenate two lists:
print(x + [3,'c'])

[1, 'a', 1, 2, 'a', 'b', 4, 'b', 3, 'c']


In [37]:
# query the length of a list:
print(len(x))

8


In [38]:
# make a nested list:
y = [[1,2,3,4], ['a','b','c','d']]
print(y)

[[1, 2, 3, 4], ['a', 'b', 'c', 'd']]


In [39]:
print(len(y))

2


In [40]:
# append a list:
x.append(4.75)
print(x)

[1, 'a', 1, 2, 'a', 'b', 4, 'b', 4.75]


In [41]:
# remove elements of a list:
x.remove(4.75)
print(x)

[1, 'a', 1, 2, 'a', 'b', 4, 'b']


In [43]:
# extend a list:
x.extend([0,5,1,'stop'])
print(x)

[1, 'a', 1, 2, 'a', 'b', 4, 'b', 0, 5, 1, 'stop']


What is the difference between `.append` and `.extend`? The former will add what ever is the input as a new element in the list, while the latter will extent the list with the input:

`x = [1,2,3]` 

`x.append([4, 5, 6]) -> [1,2,3,[4,5,6]]`  

`x.extend([4, 5, 6]) -> [1,2,3,4,5,6]`

### (b) *Numpy array*
Lists cannot used in calculations, as they are allowed to contain any type data. NumPy arrays are like lists, but can be used in calculations, because they can only contain a single type of values. To use Numpy arrays one needs to import the `numpy` package. This is done using the `import` command: `import numpy`. This will load the library to the working memory, with name `numpy`. It is however a convention to import `numpy` as `np`, but you are free to do this anyway you like.

In [44]:
# import numpy:
import numpy as np

In [45]:
# generate a numeric array:
nar = np.array(range(0,5))
print(nar)
print(type(nar))

[0 1 2 3 4]
<class 'numpy.ndarray'>


In [46]:
# calculate sum over the array:
print(sum(nar))

10


In [47]:
# mathematical operation on array:
print(nar/2)

[0.  0.5 1.  1.5 2. ]


In [48]:
# bind arrays together:
matrix = np.column_stack((nar,nar))
print(matrix)

[[0 0]
 [1 1]
 [2 2]
 [3 3]
 [4 4]]


In [49]:
# query the shape of an array (note: this returns a tuple):
print(matrix.shape)

(5, 2)


Even if **arrays** can contain only a singly data type, it does not mean it needs to be numeric; string arrays and boolean arrays are also possible.  

In [50]:
# generate string array:
sar = np.array(['anaconda','worm,','viper','snake','python','boa'])
print(len(sar))

6


In [51]:
print('snake' in sar)

True


In [52]:
# logical matching generates a boolean array:
print(sar == 'python')

[False False False False  True False]


### (c) *Tuple*
Tuples are sequences, just like lists. The differences between tuples and lists are, the tuples cannot be changed unlike lists and tuples use parentheses, whereas lists use square brackets. In Python tuples are written with round brackets (`(` and `)`). An empty tuple is created by: ().

In [53]:
tu = matrix.shape
print(type(tu))

<class 'tuple'>


In [54]:
tu2 = (1,2,3,4)
print(tu2)

(1, 2, 3, 4)


In [55]:
tu3 = 'anaconda','worm','viper','snake','python','boa'
print(tu3)

('anaconda', 'worm', 'viper', 'snake', 'python', 'boa')


In [56]:
print(tu3[3])

snake


In [57]:
tu3[3] = 'serpent'

TypeError: 'tuple' object does not support item assignment

In [70]:
# unpacking of a tuple:
a,b,c,d = tu2

but what if one want's only the first element from the tuple as a separate object?

In [71]:
a,*rest = tu2
print(a)
print(rest)

1
[2, 3, 4]


### (d) *Dictionary*
Dictionaries are handy data structures that combine key-value pairs. Each key is separated from its value by a colon (`:`), the items are separated by commas, and the whole thing is enclosed in curly braces (`{` and `}`). An empty dictionary without any items is written with just two curly braces, like this: {}. Dictionary keys can be accessed with `.keys()` method, while the values can be accessed with the `.values()` method. As lists, dictionaries can be nested.

In [72]:
# create a dictionary:
mydict = {'name':'Anne','pet':'cat','age':25}

In [73]:
# get keys:
print(mydict.keys())

dict_keys(['name', 'pet', 'age'])


In [74]:
# get values:
print(mydict.values())

dict_values(['Anne', 'cat', 25])


In [75]:
# add new data:
mydict['job'] = 'manager'
print(mydict)

{'name': 'Anne', 'pet': 'cat', 'age': 25, 'job': 'manager'}


In [76]:
# remove data:
del(mydict['job'])
print(mydict)

{'name': 'Anne', 'pet': 'cat', 'age': 25}


### (e) *DataFrame*
A DataFrame is a two-dimensional array with heterogeneous data. For this a powerful library is available in Python, called `pandas`. By convention, pandas is usually imported as `import pandas as pd`. In practice, one usually encounters a dataframe when reading data into Python via the `pd.read_csv()` function.

In [77]:
# import pandas
import pandas as pd

# read .csv file to dataframe:
df = pd.read_csv('Datasets/iris.csv')

In [78]:
# inspect the dataframe shape:
print(df.shape)

(150, 5)


In [79]:
# print the dataframe head:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [80]:
# quary variable types:
print(df.dtypes)

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object


In [81]:
# descriptive statistics for numeric variables:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [82]:
# access the dataframe row index:
print(df.index)

RangeIndex(start=0, stop=150, step=1)


In [83]:
# access the dataframe column index (variable names):
print(df.columns)

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')


In [84]:
# get usefull information about the data.frame:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


***
## (3) Indexing

### (a) *Indexing a list*
One can access the content of a **list** using square brackets and specifying the desired index location. Alternatively, one can also use logical expressions to filter the content of a list. Note that indexing take place on open range, such that the last index is not contained in the range.

In [85]:
# access the first entry (at index location 0!) of x:
print(x[0])
# access the last entry of x:
print(x[-1])
print(x)

1
stop
[1, 'a', 1, 2, 'a', 'b', 4, 'b', 0, 5, 1, 'stop']


In [86]:
# selecting a range:
print(x[2:]) # from 3rd to last
print(x[:2]) # from 1st to 2nd

[1, 2, 'a', 'b', 4, 'b', 0, 5, 1, 'stop']
[1, 'a']


In [87]:
# access the second list within y:
print(y[1])

['a', 'b', 'c', 'd']


In [88]:
# hierarchical indexing:
print(y[0][1:])

[2, 3, 4]


In [89]:
# index a string:
print(opinion[0:6])

I hate


In [90]:
# use list comprehension to get only the numeric values in list:
print([i for i in x if isinstance(i,(int,float))])
print([i for i in x if isinstance(i,(str))])

[1, 1, 2, 4, 0, 5, 1]
['a', 'a', 'b', 'b', 'stop']


### (b) *Indexing an array*
Arrays are indexed in a similar way to lists, using square brackets.

In [91]:
# indexing with location:
print(nar[:3])

[0 1 2]


In [92]:
# indexing using a boolean:
print(nar[nar<3])
#print(nar<3)

[0 1 2]


### (c) *Indexing a dictionary*
Dictionaries are indexed first by **key**, then by index location.

In [93]:
# example:
mydict = {'name':['Anne','Mike'],'pet':['cat','python'],'age':[25,37]}
print(mydict['pet'][1])

python


### (d) *Indexing a DataFrame*
Dataframe indexing is somewhat more complicated than for other data structures. Indexing a dataframe using a numeric index returns a slice of rows, while using variable names as index returns a slice of columns. If one needs to index across several dimensions, indexers `.loc` or `.iloc` need to be used, depending on whether the index uses values in the dataframe index or not, respectively.

In [94]:
# row slice:
df[:3]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [95]:
# column slice:
df[['species','sepal_length']].head()

Unnamed: 0,species,sepal_length
0,setosa,5.1
1,setosa,4.9
2,setosa,4.7
3,setosa,4.6
4,setosa,5.0


In [96]:
# filtering:
df.loc[df.species=='setosa'].head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [97]:
# 2D slicing:
df.iloc[:3,[0,3]]

Unnamed: 0,sepal_length,petal_width
0,5.1,0.2
1,4.9,0.2
2,4.7,0.2


In [98]:
# more advanced filtering (note the use of function np.logical_and()):
df.loc[
    np.logical_and(
        df.species=='versicolor',
        df.sepal_length<5.5
    )
]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
57,4.9,2.4,3.3,1.0,versicolor
59,5.2,2.7,3.9,1.4,versicolor
60,5.0,2.0,3.5,1.0,versicolor
84,5.4,3.0,4.5,1.5,versicolor
93,5.0,2.3,3.3,1.0,versicolor
98,5.1,2.5,3.0,1.1,versicolor
