# Data & Web Mining
## Python crash course

#### Prof. Claudio Lucchese


## Books

 - Learning Python. O'Reilly. Mark Lutz.

 - advanced notes: https://github.com/satwikkansal/wtfpython

## How can I run my Python code ?

In this course we use **Jupyter notebooks** as provided by
 - **Anaconda for Python**
 - **Google Colab**

Jupyter notebooks allow
 - to write slides like this.
 - to write complex documents interleaving text with programs
 - it is basically an interactive interpreter accessed via browser


Additional tools:
 - PyCharm by JetBrains https://www.jetbrains.com/pycharm/

## Your best friends in learning Python

1. The Python website:
    - plenty of links to books and tutorials!
        - e.g., https://docs.python.org/3/tutorial/
0. The official Python documentation:
    - https://docs.python.org/3/library/index.html
0. Google & StackOverflow:
    - try googling for `TypeError: can't multiply sequence by non-int of type 'float'`
0. Python Tutor
    - visualizes the execution of python code
    - http://pythontutor.com/

## Who uses python

 - The popular *YouTube* video sharing service is largely written in Python
 - The *Dropbox* storage service codes both its server and desktop client software primarily in Python
 - The widespread *BitTorrent* peer-to-peer file sharing system began its life as a Python program
 - *Netflix* and *Yelp* have both documented the role of Python in their software infrastructures
 - *JPMorgan, Chase, UBS, Getco, and Citadel* apply Python to financial market forecasting
 - *NASA, Los Alamos, Fermilab, JPL*, and others use Python for scientific programming tasks

 - In "The Anatomy of a Large-Scale Hypertextual Web Search Engine" 1998, Google founders describe the Google architecture
    - crawlers were written in python !

# Python types

Python provides the following types:

| Object type | Examples |
|:-:|:-:|
| Numbers | `1234`, `3.1415`, `3+4j`, ... |
| Strings | `'spam'`, `"Bob's"`, ... |
| Lists   | `[1, [2, 'three'], 4.5]`, `list(range(10))`, ... |
| Dictionaries | `{'food': 'spam', 'taste': 'yum'}`, `dict(hours=10)`, ... |
| Tuples |  `(1, 'spam', 4, 'U')`, `tuple('spam')`, ...|
| Files |   `open('eggs.txt')`, `open(r'C:\ham.bin', 'wb')`, ... |
| Sets  | `set('abc')`, `{'a', 'b', 'c'}`, ... |
| Other core types | `Booleans`, `None`, ... |

 - The type of a variable is inferred from the expression.
 - You can use the function `type` to ask Python which type is being used
 - The type determines the set of valid operators

In [2]:
a = 2.0
print (type(a))
a = 7.
print (type(a))
a = "Hello!"
print (type(a))

<class 'float'>
<class 'float'>
<class 'str'>


# Numbers

Check integer vs. floating point division. Type of the results is determined by the operation.

In [3]:
print ("What is the output of 11/2 ?   ", 11/2)
print ("What is the output of 11%2 ?   ", 11%2)
print ("What is the output of  2**10 ? ", 2**10)

What is the output of 11/2 ?    5.5
What is the output of 11%2 ?    1
What is the output of  2**10 ?  1024


In [4]:
print ("What is the output of 11//2 ? ", 11//2)

What is the output of 11//2 ?  5


# Strings

Check the `*` operation.

In [5]:
print ("What is the output of 'a'+'b' ?  ", 'a'+'b'   )
print ("What is the output of 'a'=='b' ? ", 'a'=='b'  )
print ("What is the output of 'a'<='b' ? ", 'a'<='b'  )
print ("What is the output of 'a'<='A' ? ", 'a'=='A'  )

What is the output of 'a'+'b' ?   ab
What is the output of 'a'=='b' ?  False
What is the output of 'a'<='b' ?  True
What is the output of 'a'<='A' ?  False


In [6]:
print ("What is the output of 'a'*5 ?    ",    'a'*5 )
print ("What is the output of 'aaaa'/5 ? ", 'a'/5 )

What is the output of 'a'*5 ?     aaaaa


TypeError: unsupported operand type(s) for /: 'str' and 'int'

In [7]:
print    (      "What is the output of int('10')/5 ? ", int('10')/5)

What is the output of int('10')/5 ?  2.0


In [9]:
print ( str(9) *4)

9999


# Conditional Statements

Tabbing is used to identify the body of `if`-`else` and other constructs such as `for`, `while`, `functions`.

Check if a variable x is within the interval $[0,10]$.

In [10]:
x = 33
if x<=10 and x>=0 :
    print ("x is in the interval [0,10]")
    pass # pass does nothing
    pass
    print ("I'm here !")
else :
    print ("x is not in the interval [0,10]")
    pass
    pass


x is not in the interval [0,10]


In [11]:
x = 33
# This is a special compact form
if 0<=x<=10 :
    print ("x is in the interval [0,10]")
else :
    print ("x is not in the interval [0,10]")

x is not in the interval [0,10]


In [12]:
x = 33
if 0<=x<=10 : print ("x is in the interval [0,10]")
else : print ("x is not in the interval [0,10]")

x is not in the interval [0,10]


In [13]:
if x>=0:
    pass
elif x<=10:
    pass
else:
    pass



# While loops

Nothing new: `while`, `break` `continue`


In [14]:
i = 0
while i<10:
    pass
    if i==8: break
    pass
    print ("This is Iteration N.", i)
    pass
    i += 1

    if i==5: continue
    pass
    pass

print ("I'm out of the loop")

This is Iteration N. 0
This is Iteration N. 1
This is Iteration N. 2
This is Iteration N. 3
This is Iteration N. 4
This is Iteration N. 5
This is Iteration N. 6
This is Iteration N. 7
I'm out of the loop


# For Loops

A `range` is a special tool to create sequences of numbers, given start, end, and step parameters.

In [15]:
for i in range(5):
    print ("This is Iteration N.", i)

This is Iteration N. 0
This is Iteration N. 1
This is Iteration N. 2
This is Iteration N. 3
This is Iteration N. 4


In [16]:
for i in range(0,10,2):
    print ("This is Iteration N.", i)

This is Iteration N. 0
This is Iteration N. 2
This is Iteration N. 4
This is Iteration N. 6
This is Iteration N. 8


In [17]:
for i in range(10,0,-2):
    print ("This is Iteration N.", i)

This is Iteration N. 10
This is Iteration N. 8
This is Iteration N. 6
This is Iteration N. 4
This is Iteration N. 2


In [18]:
print ( range(5) )

range(0, 5)


This is called **iterable**! You can only iterate through it ...

# Lists

Lists are used very frequently, and they can be dynamically modified.

In [19]:
for i in [0,1,2,3,4]:
    print ("This is Iteration N.", i)

This is Iteration N. 0
This is Iteration N. 1
This is Iteration N. 2
This is Iteration N. 3
This is Iteration N. 4


In [20]:
my_list = [1,2,3] + [4,5]
print (my_list)

[1, 2, 3, 4, 5]


In [21]:
my_list = [1,2,3]
my_list += [4,5]
print (my_list)

[1, 2, 3, 4, 5]


In [22]:
my_list = [1,2,3] + ["donald duck", 42.0]
print (my_list)

[1, 2, 3, 'donald duck', 42.0]


In [23]:
my_list = [1,2,3] + \
          ["donald duck", ["this", "is", 1, "nested", "list"] ]
print (my_list)

[1, 2, 3, 'donald duck', ['this', 'is', 1, 'nested', 'list']]


In [24]:
print ( len([1,2,3,4,5]) )

5


In [25]:
my_list = [1,2,3,4,5]
print ( my_list[0])
print ( my_list[4])
print ( my_list[5])

1
5


IndexError: list index out of range

In [26]:
my_list = [1,2,3,4,5]
print ( my_list[-1])
print ( my_list[-2])
print ( my_list[-100])

5
4


IndexError: list index out of range

In [27]:
my_list = [1,2,3,4,5,4,3,2,1]

print ( 3 in my_list)

print ( my_list.count(3))

print ( my_list.index(1))
# print ( my_list.index(33)) # this raises an error

True
2
0


# Slicing

Slicing allows to access a sublist

In [28]:
my_list = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo',
           'violet']

print ( my_list[1:3] )

['orange', 'yellow']


In [29]:
print ( my_list[3:-1] )

['green', 'blue', 'indigo']


In [30]:
print ( my_list[3:] )

['green', 'blue', 'indigo', 'violet']


In [31]:
print ( my_list[0:7:2] )

['red', 'yellow', 'blue', 'violet']


In [32]:
print ( my_list[0::2] )

['red', 'yellow', 'blue', 'violet']


In [33]:
print ( my_list[::2] )

['red', 'yellow', 'blue', 'violet']


In [34]:
print ( my_list[::-1] )

['violet', 'indigo', 'blue', 'green', 'yellow', 'orange', 'red']


# Lists are mutable

Elements of a list can be replaced. Sublists can be replaced with other sublists.

In [35]:
# original list
my_list = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
print (my_list)

# modify one element
my_list[-2] = 'ultramarine'

# the new list
print (my_list)

['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
['red', 'orange', 'yellow', 'green', 'blue', 'ultramarine', 'violet']


In [36]:
my_list = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
my_list[4] = ['light blue', 'blue', 'dark blue']
print (my_list)

['red', 'orange', 'yellow', 'green', ['light blue', 'blue', 'dark blue'], 'indigo', 'violet']


In [37]:
# here we replace one slice with another slice
my_list = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
my_list[4:5] = ['light blue', 'blue', 'dark blue']
print (my_list)

['red', 'orange', 'yellow', 'green', 'light blue', 'blue', 'dark blue', 'indigo', 'violet']


In [38]:
# A special case of replacement when start and end index are the same
my_list = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
my_list[5:5] = ['dark blue', 'darker blue']
print (my_list)

['red', 'orange', 'yellow', 'green', 'blue', 'dark blue', 'darker blue', 'indigo', 'violet']


In [39]:
my_list = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
my_list[2] = []
print (my_list)

['red', 'orange', [], 'green', 'blue', 'indigo', 'violet']


In [40]:
my_list = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']
my_list[2:3] = []
print (my_list)

['red', 'orange', 'green', 'blue', 'indigo', 'violet']


In [41]:
my_list = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet']

print ("Is orange in the rainbow?", 'orange' in my_list )

print ("Is brown in the rainbow?", 'brown' in my_list )

print ("Is it true that cobal is not in the rainbow?", 'cobalt' not in my_list )

Is orange in the rainbow? True
Is brown in the rainbow? False
Is it true that cobal is not in the rainbow? True


# Tuple

Like lists, but **immutable**.

In [42]:
my_tuple = (1,2,3,4, "five")

print (my_tuple)
print (my_tuple[2])

(1, 2, 3, 4, 'five')
3


In [43]:
my_tuple = (1,2,3) + (4, "five")

print (my_tuple)
print (my_tuple[2])

(1, 2, 3, 4, 'five')
3


In [44]:
my_tuple[2] = 3

TypeError: 'tuple' object does not support item assignment

# Unpacking

Multiple assignment, typical of function returning multiple values.

In [45]:
my_tuple = (1,2,3)
a,b,c = my_tuple
print (a,b,c)

1 2 3


In [46]:
my_list = [1,2,3]
a,b,c = my_list
print (a,b,c)

1 2 3


# Sorting

In-place vs. returning a new list.

In [47]:
my_list = [2,3,1]

my_list.sort()

print (my_list)

[1, 2, 3]


In [48]:
my_list = [2,3,1]

new_list = sorted( my_list )

print (my_list)
print (new_list)

[2, 3, 1]
[1, 2, 3]


# Careful !

**Check** in python tutor: http://pythontutor.com/  !

In [49]:
a = 11
b = a
a = 22
print (a,b)

22 11


In [50]:
a = [11]
b = a
a[0] = 22
print (a,b)

[22] [22]


In [51]:
my_list = [1,2,3]
new_list = my_list
new_list[1] = 77

print ( new_list + my_list)

[1, 77, 3, 1, 77, 3]


In [52]:
my_list = [1,2,3] *2
print (my_list)

[1, 2, 3, 1, 2, 3]


In [53]:
my_list = [ [1,2,3] ]*2
print (my_list)

[[1, 2, 3], [1, 2, 3]]


In [54]:
my_list[0]+= [4]
print ( my_list )

[[1, 2, 3, 4], [1, 2, 3, 4]]


In [55]:
my_tuple = (1,2,3)
new_tuple = my_tuple
my_tuple += tuple([77])

print ( new_tuple + my_tuple)

(1, 2, 3, 1, 2, 3, 77)


In [56]:
# if you want to actually copy a list
a = [11]
b = a.copy()
a[0] = 22
print (a,b)

[22] [11]


In [57]:
a = [11]
b = list(a)
a[0] = 22
print (a,b)

[22] [11]


In [58]:
a = [11]
b = a[:]
a[0] = 22
print (a,b)

[22] [11]


# Iterating through lists

Or through multiple lists.

In [59]:
my_list = [2,3,1]
for x in my_list:
    print (x)

2
3
1


In [60]:
my_list = [2,3,1]
for i,x in enumerate(my_list):
    print (i,x)

0 2
1 3
2 1


In [61]:
my_list = [2,3,1]
for z in enumerate(my_list):
    print (z, type(z))

(0, 2) <class 'tuple'>
(1, 3) <class 'tuple'>
(2, 1) <class 'tuple'>


In [62]:
A = [2,3,1]
B = ["two", "three", "one"]
for a,b in zip(A,B):
    print (a,b)

2 two
3 three
1 one


# More about strings

Strings are like lists of character, but they are immutable.

In [63]:
msg = "I like programming with python!"

In [64]:
print (msg[2])

l


In [65]:
print (msg[2:6])

like


In [66]:
msg[3] = "x"

TypeError: 'str' object does not support item assignment

In [67]:
for c in msg:
    print (c)

I
 
l
i
k
e
 
p
r
o
g
r
a
m
m
i
n
g
 
w
i
t
h
 
p
y
t
h
o
n
!


In [68]:
print (msg.split())

['I', 'like', 'programming', 'with', 'python!']


In [69]:
print (msg.split("i"))

['I l', 'ke programm', 'ng w', 'th python!']


In [70]:
#Remove leading and training whitespaces

my_string = "     A Bit of Python \n"

print ( "---", my_string, "---", sep="" )
print ( "---", my_string.strip(), "---", sep="" )

---     A Bit of Python 
---
---A Bit of Python---


In [71]:
# Remove leading and training characters of choice

my_string = "###!#!#!##!#A Bit of Python?!!???##"

print ( "---", my_string.strip("#"), "---", sep="" )
print ( "---", my_string.strip("#?"), "---", sep="" )
print ( "---", my_string.strip("!?#"), "---", sep="" )

---!#!#!##!#A Bit of Python?!!???---
---!#!#!##!#A Bit of Python?!!---
---A Bit of Python---


# Sets

The mathematical notion of set.

In [72]:
my_set = set([1,2,3,4,5,4,3,2,1])

print (my_set)

{1, 2, 3, 4, 5}


In [73]:
A = set([1,2,3])
B = set([4,5])
C = A | B

print (C)

{1, 2, 3, 4, 5}


In [74]:
A = set([1,2,3])
B = set([3,4,5])
C = A & B

print (C)

{3}


In [75]:
A = set([1,2,3])
B = set([3,4,5])
C = A - B

print (C)

{1, 2}


In [76]:
A = set([1,2,3])
B = set([3,4,5])

print (1 in A)
print (7 not in A)

True
True


# Dictionaries

A dictionary is a map

In [77]:
my_dict = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun",
           7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dec"}

print (my_dict[0])

KeyError: 0

In [78]:
my_dict = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun",
           7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dec"}

print (my_dict[1])
print (my_dict[12])

Jan
Dec


In [79]:
my_dict = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun",
           7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dec"}

my_dict[1] = 777
del my_dict[12]
print (my_dict)


{1: 777, 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov'}


In [80]:
my_dict[8474] = "claudio"
print (my_dict)

{1: 777, 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 8474: 'claudio'}


In [81]:
print (my_dict.keys())

dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 8474])


In [82]:
print (my_dict.values())

dict_values([777, 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'claudio'])


In [83]:
my_dict = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun",
           7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dec"}

for k in my_dict:
    print (k)

1
2
3
4
5
6
7
8
9
10
11
12


In [84]:
my_dict = {1:"Jan", 2:"Feb", 3:"Mar", 4:"Apr", 5:"May", 6:"Jun",
           7:"Jul", 8:"Aug", 9:"Sep", 10:"Oct", 11:"Nov", 12:"Dec"}

for k,v in my_dict.items():
    print (k,v)

1 Jan
2 Feb
3 Mar
4 Apr
5 May
6 Jun
7 Jul
8 Aug
9 Sep
10 Oct
11 Nov
12 Dec


# Comprehensions

Creating lists by iterating through other lists.

In [85]:
my_list = [ x**2 for x in range(10) ]
print (my_list)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]


In [86]:
my_list = [ x**2 for x in range(10) if x%2==0 ]
print (my_list)

[0, 4, 16, 36, 64]


In [87]:
my_dict = { x:x**2 for x in range(10) if x%2==0 }
print (my_dict)

{0: 0, 2: 4, 4: 16, 6: 36, 8: 64}


# Functions

Do not write code outside functions!

Careful when passing lists as parameters.

You can return lists, tuples, sets, dictionaries.

Get used to move *stable* functions in a separate `module.py`.

In [88]:
def square(x):
    return x**2

print ( square(3) )

9


In [89]:
def powers(x,n):
    return [ x**i for i in range(n) ]

print ( powers(2,5) )

[1, 2, 4, 8, 16]


In [90]:
copy_f = powers

print ( copy_f(2,5) )

[1, 2, 4, 8, 16]


In [91]:
powers_3 = lambda x : powers(x,3)

print (powers_3(5))

[1, 5, 25]


In [92]:
a = [-6, 1,-2,3,-4,5]

print (sorted(a))

print (sorted(a, key=lambda x: abs(x) ))

[-6, -4, -2, 1, 3, 5]
[1, -2, 3, -4, 5, -6]


In [93]:
def add1(x):
    x+=1
    return x

y = 10
z = add1(y)
print( y,z )

10 11


In [94]:
def add1(x):
    for i in range(len(x)):
        x[i] = x[i]+1
    return x


y = [1,2,3,4,5]
z = add1(y)
print( y,z )

[2, 3, 4, 5, 6] [2, 3, 4, 5, 6]


In [95]:
def myfun (a, b=3, c = 77):
    print (a,b,c)

myfun(10)
myfun(10,20)
myfun(10, c=99)

10 3 77
10 20 77
10 3 99


# JSON

JavaScript Object Notation, very popular in Web APIs.

In [96]:
import json

a  = {"key": 10}

# to string
s = json.dumps(a)

print (type(s))
print (s)


<class 'str'>
{"key": 10}


In [97]:
a = json.loads('{"key": 10}')

print (type(a))
print (a['key'])

<class 'dict'>
10


# Files

Check how to iterate through a file, and how to run shell commands in Jupyter.

In [98]:
out_file = open("test.txt", "w")
out_file.write("line 1\n")
out_file.write("line 2\n")
out_file.close()

In [99]:
!head test.txt

line 1
line 2


In [100]:
out_file = open("test.txt", "w")
print ("line 11", file=out_file)
print ("line 22", file=out_file)
out_file.close()

In [101]:
in_file = open("test.txt", "r")
line = in_file.readline()
print ( line )
line = in_file.readline()
print ( line )

in_file.close()

line 11

line 22



In [102]:
with open("test.txt", "r") as in_file:
    line = in_file.readline()
    print ( line )

line 11



In [103]:
with open("test.txt", "r") as in_file:
    for line in in_file:
        print ( line )

line 11

line 22



In [104]:
with open("test.txt", "r") as in_file:
    for line in in_file:
        print ( "**" + line + "** ")

**line 11
** 
**line 22
** 


In [105]:
with open("test.txt", "r") as in_file:
    for line in in_file:
        print (line, end="")

line 11
line 22


# JSON

In [106]:
import json

a  = {"key": 10}

# to string
with open("test.txt", "w") as out_file:
    json.dump(a,out_file)

!cat test.txt

{"key": 10}

In [107]:
with open("test.txt", "r") as in_file:
    b = json.load(in_file)
    print (type(b))
    print (b)

<class 'dict'>
{'key': 10}
