### Data Science Overview
Reference: 
Doing Data Science By: Cathy O'Neil; Rachel Schutt (Chapter 1)
http://proquest.safaribooksonline.com.proxy.lib.odu.edu/book/databases/9781449363871

Two factors that are driving the emergence of data science discipline:

-  Massive amount of data being generated: Internet (shopping, browsing, etc.), finance, medical, pharmaceuticals, bioinformatics, social welfare, government, education, etc.

-  Availability of cheap computing power

This is resluting in tools and products, for example  Amazon recommendation systems by analysing online shopping data, friend recommendations on Facebook, film and music recommendations, and so on. In finance, this means credit ratings, trading algorithms, and models. 


#### Data Science is an emerging field that  contains topics from different disciplines:

-  Computer science
-  Math and Statistics
-  Machine learning
-  Application Domain
-  Data visualization including Communication and presentation


#### Data Science Process
![image.png](attachment:image.png)

### Introduction to Notebook and Python Basics 

### References 
<a href ="http://jupyter-notebook.readthedocs.io/en/latest/notebook.html"> Notebook Basics </a><br>
<a href="https://docs.python.org/3/tutorial/"> Python Tutorial </a><br>
<a href="http://www.python-course.eu/python3_lambda.php"> Python Tutorial Lambda </a><br>

<h3> Cell Types</h3> 
When you open a new notebook, first you select the environmnet you will be using. In our case it is Python 3.6. The notebook has two main types of cells:
Markdown Cell and Code Cell
In the markdown cell you can type plain text, html text, and latex

HTML Example:
<ul> 
<li> first  item
<li> second item
</ul>


Latex Example:

$e^x=\sum_{i=0}^\infty \frac{1}{i!}x^i$

In the code cell you can type python code and execute it by entering  shift-enter. A simple use of Python is to use it as a calculator.

In [1]:
2+3

5

You can assign values to variables and do operations.

In [3]:
a = 3
b = 5
c = a + b
print("value of c is: ", c)

value of c is:  8


Exercise 0(no submission required): Complete the user interface tour under help and review keyboard shortcurts.

In [5]:
3+2

5

### List
A list of comma-separated values (items) between square brackets. The items of a list are ordered and can be accessed by indices.

In [7]:
squares = [1, 4, 9, 16, 25]
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
print(letters[2:5])

['c', 'd', 'e']


In [8]:
csquares = squares
print(csquares)

[1, 4, 9, 16, 25]


In [9]:
squares[0] = 9999
print(csquares)

[9999, 4, 9, 16, 25]


In [10]:
print(squares)

[9999, 4, 9, 16, 25]


In [20]:
squares = [1, 4, 9, 16, 25]
csquares = list(squares) # make a copy
csquares[0] = 9999
print(csquares)
print(squares)

[9999, 4, 9, 16, 25]
[1, 4, 9, 16, 25]


It is possible to nest lists (create lists containing other lists), for example:

In [22]:
x = [['a', 'b', 'c'], [1, 2, 3]]
x[1]

[1, 2, 3]

#### List methods

In [23]:
fruits = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']
fruits.count('apple')

2

In [25]:
fruits.count('tangerine')

0

In [26]:
fruits.index('banana')

3

In [28]:
fruits.index('banana', 4)  # Find next banana starting a position 4

6

In [29]:
fruits.reverse()
fruits

['banana', 'apple', 'kiwi', 'banana', 'pear', 'apple', 'orange']

In [30]:
fruits.append('grape')
fruits

['banana', 'apple', 'kiwi', 'banana', 'pear', 'apple', 'orange', 'grape']

In [31]:
fruits.sort()
fruits

['apple', 'apple', 'banana', 'banana', 'grape', 'kiwi', 'orange', 'pear']

In [32]:
sorted(fruits)

['apple', 'apple', 'banana', 'banana', 'grape', 'kiwi', 'orange', 'pear']

In [33]:
sorted(fruits, reverse=True)

['pear', 'orange', 'kiwi', 'grape', 'banana', 'banana', 'apple', 'apple']

In [34]:
sorted(fruits, key=len)

['kiwi', 'pear', 'apple', 'apple', 'grape', 'banana', 'banana', 'orange']

In [35]:
def lastc(s): return s[-1] # simple function, more about functions later

sorted(fruits,key=lastc)

['banana', 'banana', 'apple', 'apple', 'grape', 'orange', 'kiwi', 'pear']

Exercise 1: Write a Python program to get a list of tuples sorted in increasing order by the last element 
in each tuple

Sample List : [(2, 5), (1, 2), (4, 4), (2, 3), (2, 1)]

Expected output : [(2, 1), (1, 2), (2, 3), (4, 4), (2, 5)]

In [38]:
list1 = [(2, 5), (1, 2), (4, 4), (2, 3), (2, 1)]
#list1.sort()
sorted(list1,key=lastc)

[(2, 1), (1, 2), (2, 3), (4, 4), (2, 5)]

In [39]:
a=(2,3)
a[-1]

3

#### Using Lists as Stacks
The list methods make it very easy to use a list as a stack, where the last element added is the first element retrieved (“last-in, first-out”). To add an item to the top of the stack, use append(). To retrieve an item from the top of the stack, use pop() without an explicit index. 

In [40]:
stack = [3, 4, 5]
stack.append(6)
stack.append(7)
stack

[3, 4, 5, 6, 7]

In [41]:
stack.pop()

7

In [42]:
stack

[3, 4, 5, 6]

In [43]:
stack.pop()

6

In [44]:
stack.pop()

5

In [45]:
stack

[3, 4]

#### Using Lists as Queues

To implement a queue, use collections.deque which was designed to have fast appends and pops from both ends. 

In [11]:
from collections import deque
queue = deque(["Eric", "John", "Michael"])
queue.append("Terry")           # Terry arrives
queue.append("Graham")          # Graham arrives
queue

deque(['Eric', 'John', 'Michael', 'Terry', 'Graham'])

In [47]:
queue.popleft()                 # The first to arrive now leaves

'Eric'

In [48]:
queue.popleft()                 # The second to arrive now leaves

'John'

In [49]:
queue                           # Remaining queue in order of arrival

deque(['Michael', 'Terry', 'Graham'])

#### For Statement

In [50]:
words = ['cat', 'window', 'defenestrate']
for w in words:
    print(w, len(w))  # note the indentation

cat 3
window 6
defenestrate 12


In [51]:
for i in range(5):
    print(i)

0
1
2
3
4


In [52]:
for i in range(4,9,2):
    print(i)

4
6
8


In [55]:
a,b =3,4
b

4

<h3>Functions</h3>

In [186]:
def fib(n):    # write Fibonacci series up to n
    #Return a list containing the Fibonacci series up to n.
    result = []
    a, b = 0, 1
    while a < n:
        result.append(a)    
        a, b = b, a+b
    return result

fib(100)

[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

Exercise 2: Define a python function filter_list(A, t) that takes a list of integers A, and scalar t as input and returns a list with integers in A greater than t. Test your function using A = [7, 9, 3, 11, 5, 19, 34, 21], and t = 10

In [37]:
def filter_list(A, t):
    result = []
    for k in A:
        if k > t:
            result.append(k)
    return result
    
A = [7,9,3,11,5,19,34,21]
t = 10
filter_list(A, t)

[11, 19, 34, 21]

#### Creating list

In [85]:
squares = []
for x in range(10):
    squares.append(x**2)

squares

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [83]:
squares = [x**2 for x in range(10)]
squares

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [38]:
[(x, y) for x in [1,2,3] for y in [3,1,4] if x != y]

[(1, 3), (1, 4), (2, 3), (2, 1), (2, 4), (3, 1), (3, 4)]

In [91]:
vec = [-4, -2, 0, 2, 4]
# create a new list with the values doubled
[x*2 for x in vec]

[-8, -4, 0, 4, 8]

In [92]:
# filter the list to exclude negative numbers
[x for x in vec if x >= 0]

[0, 2, 4]

In [93]:
# apply a function to all the elements abs absolute value
[abs(x) for x in vec]

[4, 2, 0, 2, 4]

In [95]:
# call a method on each element
freshfruit = ['  banana', '  loganberry ', 'passion fruit  ']
[weapon.strip() for weapon in freshfruit]

['banana', 'loganberry', 'passion fruit']

In [94]:
# create a list of 2-tuples like (number, square)
[(x, x**2) for x in range(6)]

[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16), (5, 25)]

In [103]:
x = [1, 2, 3]
y = [4, 5, 6]
zipped = zip(x, y)
print(zipped)
print(list(zipped))

<zip object at 0x110cf4348>
[(1, 4), (2, 5), (3, 6)]


In [97]:
# the tuple must be parenthesized, otherwise an error is raised
[(x, x**2) for x in range(6)]

[(0, 0), (1, 1), (2, 4), (3, 9), (4, 16), (5, 25)]

Exercise 3: Write a function front_x(words) that takes a of strings as input, and returns a list with the strings in sorted order, except group all the strings that begin with 'x' first.

For example:  ['mix', 'xyz', 'apple', 'xanadu', 'aardvark'] yields  ['xanadu', 'xyz', 'aardvark', 'apple', 'mix']
Hint: this can be done by making 2 lists and sorting each of them  before combining them.

In [44]:
def front_x(words):
    x_list=[]
    other_list=[]
    for w in words:
        if w.startswith('x'):
            x_list.append(w)
        else:
            other_list.append(w)
    return sorted(x_list)+sorted(other_list)
    
words=['mix', 'xyz', 'apple', 'xanadu', 'aardvark'] 
front_x(words)
               





['xanadu', 'xyz', 'aardvark', 'apple', 'mix']

### Sets
Python also includes a data type for sets. A set is an unordered collection with no duplicate elements. Basic uses include membership testing and eliminating duplicate entries. Set objects also support mathematical operations like union, intersection, difference, and symmetric difference.
Curly braces or the set() function can be used to create sets. Note: to create an empty set you have to use set(), not {};

In [104]:
basket = {'apple', 'orange', 'apple', 'pear', 'orange', 'banana'}
print(basket)                      # show that duplicates have been removed

{'apple', 'pear', 'orange', 'banana'}


In [105]:
'orange' in basket                 # fast membership testing

True

In [106]:
'crabgrass' in basket

False

In [107]:
# Demonstrate set operations on unique letters from two words
s1 = set('abracadabra')
s2 = set('alacazam')
print(s1)                                  # unique letters in s1
print(s2)                                  # unique letters in s2

{'r', 'd', 'b', 'a', 'c'}
{'z', 'l', 'm', 'a', 'c'}


In [108]:
s1 - s2                              # letters in s1 but not in s2

{'b', 'd', 'r'}

In [109]:
s1 | s2                              # letters in s1 or s2 or both

{'a', 'b', 'c', 'd', 'l', 'm', 'r', 'z'}

In [110]:
s1 & s2                              # letters in both s1 and s2

{'a', 'c'}

In [111]:
s1 ^ s2                              # letters in s1 or s2 but not both

{'b', 'd', 'l', 'm', 'r', 'z'}

Exercise 4: Consider the following list
nlist = [1,2,3,2,1,5,6,4,8,5,4]
Write python code to create a new list that does not contain duplicates.


In [115]:
nlist = {1,2,3,2,1,5,6,4,8,5,4}
print(nlist)
s=set(nlist)
print(s)

{1, 2, 3, 4, 5, 6, 8}
{1, 2, 3, 4, 5, 6, 8}


### Dictionary
It is  an unordered set of key: value pairs, with the requirement that the keys are unique.

In [116]:
tel = {'sape': 4139, 'guido': 4127, 'jack': 4098}
print (tel['guido'])

4127


In [117]:
print (tel.keys())

dict_keys(['sape', 'guido', 'jack'])


In [118]:
print (tel.values())

dict_values([4139, 4127, 4098])


In [119]:
print (tel.items())

dict_items([('sape', 4139), ('guido', 4127), ('jack', 4098)])


In [120]:
tel = {'sape': 4139, 'guido': 4127, 'jack': 4098}
for k, v in tel.items():
    print (k,v)


sape 4139
guido 4127
jack 4098


In [121]:
for key in tel:
    print( key, tel[key])

sape 4139
guido 4127
jack 4098


In [122]:
# print keys, values in sorted order (by key) 
for key in sorted(tel.keys()):
    print(key, tel[key])

guido 4127
jack 4098
sape 4139


#### Deletion 

In [123]:
# Deleting an item
print(tel)
del tel['guido']
print(tel)

{'sape': 4139, 'guido': 4127, 'jack': 4098}
{'sape': 4139, 'jack': 4098}


Exercise 5: Write a program to create a dictionary from two lists: 
tel0 = ['sape', 'guido', 'jack']
tel1 = [4139, 4127, 4098]
The dictionary should look like this:
{'sape': 4139, 'guido': 4127, 'jack': 4098}

In [125]:
tel0 = ['sape', 'guido', 'jack']
tel1 = [4139, 4127, 4098]
dict_tel=dict (zip(tel0, tel1))

print(dict_tel)


{'sape': 4139, 'guido': 4127, 'jack': 4098}


#### Functions Default Argument Values

In [126]:
def ask_ok(prompt, retries=4, reminder='Please try again!'):
    while True:
        ok = input(prompt)
        if ok in ('y', 'ye', 'yes'):
            return True
        if ok in ('n', 'no', 'nop', 'nope'):
            return False
        retries = retries - 1
        if retries < 0:
            raise ValueError('invalid user response')
        print(reminder)

In [130]:
ask_ok('Do you really want to quit?')

Do you really want to quit?n


False

In [128]:
ask_ok('OK to overwrite the file?', 2)

OK to overwrite the file?2
Please try again!
OK to overwrite the file?y


True

In [132]:
ask_ok('OK to overwrite the file?', 2, 'Come on, only yes or no!')

OK to overwrite the file?y


True

#### Keyword Arguments
Functions can also be called using keyword arguments of the form kwarg=value. For instance, the following function:

In [4]:
def parrot(voltage, state='a stiff', action='voom', type='Norwegian Blue'):
    print("-- This parrot wouldn't", action, end=' ')
    print("if you put", voltage, "volts through it.")
    print("-- Lovely plumage, the", type)
    print("-- It's", state, "!")

accepts one required argument (voltage) and three optional arguments (state, action, and type). This function can be called in any of the following ways:

In [5]:
parrot(1000)                                          # 1 positional argument
parrot(voltage=1000)                                  # 1 keyword argument
parrot(voltage=1000000, action='VOOOOOM')             # 2 keyword arguments
parrot(action='VOOOOOM', voltage=1000000)             # 2 keyword arguments
parrot('a million', 'bereft of life', 'jump')         # 3 positional arguments
parrot('a thousand', state='pushing up the daisies')  # 1 positional, 1 keyword

-- This parrot wouldn't voom if you put 1000 volts through it.
-- Lovely plumage, the Norwegian Blue
-- It's a stiff !
-- This parrot wouldn't voom if you put 1000 volts through it.
-- Lovely plumage, the Norwegian Blue
-- It's a stiff !
-- This parrot wouldn't VOOOOOM if you put 1000000 volts through it.
-- Lovely plumage, the Norwegian Blue
-- It's a stiff !
-- This parrot wouldn't VOOOOOM if you put 1000000 volts through it.
-- Lovely plumage, the Norwegian Blue
-- It's a stiff !
-- This parrot wouldn't jump if you put a million volts through it.
-- Lovely plumage, the Norwegian Blue
-- It's bereft of life !
-- This parrot wouldn't voom if you put a thousand volts through it.
-- Lovely plumage, the Norwegian Blue
-- It's pushing up the daisies !


In [135]:
parrot()                     # required argument missing

TypeError: parrot() missing 1 required positional argument: 'voltage'

In [136]:
parrot(voltage=5.0, 'dead')  # non-keyword argument after a keyword argument

SyntaxError: positional argument follows keyword argument (<ipython-input-136-a03c61c948f4>, line 1)

In [137]:
parrot(110, voltage=220)     # duplicate value for the same argument

TypeError: parrot() got multiple values for argument 'voltage'

In [2]:
parrot(actor='John Cleese')  # unknown keyword argument

NameError: name 'parrot' is not defined

#### Packing/Unpacking Argument Lists

In [1]:
def cheeseshop1(kind, *arguments):
    print("-- Do you have any", kind, "?")
    print("-- IB'm sorry, we're all out of", kind)
    for arg in arguments:
        print(arg)

In [140]:
cheeseshop1("Limburger", "It's very runny, sir.",
           "It's really very, VERY runny, sir.")

-- Do you have any Limburger ?
-- IB'm sorry, we're all out of Limburger
It's very runny, sir.
It's really very, VERY runny, sir.


In [142]:
def cheeseshop2(kind, **keywords):
    print("-- Do you have any", kind, "?")
    print("-- I'm sorry, we're all out of", kind)
    for kw in keywords:
        print(kw, ":", keywords[kw])

In [143]:
cheeseshop2("Limburger", 
           shopkeeper="Michael Palin",
           client="John Cleese",
           sketch="Cheese Shop Sketch")

-- Do you have any Limburger ?
-- I'm sorry, we're all out of Limburger
shopkeeper : Michael Palin
client : John Cleese
sketch : Cheese Shop Sketch


In [146]:
def cheeseshop(kind, *arguments, **keywords):
    print("-- Do you have any", kind, "?")
    print("-- I'm sorry, we're all out of", kind)
    for arg in arguments:
        print(arg)
    print("-" * 40)
    for kw in keywords:
        print(kw, ":", keywords[kw])

In [147]:
cheeseshop("Limburger", "It's very runny, sir.",
           "It's really very, VERY runny, sir.",
           shopkeeper="Michael Palin",
           client="John Cleese",
           sketch="Cheese Shop Sketch")

-- Do you have any Limburger ?
-- I'm sorry, we're all out of Limburger
It's very runny, sir.
It's really very, VERY runny, sir.
----------------------------------------
shopkeeper : Michael Palin
client : John Cleese
sketch : Cheese Shop Sketch


Exercise 6. Write a function flexible_sum(*args), which can be called with different number of arguments:
flexible_sum(1,5,7,9) that returns 22, and flexible_sum(100,400) that returns 500

In [148]:
def flexible_sum(*args):
    sum=0
    for arg in args:
        sum=sum+arg
    return sum
print(flexible_sum(1,5,7,9))
print(flexible_sum(100,400))
    

22
500


#### More on Unpacking Argument Lists

In [73]:
def pfun(x,y,z):
    print(x,y,z)

In [76]:
t1 = [10, 20, 30]
pfun(t1)

TypeError: pfun() missing 2 required positional arguments: 'y' and 'z'

In [153]:
pfun(*t1)  # *t1 is unpacking from the list t1

10 20 30


In [154]:
print(t1)  # print list
print(*t1)  # print unpacked list

[10, 20, 30]
10 20 30


In [155]:
t1 = list(range(3, 6))  # normal call with separate arguments
print(t1)

[3, 4, 5]


In [156]:
args = [3, 6]
list(range(*args))

[3, 4, 5]

In [157]:
def zprint(*args):
    for x,y in args:
        print(x, y)
    

In [158]:
a1 = [0, 1, 2, 3]
a2 = [4, 5, 6, 7]
zprint(*zip(a1,a2))

0 4
1 5
2 6
3 7


In [159]:
def zprintn(args):
    for x in args:
        print(x[0],x[1])

In [160]:
print(*zip(a1,a2))
print(list(zip(a1,a2)))
zprintn(list(zip(a1,a2)))

(0, 4) (1, 5) (2, 6) (3, 7)
[(0, 4), (1, 5), (2, 6), (3, 7)]
0 4
1 5
2 6
3 7


<h3>Lambda Functions</h3>
Small anonymous functions (not assigned a name) - convenient and readable

In [47]:
# traditional function
def f(x):
    return x**2
print (f(4))

16


In [48]:
# anonymous function using lambda construct
g = lambda x: x**2  # a single expression defines for input x,  compute x**2
g(4)

16

In [49]:
# another example of anonymous function
g = lambda x,y: x**2 + y**2
g(4,3)

25

Standard functions filter(), map(), and reduce()

In [50]:
list_a = [1, 2, 3, 4, 5, 6, 7, 8]
# in built function filter has two arguments: a function, and a list
# the lambda function is called for each element of the list and return 
# those elements for which the lambda function returns true (even number)   
result = list(filter(lambda x: x%2 == 0, list_a )) #function returns an iterator
print (result)
list_a

[2, 4, 6, 8]


[1, 2, 3, 4, 5, 6, 7, 8]

In [51]:
result = list(map(lambda x: x**2, list_a )) 
print (result)
list_a

[1, 4, 9, 16, 25, 36, 49, 64]


[1, 2, 3, 4, 5, 6, 7, 8]

reduce(func, list), continiously apply the function to the elements of list to get a scalar value
reduce funtion is dropped from python 3 in-built functions. you need to use 

![image.png](attachment:image.png)

In [68]:
import functools
functools.reduce(lambda x,y: x+y, list_a)

36

Exercise 7: Use anonymous function lambda in the python built-in function filter() to filter a list of 
integers greater than t

In [77]:
t=1
result=list(filter(lambda x: x>t, list_a))
print(result)

[2, 3, 4, 5, 6, 7, 8]


Exercise 8: Sort the dictionary by value using the function 'sorted()' and a lambda function for selecting the key. Test your function on the dictionary: tel = {'sape': 4139, 'guido': 4127, 'jack': 4098}

In [91]:
tel = {'sape': 4139, 'guido': 4127, 'jack': 4098}
result=list(sorted( tel.items(), key=lambda value:value[1] ))
print(result)

[('jack', 4098), ('guido', 4127), ('sape', 4139)]


#### File I/O

In [178]:
f = open('hello.txt', 'w')
f.write('Hello-1, world!\n')
f.write('Hello-2, world!\n')
f.write('Hello-3, world!\n')
f.close()

In [172]:
f = open('hello.txt', 'r')
for line in f:   
    print(line) 
f.close()

Hello-1, world!

Hello-2, world!

Hello-3, world!



####  Operating system utitlity

In [92]:
import os

In [95]:
dir = 'resources'
filenames = os.listdir(dir,'hello.txt')
for filename in filenames:
    print(filename)  
    print(os.path.join(dir, filename)) 
    print(os.path.abspath(os.path.join(dir, filename))) 

TypeError: listdir() takes at most 1 argument (2 given)

In [174]:
N = 3 # print top N lines from a file
dir = 'resources'
filename = os.path.join(dir, 'test_house.csv')
print(filename)
with open(filename) as f:
    for i, line in enumerate(f):
        print(line)
        if i > 2:
            break
f.close()

resources/test_house.csv


FileNotFoundError: [Errno 2] No such file or directory: 'resources/test_house.csv'

#### Handling exceptions

In [177]:
import sys
dir = 'resources'
filename = os.path.join(dir, 'ttest.csv')
try:
    f = open(filename, 'r')
    text = f.read()
    f.close()
except IOError:
    sys.stderr.write('problem reading:' + filename)

problem reading:resources/ttest.csv

In [None]:
dir = 'resources'
filename = os.path.join(dir, 'ttest.csv')
try:
    f = open(filename, 'r')
    text = f.read()
    f.close()
except IOError:
    sys.stderr.write('problem reading:' + filename)
print('After handling exception')

### Homework (no submission required)
Go over section 3,4,5,and 7 of the <a href="https://docs.python.org/3/tutorial/"> Python Tutorial </a><br>