# Python Data Science CookBook: Chapter 1

In this chapter, we will cover the following recipes:

- Using dictionary objects
- Working with a dictionary of dictionaries
- Working with tuples
- Using sets
- Writing a list
- Creating a list from another list - list comprehension
- Using iterators
- Generating an iterator and a generator
- Using iterables
- Passing a function as a variable
- Embedding functions in another function
- Passing a function as a parameter
- Returning a function
- Altering the function behavior with decorators
- Creating anonymous functions with lambda
- Using the map function
- Working with filters
- Using zip and izip
- Processing arrays from the tabular data
- Preprocessing the columns
- Sorting lists
- Sorting with a key
- Working with itertools

### Using dictionary objects

Let’s proceed to demonstrate how to operate a dictionary in Python. Let’s use a simple
sentence to demonstrate the use of a dictionary. We will follow it up with an actual
dictionary creation:

In [1]:
# 1.Load a variable with sentences
sentence = "Peter Piper picked a peck of pickled peppers A peck of pickled\
peppers Peter Piper picked If Peter Piper picked a peck of pickled \
peppers Wheres the peck of pickled peppers Peter Piper picked"

# 2. Initialize a dictionary object
word_dict = {}

# 3. Perform the word count
for word in sentence.split():
    if word not in word_dict:
        word_dict[word] =1
    else:
        word_dict[word] += 1
# 4. Print the output
print (word_dict)

{'a': 2, 'A': 1, 'Peter': 4, 'of': 4, 'Piper': 4, 'pickled': 3, 'pickledpeppers': 1, 'picked': 4, 'peppers': 3, 'the': 1, 'peck': 4, 'Wheres': 1, 'If': 1}


In [2]:
# Step 3 revisited setting default value for each key
word_dict ={}
for word in sentence.split():
    word_dict.setdefault(word,0)
    word_dict[word]+=1

# Print the output
print (word_dict)    

{'a': 2, 'A': 1, 'Peter': 4, 'of': 4, 'Piper': 4, 'pickled': 3, 'pickledpeppers': 1, 'picked': 4, 'peppers': 3, 'the': 1, 'peck': 4, 'Wheres': 1, 'If': 1}


In [3]:
# Using defaultdict class
from collections import defaultdict
sentence = "Peter Piper picked a peck of pickled peppers A peck of pickled\
peppers Peter Piper picked If Peter Piper picked a peck of pickled \
peppers Wheres the peck of pickled peppers Peter Piper picked"

word_dict = defaultdict(int)

for word in sentence.split():
    word_dict[word]+=1
# Note the output here it will be an abject of defaultdict
print word_dict

defaultdict(<type 'int'>, {'a': 2, 'A': 1, 'Peter': 4, 'of': 4, 'Piper': 4, 'pickled': 3, 'pickledpeppers': 1, 'picked': 4, 'peppers': 3, 'the': 1, 'peck': 4, 'Wheres': 1, 'If': 1})


In [5]:
# Now we look at Counters a  dictionary subclass to count the hashable objects.
from collections import Counter
sentence = "Peter Piper picked a peck of pickled peppers A peck of pickled\
peppers Peter Piper picked If Peter Piper picked a peck of pickled \
peppers Wheres the peck of pickled peppers Peter Piper picked"

words = sentence.split()
# It automatically counts the occurance of an item in the list
word_count = Counter(words)

print word_count['Peter']
print word_count

4
Counter({'Peter': 4, 'of': 4, 'Piper': 4, 'picked': 4, 'peck': 4, 'pickled': 3, 'peppers': 3, 'a': 2, 'A': 1, 'pickledpeppers': 1, 'the': 1, 'Wheres': 1, 'If': 1})


In [9]:
# Using OrderedDict class
from collections import OrderedDict
sentence = "Peter Piper picked a peck of pickled peppers A peck of pickled\
peppers Peter Piper picked If Peter Piper picked a peck of pickled \
peppers Wheres the peck of pickled peppers Peter Piper picked"

word_dict = {}

for word in sentence.split():
    word_dict.setdefault(word,0)
    word_dict[word]+=1
    
# dictionary sorted by key
print OrderedDict(sorted(word_dict.items(), key=lambda t: t[0]))

# dictionary sorted by value
print OrderedDict(sorted(word_dict.items(), key=lambda t: t[1]))

# dictionary sorted by length of the key string
print OrderedDict(sorted(word_dict.items(), key=lambda t: len(t[0])))

# Note: The new sorted dictionaries maintain their sort order when entries are deleted. 
# But when new keys are added, the keys are appended to the end and the sort is not maintained.

OrderedDict([('A', 1), ('If', 1), ('Peter', 4), ('Piper', 4), ('Wheres', 1), ('a', 2), ('of', 4), ('peck', 4), ('peppers', 3), ('picked', 4), ('pickled', 3), ('pickledpeppers', 1), ('the', 1)])
OrderedDict([('A', 1), ('pickledpeppers', 1), ('the', 1), ('Wheres', 1), ('If', 1), ('a', 2), ('pickled', 3), ('peppers', 3), ('Peter', 4), ('of', 4), ('Piper', 4), ('picked', 4), ('peck', 4)])
OrderedDict([('a', 2), ('A', 1), ('of', 4), ('If', 1), ('the', 1), ('peck', 4), ('Peter', 4), ('Piper', 4), ('picked', 4), ('Wheres', 1), ('pickled', 3), ('peppers', 3), ('pickledpeppers', 1)])


### Working with a dictionary of dictionaries

Let’s look at an example to understand how to use dictionaries in a dictionary.
We will create the user_movie_rating dictionary using an anonymous function to
demonstrate the concept of a dictionary of dictionaries.

In [10]:
from collections import defaultdict
user_movie_rating = defaultdict(lambda :defaultdict(int))

# Initialize ratings for Alice
user_movie_rating["Alice"]["LOR1"] = 4
user_movie_rating["Alice"]["LOR2"] = 5
user_movie_rating["Alice"]["LOR3"] = 3
user_movie_rating["Alice"]["SW1"] = 5
user_movie_rating["Alice"]["SW2"] = 3

# Initialize ratings for Huntsman
user_movie_rating["Huntsman"]["LOR1"] = 1
user_movie_rating["Huntsman"]["LOR2"] = 2
user_movie_rating["Huntsman"]["LOR3"] = 1
user_movie_rating["Huntsman"]["SW1"] = 4
user_movie_rating["Huntsman"]["SW2"] = 4

# Initialize ratings for Snipe
user_movie_rating["Snipe"]["LOR1"] = 3
user_movie_rating["Snipe"]["LOR2"] = 4
user_movie_rating["Snipe"]["LOR3"] = 4
user_movie_rating["Snipe"]["SW1"] = 2
user_movie_rating["Snipe"]["SW2"] = 1

print user_movie_rating

defaultdict(<function <lambda> at 0x03083670>, {'Huntsman': defaultdict(<type 'int'>, {'SW1': 4, 'SW2': 4, 'LOR1': 1, 'LOR3': 1, 'LOR2': 2}), 'Snipe': defaultdict(<type 'int'>, {'SW1': 2, 'SW2': 1, 'LOR1': 3, 'LOR3': 4, 'LOR2': 4}), 'Alice': defaultdict(<type 'int'>, {'SW1': 5, 'SW2': 3, 'LOR1': 4, 'LOR3': 3, 'LOR2': 5})})


### Working with tuples

Tuples are immutable and can have a heterogeneous sequence of elements separated by a comma and
enclosed in parentheses. They support the following operations:
- in and not in
- Comparision, concatenation, slicing, and indexing
- min() and max()

In [13]:
# 1. Ways of creating a tuple
a_tuple = (1,2,'a')
b_tuple = 1,2,'c'

print a_tuple
print b_tuple

# 2. Accessing elements of a tuple through index
print b_tuple[0]
print b_tuple[-1]

# 3.It is not possible to change the value of an item in a tuple,
# for example the next statement will result in an error.
try:
    b_tuple[0] = 20
except:
    print "Cannot change value of tuple by index"

# 4.Though tuples are immutable
# But elements of a tuple can be mutable objects,
# for instance a list, as in the following line of code
c_tuple =(1,2,[10,20,30])
c_tuple[2][0] = 100
print c_tuple
# 5.Tuples once created cannot be extended like list,
# however two tuples can be concatenated.
print a_tuple + b_tuple

# 6.Slicing of tuples
a =(1,2,3,4,5,6,7,8,9,10)
print a[1:]
print a[1:3]
print a[1:6:2]
print a[:-1]
print a[::-1]

# 7. Tuple min max
print min(a),max(a)

#8. in and not in
if 1 in a:
    print "Element 1 is avalilable in tuple a"
else:
    print "Element 1 is not available in tuple a"
    


(1, 2, 'a')
(1, 2, 'c')
1
c
Cannot change value of tuple by index
(1, 2, [100, 20, 30])
(1, 2, 'a', 1, 2, 'c')
(2, 3, 4, 5, 6, 7, 8, 9, 10)
(2, 3)
(2, 4, 6)
(1, 2, 3, 4, 5, 6, 7, 8, 9)
(10, 9, 8, 7, 6, 5, 4, 3, 2, 1)
1 10
Element 1 is avalilable in tuple a


**Tip** :
While building programs for machine learning, in particular during the feature generation
from raw data, creating feature tuples ensures that values cannot be changed by
downstream programs.

### Using Sets

Sets are very similar to list data structures except that they do not allow duplicates. It’s an
unordered collection of homogeneous elements. Typically, sets are used to remove the
duplicate elements from a list. However, a set supports operations such as intersection,
union, difference, and symmetric difference. These operations are very handy in a lot of
use cases.

In [15]:
# In our example, we will calculate a similarity score between two sentences using Jaccard’s coefficient

# 1.Initialize two sentences.
st_1 = "dogs chase cats"
st_2 = "dogs hate cats"

# 2.Create set of words from strings
st_1_wrds = set(st_1.split())
st_2_wrds = set(st_2.split())

# 3.Find out the number of unique words in each set, vocabulary size.
no_wrds_st_1 = len(st_1_wrds)
no_wrds_st_2 = len(st_2_wrds)

# 4.Find out the list of common words between the two sets.
# Also find out the count of common words.
cmn_wrds = st_1_wrds.intersection(st_2_wrds)
no_cmn_wrds = len(st_1_wrds.intersection(st_2_wrds))

# 5.Get a list of unique words between the two sets.
# Also find out the count of unique words.
unq_wrds = st_1_wrds.union(st_2_wrds)
no_unq_wrds = len(st_1_wrds.union(st_2_wrds))

# 6.Calculate Jaccard similarity
similarity = no_cmn_wrds / (1.0 * no_unq_wrds)

# 7.Let us now print to grasp our output.
print "No words in sent_1 = %d"%(no_wrds_st_1)
print "Sentence 1 words =", st_1_wrds
print "No words in sent_2 = %d"%(no_wrds_st_2)
print "Sentence 2 words =", st_2_wrds
print "No words in common = %d"%(no_cmn_wrds)
print "Common words =", cmn_wrds
print "Total unique words = %d"%(no_unq_wrds)
print "Unique words=",unq_wrds
print "Similarity = No words in common/No unique words, %d/%d = %.2f"%(no_cmn_wrds,no_unq_wrds,similarity)


No words in sent_1 = 3
Sentence 1 words = set(['cats', 'dogs', 'chase'])
No words in sent_2 = 3
Sentence 2 words = set(['cats', 'hate', 'dogs'])
No words in common = 2
Common words = set(['cats', 'dogs'])
Total unique words = 4
Unique words= set(['cats', 'hate', 'dogs', 'chase'])
Similarity = No words in common/No unique words, 2/4 = 0.50


We can use the built-in functions from libraries such as scikit-learn

In [18]:
# Load libraries
from sklearn.metrics import jaccard_similarity_score

# 1.Initialize two sentences.
st_1 = "dogs chase cats"
st_2 = "dogs hate cats"

# 2.Create set of words from strings
st_1_wrds = set(st_1.split())
st_2_wrds = set(st_2.split())

unq_wrds = st_1_wrds.union(st_2_wrds)

a =[ 1 if w in st_1_wrds else 0 for w in unq_wrds ]
b =[ 1 if w in st_2_wrds else 0 for w in unq_wrds]

print a
print b
print jaccard_similarity_score(a,b)


[1, 0, 1, 1]
[1, 1, 1, 0]
0.5


### Writing a List

They are similar to tuples except that they are homogenous and mutable. A list allows append operations. They can also be used as either a stack or queue. Unlike tuples, lists are expandable;you can add elements to a list
using the append function after its creation.

In [22]:
# 1.Let us look at a quick example of list creation.
a = range(1,10)
print a
b = ["a","b","c"]
print b

# 2.List can be accessed through indexing. Indexing starts at 0.
print a[0]
# 3.With negative indexing the elements of a list are accessed from backwards.
a[-1]
# 4.Slicing is accessing a subset of list by providing two indices.
print a[1:3] # prints [2, 3]
print a[1:] # prints [2, 3, 4, 5, 6, 7, 8, 9]
print a[-1:] # prints [9]
print a[:-1] # prints [1, 2, 3, 4, 5, 6, 7, 8]

#5.List concatenation
a = [1,2]
b = [3,4]
print a + b # prints [1, 2, 3, 4]
# 6. List min max
print min(a),max(a)
# 7. in and not in
if 1 in a:
    print "Element 1 is available in list a"
else:
    print "Element 1 is available in tuple a"

# 8. Appending and extending list
a = range(1,10)
print a
a.append(10)
print a
# 9.List as a stack
a_stack = []
a_stack.append(1)
a_stack.append(2)
a_stack.append(3)
print a_stack.pop()
print a_stack.pop()
print a_stack.pop()

# 10.List as queue
a_queue = []
a_queue.append(1)
a_queue.append(2)
a_queue.append(3)
print a_queue.pop(0)
print a_queue.pop(0)
print a_queue.pop(0)

# 11. List sort and reverse
from random import shuffle
a = range(1,20)
shuffle(a)
print a
a.sort()
print a

a.reverse()
print a

[1, 2, 3, 4, 5, 6, 7, 8, 9]
['a', 'b', 'c']
1
[2, 3]
[2, 3, 4, 5, 6, 7, 8, 9]
[9]
[1, 2, 3, 4, 5, 6, 7, 8]
[1, 2, 3, 4]
1 2
Element 1 is available in list a
[1, 2, 3, 4, 5, 6, 7, 8, 9]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
3
2
1
1
2
3
[18, 17, 5, 16, 7, 8, 13, 15, 2, 14, 19, 6, 11, 12, 10, 3, 4, 9, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]


### Creating a list from another list - list comprehension

Comprehension is a way to create a sequence from another sequence. 
Typically, a list comprehension involves the following features:
- A sequence, say a list whose elements we are interested in
- A variable representing the elements of the sequence
- An output expression that is responsible for producing the output sequence using the elements of the input sequence
- An optional predicate expression

In [23]:
# 1. Let us define a simple list with some positive and negative numbers.
a = [1,2,-1,-2,3,4,-3,-4]

# 2. Now let us write our list comprehension.
# pow() a power function takes two input and
# its output is the first variable raised to the power of the second.
b = [pow(x,2) for x in a if x < 0]

# 3. Finally let us see the output, i.e. the newly created list b.
print b

[1, 4, 9, 16]


The comprehension syntax is exactly the same as a dictionary. A simple example will
illustrate the following:

In [24]:
a = {'a':1,'b':2,'c':3}
b = {x:pow(y,2) for x,y in a.items()}
print b

{'a': 1, 'c': 9, 'b': 4}


We can do comprehension for tuples with a small trick. See the following example:

In [25]:
def process(x):
    if isinstance(x,str):
        return x.lower()
    elif isinstance(x,int):
        return x*x
    else:
        return -9
a = (1,2,-1,-2,'D',3,4,-3,'A')
b = tuple(process(x) for x in a )
print b

(1, 4, 1, 4, 'd', 9, 16, 9, 'a')


### Using iterators

For using data in chunks and to iterate through the data we can use iterators.An iterator in Python implements an iterator pattern. It allows us to go over a sequence one by one without materializing the whole sequence!

Let’s create a simple iterator called simple counter and provide it with some code on how
to effectively use the iterator:

In [37]:
import traceback
# 1. Let us write a simple iterator.
class SimpleCounter(object):
    def __init__(self, start, end):
        self.current = start
        self.end = end
    
    def __iter__(self):
        'Returns itself as an iterator object'
        return self

    def next(self):
        'Returns the next value till current is lower than end'
        if self.current > self.end:
            raise StopIteration
        else:
            self.current += 1
            return self.current - 1

c = SimpleCounter(1,3)
print c

# 2. Now let us try to access the iterator
print "try next"
try:
    print c.next()
    print c.next()
    print c.next()
    print c.next()
except:
    traceback.print_exc()


# 3. Another way to access
print "loop c"
for entry in iter(c):
    print entry



<__main__.SimpleCounter object at 0x05A6E1F0>
try next
1
2
3
loop c


Traceback (most recent call last):
  File "<ipython-input-37-d9d4294da432>", line 29, in <module>
    print c.next()
  File "<ipython-input-37-d9d4294da432>", line 15, in next
    raise StopIteration
StopIteration


### Generating an iterator and a generator

Generators provide a clean syntax to loop through a sequence of values eliminating the
need to have the two functions, __iter__ and next(). We don’t have to write a class. A
point to note is that both generators and iterables produce an iterator.

In [32]:
SimpleCounter = (x**2 for x in range(1,10))
tot = 0
print SimpleCounter
for val in SimpleCounter:
    tot+=val
print tot

<generator object <genexpr> at 0x05A5FA08>
285


In [33]:
def my_gen(low,high):
    for x in range(low,high):
        yield x**2
tot = 0
for val in my_gen(1,10):
    tot+=val
print tot

285


In the previous section, we mentioned that both a generator and iterables produce an
iterator. Let’s validate this by trying to call the generator using the iter function:

In [34]:
gen = (x**2 for x in range(1,10))
for val in iter(gen):
    print val

1
4
9
16
25
36
49
64
81


### Using iterables

Iterables are similar to generators except for a key difference, we can go on and on with an
iterable, that is, once we have exhausted all the elements in a sequence, we can again start
accessing it from the beginning unlike a generator.
They are object-based generators that do not hold any state. Any class with the iter
method that yields data can be used as a stateless object generator.

In [38]:
# 1. Let us define a simple class with __iter__ method.
class SimpleIterable(object):
    def __init__(self, start, end):
        self.start = start
        self.end = end
    def __iter__(self):
        for x in range(self.start,self.end):
            yield x**2

# Now let us invoke this class and iterate over its values two times.
c = SimpleIterable(1,10)

# First iteration
tot = 0
for val in iter(c):
    tot+=val
print tot
# Second iteration
tot =0
for val in iter(c):
    tot+=val
print tot

285
285


### Passing a function as a variable

Functions are first-class citizens in Python. They have attributes and they can be referenced and assigned to a variable.
Let’s look at the paradigm of passing a function as a variable in Python in this section.

In [39]:
# 1.Let us define a simple function.
def square_input(x):
    return x*x
# We will follow it by assigning that function to a variable
square_me = square_input
# And finally invoke the variable
print square_me(5)

25


### Embedding functions in another function
This recipe will explain yet another functional programming construct; defining a function
in another function.

In [41]:
# 1. Let us define a function of function to find the sum of squares of the given input
def sum_square(x):
    def square_input(x):
        return x*x
    return sum([square_input(x1) for x1 in x])

# Print the output to check for correctness
print sum_square([2,4,5])

45


### Passing a function as a parameter

Python supports higher order functions, that is, functions that can accept other functions as
arguments.

In [42]:
from math import log
def square_input(x):
    return x*x
# 1. Define a generic function, which will take another function as input
# and will apply it on the given input sequence.
def apply_func(func_x,input_x):
    return map(func_x,input_x)

# Let us try to use the apply_func() and verify the results
a = [2,3,4]
print apply_func(square_input,a)
print apply_func(log,a)

[4, 9, 16]
[0.6931471805599453, 1.0986122886681098, 1.3862943611198906]


### Returning a function
In this section, let’s look at the functions that will return another function.

In [43]:
# 1. Let us define a function which will explain our
# concept of function returning a function.
def cylinder_vol(r):
    pi = 3.141
    def get_vol(h):
        return pi * r**2 * h
    return get_vol

# 2. Let us define a radius and find get a volume function,
# which can now find out the volume for the given radius and any height.
radius = 10
find_volume = cylinder_vol(radius)

# 3. Let us try to find out the volume for different heights
height = 10
print "Volume of cylinder of radius %d and height %d = %.2f cubic units" \
%(radius,height,find_volume(height))

height = 20
print "Volume of cylinder of radius %d and height %d = %.2f cubic units" \
%(radius,height,find_volume(height))


Volume of cylinder of radius 10 and height 10 = 3141.00 cubic units
Volume of cylinder of radius 10 and height 20 = 6282.00 cubic units


Functools is a module for higher order functions:
https://docs.python.org/2/library/functools.html

### Altering the function behavior with decorators

Decorators wrap a function and alter their behavior

In [45]:
from string import punctuation

def pipeline_wrapper(func):
    def to_lower(x):
        return x.lower()

    def remove_punc(x):
        for p in punctuation:
            x = x.replace(p,'')
        return x

    def wrapper(*args,**kwargs):
        x = to_lower(*args,**kwargs)
        x = remove_punc(x)
        return func(x)
    
    return wrapper

@pipeline_wrapper
def tokenize_whitespace(inText):
    return inText.split()

s = "string. With. Punctuation?"
print tokenize_whitespace(s)

['string', 'with', 'punctuation']


### Creating anonymous functions with lambda

Anonymous functions are created using the lambda statement in Python. Functions that
are not bound to a name are called anonymous functions.

In [56]:
# 1. Create a simple list and a function similar to the
# one in functions as parameter section.
a =[10,20,30]

def do_list(a_list,func):
    total = 0
    for element in a_list:
        total+=func(element)
    return total

print do_list(a,lambda x:x**2)
print do_list(a,lambda x:x**3)
b =[lambda x: x%3 ==0 for x in a ]
for f in b:
    print f(3)


1400
36000
True
True
True


### Using the map function

Map is a built-in Python function. It takes a function and an iterable for an argument:
map(aFunction, iterable)
The function is applied on all the elements of the iterable and results are returned as a list.
As a function is passed to map, lambda is most commonly used along with map.

In [57]:
#First let us declare a list.
a =[10,20,30]

# Let us now call the map function in our Print statement.
print map(lambda x:x**2,a)

[100, 400, 900]


In [58]:
print sum(map(lambda x:x**2,a))
print sum(map(lambda x:x**3,a))

1400
36000


In [59]:
# Map expects an N-argument function if we have N-sequences. Let’s see an example to understand this:
a =[10,20,30]
b = [1,2,3]

print map(pow,a,b)

[10, 400, 27000]


### Working with filters

Filter filters elements from a sequence based on the given function. 
Filter is a built-in Python function. It takes a function and an iterable for an argument:
Filter(aFunction, iterable)
The function that is passed as an argument is returned as a Boolean value based on a test.
The function is applied on all the elements of the iterable and all the items that are
returned as true when the function is applied over them are returned as a list. An
anonymous function, lambda, is most commonly used along with filter.


In [60]:
# Let us declare a list.
a = [10,20,30,40,50]
# Let us apply Filter function on all the elements of the list.
print filter(lambda x:x>10,a)

[20, 30, 40, 50]


### Using zip and izip

Zip takes two equal length collections and merges them together in pairs. Zip is a built-in
Python function.

In [61]:
print zip(range(1,5),range(1,5))

[(1, 1), (2, 2), (3, 3), (4, 4)]


In [66]:
# Let’s see what the * operator does. A * operator unpacks a collection in their positional arguments:
a =(2,3)
print pow(*a)

8


In [63]:
#The ** operator unpacks a dictionary as a set of named arguments. In this case, we will get 
#an output,6, when we apply the ** operator to a dictionary

def dist(x,y,z,x1,y1,z1):
    return abs((x-x1)+(y-y1)+(z-z1))

a_dict = {"x":10,"y":10,"z":10,"x1":10,"y1":10,"z1":10}

print dist(**a_dict)

0


In [64]:
def any_sum(*args):
    tot = 0
    for arg in args:
        tot+=arg
    return tot
print any_sum(1,2)
print any_sum(1,2,3)

3
6


### Processing arrays from the tabular data

Typically, data is available as a text file, separated by either a comma or tab. A Python
built-in file object utility can be used in this case. As we saw earlier, a file object
implements the __iter__() and next() methods. 

This allows us to work on very large files, which do not fit into memory, by reading only a small chunk of the files at a time.Python machine learning libraries such as scikit-learn works on the NumPy libraries. In
this section, we will see ways of efficiently reading external data and converting it to numPy arrays for the downstream data processing.

In [70]:
# 1. Let us simulate a small tablular input using StringIO
import numpy as np
from StringIO import StringIO
in_data = StringIO("10,20,30\n56,89,90\n33,46,89")

# 2.Read the input using numpy genfromtext to create a nummpy array.
data = np.genfromtxt(in_data,dtype=int,delimiter=",")
print data

# cases where we may not need to use some columns.
in_data = StringIO("10,20,30\n56,89,90\n33,46,89")
data = np.genfromtxt(in_data,dtype=int,delimiter=",",usecols=(0,1))
print data

# providing column names
in_data = StringIO("10,20,30\n56,89,90\n33,46,89")
data = np.genfromtxt(in_data,dtype=int,delimiter=",",names="a,b,c")
print data

# using column names from data
in_data = StringIO("a,b,c\n10,20,30\n56,89,90\n33,46,89")
data = np.genfromtxt(in_data,dtype=int,delimiter=",",names=True)
print data

array([(10, 20, 30), (56, 89, 90), (33, 46, 89)], 
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4')])

Another simple method from NumPy to create NumPy arrays from the text input is
loadtxt:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html
This is less sophisticated than genfromtxt;


### Preprocessing the columns

Often the data that we get is not in the format we can consume. A lot of data processing
called data preprocessing steps in machine learning terminology has to be applied. 
The genfromtext provides us with some functionalities in order to perform this data transformation while reading from the source.


In [72]:
import numpy as np
from StringIO import StringIO
# Define a data set
in_data = StringIO("30kg,inr2000,31.11,56.33,1\n52kg,inr8000.35,12,16.7,2")

# 1.Let us define two data pre-processing using lambda functions,
strip_func_1 = lambda x : float(x.rstrip("kg"))
strip_func_2 = lambda x : float(x.lstrip("inr"))

# 2.Let us now create a dictionary of these functions,
convert_funcs = {0:strip_func_1,1:strip_func_2}

# 3.Now provide this dictionary of functions to genfromtxt.
data = np.genfromtxt(in_data,delimiter=",", converters=convert_funcs)

print data

# Using a lambda function to handle conversions
in_data = StringIO("10,20,30\n56,,90\n33,46,89")
mss_func = lambda x : float(x.strip() or -999)
data = np.genfromtxt(in_data,delimiter=",", converters={1:mss_func})

print data

[[  3.00000000e+01   2.00000000e+03   3.11100000e+01   5.63300000e+01
    1.00000000e+00]
 [  5.20000000e+01   8.00035000e+03   1.20000000e+01   1.67000000e+01
    2.00000000e+00]]
[[  10.   20.   30.]
 [  56. -999.   90.]
 [  33.   46.   89.]]


Refer to the SciPy documentations given here for more details:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

### Sorting lists

We will start with sorting a list and then move on to sorting other iterables.

In [75]:
# Let us look at a very small code snippet, which does sorting of a given list.
a = [8, 0, 3, 4, 5, 2, 9, 6, 7, 1]
b = [8, 0, 3, 4, 5, 2, 9, 6, 7, 1]
print a
a.sort()

print a
print b

b_s = sorted(b)
print b_s

a.sort(reverse=True)
print a

a = (8, 0, 3, 4, 5, 2, 9, 6, 7, 1)
sorted(a)

[8, 0, 3, 4, 5, 2, 9, 6, 7, 1]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[8, 0, 3, 4, 5, 2, 9, 6, 7, 1]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

### Sorting with a key

Let’s now proceed to see if we can sort it using keys. We will work out
our examples through a list of tuples and the same applies to the other sequence objects.

In [80]:
#1.The first step is to create a list of tuples, which we will use to test our sorting.
employee_records = [ ('joe',1,53),('beck',2,26), \
('ele',6,32),('neo',3,45), \
('christ',5,33),('trinity',4,29), \
]

# 2.Let us now sort it by employee name
print sorted(employee_records,key=lambda emp : emp[0])

# 3.Let us now sort it by employee id
print sorted(employee_records,key=lambda emp : emp[1])

# 4.Finally we sort it with employee age
print sorted(employee_records,key=lambda emp : emp[2])

from operator import itemgetter

print sorted(employee_records,key=itemgetter(0))
print sorted(employee_records,key=itemgetter(1))
print sorted(employee_records,key=itemgetter(2))

print sorted(employee_records,key=itemgetter(0,1))

[('beck', 2, 26), ('christ', 5, 33), ('ele', 6, 32), ('joe', 1, 53), ('neo', 3, 45), ('trinity', 4, 29)]
[('joe', 1, 53), ('beck', 2, 26), ('neo', 3, 45), ('trinity', 4, 29), ('christ', 5, 33), ('ele', 6, 32)]
[('beck', 2, 26), ('trinity', 4, 29), ('ele', 6, 32), ('christ', 5, 33), ('neo', 3, 45), ('joe', 1, 53)]
[('beck', 2, 26), ('christ', 5, 33), ('ele', 6, 32), ('joe', 1, 53), ('neo', 3, 45), ('trinity', 4, 29)]
[('joe', 1, 53), ('beck', 2, 26), ('neo', 3, 45), ('trinity', 4, 29), ('christ', 5, 33), ('ele', 6, 32)]
[('beck', 2, 26), ('trinity', 4, 29), ('ele', 6, 32), ('christ', 5, 33), ('neo', 3, 45), ('joe', 1, 53)]
[('beck', 2, 26), ('christ', 5, 33), ('ele', 6, 32), ('joe', 1, 53), ('neo', 3, 45), ('trinity', 4, 29)]


The attrgetter and methodcaller comes in handy when the elements of our iterable are
class objects. Look at the following example:

In [82]:
# Let us now enclose the employee records as class objects,
class employee(object):
    def __init__(self,name,id,age):
        self.name = name
        self.id = id
        self.age = age
    def pretty_print(self):
        print self.name,self.id,self.age
    def random_method(self):
        return self.age / self.id

# Now let us populate a list with these class objects.
employee_records = []
emp1 = employee('joe',1,53)
emp2 = employee('beck',2,26)
emp3 = employee('ele',6,32)

employee_records.append(emp1)
employee_records.append(emp2)
employee_records.append(emp3)

# Print the records
print "print records"
for emp in employee_records:
    emp.pretty_print()

    
print "print attrgetter sorted records"

from operator import attrgetter
employee_records_sorted = sorted(employee_records,key=attrgetter('age'))
# Now let us print the sorted list,
for emp in employee_records_sorted:
    emp.pretty_print()

print "print methodcaller sorted records"

from operator import methodcaller
employee_records_sorted = sorted(employee_records,key=methodcaller('random_method'))
for emp in employee_records_sorted:
    emp.pretty_print()


print records
joe 1 53
beck 2 26
ele 6 32
print attrgetter sorted records
beck 2 26
ele 6 32
joe 1 53
print methodcaller sorted records
ele 6 32
beck 2 26
joe 1 53


### Working with itertools
Itertools includes functions to work with iterables. 
Let’s proceed to see a set of Python scripts used to demonstrate the usage of itertools:


In [84]:
# Load libraries
from itertools import chain,compress,combinations,count,izip,islice

# 1.Chain example, where different iterables can be combined together.
a = [1,2,3]
b = ['a','b','c']
print list(chain(a,b)) # prints [1, 2, 3, 'a', 'b', 'c']

# 2.Compress example, a data selector, where the data in the first iterator
# is selected based on the second iterator.
a = [1,2,3]
b = [1,0,1]
print list(compress(a,b)) # prints [1, 3]

# 3.From a given list, return n length sub sequences.
a = [1,2,3,4]
print list(combinations(a,2)) # prints [(1, 2), (1, 3), (1, 4), (2, 3), (2,4), (3, 4)]

# 4.A counter which produces infinite consequent integers, given a start integer,
a = range(5)
b = izip(count(1),a)
for element in b:
    print element

# 5. Extract an iterator from another iterator,
# let us say we want an iterator which only returns every
# alternate elements from the input iterator
a = range(100)
b = islice(a,0,100,2)
print list(b)


[1, 2, 3, 'a', 'b', 'c']
[1, 3]
[(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)]
(1, 0)
(2, 1)
(3, 2)
(4, 3)
(5, 4)
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94, 96, 98]
