Text analytics with python

Using anonymous function: lambdas
The lambda
keyword is used to define inline function objects that can be used just like regular
functions, with a few differences. The general syntax for a lambda function is shown in the
following code snippet:
> lambda arg, arg2,... arg_n : <inline expression using args>
    


In [8]:
# simple lambda function to square a number
lambda_square = lambda n: n*n
lambda_square(5)

25

In [9]:
import functools
from functools import reduce

In [10]:
# lambda function to add numbers used for adding numbers in reduce function
lambda_sum = lambda x, y: x + y
reduce(lambda_sum, [1, 2, 3, 4, 5])

15

In [38]:
# map function to square numbers using lambda
map(lambda_square, [1, 2, 3, 4, 5])

<map at 0x218ef2fd248>

The join() method is a string method and returns a string in which the elements of sequence have been joined by str separator.

In [15]:
list1 = ['1','2','3','4'] 
s = "-"
# joins elements of list1 by '-'
# and stores in sting s
s = s.join(list1)
s

'1-2-3-4'

In [16]:
# lambda function to make a sentence from word tokens with reduce function
lambda_sentence_maker = lambda word1, word2: ' '.join([word1,
word2])

In [17]:
reduce(lambda_sentence_maker, ['I', 'am', 'making', 'a',
'sentence', 'from', 'words!'])

'I am making a sentence from words!'

# Iterators
Iterators are constructs used to iterate through iterables. Iterables are objects that are
basically sequences of other objects and data. A good example would be a for loop, which is
actually an iterable that iterates through a list or sequence. Iterators are objects or constructs
that can be used to iterate through iterables using the next() function, which returns the next
item from the iterable at each call. Once it has iterated through the entire iterable, it returns
a StopIteration exception. We have seen how a for loop works in general, however behind
the abstraction, the for loop actually calls the iter() function on the iterable to create an
iterator object and then traverses through it using the next() function.

# Comprehensions
Comprehensions are interesting constructs that are similar to for loops but more
efficient. They fall rightly into the functional programming paradigm following the set
builder notation.



equivalent for loop statement
for item in iterable:
    expression

typical comprehension syntax
[ expression for item in iterable ]


equivalent for loop statement
for item in iterable:
    expression


complex and nested iterations
[ expression for item1 in iterable1 if condition1
    for item2 in iterable2 if condition2 ...
    for itemN in iterableN if conditionN ]
    
    
equivalent for loop statement
for item1 in iterable1:
    if condition1:
        for item2 in iterable2:
            if condition2:
                ...
                    for itemN in iterableN:
                        if conditionN:
                            expression

In [20]:
# simple list comprehension to compute squares
numbers = range(6)
[num*num for num in numbers]

[0, 1, 4, 9, 16, 25]

In [21]:
# list comprehension to check if number is divisible by 2
[num%2 for num in numbers]

[0, 1, 0, 1, 0, 1]

In [22]:
# set comprehension returns distinct values of the above operation
set(num%2 for num in numbers)

{0, 1}

In [23]:
# dictionary comprehension where key:value is number: square(number)
{num: num*num for num in numbers}

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}

In [28]:
# a more complex comprehension showcasing above operations in a single comprehension
[{'number': num,
'square': num*num,
'type': 'even' if num%2 == 0 else 'odd'} for num in numbers]

[{'number': 0, 'square': 0, 'type': 'even'},
 {'number': 1, 'square': 1, 'type': 'odd'},
 {'number': 2, 'square': 4, 'type': 'even'},
 {'number': 3, 'square': 9, 'type': 'odd'},
 {'number': 4, 'square': 16, 'type': 'even'},
 {'number': 5, 'square': 25, 'type': 'odd'}]

In [29]:
# nested list comprehension - flattening a list of lists
list_of_lists = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
list_of_lists[1]

[5, 6, 7, 8]

In [30]:
[item for each_list in list_of_lists for item in each_list]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

In [31]:
#another way to express.
[item for x in list_of_lists for item in x]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

## Generators
They exist in two variants: functions and expressions. Generators work on a
concept known as lazy evaluation —hence, they are more memory efficient and perform
better in most cases because they do not require the entire object to be evaluated
and loaded in one go, as in list comprehensions.
Generator functions are implemented as regular functions using the def statement.
However, they use the concept of lazy evaluation and return one object at a time using
the yield statement. Unlike regular functions that have a return statement, which once
executed ends the execution of the code block inside the function, generators use the
yield statement, which suspends and resumes execution and the state after generating
and returning each value or object. To be more precise, generator functions yield values
at each step rather than returning them. This ensures that the current state including
information about the local code block scope it retained and enables the generator to
resume from where it left off.

In [32]:
numbers = [1, 2, 3, 4, 5]
def generate_squares(numbers):
    for number in numbers:
        yield number*number   


In [41]:
gen_obj = generate_squares(numbers)
gen_obj

<generator object generate_squares at 0x00000218EF49BD48>

In [42]:
for item in gen_obj:
    print (item)

1
4
9
16
25


The advantages of these generators are both memory efficiency and execution time, especially when iterables and objects are large in size and occupy substantial memory. You also do not need to load whole objects into the main memory for performing various operations on them. They often work very well on streaming data where you cannot keep all the data in memory at all times. The same applies for generator expressions, which are very similar to comprehensions except they are enclosed in parentheses.

In [43]:
csv_string = 'The,fox,jumps,over,the,dog'

In [44]:
# making a sentence using list comprehension
list_cmp_obj = [item for item in csv_string.split(',')]
list_cmp_obj

['The', 'fox', 'jumps', 'over', 'the', 'dog']

In [45]:
' '.join(list_cmp_obj)

'The fox jumps over the dog'

In [46]:
# making a sentence using generator expression
gen_obj = (item for item in csv_string.split(','))
gen_obj
' '.join(gen_obj)


'The fox jumps over the dog'

## The itertools and functools Modules
Some of the popular ones include collections , itertools , and functools, which have various constructs and functions that can be used to boost productivity and reduce time spent writing extra code to solve problems. The itertools module is a complete module dedicated to building and operating on iterators. It has various functions that support different operations including
slicing, chaining, grouping, and splitting iterators.

# String Operations and Methods
- Basic operations
- Indexing and slicing
- Methods
- Formatting
- Regular expressions

In [50]:
# Different ways of String concatenation
'Hello' + ' and welcome ' + 'to Python!'


'Hello and welcome to Python!'

In [51]:
'Hello' ' and welcome ' 'to Python!'

'Hello and welcome to Python!'

In [52]:
# concatenation of variables and literals
s1 = 'Python!'
'Hello ' + s1

'Hello Python!'

In [53]:
# checking for substrings in a string
s3 = ('This '
      'is another way '
      'to concatenate '
      'several strings!')

'way' in s3

True

In [54]:
# computing total length of the string
len(s3)

51

## Indexing and Slicing
As mentioned, strings are iterables—ordered sequences of characters. Hence they can
be indexed, sliced, and iterated through similarly to other iterables such as lists. Each
character has a specific position in the string, called its index . Using indexes, we can
access specific parts of the string. Accessing a single character using a specific position
or index in the string is called indexing , and accessing a part of a string, for example,
a substring using a start and end index, is called slicing.

To access any particular character in the string, you need to use the corresponding
index, and slices can be extracted using the syntax var[start:stop] , which extracts all
characters in the string var from index start till index stop excluding the character at the
stop index.

In [56]:
# creating a string
s = 'PYTHON'

In [57]:
# depicting string indexes
for index, character in enumerate(s):
    print ('Character', character+':', 'has index:', index)

Character P: has index: 0
Character Y: has index: 1
Character T: has index: 2
Character H: has index: 3
Character O: has index: 4
Character N: has index: 5


In [58]:
# string indexing
s[0], s[1], s[2], s[3], s[4], s[5]

('P', 'Y', 'T', 'H', 'O', 'N')

### Methods

In [59]:
# case conversions
s = 'python is great'
s.upper()

'PYTHON IS GREAT'

In [60]:
# string replace
s.replace('python', 'analytics')

'analytics is great'

In [61]:
# string splitting and joining
s = 'I,am,a,comma,separated,string'
s.split(',')

['I', 'am', 'a', 'comma', 'separated', 'string']

In [62]:
' '.join(s.split(','))

'I am a comma separated string'

In [63]:
# stripping whitespace characters
s = '  I am surrounded by spaces  '
s.strip()

'I am surrounded by spaces'

In [65]:
# coverting to title case
s = 'this is in lower case'
s.title()

'This Is In Lower Case'

### Formatting
String formatting is used to substitute specific data objects and types in a string. This
is mostly used when displaying text to the user. There are mainly two different types of
formatting used for strings:
- Formatting expressions : These expressions are typically of the
syntax '...%s...%s...' %(values) , where the %s denotes a
placeholder for substituting a string from the list of strings depicted
in values . This is quite similar to the C style printf model and has
been there in Python since the beginning. You can substitute values
of other types with the respective alphabet following the % symbol,
like %d for integers and %f for floating point numbers.
- Formatting methods : These strings take the form of '...{}...
{}...'.format(values) , which makes use of the braces {}
for placeholders to place strings from values using the format
method. These have been present in Python since version 2.6.x.

In [66]:
# simple string formatting expressions
'Hello %s' %('Python!')

'Hello Python!'

In [67]:
# formatting expressions with different data types
'We have %d %s containing %.3f gallons of %s' %(2, 'bottles', 2.5, 'milk')

'We have 2 bottles containing 2.500 gallons of milk'

Formatting methods : These strings take the form of '...{}... {}...'.format(values) , which makes use of the braces {} for placeholders to place strings from values using the format method. These have been present in Python since version 2.6.x.

In [68]:
# formatting using the format method
'Hello {} {}, it is a great {} to meet you'.format('Mr.', 'Jones',
'pleasure')

'Hello Mr. Jones, it is a great pleasure to meet you'

In [69]:
# alternative ways of using format
'I have a {food_item} and a {drink_item} with me'.format(drink_item='soda', food_item='sandwich')

'I have a sandwich and a soda with me'

In [70]:
'The {animal} has the following attributes: {attributes}'.format(animal='dog', attributes=['lazy', 'loyal'])

"The dog has the following attributes: ['lazy', 'loyal']"

### Regular Expressions (Regexes)
Regular expressions, also called regexes , allow you to create string patterns and use them
for searching and substituting specific pattern matches in textual data. Python offers a
rich module named re for creating and using regular expressions. Entire books have been
written on this topic because it is easy to use but difficult to master. Discussing every
aspect of regular expressions would not be possible in these pages, but I will cover the
main areas with sufficient examples.

- re.I or re.IGNORECASE is used to match patterns ignoring case
sensitivity.
- re.S or re.DOTALL causes the period ( . ) character to match any
character including new lines.

For pattern matching, various rules are used in regexes. Some popular ones include
the following:

- . for matching any single character
- ^ for matching the start of the string
- $ for matching the end of the string
-   *for matching zero or more cases of the previous mentioned
regex before the * symbol in the pattern
- ? for matching zero or one case of the previous mentioned regex
before the ? symbol in the pattern
- [...] for matching any one of the set of characters inside the
square brackets
- [^...] for matching a character not present in the square
brackets after the ^ symbol
- | denotes the OR operator for matching either the preceding or
the next regex
- + for matching one or more cases of the previous mentioned regex
before the + symbol in the pattern
- \d for matching decimal digits which is also depicted as [0-9]
- \D for matching non-digits, also depicted as [^0-9]
- \s for matching white space characters
- \S for matching non whitespace characters
- \w for matching alphanumeric characters also depicted as
[a-zA-Z0-9_]
- \W for matching non alphanumeric characters also depicted as
[^a-zA-Z0-9_]

Regular expressions can be compiled into pattern objects and then used with a
variety of methods for pattern search and substitution in strings. The main methods
offered by the re module for performing these operations are as follows:
- re.compile() : This method compiles a specified regular
expression pattern into a regular expression object that can be
used for matching and searching. Takes a pattern and optional
flags as input, discussed earlier.
- re.match() : This method is used to match patterns at the
beginning of strings.
- re.search() : This method is used to match patterns occurring at
any position in the string.
- re.findall() : This method returns all non-overlapping matches
of the specified regex pattern in the string.
- re.finditer() : This method returns all matched instances in the
form of an iterator for a specific pattern in a string when scanned
from left to right.
- re.sub() : This method is used to substitute a specified regex
pattern in a string with a replacement string. It only substitutes
the leftmost occurrence of the pattern in the string.

In [71]:
import re

In [72]:
# setting up a pattern we want to use as a regex
# also creating two sample strings

In [73]:
pattern = 'python'
s1 = 'Python is an excellent language'
s2 = 'I love the Python language. I also use Python to build applications at work!'

In [74]:
# match only returns a match if regex match is found at the beginning of the string
re.match(pattern, s1)

In [76]:
# pattern is in lower case hence ignore case flag helps
# pattern is in lower case hence ignore case flag helps
re.match(pattern, s1, flags=re.IGNORECASE)

<re.Match object; span=(0, 6), match='Python'>