Note: This notebook was created alongside DataCamp's course of the same name

# Regular Expressions in Python
As a data scientist, you will encounter many situations where you will need to extract key information from huge corpora of text, clean messy data containing strings, or detect and match patterns to find useful words. All of these situations are part of text mining and are an important step before applying machine learning algorithms. This course will take you through understanding compelling concepts about string manipulation and regular expressions. You will learn how to split strings, join them back together, interpolate them, as well as detect, extract, replace, and match strings using regular expressions. On the journey to master these skills, you will work with datasets containing movie reviews or streamed tweets that can be used to determine opinion, as well as with raw text scraped from the web.

**Instructor:** Maria Eugenia Inzaugarat, PhD in Data Science

# $\star$ Chapter 1: Basic Concepts of String Manipulation
Start your journey into the regular expression world! From slicing and concatenating, adjusting the case, removing spaces, to finding and replacing strings. You will learn how to master basic operation for string manipulation using a movie review dataset.

### Introduction to string manipulation

#### Why regex is important
* Clean dataset to prepare it for text mining or sentiment analyis
* Process email content to feed a machine learning algorithm that decides whether an email is spam
* Parse and extract specific data from a website to build a database
* Learning to manipulate strings and master regular expressions will allow you to perform these tasks faster and more efficiently

#### String functions
* **`str()`** returns the string representation of an object as seen in the code
* len()

#### Concatenation
* Concatenate: `+` operator
* Applying the plus operand to "sum up" both strings (specifying also the space in between), generates the output seen below:

In [1]:
my_string1 = "Awesome day"
my_string2 = "for biking"
print(my_string1+" "+my_string2)

Awesome day for biking


#### Indexing
* Individual characters of a string can be accessed directly using an index
* **String slicing also accepts a third index: STRIDE**
* `string[beginning : end, stride]`
    * Stride specifies how many characters to omit before retrieving a character
    
<img src='data/string1.png' width="300" height="150" align="center"/>

In [3]:
my_string = 'MY STRING'
print(my_string[0:-1:2])

M TI


In [4]:
my_string = 'MY STRING'
print(my_string[0:-1:3])

MSI


* Interestingly, **omitting the first and second indices, and designating a `-1` step for stride, returns a reversed string** as shown below:

In [5]:
print(my_string[::-1])

GNIRTS YM


In [6]:
print(my_string[::-2])

GIT M


### String operations
* **`.lower()`** : converts all alphabetic characters to lowercase
* **`.upper()`** : converts all alphabetic characters to uppercase
* **`.capitalize()`** : returns a copy of the string with the first letter capitalized (while keeping all other characters in lowercase)

### Splitting
* `my_string = "This string will be split"`
* Splitting a string into a list of substrings:
    * Both take a separating element by which we are splitting the string and a maxsplit that tells us the maximum number of substrings we want
    * The difference between the two following methods is that `split` starts splitting at the left, while `rsplit` starts splitting from the right
    * If `maxsplit` is not specified, both methods behave in the same way
    * Default `sep` is whitespace (or `" "`); for this separator, you don't need to specify the argument
* **`.split()`**
    * `my_string.split(sep=" ", maxsplit=2)`

* **`.rsplit()`**
    * `my_string.rsplit(sep=" ", maxsplit=2)`

In [7]:
my_string = "This string will be split"

In [8]:
my_string.split(sep=" ", maxsplit=2)

['This', 'string', 'will be split']

In [9]:
my_string.rsplit(sep=" ", maxsplit=2)

['This string will', 'be', 'split']

### Escape sequences
* There are some **escape sequences** such as `\n` or `\r` that indicates a line boundary

<img src='data/escape_sequences.png' width="200" height="100" align="center"/>

In [11]:
my_string_n = "This string will be split\nin two"
print(my_string_n)

This string will be split
in two


In [12]:
my_string_r = "This string will be split\rin two"
print(my_string_r)

This string will be splitin two


* Python method **`splitlines()`** for **breaking at line boundaries**:

In [14]:
my_string_n.splitlines()

['This string will be split', 'in two']

* As you can se, the string is split at the `\n` sequence, returning a list of two elements

#### Joining
* Some methods can paste or concatenate together the objects in a list or other iterable data
* Concatenate strings from list or another iterable
    * `sep.join(iterable)`
    * **Syntax:** 
        * First takes the separating element (`" "` or `"_"`, for example)
        * Inside the call, we specify the list or iterable element     
        * The result is a single string containing all the objects in the list separated by whitespace

In [15]:
my_list = ["this", "would", "be", "a", "string"]
print(" ".join(my_list))

this would be a string


### Stripping characters
* Methods that will trim characters from a string
* Strips characters from left to right: `.strip()`
    * removes both leading and trailing characters
    * Inside the call, we specify a character to be stripped
    * Default is whitespace

In [16]:
my_string = " This string will be stripped "

In [17]:
my_string.strip()

'This string will be stripped'

In [18]:
my_string2 = " This string will be stripped\n"
my_string2.strip()

'This string will be stripped'

* **Notice that both leading and trailing whitespace, as well as the trailing escape sequence, were removed.**
* We can also apply `.rsplit()` and it will return a string where the trailing whitespaces and/or trailing escape sequence is removed

In [23]:
my_string.rstrip()

' This string will be stripped'

In [20]:
my_string2.rstrip()

' This string will be stripped'

* **If we apply the `.lstrip()` method, we'll get a string with the *leading* whitespace eliminated**

### Finding and replacing
* Python has several built-in methods that will help you search a target strig for a specified substring:

#### `.find()`
* The `find()` method returns the lowest index in the string where it can find the substring (and then stops "searching")
* If the substring is **not** found, `-1` is returned

<img src='data/string2.png' width="600" height="300" align="center"/>

In [24]:
my_string = "Where's Waldo?"
my_string.find("Waldo")

8

In [25]:
my_string.find("Wenda")

-1

In [26]:
my_string.find("Waldo", 0, 6)

-1

* Note that substring to find must exist *completely* within the start and end indices. As shown below, it is not enough to just have *part* of the substring within the indices. 

In [27]:
my_string.find("Waldo", 6, 11)

-1

#### The `.index()` method
* Similar to `.find()`, searches target string for a specified substring
* *However*, if we search for a substring that does not exist (within the indices provided), instead of returning `-1`, a `ValueError` will be raised

In [28]:
my_string.index("Waldo")

8

<img src='data/string3.png' width="600" height="300" align="center"/>

* We can handle this error using a `try ` / `except` block

In [31]:
try:
    my_string.index("Wenda")
except ValueError:
    print("Not found")    

Not found


In [32]:
try:
    my_string.index("Wenda")
except:
    print("Not found")    

Not found


#### Counting occurences
* The **`.count()`** method searches for a specified substring in the target string
* It **returns the number of non-overlapping occurrences**; in other words, how many times the substring is present in the string

<img src='data/string4.png' width="600" height="300" align="center"/>

In [34]:
my_string = "How many fruits do you have in your fruit basket?"

In [35]:
my_string.count("fruit")

2

In [36]:
my_string.count("fruit", 0, 16)

1

#### Replacing substrings
* Replace occurences of substring with new substring
* **Note:** Returns a *copy* of the string (does not reassign without reassignment of variable)

<img src='data/string5.png' width="600" height="300" align="center"/>

#### Exercises: Finding a substring

```
for movie in movies:
  	# If actor is not found between character 37 and 41 inclusive
    # Print word not found
    if movie.find("actor", 37, 42) == -1:
        print("Word not found")
    # Count occurrences and replace two with one
    elif movie.count("actor") == 2:  
        print(movie.replace("actor actor", "actor"))
    else:
        # Replace three occurrences with one
        print(movie.replace("actor actor actor", "actor"))
```

#### Exercises: Where's the word?

```
for movie in movies:
  # Find the first occurrence of word
  print(movie.find("money", 12, 51))
```

```
for movie in movies:
  try:
    # Find the first occurrence of word
  	print(movie.index("money", 12, 51))
  except ValueError:
    print("substring not found")
```

#### Exercises: Replacing negations

```
# Replace negations 
movies_no_negation = movies.replace("isn't", "is")

# Replace important
movies_antonym = movies_no_negation.replace("important", "insignificant")

# Print out
print(movies_antonym)
```

# $\star$ Chapter 2: Formatting strings
Following your journey, you will learn the main approaches that can be used to format or interpolate strings in python using a dataset containing information scraped from the web. You will explore the advantages and disadvantages of using positional formatting, embedding expressing inside string constants, and using the Template class.

### Positional formatting
#### What is string formatting?
* **String formatting** is also called **string interpolation**
* String formatting is **the process of inserting a custome string in a predefined text.**

In [38]:
custom_string = "String formatting"
print(f"{custom_string} is a powerful technique")

String formatting is a powerful technique


#### Usage:
* Insert a title in a graph
* Show message or error
* Pass a statement to a function
* Print model outputs in sentence format

## Methods for formatting
* The modern versions of Python have three main approaches to string formatting
    * **Positional formatting**
    * **Formatted string literals**
    * **Template method**
    
### Positional formatting
* We put placeholders, defined by a pair of curly brackets, in a text
* We call the string `.format()` method
* Pass the desired value into the method
* The method replaces the placeholders using the values in order of appearance

<img src='data/string6.png' width="600" height="300" align="center"/>

In [39]:
print("Machine learning provides {} with the ability to learn {}".format("systems", "automatically"))

Machine learning provides systems with the ability to learn automatically


* We can also use variable for both the initial string and the values passed into the method:

<img src='data/string7.png' width="600" height="300" align="center"/>

In [40]:
my_string = "{} rely on {} datasets"
method = "Supervised algorithms"
condition = "labeled"

In [41]:
print(my_string.format(method, condition))

Supervised algorithms rely on labeled datasets


### Reordering values
* We can add index numbers into the curly braces. 
* This affects the order in which the method replaces placeholders. 
* In the example above, we left the placeholders empty. The method replaces them with the values in the given order.
* If we add the index numbers, the replacement order changes accordingly.

* Include an index number into the placeholders to reorder values 

In [42]:
print("{} has a friend called {} and a sister called {}".format("Betty", "Linda", "Daisy"))

Betty has a friend called Linda and a sister called Daisy


* If we add the index numbers, the replacement order changes accordingly:

In [43]:
print("{2} has a friend called {0} and a sister called {1}".format("Betty", "Linda", "Daisy"))

Daisy has a friend called Betty and a sister called Linda


<img src='data/string8.png' width="600" height="300" align="center"/>

### Named placeholders
* Specify a name for the placeholders

<img src='data/string9.png' width="600" height="300" align="center"/>

In [44]:
tool = "Unsupervised algorithms"
goal = "patterns"

In [45]:
print("{title} try to find {aim} in the dataset".format(title=tool, aim=goal))

Unsupervised algorithms try to find patterns in the dataset


* **We can also use dictionaries for named placeholders:**

<img src='data/string10.png' width="600" height="300" align="center"/>

In [46]:
my_methods = {"tool": "Unsupervised algorithms", "goal": "patterns"}

In [47]:
print('{data[tool]} try to find {data[goal]} in the dataset.'.format(data=my_methods))

Unsupervised algorithms try to find patterns in the dataset.


* **Note above that `my_methods` is assigned to `data` in the `.format()` call.**
* Also **note** that you need to specify the index without using quotes.

### Format specifiers
* Specify data type to be used: `{index: specifier}`
* This defines how individual values are presented 
* **One of the most common format specifiers is *float*, represented by the letter *f*.**
* In the code below, we specify that the value passed with index 0 will be a float

<img src='data/string11.png' width="600" height="300" align="center"/>

* **We could also add `.2f` to indicate that we want the float to have 2 decimals**

In [None]:
print("Only {0:f}% of the {1} produced worldwide is ")

### Formatting datetime

In [48]:
from datetime import datetime

In [49]:
print(datetime.now())

2022-01-31 12:53:43.011728


* You can see that the format returned above is very particular
* We can use **format specifiers**

### Formatting datetime

<img src='data/string12.png' width="600" height="300" align="center"/>

In [50]:
print("Today's date is {:%Y-%m-%d %H:%M}".format(datetime.now()))

Today's date is 2022-01-31 12:55


#### Exercises: Put it in order!

```
# Assign the substrings to the variables
first_pos = wikipedia_article[3:19].lower()
second_pos = wikipedia_article[21:44].lower()

# Define string with placeholders 
my_list.append("The tool {} is used in {}")

# Define string with rearranged placeholders
my_list.append("The tool {1} is used in {0}")

# Use format to print strings
for my_string in my_list:
  	print(my_string.format(first_pos, second_pos))
```

#### Exercises: Calling by its name

```
# Create a dictionary
plan = {
  		"field": courses[0],
        "tool": courses[1]
        }

# Complete the placeholders accessing elements of field and tool keys in the data dictionary
my_message = "If you are interested in {data[field]}, you can take the course related to {data[tool]}"

# Use the plan dictionary to replace placeholders
print(my_message.format(data=plan))
```

#### Exercises: What day is today?

```
# Import datetime 
from datetime import datetime

# Assign date to get_date
get_date = datetime.now()

# Add named placeholders with format specifiers
message = "Good morning. Today is {today:%B %d, %Y}. It's {today:%H:%M} ... time to work!"

# Use the format method replacing the placeholder with get_date
print(message.format(today=get_date))
```

<img src='data/string.png' width="600" height="300" align="center"/>