Note: This notebook was created alongside DataCamp's course of the same name

# Regular Expressions in Python
As a data scientist, you will encounter many situations where you will need to extract key information from huge corpora of text, clean messy data containing strings, or detect and match patterns to find useful words. All of these situations are part of text mining and are an important step before applying machine learning algorithms. This course will take you through understanding compelling concepts about string manipulation and regular expressions. You will learn how to split strings, join them back together, interpolate them, as well as detect, extract, replace, and match strings using regular expressions. On the journey to master these skills, you will work with datasets containing movie reviews or streamed tweets that can be used to determine opinion, as well as with raw text scraped from the web.

**Instructor:** Maria Eugenia Inzaugarat, PhD in Data Science

# $\star$ Chapter 1: Basic Concepts of String Manipulation
Start your journey into the regular expression world! From slicing and concatenating, adjusting the case, removing spaces, to finding and replacing strings. You will learn how to master basic operation for string manipulation using a movie review dataset.

### Introduction to string manipulation

#### Why regex is important
* Clean dataset to prepare it for text mining or sentiment analyis
* Process email content to feed a machine learning algorithm that decides whether an email is spam
* Parse and extract specific data from a website to build a database
* Learning to manipulate strings and master regular expressions will allow you to perform these tasks faster and more efficiently

#### String functions
* **`str()`** returns the string representation of an object as seen in the code
* len()

#### Concatenation
* Concatenate: `+` operator
* Applying the plus operand to "sum up" both strings (specifying also the space in between), generates the output seen below:

In [1]:
my_string1 = "Awesome day"
my_string2 = "for biking"
print(my_string1+" "+my_string2)

Awesome day for biking


#### Indexing
* Individual characters of a string can be accessed directly using an index
* **String slicing also accepts a third index: STRIDE**
* `string[beginning : end, stride]`
    * Stride specifies how many characters to omit before retrieving a character
    
<img src='data/string1.png' width="300" height="150" align="center"/>

In [3]:
my_string = 'MY STRING'
print(my_string[0:-1:2])

M TI


In [4]:
my_string = 'MY STRING'
print(my_string[0:-1:3])

MSI


* Interestingly, **omitting the first and second indices, and designating a `-1` step for stride, returns a reversed string** as shown below:

In [5]:
print(my_string[::-1])

GNIRTS YM


In [6]:
print(my_string[::-2])

GIT M


### String operations
* **`.lower()`** : converts all alphabetic characters to lowercase
* **`.upper()`** : converts all alphabetic characters to uppercase
* **`.capitalize()`** : returns a copy of the string with the first letter capitalized (while keeping all other characters in lowercase)

### Splitting
* `my_string = "This string will be split"`
* Splitting a string into a list of substrings:
    * Both take a separating element by which we are splitting the string and a maxsplit that tells us the maximum number of substrings we want
    * The difference between the two following methods is that `split` starts splitting at the left, while `rsplit` starts splitting from the right
    * If `maxsplit` is not specified, both methods behave in the same way
    * Default `sep` is whitespace (or `" "`); for this separator, you don't need to specify the argument
* **`.split()`**
    * `my_string.split(sep=" ", maxsplit=2)`

* **`.rsplit()`**
    * `my_string.rsplit(sep=" ", maxsplit=2)`

In [7]:
my_string = "This string will be split"

In [8]:
my_string.split(sep=" ", maxsplit=2)

['This', 'string', 'will be split']

In [9]:
my_string.rsplit(sep=" ", maxsplit=2)

['This string will', 'be', 'split']

### Escape sequences
* There are some **escape sequences** such as `\n` or `\r` that indicates a line boundary

<img src='data/escape_sequences.png' width="200" height="100" align="center"/>

In [11]:
my_string_n = "This string will be split\nin two"
print(my_string_n)

This string will be split
in two


In [12]:
my_string_r = "This string will be split\rin two"
print(my_string_r)

This string will be splitin two


* Python method **`splitlines()`** for **breaking at line boundaries**:

In [14]:
my_string_n.splitlines()

['This string will be split', 'in two']

* As you can se, the string is split at the `\n` sequence, returning a list of two elements

#### Joining
* Some methods can paste or concatenate together the objects in a list or other iterable data
* Concatenate strings from list or another iterable
    * `sep.join(iterable)`
    * **Syntax:** 
        * First takes the separating element (`" "` or `"_"`, for example)
        * Inside the call, we specify the list or iterable element     
        * The result is a single string containing all the objects in the list separated by whitespace

In [15]:
my_list = ["this", "would", "be", "a", "string"]
print(" ".join(my_list))

this would be a string


### Stripping characters
* Methods that will trim characters from a string
* Strips characters from left to right: `.strip()`
    * removes both leading and trailing characters
    * Inside the call, we specify a character to be stripped
    * Default is whitespace

In [16]:
my_string = " This string will be stripped "

In [17]:
my_string.strip()

'This string will be stripped'

In [18]:
my_string2 = " This string will be stripped\n"
my_string2.strip()

'This string will be stripped'

* **Notice that both leading and trailing whitespace, as well as the trailing escape sequence, were removed.**
* We can also apply `.rsplit()` and it will return a string where the trailing whitespaces and/or trailing escape sequence is removed

In [23]:
my_string.rstrip()

' This string will be stripped'

In [20]:
my_string2.rstrip()

' This string will be stripped'

* **If we apply the `.lstrip()` method, we'll get a string with the leading whitespace eliminated**

<img src='data/string.png' width="600" height="300" align="center"/>