## __Manipulating Text with Regular Expressions__


Here, let's learn to read and modify the file. 


## Step 1: Import the RE Library


- Import the regular expression library:



In [None]:
import re

## Step 2: Read the File
- Open and read the contents of the **Sample.txt** file:


In [None]:
f = open('Sample.txt','r')
string = f.read()

In [None]:
print(string)

   Captain $Amrica is played by Chris_Evans.
I love Chris Evans.
Captain Amrica was     released in 2011  ~2  as a part of ^72 Marvel Franchise.
WHILE THERE WERE MANY PARTS OF THE MOVIE RELEASED   I LOVED THE FIRST ONE. -- !!!!


__Observation__


* These are the contents of the file. We can see some anomalies like spelling mistakes or unwanted characters and also a lot of uppercase issues.
We'll learn to handle this and make it a clean text file.


* Let's fix the areas where America is spelt incorrectly, but before that, let's check if our
pattern matches. We can see that there is a dollar; for this, we need to use
an escape sequence.





In [None]:
re.findall('\$*Amrica',string)

['$Amrica', 'Amrica']

## Step 3: Find and Replace the Wrong Spellings of America


Let's replace the incorrect words, which can be done using the dot sub method.

In [None]:
string = re.sub('\$*Amrica','America',string)

## Step 4: Replace Special Characters and Unwanted Whitespaces


Looking at the numbers will enable us to create a regular expression and replace it as follows.

In [None]:
string = re.sub('[~^]\d+','',string)

Now, let’s add a character set first. It needs to be small a to z, capital A to Z, 0–9, single spaces and a period.


In [None]:
string = re.sub('[^a-zA-Z0-9\s.]',' ',string)

Finally, there are unwanted white spaces. There need to be at least two or more spaces, which we need to replace with a single space in the string.

In [None]:
string = re.sub('\s\s+',' ',string)

__Observation__

* We can see that there is one space at the beginning.
* We can remove that using the below command.

In [None]:
string = re.sub('^\s','',string)

In [None]:
print(string)

Captain America is played by Chris Evans.
I love Chris Evans.
Captain America was released in 2011 as a part of Marvel Franchise.
WHILE THERE WERE MANY PARTS OF THE MOVIE RELEASED I LOVED THE FIRST ONE. 


__Observation__

Thus, the final output does not have any spelling mistakes or unwanted spaces.


## Step 5: Find All Uppercased Words

The last thing we need to do is replace all the uppercased word with lowercased word. Let's first take all the uppercased words. *findall* will return
all the uppercased words.

In [None]:
upper_words = re.findall('[A-Z][A-Z]+',string)

In [None]:
upper_words

['WHILE',
 'THERE',
 'WERE',
 'MANY',
 'PARTS',
 'OF',
 'THE',
 'MOVIE',
 'RELEASED',
 'LOVED',
 'THE',
 'FIRST',
 'ONE']

__Observation__

Here, we have all the uppercased words.

## Step 6: Convert the Uppercased Words to Lowercase

Let's convert all of this into lowercase using list comprehension.

In [None]:
lower_words = [ x.lower() for x in upper_words]

In [None]:
lower_words

['while',
 'there',
 'were',
 'many',
 'parts',
 'of',
 'the',
 'movie',
 'released',
 'loved',
 'the',
 'first',
 'one']

## Step 7: Replace the Uppercased Words with the Lowercase Words

Let's replace all the occurrences of uppercased words with all the occurrences of lowercase words and print this string.

In [None]:
for u,l in list(zip(upper_words,lower_words)):
    string = re.sub(u,l,string)

In [None]:
print(string)

Captain America is played by Chris Evans.
I love Chris Evans.
Captain America was released in 2011 as a part of Marvel Franchise.
while there were many parts of the movie released I loved the first one. 


__Obseravations__

Thus, we have a clean file now.


## Step 8: Write the Modified Content to a New File

Finally, let's open the file in write mode and close it using the below code.

In [None]:
# Write it to a file
f = open('Sample_modified.txt','w')
f.write(string)
f.close()