# Text Preprocessing Using Regular Expressions

# Challenge

# Problem Statement

Disaster Tweets Analysis

AM_CARES is one of the top disaster relief organizations. It provides emergency assistance including mass and mobile feeding, temporary shelter, counseling, search and rescue, medical assistance, and resource distribution. Nowadays, Twitter has become an important communication channel in times of emergency. The iniquitousness of smartphones enables people to announce an emergency they are observing in real time. As a result, more agencies, such as disaster relief organizations and news agencies, are interested in programmatically Monitoring Twitter.

How ever, it's not always clear whether a person's words are announcing a disaster. Look at the data present in
"DS3 C2 S1 Tweets Data Practice.csv". AM CARES wishes to utilize this file so that it can provide better services at the time of disaster.

# Objective

Review Text Preprocessing Using Regular Expressions

In [1]:
# importing libraries
import pandas as pd    # importing pandas library to read csv file
import numpy as np     # working with array
import re              # for regular expressions

In [2]:
# importing the csv file into a Pandas dataframe
df = pd.read_csv(r"C:\Users\Admin\Desktop\Level -1\C2\Repository\DS3_C2_S1_Tweets_Data_Challenge.csv")

# viewing the shape of the data (the number of rows and columns)
df.shape

(11370, 5)

In [3]:
# viewing columns in the data
df.columns

Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')

# Task 1

The disaster relief organization AM_CARES wishes to know the tweets that contain disaster words such as 'crash', 'quarantine', and 'brush fires'. This will enable them to plan their activities at a particular location. How can you accomplish this task?

In [4]:
text_string = " ".join(x for x in df["text"]) # creating a single string containing all the texts, as this will be needed to be able to perform some operation
                                              # joining the each value of column 'text' with a single space
text_string[0:200]                            # viewing the first 200 elements of the string to check this worked as expected

'Communal violence in Bhainsa, Telangana. "Stones were pelted on Muslims\' houses and some houses and vehicles were set ablaze… Telangana: Section 144 has been imposed in Bhainsa from January 13 to 15, '

In [5]:
crash = re.findall("crash", text_string)           #re.findall() matches all instances of a pattern in a string and returns them in a list
print(len(crash))                                  # viewing the number of times a crash word appeared in tweets

88


In [6]:
quarantine = re.findall("quarantine", text_string)      #re.findall() matches all instances of a pattern in a string and returns them in a list
print(len(quarantine))                                  # viewing the number of times a quarantine word appeared in tweets

70


In [7]:
brush_fires = re.findall("brush fires", text_string)     #re.findall() matches all instances of a pattern in a string and returns them in a list
print(len(brush_fires))                                  # viewing the number of times a brush fires word appeared in tweets

1


# Task 2

After finding the disaster words in the tweets, AM_CARES wishes to know the tweets that denote a real disaster along with their locations. Help the disaster relief organization accomplish this task.

In [8]:
disaster = df.loc[df['text'].str.contains('disaster')]  #fetching tweets containing 'disaster' word
disaster.head()                                         #viewing the number of twwets that contain a 'disaster' word

Unnamed: 0,id,keyword,location,text,target
687,687,avalanche,,#SONDAKİKA #breakingsound Avalanche disaster i...,1
817,817,bioterrorism,MX-TX-TN-CA-MI-IN-GA,"Similarly, there are lots of different disaste...",0
1037,1037,blew%20up,Pilipinz,"here in the ph, we experienced the taal disast...",0
1852,1852,burned,au,Australia’s government called bushfire damage ...,0
1925,1925,bush%20fires,United States,Linking the bushfire disaster in NSW to climat...,0


In [9]:
# viewing the data with the "text" column widened to 800px so that the full tweet is displayed and hiding the index column
disaster[['text','location']].style.set_properties(subset=['text'], **{'width': '800px'}).hide_index() 

  disaster[['text','location']].style.set_properties(subset=['text'], **{'width': '800px'}).hide_index()


text,location
#SONDAKİKA #breakingsound Avalanche disaster in Pakistan Kashmir: 57 dead https://t.co/LJwe842mMN,
"Similarly, there are lots of different disasters including epidemics, natural disasters, wars, bioterrorism, and we… https://t.co/NRP53sUN5C",MX-TX-TN-CA-MI-IN-GA
"here in the ph, we experienced the taal disaster then my dad sent us a msg saying their ship's engine blew up 😭 tha… https://t.co/8cB1UzUCNm",Pilipinz
"Australia’s government called bushfire damage to wildlife an ""ecological disaster."" The military is clearing up scores of dead…",au
"Linking the bushfire disaster in NSW to climate change is ""an absolute nonsense"" and reducing fuel loads in the Australi…",United States
"In pairs, our workshop participants designed & coded a disaster detector system! 🚨 This pair designed a rural fire detect…","Brisbane, Queensland"
"In pairs, our workshop participants designed & coded a disaster detector system! 🚨 This pair designed a rural fire… https://t.co/tQFhDH1fys","Brisbane, Australia"
The sweet relief of rain after bushfires threaten disaster for our rivers Fire debris flowing into the Murray-Darling Basin w…,"Georgia, USA"
Train cars scattered all over the derailment area in Mississauga. #disasters #railroad #raildisaster #derailment… https://t.co/YUVB1wWzCd,United States
"As we grapple w. mass extinction of species, destruction of cities & displacement of millions due to natural disasters, th…",


# Task 3

To ensure the data present in the tweet (text) is consistent, correct, and usable, AM_CARES needs to perform data cleaning in the text of tweets. Help AM_CARES accomplish this task.

In [10]:
# re.sub() : replacing a string that matches a regular expression.In re.sub(), specify a regular expression pattern in the first argument, a new string in the second argument, and a string to be processed in the third argument.

text_string = re.sub(r"\s+"," ",text_string)  #\s+ is the pattern used to find spaces. This should be followed with a '+' so that the previous element is matched one or more times.
text_string[0 : 200]                  # viewing the first 200 elements of the string to check this worked as expected

'Communal violence in Bhainsa, Telangana. "Stones were pelted on Muslims\' houses and some houses and vehicles were set ablaze… Telangana: Section 144 has been imposed in Bhainsa from January 13 to 15, '

In [11]:
text_string = re.sub("http\S+","_URL_", text_string)  # to replace all the url start with http by the '_URL_', so let's use the sub() function
                                        # \S matches any non-white space character # + for one or more occurance of the pattern specified to its left
text_string[0 : 200]                          # viewing the first 200 elements of the string to check this worked as expected

'Communal violence in Bhainsa, Telangana. "Stones were pelted on Muslims\' houses and some houses and vehicles were set ablaze… Telangana: Section 144 has been imposed in Bhainsa from January 13 to 15, '

In [12]:
text_string = re.sub("\W+"," ", text_string)          # to replace all special characters with white space, so let's use the sub() function
                                        # \W matches non alphanumeric (special) character # + one or more occurance of the pattern specified to its left
text_string[0 : 200]                          # viewing the first 200 elements of the string to check this worked as expected

'Communal violence in Bhainsa Telangana Stones were pelted on Muslims houses and some houses and vehicles were set ablaze Telangana Section 144 has been imposed in Bhainsa from January 13 to 15 after c'

In [13]:
text_string = text_string.lower()         # converting to lowercase
text_string[0 : 200] 

'communal violence in bhainsa telangana stones were pelted on muslims houses and some houses and vehicles were set ablaze telangana section 144 has been imposed in bhainsa from january 13 to 15 after c'

# Task 4

During earthquake disasters, AM_CARES delivers lifesaving aid and vital assistance to help communities rebuild. To deliver these services efficiently, the company needs to know the tweets that contain the word 'magnitude' in the tweet text and the earthquake location. Can you help the company find all tweets containing 'magnitude' and locate the earthquake location? Also, find the average strength of the earthquake.

In [14]:
magnitude = df.loc[df['text'].str.contains('magnitude')]  # fetching tweets containing 'magnitude' word
len(magnitude)                                            # viewing the number of tweets that contain a 'magnitude' word

17

In [15]:
magnitude = magnitude[['text','location']]                                 # text and location of the tweets containing 'magnitude' word
magnitude.style.set_properties(subset=['text'], **{'width':'800px'}).hide_index()  # viewing the data with the "text" column widened to 800px so that the full tweet is displayed,
                                                                                         # and hide the index column

  magnitude.style.set_properties(subset=['text'], **{'width':'800px'}).hide_index()  # viewing the data with the "text" column widened to 800px so that the full tweet is displayed,


text,location
Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/SF2wBVd5me,Global
Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/rrTNTMA8NR,Texas
Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/wW3CdsvDV2,são paulo -sp
Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/vRAdkSZaLn,"Flatbush, Brooklyn"
Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/MMxXzwBpaF,
Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/lRTLlbB9TV,ethers
"If two quakes have about the same magnitude, you cannot say one is the other's preshock or aftershock. Only when one is obv…",
Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/P9Th3vpTak,"Wilmington, NC- USA"
The magnitude 5.9 quake in #PuertoRico this morning caused our 7-day aftershock forecast to change. We now estimate an increased…,NYC
This timelapse of #Sentinel3 🇪🇺🛰 imagery shows the magnitude of the #AustralianBushfire crisis in terms of burned areas…,Japan


In [16]:
string = ' '.join(x for x in magnitude['text'])         # Joining the each value of column 'text' with a single space
string[0:500]                                           # viewing the first 500 elements of the string to check this worked as expected

'Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/SF2wBVd5me Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/rrTNTMA8NR Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/wW3CdsvDV2 Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/vRAdkSZaLn Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/MMxXzwBpaF Puerto Rico hit by another 5.9 magnitude aftershock - KYMA https://t.co/lRTLlbB9TV If'

In [17]:
num = re.findall('\d\.\d+',string)      # '\d' matches with digit, '+' one or more occurance digits. Fetching numeric values from string
print(len(num))                         # count of numeric values
print(type(num[0]))                     # to check datatype

14
<class 'str'>


In [18]:
num = list(map(float, num))            # mapping to int from string
print(type(num[0]))                   # checking datatype

<class 'float'>


In [19]:
avg = np.mean(num)                   # finding average
print(f"Average = {avg}")

Average = 5.528571428571429


# Task 5

AM CARES is interested to read the tweet text of all the people who are talking about the accident. For this, the company needs to know the tweet text that contains the keyword 'accident' and also needs to determine the tweets that denote a real disaster so that it can read the tweets and plan activities. Help AM_CARES to accomplish this task. Also, remove the special characters from the tweet text. ??

In [20]:
accident = df.loc[df['text'].str.contains('accident')]  # fetching tweets containing 'accident' word
len(accident)                                            # viewing the number of tweets that contain a 'accident' word

78

In [21]:
accident_str = ' '.join(accident["text"])      # joining the each value of column 'text' with a single space
accident_str[0:500]                      # viewing the first 500 elements of the string to check this worked as expected

'#WATCH Former CM Akhilesh Yadav who went to meet injured of Kannauj accident, at a hospital in Chhibramau asks Emergency Med… 😁yeah! His new swag is on point 100%, since the accident! Like this is a totally transformed Bob… my back and neck are still fucked up from the accident 😡😡😭😭 RT! Prince Harry just confirmed that his mother’s (Princess Diana) death was not an accident! https://t.co/1ADe3uZ3eR Juwan Johnson/Oregon is one big dude. Looks like a tight end stuck in the receiver group by accide'

In [22]:
accident_str = re.sub("\W+"," ", accident_str)          # to replace all special characters with white space, so let's use the sub() function
                                            # \W matches non alphanumeric (special) character # + one or more occurance of the pattern specified to its left
accident_str[0 : 500]                          # viewing the first 500 elements of the string to check this worked as expected

' WATCH Former CM Akhilesh Yadav who went to meet injured of Kannauj accident at a hospital in Chhibramau asks Emergency Med yeah His new swag is on point 100 since the accident Like this is a totally transformed Bob my back and neck are still fucked up from the accident RT Prince Harry just confirmed that his mother s Princess Diana death was not an accident https t co 1ADe3uZ3eR Juwan Johnson Oregon is one big dude Looks like a tight end stuck in the receiver group by accident My friend an army'