<a href="https://colab.research.google.com/github/arutraj/ML_Basics/blob/main/4_6_Using_RegEx_on_Real_World_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of Contents
 1. About the Dataset
 2. Regex for Cleaning Text Data
 3. Regex for Text Data Extraction
 4. Regex Challenge


## 1. About the Dataset

In [2]:
import pandas as pd

#Loading the dataset
df = pd.read_csv("/content/tweets.csv", encoding = "ISO-8859-1")

# Printing first 5 rows
df.head()

Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted
0,RT @rssurjewala: Critical question: Was PayTM ...,False,0.0,,2016-11-23 18:40:30,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331.0,True,False
1,RT @Hemant_80: Did you vote on #Demonetization...,False,0.0,,2016-11-23 18:40:29,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",PRAMODKAUSHIK9,66.0,True,False
2,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0.0,,2016-11-23 18:40:03,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12.0,True,False
3,RT @ANI_news: Gurugram (Haryana): Post office ...,False,0.0,,2016-11-23 18:39:59,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",deeptiyvd,338.0,True,False
4,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0.0,,2016-11-23 18:39:39,False,,8.014954e+17,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120.0,True,False


In [4]:
# Looking at some Tweets
for index, tweet in enumerate(df["text"][10:18]):
    print(index+1,".",tweet)

1 . Many opposition leaders are with @narendramodi on the #Demonetization 
And respect their decision,but support opposition just b'coz of party
2 . RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
3 . @Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders.
4 . RT @Atheist_Krishna: The effect of #Demonetization !!
. https://t.co/A8of7zh2f5
5 . RT @sona2905: When I explained #Demonetization to myself and tried to put it down in my words which are not laced with any heavy technicalÂ
6 . RT @Dipankar_cpiml: The Modi app on #DeMonetization proves once again that the govt is totally indifferent to the mounting misery and hardsÂ
7 . RT @Atheist_Krishna: BEFORE and AFTER Gandhi ji heard they are standing there against #Demonetization
. https://t.co/9NheK63TPg
8 . RT @pGurus1: #Demonetization The co-operative b

## 2. Regex for Cleaning Text Data

In [5]:
import re

### a. Removing `RT`

In [6]:
# Removing RT from a single Tweet
text = "RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r"
clean_text = re.sub('RT ','', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
Text after:
 @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r


In [7]:
# Tweets before removal
df['text'].head()

0    RT @rssurjewala: Critical question: Was PayTM ...
1    RT @Hemant_80: Did you vote on #Demonetization...
2    RT @roshankar: Former FinSec, RBI Dy Governor,...
3    RT @ANI_news: Gurugram (Haryana): Post office ...
4    RT @satishacharya: Reddy Wedding! @mail_today ...
Name: text, dtype: object

In [8]:
# Removing RT from all the tweets
df['text']=df['text'].apply(lambda x: re.sub('RT ','',x))

In [9]:
# Tweets after removal
df['text'].head()

0    @rssurjewala: Critical question: Was PayTM inf...
1    @Hemant_80: Did you vote on #Demonetization on...
2    @roshankar: Former FinSec, RBI Dy Governor, CB...
3    @ANI_news: Gurugram (Haryana): Post office emp...
4    @satishacharya: Reddy Wedding! @mail_today car...
Name: text, dtype: object

### b. Removing `<U+..>` like symbols

In [10]:
# Removing <U+..> like symbols from a single tweet
text = "@Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders"
clean_text = re.sub('<U\+[A-Z0-9]+>','', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 @Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders
Text after:
 @Jaggesh2 Bharat band on 28??<ed><ed>Those who  are protesting #demonetization  are all different party leaders


**Note** that although we have gotten rid of majority of symbols, `<ed>` is still present. I leave this as an exercise for you to try out.

In [11]:
# Removing <U+..> like symbols from all the tweets
df['text']=df['text'].apply(lambda x: re.sub('<U\+[A-Z0-9]+>', '', x))

### c. Fixing the `&` and `&amp;`

In [12]:
# Replacing &amp with & in a single tweet
text = "RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response &amp; Commitment in fight against Blackmoney"
clean_text = re.sub('&amp;','&', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response &amp; Commitment in fight against Blackmoney
Text after:
 RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response & Commitment in fight against Blackmoney


In [63]:
# Replacing &amp with & in all the tweets
df['text']=df['text'].apply(lambda x: re.sub('&amp', '&', x))
df['text']

0       @rssurjewala: Critical question: Was PayTM inf...
1       @Hemant_80: Did you vote on #Demonetization on...
2       @roshankar: Former FinSec, RBI Dy Governor, CB...
3       @ANI_news: Gurugram (Haryana): Post office emp...
4       @satishacharya: Reddy Wedding! @mail_today car...
                              ...                        
5152    @thehill To The Hill. Shame on you for your an...
5153    @saxenavishakha: Ghost of demonetization retur...
5154    N d modi fans-d true nationalists of the count...
5155    @Stupidosaur: @Vidyut B team of BJP. CIA baby....
5156    @Vidyut B team of BJP. CIA baby. CCTV, EVM but...
Name: text, Length: 5157, dtype: object

## 3. Regex for Text Data Extraction
### a. Extracting platform type of tweets

In [14]:
# Getting number of tweets per platform type
platform_count = df["statusSource"].value_counts()

In [15]:
platform_count

statusSource
<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>      1838
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                        1394
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>         534
<a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a>                      166
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>        139
                                                                                          ... 
<a href="http://www.agileminder.com" rel="nofollow">agileminderbot</a>                       1
<a href="http://novapress.net.ru/" rel="nofollow">NovaPress Publisher</a>                    1
<a href="http://www.quora.com/" rel="nofollow">Quora</a>                                     1
<a href="http://imploded-explode.com" rel="nofollow">IEHIAutoPost</a>                        1
<a href="https://twitter.com/download

In [48]:
#List platforms that have more than 100 tweets
top_platforms = platform_count.loc[platform_count>50]
top_platforms

statusSource
<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>    1838
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                      1394
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>       534
<a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a>                    166
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      139
<a href="http://www.google.com/" rel="nofollow">Google</a>                               101
<a href="http://onlywire.com/" rel="nofollow">OnlyWire / Official App</a>                 85
<a href="https://mobile.twitter.com" rel="nofollow">Twitter Lite</a>                      82
<a href="http://www.hootsuite.com" rel="nofollow">Hootsuite</a>                           69
<a href="http://ifttt.com" rel="nofollow">IFTTT</a>                                       64
<a href="http://www.twitter.com" rel="nofollow">Twitter f

In [49]:
def platform_type(x):
    ser = re.search( r"android|iphone|web|windows|mobile|google|facebook|ipad|tweetdeck|onlywire", x, re.IGNORECASE)
    if ser:
        return ser.group()
    else:
        return None

#reset index of the series
#top_platforms = top_platforms.reset_index()['index']
top_platforms = top_platforms.reset_index()['statusSource']

#extract platform types
top_platforms.apply(lambda x: platform_type(x))

0       android
1           Web
2        iphone
3      facebook
4     tweetdeck
5        google
6      onlywire
7        mobile
8          None
9          None
10      Windows
Name: statusSource, dtype: object

In [43]:
#top_platforms = top_platforms.reset_index()['statusSource']
top_platforms

0    <a href="http://twitter.com/download/android" ...
1    <a href="http://twitter.com" rel="nofollow">Tw...
2    <a href="http://twitter.com/download/iphone" r...
3    <a href="http://www.facebook.com/twitter" rel=...
4    <a href="https://about.twitter.com/products/tw...
5    <a href="http://www.google.com/" rel="nofollow...
Name: statusSource, dtype: object

In [44]:
top_platforms.apply(lambda x: platform_type(x))


0      android
1          Web
2       iphone
3     facebook
4    tweetdeck
5       google
Name: statusSource, dtype: object

### b. Extracting hashtags from the tweets

In [50]:
# Extract first hashtag from a tweet
text = "RT @Atheist_Krishna: The effect of #Demonetization !!\r\n. https://t.co/A8of7zh2f5"
hashtag = re.search('#\w+', text)

print("Tweet:\n", text)
print("Hashtag:\n", hashtag.group())

Tweet:
 RT @Atheist_Krishna: The effect of #Demonetization !!
. https://t.co/A8of7zh2f5
Hashtag:
 #Demonetization


In [59]:
# Extract multiple hastags from a tweet
text = """RT @kapil_kausik: #Doltiwal I mean #JaiChandKejriwal is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo"""
hashtags = re.findall('#\w+', text)

print("Tweet:\n", text)
print("Hashtag:\n", hashtags)

Tweet:
 RT @kapil_kausik: #Doltiwal I mean #JaiChandKejriwal is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo
Hashtag:
 ['#Doltiwal', '#JaiChandKejriwal', '#Demonetization']


In [62]:
df['hashtags']=df['text'].apply(lambda x: re.findall('#\w+', x))
df['hashtags']

0                      [#Demonetization]
1                      [#Demonetization]
2                      [#Demonetization]
3                      [#demonetization]
4       [#demonetization, #ReddyWedding]
                      ...               
5152                                  []
5153                                  []
5154                                  []
5155                                  []
5156                                  []
Name: hashtags, Length: 5157, dtype: object

In [61]:
df[['text','hashtags']].head()

Unnamed: 0,text,hashtags
0,@rssurjewala: Critical question: Was PayTM inf...,[#Demonetization]
1,@Hemant_80: Did you vote on #Demonetization on...,[#Demonetization]
2,"@roshankar: Former FinSec, RBI Dy Governor, CB...",[#Demonetization]
3,@ANI_news: Gurugram (Haryana): Post office emp...,[#demonetization]
4,@satishacharya: Reddy Wedding! @mail_today car...,"[#demonetization, #ReddyWedding]"


## 4. Regex Challenge

Now that you have learned all the concepts regarding regex and have also seen it in action, it's time for you to utilize that to solve a challenge all by yourself. Here are some of the tasks that you have to do -

### a. Removing URLs from tweets

**Difficulty - Easy**

There are multiple URLs present in individual tweet's `text` and they don't neccessarily provide useful information so we can get rid of them. For example -  

*@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r*


We can very well remove the URL as it isn't providing much useful information.


In [65]:
# Your Code Here
text = "@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r"
urltag = re.search('http[s]?:\/\/.*?\/[a-zA-Z-_]+.*', text)
cleantag = re.sub('http[s]?:\/\/.*?\/[a-zA-Z-_]+.*', '', text)

print("withurl:\n", text)
print("cleantag:\n", urltag.group())
print(cleantag)


withurl:
 @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
cleantag:
 https://t.co/pYgK8Rmg7r
@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy 


### b. Extract Top 100 mentions

**Difficulty - Medium**

Many of the tweets have mentions of people in the form *@username*, for example see the following tweet -

*@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r*

Here *@Joydas* is a mention. You need to extract mentions from all the tweets and find which are the top 100 usernames.

In [None]:
# Your Code Here


### Solution - 1

In [66]:
# Removing URLs from a single tweet
text='@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r'
re.sub('https?://[A-Za-z0-9.-/]+','',text)

'@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy '

In [67]:
# Removing URLs from all the tweets
df['text']=df['text'].apply(lambda x: re.sub('https?://[A-Za-z0-9.-/]+','',x))

### Solution - 2

In [68]:
# Function for extracting mentions from the tweet
def mention(x):
    found=re.findall(r'@\w+',x)
    if found:
        return found
    return None

In [69]:
# Extract mentions from all the tweets
arr=df['text'].apply(lambda x : mention(x))

In [70]:
arr

0                      [@rssurjewala]
1                        [@Hemant_80]
2                        [@roshankar]
3                         [@ANI_news]
4       [@satishacharya, @mail_today]
                    ...              
5152                       [@thehill]
5153                [@saxenavishakha]
5154                             None
5155          [@Stupidosaur, @Vidyut]
5156                        [@Vidyut]
Name: text, Length: 5157, dtype: object

In [71]:
# Combining all the mentions into a list
mentions_arr=[]

for x in arr:
    if x != None:
        mentions_arr.extend(x)

In [72]:
mentions_arr[:10]

['@rssurjewala',
 '@Hemant_80',
 '@roshankar',
 '@ANI_news',
 '@satishacharya',
 '@mail_today',
 '@DerekScissors1',
 '@ambazaarmag',
 '@gauravcsawant',
 '@Joydeep_911']

In [74]:
# Getting top 100 mentions
mentions_count=pd.Series(mentions_arr).value_counts().head(100)

In [75]:
mentions_count

@narendramodi      326
@YouTube           143
@PMOIndia           99
@ArvindKejriwal     83
@arunjaitley        39
                  ... 
@UIDAI               6
@pGurus1             6
@aartic02            6
@WG_Burton           6
@s_navroop           6
Name: count, Length: 100, dtype: int64