# Data Processing notebook.
#### We are here aiming to manipulate the data that we generated before.
#### It will be separated in 6 steps:
1. Data cleaning
2. Provide a list of the 15 most common words
3. Provide a list of the 2 pairs of words having the highest co-occurrence frequency
4. Build a graphical representation of the most frequent words with their polarity (pos/neg or anger/joy/fear/...)
5. Indicate the 3 most frequent representatives words in each category
6. Compare the results of the two approaches

<font size=5 color="#2E1698"><B><u>PARTIE I:</u>  Data cleaning</B></font>
<font color="#2E1698">**Here, we will clean the dataset in order to analyze it**</font>
<br>


<font color="#2E1698">**First things first, let's import the csv file**</font>

In [None]:
tweets = read.csv("data/debat_primaire_20000.csv", encoding="UTF-8")

In [None]:
dim(tweets)

<font color="#2E1698">**So our data frame contains 17 columns and 20 000 rows, let's see the 10 firsts rows**</font>


In [None]:
head(tweets, n = 10)

<font color="#2E1698">**Let's see if all the columns have multiple values, or if some are useless**</font>


In [None]:
head(unique(tweets$favorited))
head(unique(tweets$favoriteCount))
head(unique(tweets$replyToSN))
head(unique(tweets$replyToUID))
head(unique(tweets$id))
head(unique(tweets$isRetweet))
head(unique(tweets$longitude))
head(unique(tweets$latitude))

In [None]:
length(which(tweets$favorited == "TRUE"))
length(which(tweets$favorited == "FALSE"))

<font  
color="#2E1698">**We can see here that there is no TRUE value for favorited, only FALSE. favorited is useless though.**</font>


In [None]:
length(which(tweets$favoriteCount == 0))
length(which(tweets$favoriteCount != 0))

In [None]:
(3919/20000)*100

<font color="#2E1698">**The favoriteCount have multiple values, 20% of the are not 0 we better keep this column. It is maybe a significative data**</font>


In [None]:
length(which(tweets$longitude != "NA"))
length(which(tweets$latitude != "NA"))

In [None]:
(9/20000)*100

<font color="#2E1698">**There is only 9 tweets over 20 000 that contains latitude and longitude, this represents only 0.045% of the tweets, this info can be considered as useless, and we can delete this two columns too.**</font>


In [None]:
length(which(tweets$replyToSN != "NA"))
length(which(tweets$replyToUID != 'NA'))
length(which(tweets$replyToSID != 'NA'))

In [None]:
(698/20000)*100

<font color="#2E1698">**There is only about 3.5% of the replytoSN and replyToUID data that are not NA, we can delete these two columns as they don't seem to be interesting to study.**</font>


<font color="#2E1698"><B>  Let's delete these useless columns!</B></font>


In [None]:
tweets <- subset(tweets, select=-c(replyToSN,replyToUID, replyToSID, latitude, longitude, favorited))

In [None]:
head(tweets)

<font color="#2E1698"><B>  If we want to use the text, it have to be cleaned first</B></font>


In [None]:
clean_text = function(x)
{
    #To convert the text in lowercase
    try.error = function(z)
    {
        y = NA
        try_error = tryCatch(tolower(z), error=function(e) e)
            if (!inherits(try_error, "error"))
                y = tolower(z)
                return(y)
    }
    x = sapply(x, try.error)
            
     #remove all links starting by http
    x = gsub('http\\S+\\s*', '', x)
            
    # replace apostrophes
    x = gsub("'", " ", x)

    # remove punctuation except @, #, _, -
    x = gsub("@", "AAAAAAAAAAA", x)
    x = gsub("#", "BBBBBBBBBBB", x)
    x = gsub("_", "CCCCCCCCCCC", x)
    x = gsub("-", "DDDDDDDDDDD", x)
    x = gsub("[[:punct:]]", " ", x)
    x = gsub("AAAAAAAAAAA", "@", x)
    x = gsub("BBBBBBBBBBB", "#", x)
    x = gsub("CCCCCCCCCCC", "_", x)
    x = gsub("DDDDDDDDDDD", "-", x)
            
    # correcting the spaces after the conserved @
    x = gsub("@ ", "@", x)
            
    # correcting the spaces after the conserved _
    x = gsub("_ ", "_", x)
            
    # correcting the spaces after the conserved -
    x = gsub("- ", "-", x)
    
    # remove numbers/Digits
    x = gsub("[[:digit:]]", "", x)
    
    # remove tabs
    x = gsub("[ |\t]{2,}", " ", x)
            
    # remove blank spaces at the beginning/end
    x = gsub("^ ", "", x)  
    x = gsub(" $", "", x)
    
    
    # As we have already a column indicating if the tweet is a retweet or not 
    # we can remove "RT @xxx" in the tweet header
    x = gsub("rt @\\w+ *", "", x)
    x = gsub('\\b\\w{1,3}\\s','', x)
            
    # remove double spaces
    x = gsub("  ", " ", x)
    x = gsub("  ", " ", x)
    return(x)
}

In [None]:
tweets$text <- clean_text(tweets$text)

In [None]:
head(tweets, n = 20)

<font color="#2E1698"><B> Let's see which are the most used @xxx and replace them with words. Afterward we will delete all the @xxx that will not be replaced</B></font>


In [None]:
col = tweets$text
head(col, n=5)

In [None]:
at.pattern = "@\\w+ *"
have.at = grep(x = col, pattern = at.pattern)
at.matches = gregexpr(pattern = at.pattern,
                        text = col[have.at])
extracted.at = regmatches(x = col[have.at], m = at.matches)

# most frequent words
mfw = sort(unlist(extracted.at), decreasing=TRUE)
mfw = gsub(" ", "", mfw)
d = sort(table(unlist(mfw)), decreasing=TRUE)
head(d, n = 20)

In [None]:
top15 = head(d, n = 15)
# barplot
mar.default <- c(12,4,4,4) + 0.1
par(mar = mar.default + c(0, 0, 0, 0))
barplot(top15, border=NA, las=2, main="Top 15 most frequent @twitter_username", cex.main=1)

<font color="#2E1698"><B> Let's see which are the most used words</B></font>

In [None]:
at.pattern = "[a-zA-Z]\\w+ *"
have.at = grep(x = col, pattern = at.pattern)
at.matches = gregexpr(pattern = at.pattern,
                        text = col[have.at])
extracted.at = regmatches(x = col[have.at], m = at.matches)

# most frequent words
mfw = sort(unlist(extracted.at), decreasing=TRUE)
mfw = gsub(" ", "", mfw)
w = sort(table(unlist(mfw)), decreasing=TRUE)
head(d, n = 20)

In [None]:
top15 = head(w, n = 15)
top15d = sort(top15, decreasing=FALSE) 
# barplot
mar.default <- c(15,10,5,0) + 0.1
par(mar = mar.default + c(0, 0, 0, 0))
barplot(top15d, border=NA, las=2, main="Top 15 most frequent word", cex.main=1, horiz=TRUE)