# Data Processing notebook.
#### We are here aiming to manipulate the data that we generated before.
#### It will be separated in 6 steps:
1. Data cleaning
2. Provide a list of the 15 most common words
3. Provide a list of the 2 pairs of words having the highest co-occurrence frequency
4. Build a graphical representation of the most frequent words with their polarity (pos/neg or anger/joy/fear/...)
5. Indicate the 3 most frequent representatives words in each category
6. Compare the results of the two approaches

# 1. Data Cleaning
#### Here, we will clean the dataset in order to analyze it

First things first, let's import the csv file

In [None]:
tweets = read.csv("data/debat_primaire_20000.csv")

In [None]:
dim(tweets)

So our data frame contains 17 columns and 20 000 rows, let's see the 10 firsts rows

In [None]:
head(tweets, n = 10)

Let's see if all the columns have multiple values, or if some are useless

In [None]:
head(unique(tweets$favorited))
head(unique(tweets$favoriteCount))
head(unique(tweets$replyToSN))
head(unique(tweets$replyToUID))
head(unique(tweets$id))
head(unique(tweets$isRetweet))
head(unique(tweets$longitude))
head(unique(tweets$latitude))

In [None]:
length(which(tweets$favorited == "TRUE"))
length(which(tweets$favorited == "FALSE"))

We can see here that there is no TRUE value for favorited, only FALSE. favorited is useless though.

In [None]:
length(which(tweets$favoriteCount == 0))
length(which(tweets$favoriteCount != 0))

In [None]:
(3919/20000)*100

The favoriteCount have multiple values, 20% of the are not 0 we better keep this column. It is maybe a significative data

In [None]:
length(which(tweets$longitude != "NA"))
length(which(tweets$latitude != "NA"))

In [None]:
(9/20000)*100

There is only 9 tweets over 20 000 that contains latitude and longitude, this represents only 0.045% of the tweets, this info can be considered as useless, and we can delete this two columns too.

In [None]:
length(which(tweets$replyToSN != "NA"))
length(which(tweets$replyToUID != 'NA'))
length(which(tweets$replyToSID != 'NA'))

In [None]:
(698/20000)*100

There is only about 3.5% of the replytoSN and replyToUID data that are not NA, we can delete these two columns as they don't seem to be interesting to study.

### Let's delete these useless columns!

In [None]:
tweets <- subset(tweets, select=-c(replyToSN,replyToUID, replyToSID, latitude, longitude, favorited))

In [None]:
head(tweets)

### If we want to use the text, it have to be cleaned first

In [None]:
clean_text = function(x)
{
    #To convert the text in lowercase
    try.error = function(z)
    {
        y = NA
        try_error = tryCatch(tolower(z), error=function(e) e)
            if (!inherits(try_error, "error"))
                y = tolower(z)
                return(y)
    }
    x = sapply(x, try.error)
            
    #remove all links starting by http
    x = gsub('http\\S+\\s*', '', x)
            
    #remove all words starting by #
    x = gsub("#\\w+ *", "", x)

    # remove punctuation except @, #, _, -
    x = gsub("([@#_-])|[[:punct:]]", "\\1 ", x)
            
    # correcting the spaces after the conserved @
    x = gsub("@ ", "@", x)
            
    # correcting the spaces after the conserved _
    x = gsub("_ ", "_", x)
            
    # correcting the spaces after the conserved -
    x = gsub("- ", "-", x)
    
    # remove numbers/Digits
    x = gsub("[[:digit:]]", "", x)
            
    # remove tabs
    x = gsub("[ |\t]{2,}", " ", x)
            
    # remove blank spaces at the beginning/end
    x = gsub("^ ", "", x)  
    x = gsub(" $", "", x)
    
    
    # As we have already a column indicating if the tweet is a retweet or not 
    # we can remove "RT @xxx" in the tweet header
    x = gsub("rt @\\w+ *", "", x)
    x = gsub('\\b\\w{1,3}\\s','', x)
    return(x)
}

In [None]:
tweets$text <- clean_text(tweets$text)

In [None]:
head(tweets, n = 20)