# Text Mining

Is about extracting information from textual data. It usually implies a combinated effort of machine learning and natural language processing (language written by humans) fields.

Many applications of text mining exists:

* **Sentiment analysis:** Allows to draw conclusion about the sentiments expressed inside the analyzed text. The sentiment can then be contextualized in order to obtain insights which could be useful for marketing.


* **Topic analysis:** Analyzing a large dataset with many textual sources, we can split them by topic by performing topic analysis.


* **Text categorization:** Dividing a pool of textes in different categories.


* **Text clustering:** Grouping text in an unsupervised way.


* **Entity extraction:** Being able to extract textual constructs (syntaxical, grammatical) from a text.


* **Summarization:** A technique of natural language generation which allows to build a natural language text to summarize another longer text.


First steps of creating text mining system:

* Define the nature of the input for the system. Is it long text, tweets, general, specific, whole text of sentence by sentence?

* Define the nature of the output of the system. Should we classify a text as good and bad, give more nuances to our evaluation, summarize it, categorize it?

* Define the form of learning data used by the system. Which is going to be the form of the dataframe that is going to be provided for the system.

* Define a way to assess the goodness of the system. How do we evaluate the system actually works? Can we trace correlation to other statistics which are accountable and measure the same thing, make a survey for the people, use a test set labeled by hand, measure economical impact of the system?

Text => Natural Language => Ambiguity. Data in those fields are  usually work than normal ML data.

We define a topic as a set of word. 


## Sentiment Analysis

* **Input**: A document


* **Output**:

    * A numeric value representing positivity (regression)
    * A categorical value in {Pos, Neg} (binary classification)
    * A categorical value in {Pos, Neutral, Neg} (multiclass classification)
    
    
* **Learning data**: A set of pairs {document, sentiment}

In order to be able to analyze text, we have to transform a document into a vector in space. There are many possibilities to do this:

* **Bag of words**: One dimension for each word in the vocabulary, the value of each variable for a text is the number of occurrencies of the said word inside the text. In order to reduce the size of the vocabulary used to build those vectors, we can **select only the most k frequent words** in the corpus or the ones we're most interested in. There is a trade-off, since rare but interesting words may be dropped off. This approach can be uneffective in case of documents with very different lengths, irrelevant terms with high frequencies and cases where order is important.

* **Term frequencies-Inverse Document Frequencies (tf-idf):** Gives a ratio between the frequency of the term (word) in the document (tf) with the overall frequency of the term in the whole set of documents (idf). It measures the relative importance of the term in the document, taking in account the overall rareness of the term.

* **N-grams**: Instead of considering single words, we can couple words into short sequences of n words (in text mining n can be up to 5), and then performing the bag of word approach on those. It is important to note that in this case the size of the vector increases by a power of n. There are lots of pre-trained tools for those tasks.

* **POS tagging**: A natural language processing technique which assign a label to each word in a sentence based on its role inside the said sequence. This could be used also to remove words of some families which are not important.



Possible preprocessing steps that could be applied:

* **Removing punctuation**, since it is often not useful for classification.

* **Converting the text to lowercase**.

* We can **remove stop words**, articles, conjunctions, prepositions, which are not meaningful for our purpose. They are language-dependent, so it is important to have a good domain knowledge about the language (or just use a library, ffs).

* We can perform **stemming**, which reduces each word to its word stem, the morphological root. Stemming is language dependent.

The application of preprocessing steps is very context dependent and requires some contextual knowledge to avoid errors.


# Laboratory: Categorize sport vs politics tweets

* **Input**: The system will receive the body of a english tweet in input.


* **Output**: Two values in the interval $[0,1]$ for the certainity about the tweet being about sports or about politics. If one of those values is above [to be defined], the tweet is classified accordingly. If both values are under a certain threshold, the tweet is classified as neither about sport nor politics.


* **Learning data**: The learning data which I will use are samples taken from the datasets about U.S. Congress Tweets, 2018 Olympics Tweets and Hurricane Irma Tweets in order to build a balanced training and test set, under the assumption that tweets contained in those datasets are respectively of Politics, Sports and Non-sport, non-politics topics.


* **Workflow**: 

    * Import the tweets based on their IDs using the twitter API wrapper provided by twitteR package.
    * Perform preprocessing on the tweets, removing all the punctuation and applying lowercase. Possibly use stemming. Also, remove the original hashtag to avoid overfitting.
    * Generate a training set taking a sample of [size] from all the three datasets, assigning to those tweets the labels from the original data (either Politics, Sport or General).
    * Perform an ngram on those tweets, followed by a bag of words approach in order to vectorize our tweets
    * Learning phase on the training set with associated labels.


* **Problem assessment:** Using the test set, we will evaluate the performance of the model we built and use the mean error squared to decide if our system has good enough performances.

In [12]:
library("twitteR")
library("tm")
library("SnowballC")

consumer_key <- "07fbY5UE2c7xFgIrDeVRjrh1x"
consumer_secret <- "qv2tO9OZpPv4gH88Ph2KjaMWw86EWudYrHQnT844l99LwOCpBs"
access_token <- "925913081032650752-zJFmwuv9eIYnteJGrX2Pik1HDpkzWxj"
access_secret <- "Y2DJyIOvigolC2ztWIvMLU0VwlQfmBFsP40xZ6VaLee7d"

setup_twitter_oauth(consumer_key,
                    consumer_secret,
                    access_token,
                    access_secret)

[1] "Using direct authentication"


In [None]:
t_sport <- searchTwitter("#sport", n=10000, lang="en", resultType="mixed")

In [None]:
length(t_sport)

In [None]:
t_sport_text <- sapply(t_sport, function(x) x$getText())
t_sport_text_corpus <- Corpus(VectorSource(t_sport_text))

In [None]:
# Remove links
t_sport_text_corpus <- tm_map(t_sport_text_corpus, function(x)gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x))
# Remove punctuation
t_sport_text_corpus <- tm_map(t_sport_text_corpus, removePunctuation)
# Remove stopwords
t_sport_text_corpus <- tm_map(t_sport_text_corpus, function(x)removeWords(x,stopwords()))
# Tolower
t_sport_text_corpus <- tm_map(t_sport_text_corpus, content_transformer(tolower))
# Perform stemming
t_sport_text_corpus <- tm_map(t_sport_text_corpus, stemDocument)

# Convert to dataframe
df_sport <- as.data.frame(t(as.matrix(TermDocumentMatrix(t_sport_text_corpus))))

In [6]:
nrow(df_sport)
ncol(df_sport)

In [None]:
t_politics <- searchTwitter("#politics", n=10000, lang="en", resultType="mixed")

In [None]:
length(t_politics)

In [None]:
t_politics_text <- sapply(t_politics, function(x) x$getText())
t_politics_text_corpus <- Corpus(VectorSource(t_politics_text))

In [None]:
# Remove links
t_politics_text_corpus <- tm_map(t_politics_text_corpus, function(x)gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x))
# Remove punctuation
t_politics_text_corpus <- tm_map(t_politics_text_corpus, removePunctuation)
# Remove stopwords
t_politics_text_corpus <- tm_map(t_politics_text_corpus, function(x)removeWords(x,stopwords()))
# Tolower
t_politics_text_corpus <- tm_map(t_politics_text_corpus, content_transformer(tolower))
# Perform stemming
t_politics_text_corpus <- tm_map(t_politics_text_corpus, stemDocument)

# Convert to dataframe
df_politics <- as.data.frame(t(as.matrix(TermDocumentMatrix(t_politics_text_corpus))))

In [21]:
nrow(df_politics)
ncol(df_politics)

In [27]:
summary(df_politics)

     confid          conserv          diminish       dtkq0qcr0i   
 Min.   :0.0000   Min.   :0.0000   Min.   :0e+00   Min.   :0e+00  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0e+00   1st Qu.:0e+00  
 Median :0.0000   Median :0.0000   Median :0e+00   Median :0e+00  
 Mean   :0.0018   Mean   :0.0061   Mean   :3e-04   Mean   :1e-04  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0e+00   3rd Qu.:0e+00  
      may             minut            split           surviv      
 Min.   :0.0000   Min.   :0.0000   Min.   :0e+00   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0e+00   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0e+00   Median :0.0000  
 Mean   :0.0448   Mean   :0.0024   Mean   :8e-04   Mean   :0.0015  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0e+00   3rd Qu.:0.0000  
    theresa        unionist…          vote          2sggle2eb2   
 Min.   :0.000   Min.   :0e+00   Min.   :0.0000   Min.   :0e+00  
 1st Qu.:0.000   1st Qu.:0e+00   1st Qu.:0.0000   1st Qu.:

In [34]:
t_other <- searchTwitter("#lol", n=10000, lang="en", resultType="mixed")

In [35]:
length(t_other)

In [36]:
t_other_text <- sapply(t_other, function(x) x$getText())
t_other_text_corpus <- Corpus(VectorSource(t_other_text))

In [37]:
# Remove links
t_other_text_corpus <- tm_map(t_other_text_corpus, function(x)gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", x))
# Remove punctuation
t_other_text_corpus <- tm_map(t_other_text_corpus, removePunctuation)
# Remove stopwords
t_other_text_corpus <- tm_map(t_other_text_corpus, function(x)removeWords(x,stopwords()))
# Tolower
t_other_text_corpus <- tm_map(t_other_text_corpus, content_transformer(tolower))
# Perform stemming
t_other_text_corpus <- tm_map(t_other_text_corpus, stemDocument)

# Convert to dataframe
df_other <- as.data.frame(t(as.matrix(TermDocumentMatrix(t_other_text_corpus))))

“transformation drops documents”

In [38]:
nrow(df_other)
ncol(df_other)

In [39]:
summary(df_other)

     actoz           coach            ggoong           head       
 Min.   :0e+00   Min.   :0.0000   Min.   :0e+00   Min.   :0.0000  
 1st Qu.:0e+00   1st Qu.:0.0000   1st Qu.:0e+00   1st Qu.:0.0000  
 Median :0e+00   Median :0.0000   Median :0e+00   Median :0.0000  
 Mean   :1e-04   Mean   :0.0014   Mean   :1e-04   Mean   :0.0033  
      hoon            lol              lubu         mightybear   
 Min.   :0e+00   Min.   :0.0000   Min.   :0e+00   Min.   :0e+00  
 1st Qu.:0e+00   1st Qu.:0.0000   1st Qu.:0e+00   1st Qu.:0e+00  
 Median :0e+00   Median :1.0000   Median :0e+00   Median :0e+00  
 Mean   :1e-04   Mean   :0.5936   Mean   :2e-04   Mean   :1e-04  
      pure             riri            star            verita     
 Min.   :0.0000   Min.   :0e+00   Min.   :0.0000   Min.   :0e+00  
 1st Qu.:0.0000   1st Qu.:0e+00   1st Qu.:0.0000   1st Qu.:0e+00  
 Median :0.0000   Median :0e+00   Median :0.0000   Median :0e+00  
 Mean   :0.0013   Mean   :1e-04   Mean   :0.0022   Mean   :1e-04  


In [28]:
df_politics["Topic"] <- "Politics"
df_sport["Topic"] <- "Sport"
df_general["Topic"] <- "Other"

In [63]:
t_politics_tdmatrix_clean <- removeSparseTerms(TermDocumentMatrix(t_politics_text_corpus, ), 0.95)

In [64]:
df_politics2 <- as.data.frame(t(as.matrix(t_politics_tdmatrix_clean)))

In [65]:
dim(df_politics2)

In [67]:
summary(df_politics2)

     brexit             …              polit             what       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.0849   Mean   :0.0792   Mean   :0.3896   Mean   :0.2297  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.0000  
 Max.   :3.0000   Max.   :4.0000   Max.   :3.0000   Max.   :2.0000  
      the             news              ♨               stay       
 Min.   :0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.216   Mean   :0.1129   Mean   :0.0724   Mean   :0.1073  
 3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
 Max.   :3.000   Max.   :3.0000   Max.   :1.0000   Max.   :2.0000  
      whi              101              d