# Data Analysis (Implementing RAKE Algorythm)

In this final section of the workshop, we will apply an algorythm called, Rapid Automated Keyword Extraction.    __[RAKE](https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents)__ is an unsupersived, language neutral, and domain independent algorythm developed by Stuart Stuart,  Dave Engel, Nick Cramer, and Wendy Cowley (2010).  Since our dateset consists of untagged news articles, RAKE can provide us the most important keywords emerging from our dataset representing the articles in a condensed form. 

First, we need to __import__ pandas to access dataset as a dataframe.

In [4]:
import pandas as pd


In [5]:
tw_news = pd.read_csv('tw_dataset.csv', date_parser='date')


We now need the create a function to concanate the articles in a string object.

In [6]:
def sum_text(dataset):
    '''
    takes pd dataset returns connotated text
    '''
    sumoftxt = ''
    for i in dataset['fulltext']:
        sumoftxt += i
    return sumoftxt

raw_text = sum_text(tw_news)

In order to under take the analysis, we need to __import__ the RAKE module. Before that you can install the module by typing to an empty cell:

___```!pip install python-rake```___.

In [2]:
import RAKE


Now, we can apply RAKE to our data. Before that, for a further reference, please consider this information regarding the algorythm. 
- You can find the open source code of the module __[here!](https://github.com/fabianvf/python-rake/blob/master/RAKE/RAKE.py)__

- RAKE calculates the word scores, first creating a word occurence graph of the content, and:

$$
\begin{equation}
\frac{degree(word)}{frequency(word)}
\end{equation}
$$

__The score of the keyword is the sum of all of its member words.__ For more detailed information please refer to the paper.

Please be aware that both the number of selected keywords, object parameters and attributions stems from subject/domain knowledge. Exploring our data in the previous sections, I found out after a several trials that passing 3 for each _minimum number of characters, maximum number of words in the keyword, and minimum number of word appearance._ I also choose to use in-built __FoxStopList__. I also choose the retrive top __20__ candidates.

Now, we can find the key words and save the key words in a __pandas dataframe__.


In [14]:
kwobjct = RAKE.Rake(stop_words = RAKE.FoxStopList())

candidateKWords = kwobjct.run(raw_text, 3, 3, 3)

keywordsData = pd.DataFrame(candidateKWords[:20], columns=['Keyword','Score'], index=range(1,len(candidateKWords[:21])))


Let's see the results by calling the top five items in the new data frame!

In [22]:
keywordsData.head()

Unnamed: 0,Keyword,Score
1,president tsai ing-wen,13.060656
2,hong kong-listed company,9.800372
3,china-friendly kuomintang party,9.635668
4,han kuo-yu,8.05
5,golden horse awards,8.008333


It seems that we retrived some useful keywords for our discussion. Let's save them in a csv file and finish our small data analysis project. 

In [23]:
keywordsData.to_csv('twnews_keywords.csv')

### Stepping stone for Advanced Knowledge Extraction

So far, our case study used simple but powerful methods of Natural Language Processing. Before finalizing this study, you should be reminded that Python is capable of more advance methods. We will not be able to cover them in this workshop. For instance, Python's __[spaCy](https://spacy.io)__ module provides an extensive tools, pre-trained model and interoperability with Python's other deep learning libraries, such as __[scikit-learn](https://scikit-learn.org)__. Just to note the comprehensibility of _spaCy_'s models, one model called __'en_core_web_lg'__ can size around 800 MB.

Just to demonstrate the power of _spaCy_, please see the graph below that I created using a _spaCy_ based module that can be found __[here](https://github.com/BrambleXu/news-graph)__. After downloading the module, I customized it and run the __raw_text__ variable holding the full text of the articles. This graph provided me the information about and the relationship between the key actors, locations and organizations. I wanted to particularly see the dynamic relationship between the United States and other actors in our dataset. This simple code below provided the overview that I needed.

```Python
from news_graph import NewsMining
Miner = NewsMining()
Miner.main(raw_text)
```

<img src='images/Tw_elec_nlp.gif' style="width: 850px" align="middle" />

__As demonstrated above, Python can provide extremely useful methods for the social scientists. I hope that this workshop helped you to have a brief impression of Python and its possible usage for your research.__ 

- __[Previous: Data Exploration](2 - Data Exploration.ipynb)__