# Finding the Data

We need to install [newsapi-python](https://github.com/mattlisiv/newsapi-python) package. We can do this by entering ! in the beginning of a cell to directly access to the system terminal. Using exclamation mark is an easy way to access system terminal and install required packages as well undertake other work such as finding paths for working directory or other files.
```
$ pip install newsapi-python
```
After installing the package, we can start to send queries to retrive data. First we need to import NewsApiClient from  _newsapi_ module.

In [2]:
from newsapi import NewsApiClient

We need to use the key from the News Api application that we earlier created. In order to not to show your secret and key, it is a good practice to save them as in a different python file. Then  we can import that here and attacht to variables to prevent exposure. I save mine in a file called __'nws_token.py'__. Using the code below, I __import__ key and secret string objects __from__ nws_token module that I created.

In Python there are various ways to __import__ a module, here are some examples.

```Python
import module #method 1
from module import something #method 2
from module import * #method 3 imports all
```
If you use the first method, later you need to use the syntax below by first calling the module name then the function/variable name later on:
```Python
x = module.function_name() #if you use the first method
```
Otherwise, you can just call the method/variable from that module by its name. Here, we use the second method to import a variable from a module since there will not be any other variables with the same name that might cause bugs.

After importing the key, we will create an instance of the NewsApiClient object by passing our individual key as a parameter.


In [3]:
from nws_token import key

In [4]:
api = NewsApiClient(api_key=key)

Since we created an instance of _NewsApiClient_ object, we are now ready find the data we are looking for. It is always a good practice to refer to the official documentation to find out what parametres we can pass, and what kind of data we can retrive. You can reach the official documentation of News API [here!](https://newsapi.org/docs) After reading through the documentation, we have a better understanding of the parameters we want to use. 

Now, let's try to retrive all 100 most recent news articles mentioning 2020 Taiwan Presidential Elections and save all into a __dictionary__ object called _articles_. 


In [5]:
articles = {}
for i in range(1,6):
    articles.update({'page'+str(i): (api.get_everything(q='Taiwan AND Elections', 
                                                        language= 'en', page = i))})

All the information of the articles are now saved in our dictionary object called _articles_. It has a nested data structure that the iteration above saved every 20 articles for each page. As it stands, _articles_ does not have much use for us. It is complex, hard to read data object with numerous information for each article(i.e. date posted, author, source, abstract, full content,). If you want to take a look just run this code in an empty cell:
```Python
print(articles)
```
Looks complex and hard to read! As an example, let's take a look at the data on one article.

In [6]:
print(articles['page1']['articles'][0])

{'urlToImage': 'https://static01.nyt.com/images/2019/12/02/opinion/02kassam/merlin_164544267_d1bde649-259b-4ea5-a9ad-161b056b48db-facebookJumbo.jpg', 'source': {'name': 'The New York Times', 'id': 'the-new-york-times'}, 'title': 'China Has Lost Taiwan, and It Knows It', 'url': 'https://www.nytimes.com/2019/12/01/opinion/china-taiwan-election.html', 'publishedAt': '2019-12-02T00:00:07Z', 'author': 'Natasha Kassam', 'content': 'The Sunflower Movement of 2014, a series of protests led by a coalition of students and civil-society activists, marked the rejection of close relations with China by Taiwans younger generations. So did the election of the pro-sovereignty Ms. Tsai in 2016.\r\nM… [+2354 chars]', 'description': 'So it is attacking democracy on the island from within.'}


__It is still complicated but gives a better view on the available data. Given that we have 100 of such a data, we need to manipulate and filter these information into a more useful form.__
News API does not provide the full content of the articles. We need to use webscrapping to retrive the full content of each article. For now, we can use a function to parse the results to only save the fields we need. We need Title, Source, Publication Date,description and the URL. 

#### Functions in Python

Functions are the fundamental programming tools that enables to wrap several statements and procudes the values that we desire. They make it easy for the code reusability and recyclability. For this workshop, it is sufficient just to grasp the basics of the functions in Python. A Function code usually basically looks like this:
```Python
def func_name(args):
    statement
    return result
```
You can also use 'yield' instead of return if your function is a generator. But it is a more advanced technique that we will not use in this workshop. After you define the function, you need to call it by its name, and if required you can bind the returned object to variable.
```Python
func_name() ## calls the function
x = func_name() ## binds the returned object to a variable called x
```
Functions are a rich and powerful way in Python, and I recommend you to read more about them. 

We will now use a function to grap the information we need from the articles.

Let's __```import datetime```__  and __```dateutil.parser```__ modules for formatting existing publication date into a more readable format.

In [78]:
from dateutil.parser import parse
from datetime import datetime

Let's first create a helper function to make the publication date more readable.

In [98]:
def reformat_date(date):
    '''takes a string and returns a reformatted string'''
    newdate = parse(date)
    return newdate.strftime("%d-%B-%Y")


Now, we will create another helper function to prevent duplicate articles appearing in our dataset.

In [115]:
def check_duplicate(dataset,title):
    '''
    takes a list of dictionaries and a title string
    to check for duplication of same articles
    '''
    for i in dataset:
        if i['title'] == title:
            return True
        
    

Since the article News API does not provide the full text of the articles, we need a web scrabbing function to retrive the full text of the each articles. We need to import __```requests```__ and __```BeautifulSoup```__ packages.

In [125]:
import requests
from bs4 import BeautifulSoup

Now we can write another helper function to retrive the full text of the articles. Since we might face errors and exceptions while retriving the full text from a website, it is important to catch the possible exceptions and handle them to prevent our application from breaking. We can do this using this syntax:
```Python
try:
    some_code()
except:
    some_exception_handling()
```
Below, we use __"```Exception as e```"__ expression so that we can print the properties of the error to be able to handle it better next time.

In [146]:
def get_fulltext(url):
    '''
    Takes the URL and returns 
    article full text. 
    '''
    HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)'
                ' AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
    try:
        page = requests.get(url,headers = HEADERS)
        soup = BeautifulSoup(page.content, 'html.parser')
        texts = soup.find_all('p')
        article = ''
        for i in texts:
            article += str(i.get_text())
        return article
    except Exception as e:
        print(e)
        return None
        


We can now create a function to extract the information we want in a readable way.

In [149]:
def article_extract(articles):
    '''
    takes a dictionary object returned from News API and
    returns a list of dictionary with the required fields
    '''
    news_data = []
    for i in articles.keys():
        for n in range(0,len(articles[i]['articles'])):
            if not check_duplicate(news_data,articles[i]['articles'][n]['title']):
                news_data.append({'title':articles[i]['articles'][n]['title'],
                                  'source': articles[i]['articles'][n]['source']['name'],
                                  'URL': articles[i]['articles'][n]['url'],
                                  'description': articles[i]['articles'][n]['description'],
                                  'date': reformat_date(articles[i]['articles'][n]['publishedAt']),
                                 'fulltext': get_fulltext(articles[i]['articles'][n]['url'])})
        return news_data
    
    

__Now our function is ready for operation. Let's call it and see the first item of the dataset created by our function. It must be more readible with only required fields.__

In [153]:
data_set = article_extract(articles)
print(data_set[0])

{'source': 'Reuters', 'date': '25-November-2019', 'URL': 'https://www.reuters.com/article/us-taiwan-election-idUSKBN1XZ0AP', 'title': "Taiwan ruling party says China 'enemy of democracy' after meddling allegations", 'fulltext': '4 Min ReadTAIPEI (Reuters) - Taiwan President Tsai Ing-wen’s ruling party denounced China as an “enemy of democracy” on Monday following fresh claims of Chinese interference in the island’s politics ahead of presidential and legislative elections on Jan. 11. The allegations, reported by Australian media, were made by a Chinese asylum seeker in Australia who said he was a Chinese spy. China, which claims Taiwan as its sacred territory, to be brought under Beijing’s control by force if necessary, has branded the asylum seeker a fraud. The Chinese man, Wang Liqiang, also provided details of Chinese efforts to infiltrate universities and media in the Chinese territory of Hong Kong, which has been rocked by months of anti-government protests.  Cho Jung-tai, chairman

***

It seems from the results that we managed to create our data set. Now we can save it in a commo seperated value file to start our analysis. For this, we need to __```import csv```__ module.

In [158]:
import csv
with open("tw_dataset.csv", 'w') as file:
    tw_dt= csv.DictWriter(file,data_set[0].keys())
    tw_dt.writeheader()
    tw_dt.writerows(data_set)


__Our Data Set is saved in our working directory and now ready for exploration and analysis!__

<img src="images/dataset.gif" style="width: 650px" align="middle" /> 


- __[Previous: Setting the Scene](0 - Setting the Scene.ipynb)__