## Python Basics - Challenge



- The file `guardian_articles_corona.json` contains utf-8 encoded articles for the search term *coronavirus* in the year 2020 from the [The Guardian API](http://open-platform.theguardian.com/)  (retrieved 13/05/2020)
- The objective is to simplify the data structure such that analyses can be run afterwards
- Make use of the exercises and notebooks we have discussed previously
- The challenge is much more comprehensive than the other tasks. It's OK if the solution takes more time. You might also want to tackle the challenge in your groups.

### 1.

Download the `JSON` file, read it into Python and familiarise yourself with the data structure. How many artciles does the file contain? 

### 2. 

Write a function to process the list with articles. Simplify the data structure according to the following Input / Output example:

**Input:**

```
{
    'id': 'world/2020/may/08/coronavirus-the-week-explained',
    'type": 'article',
    'sectionId': "world',
    'sectionName': 'World news',
    'webPublicationDate': '2020-05-08T10:54:45Z',
    'webTitle': 'Coronavirus: the week explained',
    'webUrl': 'https://www.theguardian.com/world/2020/may/08/coronavirus-the-week-explained',
    'apiUrl': 'https://content.guardianapis.com/world/2020/may/08/coronavirus-the-week-explained',
    'fields': {
      'bodyText': 'Welcome to our weekly roundup of developments in the coronavirus pandemic, which continues ...',
      'charCount': '6139'},     
   'tags': 
   [{'id': 'world/coronavirus-outbreak',
   'type': 'keyword',
   'sectionId': 'world',
   'sectionName': 'World news',
   'webTitle': 'Coronavirus outbreak',
   'webUrl': 'https://www.theguardian.com/world/coronavirus-outbreak',
   'apiUrl': 'https://content.guardianapis.com/world/coronavirus-outbreak',
   'references': []},
  {'id': 'science/science',
   'type': 'keyword',
   'sectionId': 'science',
   'sectionName': 'Science',
   'webTitle': 'Science',
   'webUrl': 'https://www.theguardian.com/science/science',
   'apiUrl': 'https://content.guardianapis.com/science/science',
   'references': []}]
   ...

```

**Output:**

```
{'chars': 6139,
 'id': 'world/2020/may/08/coronavirus-the-week-explained',
 'section': 'World news',
 'tags': 'world/coronavirus-outbreak, science/science',
 'text': 'Welcome to our weekly roundup of developments in the coronavirus pandemic, which continues ...',
 'title': 'Coronavirus: the week explained',
 'url': 'https://www.theguardian.com/world/2020/may/08/coronavirus-the-week-explained',
 'month': 5}
```

### 3.
The variable `chars` in your processed articles contains the particular number of characters in the text. Check by a sample article whether this result is correct.
      
### 4.
Find out in which month most articles were published.

### 5.
Find the three most frequently used tags from all articles.

### 6.
Return the titles of the five longest articles (= number of characters).

### 7.
Store the processed articles in a `JSON` file. Be careful to specify the text encoding as `utf-8`.

In [1]:
# Code for Python challenge

# Preamble

import zipfile
import json
import os
import datetime
import pandas as pd
from tqdm.notebook import tqdm

# Reading in the .zip file to avoid temporary extraction

zip_file_path = os.getcwd() + '\\guardian_articles_corona.zip'
zf = zipfile.ZipFile(zip_file_path, "r")

for name in zf.namelist():
    
    with zf.open(name) as f:
        
        data = f.read()  
        d = json.loads(data)

In [2]:
# Function to parse JSON objects from a list

def parse_tweets(json_object):
    
    tweet_list = []

    for tweet in tqdm(json_object):
    
        empty_dict = {}
    
        empty_dict['chars'] = tweet['fields']['charCount']
        empty_dict['id'] = tweet['id']
        empty_dict['section'] = tweet['sectionName']
    
        tags_list = [tag['id'] for tag in tweet['tags']]
        one_string = ', '.join([s for s in tags_list])
        empty_dict['tags'] = one_string
        empty_dict['text'] = tweet['fields']['bodyText']
        empty_dict['title'] = tweet['webTitle']
        empty_dict['url'] = tweet['webUrl']
        empty_dict['month'] = datetime.datetime.strptime(tweet['webPublicationDate'][:10], "%Y-%m-%d").month
    
        tweet_list.append(empty_dict)
        
    return tweet_list

In [3]:
data = parse_tweets(d)

  0%|          | 0/10801 [00:00<?, ?it/s]

In [4]:
# Checking whether the extracted information on number of characters in text coincides with our computation of the length of
# the articles length

sample_index = 0
data[sample_index]['chars'] == str(len(data[sample_index]['text']))

True

In [6]:
# Converting the list of dictionaries into a pandas dataframe and counting the frequency of unique values

df = pd.DataFrame(data)
df.head()

Unnamed: 0,chars,id,section,tags,text,title,url,month
0,6139,world/2020/may/08/coronavirus-the-week-explained,World news,"world/coronavirus-outbreak, science/science, w...",Welcome to our weekly roundup of developments ...,Coronavirus: the week explained,https://www.theguardian.com/world/2020/may/08/...,5
1,2196,world/2020/apr/14/coronavirus-latest-at-a-glance,World news,"world/world, world/coronavirus-outbreak",Key developments in the global coronavirus out...,Coronavirus: at a glance,https://www.theguardian.com/world/2020/apr/14/...,4
2,2469,world/2020/apr/17/coronavirus-contact-tracing-...,World news,"world/coronavirus-outbreak, science/infectious...",What is contact tracing? This is one of the mo...,Coronavirus: contact tracing explained,https://www.theguardian.com/world/2020/apr/17/...,4
3,3356,world/2020/may/12/coronavirus-latest-at-a-glan...,World news,"world/world, world/coronavirus-outbreak",Key developments in the global coronavirus out...,Coronavirus latest: at a glance,https://www.theguardian.com/world/2020/may/12/...,5
4,3483,world/2020/may/12/coronavirus-latest-at-a-glan...,World news,"world/world, world/coronavirus-outbreak",Key developments in the global coronavirus out...,Coronavirus latest: at a glance,https://www.theguardian.com/world/2020/may/12/...,5


In [7]:
df['month'].value_counts()

4    4307
3    4076
5    1479
2     744
1     195
Name: month, dtype: int64

In [8]:
tags_list = []

for i in range(len(df)):
    
    tags_list.append(df.iloc[i]['tags'])
    
one_string = ' '.join([s for s in tags_list])

tags = one_string.split(', ')

In [9]:
empty_list = []

for tag in set(tags):
    
    empty_dict = {}
    
    empty_dict['tag'] = tag
    empty_dict['count'] = tags.count(tag)
    empty_list.append(empty_dict)   

In [10]:
pd.DataFrame(empty_list).sort_values(by=['count'], ascending=False)

Unnamed: 0,tag,count
6681,world/coronavirus-outbreak,3696
1908,uk/uk,2568
2207,science/infectiousdiseases,2200
3275,world/world,2003
3723,business/business,1196
...,...,...
3485,world/sri-lanka-attacks,1
3482,sport/us-sport books/history,1
3481,australia-news/scott-morrison business/economics,1
3479,politics/health uk/uk,1


<br>
<br>


___

                
**Contact: Gerome Wolf** (Email: wolfgerome@gmail.com)