In [14]:
import pandas as pd
import numpy as np
import glob
import datetime
from tqdm.notebook import tqdm_notebook
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

# Analyzing the tags on a given day
This notebooks uses the tags given to an article to make zoomed in analysis of articles published on a given day. We analyze the articles’ tags on three dates: The announcement of the lockdown, Prince Charles testing positive on Coronavirus and Johnson’s hospitalization. For each date we are going to collect all the articles and then proceed in two steps: 
First of doing descriptive analysis of the articles that day and then secondly, creating an network of tags for all the articles on that date.
At the moment we only possess a dataset which queried specifically for only coronarelated articles. We will solve this later on.

The cell below import the tagsToArticle dataset and corrects the type of certain columns.

In [49]:
tagsToArticle = pd.read_json('tags_articles.json')
tagsToArticle.article_webPublicationDate = pd.to_datetime(tagsToArticle.article_webPublicationDate, 
                                                          unit='ms')
tagsToArticle['publicationDay'] = tagsToArticle.article_webPublicationDate.dt.date
tagsToArticle.head()

### 23rd of March (Lockdown announced by Boris Johnson): 
The first cell below carries out the specified descriptive analysis on all the articles published on the 23rd of march and their respective tags. As you can see, the Guardian published a total of 128 articles that given day and used a total of 1638 tags, by using 481 different tags. The top ten used keyword tags can be seen below.

In [78]:
date = datetime.date(2020, 3, 23)
tagsToArticleDay = tagsToArticle[tagsToArticle.publicationDay == date]
num_of_articles = len(tagsToArticleDay.article_id.unique())
num_of_tags = len(tagsToArticleDay.id.unique())
print(f"Number of articles published on {date.strftime('%d-%m-%Y')}: {num_of_articles}")
print(f"Number of different tags and total used tags: {num_of_tags} & {len(tagsToArticleDay)}")
print('Top 10 used keywords:')
print(tagsToArticleDay[tagsToArticleDay.type =='keyword'].id.value_counts()[:10]/num_of_articles * 100)
print('\nWhat kind of content was published?')
print(tagsToArticleDay[tagsToArticleDay.type == 'type'].id.value_counts()/num_of_articles * 100)

Number of articles published on 23-03-2020: 128
Number of different tags and total used tags: 481 & 1638
Top 10 used keywords:
world/coronavirus-outbreak    77.34375
uk/uk                         52.34375
world/world                   32.81250
science/infectiousdiseases    31.25000
politics/politics             17.96875
society/health                15.62500
society/society               15.62500
culture/culture               14.84375
science/science               14.84375
business/business             12.50000
Name: id, dtype: float64

What kind of content was published?
type/article    100.0
Name: id, dtype: float64


### 25th of March (prince Charles tests positive for Coronavirus): 
The first cell below carries out the specified descriptive analysis on all the articles published on the 25th of march and their respective tags. As you can see, the Guardian published a total of 141 articles that given day and used a total of 1861 tags, by using 566 different tags. The top ten used keyword tags can be seen below.

In [79]:
date = datetime.date(2020, 3, 25)
tagsToArticleDay = tagsToArticle[tagsToArticle.publicationDay == date]
num_of_articles = len(tagsToArticleDay.article_id.unique())
num_of_tags = len(tagsToArticleDay.id.unique())
print(f"Number of articles published on {date.strftime('%d-%m-%Y')}: {num_of_articles}")
print(f"Number of different tags and total used tags: {num_of_tags} & {len(tagsToArticleDay)}")
print('Top 10 used keywords:')
print(tagsToArticleDay[tagsToArticleDay.type =='keyword'].id.value_counts()[:10]/num_of_articles * 100)
print('\nWhat kind of content was published?')
print(tagsToArticleDay[tagsToArticleDay.type == 'type'].id.value_counts()/num_of_articles * 100)

Number of articles published on 25-03-2020: 141
Number of different tags and total used tags: 566 & 1861
Top 10 used keywords:
world/coronavirus-outbreak    82.269504
uk/uk                         46.099291
world/world                   31.205674
science/infectiousdiseases    22.695035
society/society               19.148936
business/business             17.021277
society/health                14.893617
culture/culture               14.184397
politics/politics             14.184397
society/nhs                   10.638298
Name: id, dtype: float64

What kind of content was published?
type/article    100.0
Name: id, dtype: float64


### 6th of April (Boris Johnson admitted to Hospital)
The first cell below carries out the specified descriptive analysis on all the articles published on the 6th of april and their respective tags. As you can see, the Guardian published a total of 46 articles that given day and used a total of 574 tags, by using 221 different tags. The top ten used keyword tags can be seen below.

In [80]:
date = datetime.date(2020, 3, 28)
tagsToArticleDay = tagsToArticle[tagsToArticle.publicationDay == date]
num_of_articles = len(tagsToArticleDay.article_id.unique())
num_of_tags = len(tagsToArticleDay.id.unique())
print(f"Number of articles published on {date.strftime('%d-%m-%Y')}: {num_of_articles}")
print(f"Number of different tags and total used tags: {num_of_tags} & {len(tagsToArticleDay)}")
print('Top 10 used keywords:')
print(tagsToArticleDay[tagsToArticleDay.type =='keyword'].id.value_counts()[:10]/num_of_articles * 100)
print('\nWhat kind of content was published?')
print(tagsToArticleDay[tagsToArticleDay.type == 'type'].id.value_counts()/num_of_articles * 100)

Number of articles published on 28-03-2020: 46
Number of different tags and total used tags: 221 & 574
Top 10 used keywords:
world/coronavirus-outbreak    91.304348
uk/uk                         50.000000
world/world                   43.478261
society/society               30.434783
science/infectiousdiseases    19.565217
society/health                17.391304
culture/culture               15.217391
world/italy                   13.043478
world/spain                   10.869565
world/europe-news             10.869565
Name: id, dtype: float64

What kind of content was published?
type/article    100.0
Name: id, dtype: float64
