# Manual topic classification with keywords 
- Training models on all the topics is proving hard due to insufficient labelled data 
- Some topics are actually pretty straightforward as the data is news article titles 
- Therefore it is possible to fairly accurately classify into topics via the right set of keywords for each topic
- Below is an attempt at that (for the underperforming models)

In [1]:
import pandas as pd 
import numpy as np
import os 

In [2]:
labelled_data_path = "../Data/labelling/news_and_twitter_labelling_1.xlsx"
# labelled data 
df_labelled = pd.read_excel(labelled_data_path, sheet_name="guardian", index_col=0)
# all the data with all the fields 
df_all = pd.read_csv("resources/df_guardian_lem_nov14.csv", index_col=0)

In [3]:
# note: masks should be a separate category 
categories = [
    'economy', 'case_reporting', 'treatments_vaccines',
    'education', 'travel_lockdown', 'healthcare',
    'other', 'politics', 'environment', 'social_issues'
]

In [4]:
df_all.head()

Unnamed: 0_level_0,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,fields,isHosted,pillarId,pillarName,trailText,full_text,title_subtitle,title_subtitle_bow,title_subtitle_lem
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
society/2020/jan/18/expert-questions-effectiveness-airport-screening-new-coronavirus,article,society,Society,2020-01-18T19:20:36Z,Expert questions effectiveness of coronavirus ...,https://www.theguardian.com/society/2020/jan/1...,https://content.guardianapis.com/society/2020/...,{'headline': 'Expert questions effectiveness o...,False,pillar/news,News,We don’t know enough yet to be sure of catchin...,Medical staff at airports trying to screen for...,Expert questions effectiveness of coronavirus ...,expert questions effectiveness of coronavirus ...,expert question effectiveness coronavirus airp...
world/2020/jan/18/no-airport-screening-for-new-sars-like-virus-yet-china,article,world,World news,2020-01-18T19:20:36Z,No screening for new Sars-like virus at UK air...,https://www.theguardian.com/world/2020/jan/18/...,https://content.guardianapis.com/world/2020/ja...,{'headline': 'No screening for new Sars-like v...,False,pillar/news,News,Experts decide no need for checks as member of...,Health officials have ruled out introducing sc...,No screening for new Sars-like virus at UK air...,no screening for new sars-like virus at uk air...,screening new sars - like virus uk airport exp...
world/2020/jan/18/coronavirus-what-airport-measures-are-in-place-to-detect-for-sick-passengers,article,world,World news,2020-01-18T09:21:23Z,Coronavirus: what airport measures are in plac...,https://www.theguardian.com/world/2020/jan/18/...,https://content.guardianapis.com/world/2020/ja...,{'headline': 'Coronavirus: what airport measur...,False,pillar/news,News,"Three US airports introduce screening, followi...",International airports are stepping up screeni...,Coronavirus: what airport measures are in plac...,coronavirus what airport measures are in place...,coronavirus airport measure place detect sick ...
business/2020/mar/25/banks-warned-against-profiteering-from-uk-coronavirus-crisis,article,business,Business,2020-03-25T23:01:04Z,Banks warned against profiteering from UK coro...,https://www.theguardian.com/business/2020/mar/...,https://content.guardianapis.com/business/2020...,{'headline': 'Banks warned against profiteerin...,False,pillar/news,News,Labour MP Chris Bryant accuses lenders of char...,"British companies, including major banks, have...",Banks warned against profiteering from UK coro...,banks warned against profiteering from uk coro...,bank warn profiteer uk coronavirus crisis labo...
world/2020/mar/25/astonishing-170000-people-sign-up-to-be-nhs-volunteers-in-15-hours-coronavirus,article,society,Society,2020-03-25T20:28:04Z,"More than 500,000 people sign up to be NHS vol...",https://www.theguardian.com/world/2020/mar/25/...,https://content.guardianapis.com/world/2020/ma...,"{'headline': 'More than 500,000 people sign up...",False,pillar/news,News,"NHS surpasses target of 250,000 to help vulne...","More than 500,000 volunteers have signed up to...","More than 500,000 people sign up to be NHS vol...",more than 500000 people sign up to be nhs volu...,500000 people sign nhs volunteer nhs surpass t...


# Topics by keywords

In [5]:
def keyword_present(string, keywords):
    res = [ele for ele in keywords if(ele in string)] 
    if len(res) > 0:
        return "yes"
    else: 
        return "no"

## Case reporting and testing

- number of cases, 
- deaths,
- number of tests, 
- clusters
- test and trace 
- testing


In [74]:
keywords = [
    "test", "infection", "death", "die", "victim", "cases", 
    "self-isolate", "infect", "symptom", "cluster",
    'first wave', 'second wave', 'third wave',
    "operation moonshot", 'contact-tracing', 'contact tracing'
]

In [75]:
df_all['case_reporting'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [76]:
len(df_all[df_all['case_reporting']=='yes']['title_subtitle'])

1815

## Treatments, vaccines 

In [9]:
# note 1: may also include the word 'oxford' as it relates to the oxford vaccine 
# note 2: make sure to use " cure " and not "cure" as it is contained in words like 'secure'
keywords = ["vaccine", "vaccination", "jab", "treatment", "drug", " cure ", "pfizer", "biontec", "moderna", "astrazeneca"]

In [10]:
df_all['treatments_vaccines'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [11]:
len(df_all[df_all['treatments_vaccines']=='yes']['title_subtitle'])

284

In [12]:
#df_labelled.dropna(subset=['education'])['title_subtitle'].tolist()

## Education 

In [13]:
# note 1: should maybe try to explicitly remove Oxford University because it's mostly about the vaccine 
keywords = ["school", "education", "teaching", "university", "universities", "pupil", "student", "classroom", "lesson"]

In [14]:
df_all['education'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [15]:
len(df_all[df_all['education']=='yes']['title_subtitle'])

507

## Travel, Lockdown
- domestic travel and restrictions
- national and regional lockdowns
- international travel
- border restrictions
- travel campaigns
- quarantine and self-isolation

In [29]:
keywords = [
    "visitors", " travel ", "lockdown", "closure", "quarantine",
    "isolation", "holiday", "rule of six", "circuit breaker",
    "firebreak", "social distance", "social distancing", 
    "physical distancing", "restrictions", "reopen", "tier"
]

In [32]:
df_all['travel_lockdown'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [33]:
len(df_all[df_all['travel_lockdown']=='yes']['title_subtitle'])

2107

## Healthcare 

In [61]:
keywords = [
    "patients", "health service", "healthcare", " nhs ", 
    "personal protective equipment", " ppe ", "care home", " hospitals ",
]

In [62]:
df_all['healthcare'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [63]:
len(df_all[df_all['healthcare']=='yes']['title_subtitle'])

729

In [64]:
#df_all[df_all['healthcare']=='yes']['title_subtitle'].tolist()

['More than 500,000 people sign up to be NHS volunteers.  NHS surpasses target of 250,000 to help vulnerable during coronavirus crisis ',
 'UK coronavirus home testing to be made available to millions. Test to be validated this week, then made available to healthcare workers and general public',
 "Why it's healthy to be afraid in a crisis. Letter:<strong> </strong><strong>Dr Lucy Johnstone</strong>, a<strong> </strong>clinical psychologist, says it is wrong to view our natural fears as mental health disorders",
 'Schools asked to donate science goggles for NHS to use as face shields. Teachers in England say they are getting requests for eyewear and other protective equipment',
 'Who is dying from coronavirus and in which NHS trusts?. Victims are getting younger and London hospitals are those under the most pressure',
 'Airbus and Dyson among firms expecting green light to make ventilators. The companies will start making up to 30,000 ventilators from next week to help the NHS fight Cov

### Environment 

In [69]:
keywords = [
    "environment", "climate crisis", "climate change", "climate emergency" 
    "climate action", "wildlife", "pollute", "pollution", "air traffic" 
    "clear skies", "green recovery", "green economy" 
]

In [70]:
df_all['environment'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [71]:
len(df_all[df_all['environment']=='yes']['title_subtitle'])

109

In [77]:
#df_all[df_all['environment']=='yes']['title_subtitle'].tolist()

## Politics 
Use NER to just extract names of politicians 

## Social issues

In [123]:
keywords = [
    "immigration", " migrant ", " immigrant ", "inequality", "unemployment", 
    "welfare", "safety net", "prison", "inmate", "racism", "racist",
    "domestic violence ", "abuse", "older people", "elderly",
    'minorities', 'disabled', 'disability benefit', 'welfare benefit',
    'poverty', 'food bank', 'bame', 'xenophobia', "overcrowded homes",
    "hostile environment"
]

In [124]:
df_labelled.dropna(subset=['social_issues'])['title_subtitle'].tolist()

['Britons in Pakistan accuse UK government of abandoning them. Anger after UK lays on rescue flights for Britons in China and Peru, but not Pakistan',
 'UK faith leaders urge chancellor to press G20 to cancel debts of poor countries. Letter to Rishi Sunak warns of risks that coronavirus pandemic poses to poverty reduction efforts',
 "Young people have plenty to protest about. No wonder they're raving again | Sheryl Garratt. Of course it’s dangerous to hold illegal parties during a pandemic, but the urge to find sense of belonging is strong, says journalist Sheryl Garratt",
 '‘I’ve not been to the city centre for months’: UK suburbs thrive as office staff stay home. Local shops are busy, and new ways of working could have huge long-term effects on the future of major British cities',
 'No time for goodbyes: tributes to six people lost to coronavirus. Campaigners call for pandemic bereavement support for relatives as UK death toll surpasses 60,000 ',
 "Noel Gallagher says he refuses to w

In [125]:
df_all['social_issues'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [126]:
len(df_all[df_all['social_issues']=='yes']['title_subtitle'])

457

In [121]:
df_all[df_all['social_issues']=='yes']['title_subtitle'].tolist()

['UK government fends off criticism with plan to pay self-employed. Rishi Sunak pressured into measures amid fears coronavirus will trigger unemployment crisis',
 'The Guardian view on social care and Covid-19: the neglected frontline. <strong>Editorial: </strong>Underpaid and overworked, care workers carry the responsibility of protecting the elderly from a disease to which they are uniquely vulnerable. They need more help',
 'UK towns lose local newspapers as impact of coronavirus deepens. Many self-isolating older people left without trusted news source as presses stop rolling',
 'Coronavirus puts vulnerable UK children at greater risk, campaigners warn.  School closures remove vital safety net. But Covid-19 means there will be even fewer foster carers to pick up the pieces',
 "'People are so thankful':\xa0how delivery drivers became the new emergency service. Just a month ago, they were deemed unskilled workers. Now they are essential in the fight to control coronavirus. But what i

## Masks 

In [87]:
keywords = ['mask ', 'masks ']

In [88]:
df_all['masks'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [89]:
len(df_all[df_all['masks']=='yes']['title_subtitle'])

145

## Homelessness

In [83]:
keywords = ['rough sleep', 'homeless', 'eviction']

In [84]:
df_all['homelessness'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [85]:
len(df_all[df_all['homelessness']=='yes']['title_subtitle'])

57

## Mental health 

In [91]:
keywords = [' mental health ', 'depression', 'suicide', 'anxiety']

In [92]:
df_all['mental_health'] = df_all["title_subtitle_bow"].apply(lambda x : keyword_present(x, keywords))

In [93]:
len(df_all[df_all['mental_health']=='yes']['title_subtitle'])

104