# ðŸ“Š Script Summary: Scrape Quality Evaluation

This script analyses the results of article scraping by loading the merged dataset (newsletter_full_articles_with_items.csv), counting successful vs. failed scrapes, inspecting failure reasons, analysing success rates by organisation and category, generating summary statistics, producing detailed breakdowns for the top 50 organisations, exporting a status report (status_by_org.xlsx), and examining cases where pages were fetched but no article text could be extracted.

- overall 741 articles were successfully scraped
- 160 rows failed to sucessfully scrape
- 153 rows had a error reason and the page could not be fetched (403,404, 406, timeout, connection, SSL
- 7 rows were empty which means the page was sucessfully feteched but no content was extraced
- These failure reasons indicate pages that were blocked (403), missing (404), rejected the request (406), required login (401), had connection/SSL issues, timed out, or didnâ€™t contain extractable article text.

In [1]:
import pandas as pd

In [2]:
ARTICLES_CSV = "/workspaces/ERP_Newsletter/data/data04_full_articles_scraped/newsletter_full_articles_with_items.csv"

df = pd.read_csv(ARTICLES_CSV)

In [3]:
df.columns

Index(['id', 'newsletter_number', 'issue_date', 'theme', 'subtheme', 'title',
       'description', 'link', 'new_theme', 'domain_x', 'organisation',
       'org_broad_category', 'org_category', 'title_length',
       'description_length', 'title_word_count', 'description_word_count',
       'text', 'text_length_chars', 'text_length_words', 'link_canonical',
       'article_id', 'domain_y', 'article_title', 'article_text', 'status',
       'failure_reason'],
      dtype='object')

#Â Check Failed vs. Successful 

In [6]:
#check number of failed vs. successful scraped articles 
df['status'].value_counts()

status
ok       741
error    153
empty      7
Name: count, dtype: int64

| Type                                   | What Happened                                               | Cause                                                                        |
| -------------------------------------- | ----------------------------------------------------------- | ---------------------------------------------------------------------------- |
| **Error rows** (have a failure_reason) | Page **could not be fetched**                               | 403, 404, 406, timeout, connection, SSL                                      |
| **Empty rows** (no text but no error)  | Page **fetched successfully** but **content not extracted** | JS pages, PDF files, non-article links, weird HTML, cookie walls, short text |


#Â Inspect Failures

In [7]:
#inspect failure reasons 
failed = df[df['status'] != 'ok']
len(failed)

160

In [8]:
#inspect at failure reason 
failed['failure_reason'].value_counts(dropna=False)

failure_reason
http_status_403                        101
http_status_404                         23
http_status_406                         18
no_main_text_extracted_or_too_short      7
request_exception_ConnectionError        7
timeout                                  2
request_exception_SSLError               1
http_status_401                          1
Name: count, dtype: int64

| Failure Reason                            | Count | Explanation                                                                                   |
|-------------------------------------------|-------|-----------------------------------------------------------------------------------------------|
| http_status_403                           | 101   | The server blocked access. Often due to missing headers, bot detection, or rate limiting.     |
| http_status_404                           | 23    | Page not found. The URL no longer exists, is broken, or has been removed.                     |
| http_status_406                           | 18    | Server refused the request format. Usually missing/strict `Accept` headers.                   |
| no_main_text_extracted_or_too_short       | 7     | Page downloaded but the scraper found no usable article text (empty page, PDF, or JS site).   |
| request_exception_ConnectionError         | 7     | Could not reach the server due to network issues, DNS failure, or dropped connection.         |
| timeout                                   | 2     | Server took too long to respond; request expired.                                             |
| request_exception_SSLError                | 1     | SSL certificate or handshake error due to misconfigured HTTPS or outdated TLS.                |
| http_status_401                           | 1     | Page requires authentication or login (unauthorized).                                         |

#### 

#Â Inspect successful vs. failed by organisation 

In [9]:
df.head(0)

Unnamed: 0,id,newsletter_number,issue_date,theme,subtheme,title,description,link,new_theme,domain_x,...,text,text_length_chars,text_length_words,link_canonical,article_id,domain_y,article_title,article_text,status,failure_reason


In [12]:
#broad organisation type 
df.groupby(['org_broad_category', 'status']).size().unstack(fill_value=0)

status,empty,error,ok
org_broad_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
academic_sector,2,31,70
civil_society_nonprofit_sector,0,9,69
commercial_private_sector,0,2,9
digital_social_media_platforms,2,1,12
government_public_sector,2,72,137
knowledge_mobiliser_think_tank_sector,0,19,100
media_sector,1,8,268
other_miscellaneous,0,3,24
research_evidence_sector,0,8,52


In [13]:
# more detailed organisation 
df.groupby(['org_category', 'status']).size().unstack(fill_value=0)

status,empty,error,ok
org_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
academic_network,1,9,27
academic_publisher_platform,1,16,6
advocacy_organisation,0,2,10
charity_ngo,0,0,45
commentary_platform,0,4,23
consultancy,0,1,4
cultural_organisation,0,0,1
edtech_education_business,0,0,2
evidence_mobiliser,0,7,30
executive_non_departmental_public_body_ndpb,0,0,1


In [None]:
#Â did text length 

In [15]:
summary = pd.DataFrame({
    'total_count': df.shape[0],
    'success_count': (df['status']=='ok').sum(),
    'fail_count': (df['status']!='ok').sum(),
    'fail_rate_%': round((df['status']!='ok').mean()*100, 2)
}, index=[0])

summary

Unnamed: 0,total_count,success_count,fail_count,fail_rate_%
0,901,741,160,17.76


#Â Success vs. Failure - Top 50 Organisations 

In [16]:
top50_orgs = (
    df['organisation']
    .value_counts()
    .head(50)
)

top50_orgs

organisation
schools_week                     138
uk_government                     65
guardian                          25
nfer                              24
epi                               23
scottish_government               21
welsh_government                  21
uk_parliament                     21
bera                              20
conversation                      19
oecd                              18
ucl                               18
belfast_telegraph                 16
ni_government                     16
fft_ed_datalab                    13
university_of_birmingham          12
teacher_tapp                      11
upen                              11
nuffield                          11
tes                               10
bera_journals                     10
bbc                               10
british_academy                   10
fed                                9
childrens_commissioner             9
ifs                                9
ifg                      

In [17]:
#filter the datast to those top 50 organisations
df_top50 = df[df['organisation'].isin(top50_orgs.index)]

In [18]:
status_by_org = (
    df_top50
    .groupby(['organisation', 'status'])
    .size()
    .unstack(fill_value=0)
    .reindex(columns=['ok','error','empty'], fill_value=0)  # optional but safe
    .sort_values(by='ok', ascending=False)
)

status_by_org

status,ok,error,empty
organisation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
schools_week,138,0,0
uk_government,65,0,0
guardian,25,0,0
scottish_government,21,0,0
epi,21,2,0
nfer,20,4,0
bera,19,1,0
conversation,19,0,0
welsh_government,19,0,2
ucl,17,1,0


In [19]:
#add failure rate
status_by_org['failure_rate_%'] = (
    status_by_org['error'] /
    (status_by_org['ok'] + status_by_org['error'])
    * 100
).round(2)

status_by_org

status,ok,error,empty,failure_rate_%
organisation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
schools_week,138,0,0,0.0
uk_government,65,0,0,0.0
guardian,25,0,0,0.0
scottish_government,21,0,0,0.0
epi,21,2,0,8.7
nfer,20,4,0,16.67
bera,19,1,0,5.0
conversation,19,0,0,0.0
welsh_government,19,0,2,0.0
ucl,17,1,0,5.56


In [20]:
# Export to Excel
status_by_org.to_excel("/workspaces/ERP_Newsletter/data/data04_full_articles_scraped/status_by_org.xlsx", index=True)

In [21]:
#success failure by broad category 
by_broad = (
    df_top50
    .groupby(['org_category', 'status'])
    .size()
    .unstack(fill_value=0)
)

by_broad['failure_rate_%'] = (
    by_broad['error'] /
    (by_broad['error'] + by_broad['ok']) * 100
).round(2)

by_broad

status,empty,error,ok,failure_rate_%
org_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
academic_network,1,8,21,27.59
academic_publisher_platform,0,10,0,100.0
charity_ngo,0,0,17,0.0
commentary_platform,0,0,19,0.0
evidence_mobiliser,0,7,28,20.0
government_legislature,2,47,116,28.83
international_organisation,0,16,2,88.89
labour_union,0,5,0,100.0
news_media,0,0,60,0.0
professional_network,0,1,9,10.0


#Â inspect where status == 'empty'

In [22]:
empty_status_df = df[df['status'] == 'empty']
empty_status_df

Unnamed: 0,id,newsletter_number,issue_date,theme,subtheme,title,description,link,new_theme,domain_x,...,text,text_length_chars,text_length_words,link_canonical,article_id,domain_y,article_title,article_text,status,failure_reason
41,bc99414e-9a35-468d-a007-9cdc1d0e8a93,6,8 September 2023,EdTech,,The Future of AI in Education: 13 Things We Ca...,Working paper co-authored by Dylan Wiliam. Pot...,https://edarxiv.org/372vr,edtech,edarxiv.org,...,The Future of AI in Education: 13 Things We Ca...,333,50,https://edarxiv.org/372vr,c1a43e50-455d-40c6-94bd-88aca5f71f51,edarxiv.org,OSF,,empty,no_main_text_extracted_or_too_short
240,3c2e1ba3-b2a7-4a82-92d0-c1ae586313f1,32,19 April 2024,Four Nations Landscape,,Education Wales - Wales' Professional Learning...,"For the first time, Professional Learning reso...",http://educationwales.blog.gov.wales/?action=u...,four_nations,educationwales.blog.gov.wales,...,Education Wales - Wales' Professional Learning...,345,51,http://educationwales.blog.gov.wales/?action=u...,f14aba51-5412-42d5-9819-5dccfab38f4b,educationwales.blog.gov.wales,,,empty,no_main_text_extracted_or_too_short
346,8ecd35a7-b495-4f68-855a-50184b8012a0,42,5 July 2024,Research â€“ Practice â€“ Policy,,Podcast - The Commission on the Future of Orac...,"10 weeks since the Commission launched, Geoff ...",https://open.spotify.com/episode/3ZDdja9OUHbwD...,ppr,open.spotify.com,...,Podcast - The Commission on the Future of Orac...,310,50,https://open.spotify.com/episode/3ZDdja9OUHbwD...,5e500c2e-c7de-4c3e-9278-12b01bf10994,open.spotify.com,Spotify â€“ Web Player,,empty,no_main_text_extracted_or_too_short
413,2dc0177e-7bbe-4c38-a7cb-6cfd3f97269b,49,11 October 2024,Political environment and key organisations,,We need to talk. Oracy Education Commission re...,The Commission on the Future of Oracy Educatio...,https://open.spotify.com/show/5UwEJKkQrUT5lFjD...,political_environment_key_organisations,open.spotify.com,...,We need to talk. Oracy Education Commission re...,384,62,https://open.spotify.com/show/5UwEJKkQrUT5lFjD...,1ce9631e-f846-42f3-8df7-0e4d769b11ca,open.spotify.com,Spotify â€“ Web Player,,empty,no_main_text_extracted_or_too_short
464,66e85ed4-eb5c-4242-9f77-19e365b05de7,53,15 November 2024,Four Nations,,Welsh Government - Open Consultation: Curricul...,Deadline: 20 December,http://educationwales.blog.gov.wales/?action=u...,four_nations,educationwales.blog.gov.wales,...,Welsh Government - Open Consultation: Curricul...,164,23,http://educationwales.blog.gov.wales/?action=u...,0eab1420-3476-407e-88f1-9f51d32d8596,educationwales.blog.gov.wales,,,empty,no_main_text_extracted_or_too_short
585,bd41786f-059f-46fb-acd9-60e41b28b1a6,63,14 February 2025,Research â€“ Practice â€“ Policy,,The British Academy Early Career Researcher Ne...,The Academy aims for the network to be a resea...,https://thebritishacademyecrn.com/,ppr,thebritishacademyecrn.com,...,The British Academy Early Career Researcher Ne...,244,36,https://thebritishacademyecrn.com/,bd4a9137-f27b-491b-a533-e97faafd3074,thebritishacademyecrn.com,The British Academy Early Career Researcher Ne...,,empty,no_main_text_extracted_or_too_short
615,2e544c9e-09d6-45ea-8162-923b6160ca7a,66,7 March 2025,EdTech,,TLS article - Teacher's friend or enemy? AI co...,Robert Ades reviews 'Brave New Words' by Salma...,https://www.the-tls.co.uk/philosophy/contempor...,edtech,www.the-tls.co.uk,...,TLS article - Teacher's friend or enemy? AI co...,279,45,https://www.the-tls.co.uk/philosophy/contempor...,6174f168-0a59-4519-b675-cb156cebca1c,www.the-tls.co.uk,Brave New Words by Salman Khan | Book review |...,,empty,no_main_text_extracted_or_too_short


# Save a file with only successfully scraped articles 

In [24]:
df_ok = df[df["status"] == "ok"].copy()

In [25]:
output_path = "/workspaces/ERP_Newsletter/data/data04_full_articles_scraped/successfully_scraped.csv"

In [27]:
df_ok.to_csv(output_path, index=False)
print(f"âœ… Saved {len(df_ok)} successfully scraped articles to:")
print(output_path)

âœ… Saved 741 successfully scraped articles to:
/workspaces/ERP_Newsletter/data/data04_full_articles_scraped/successfully_scraped.csv
