Qatar World Cup UK Newspaper Analysis

Newspapers

Newspaper	The Sun	Daily Mail	The Guardian	The Times
type	tabloid	tabloid	broadsheet	broadsheet
timespan	Sep 22 - Feb 23	Sep 22 - Feb 23	Sep 22 - Feb 23	Sep 22 - Feb 23
articles collected	1270	2622	2199	805
mean article length	433 tokens	678 tokens	1304 tokens	927 tokens
vocabulary	17499 lemmas	24781 lemmas	39283 lemmas	22586 lemmas
complete articles	yes	yes	yes	yes
tokens complete	549898	1777469	2608640	746153
without stopwords	307902	998000	1473926	412010
without symbols	271541	897411	1273947	362093
lemmas final	225749	792397	1138170	301143

Crawlers

Crawlers are written to scrape articles from September 2022 including February 2023.

The Sun, Daily Mail

Selenium to preload pages + Scrapy to crawl and scrape

The Times

Selenium to preload and login; Scrapy and Selenium to crawl and scrape

start crawler in terminal with

scrapy runspider [crawler.py] -o [articles .json or .csv]

The Guardian

Guardian API + BeautifulSoup4 to clean body html

run script in .py file

Functions

most functions can be found in preprocessing.py and methods.py

preprocessing.py

def preprocess(newspaper, csv=False, rare=False)

Preprocesses text data from JSON files for four different newspapers (The Times, The Sun, Daily Mail and The Guardian), including tokenisation, removal of stopwords, punctuation, rare tokens and player names, part-of-speech tagging, and lemmatisation. Depending on csv It returns a Pandas DataFrame containing the preprocessed data or writes a csv file; rare toggles whether rare tokens (= tokens appearing less than ten times) are kept in or not.

dataframe = preprocess("sun",csv=False,rare=True)

preprocess("guardian",csv=True)

preprocess("times",csv=True,rare=True)

methods.py

def df_to_dtm(df)

Converts a pandas DataFrame of preprocessed text data (obtained using preprocessing() ) into a Document-Term Matrix (DTM) using a CountVectorizer.

dtm_dataframe = df_to_dtm(dataframe)

def df_to_tfidf(df)

Converts a pandas DataFrame of preprocessed text data (obtained using preprocessing() ) into a Term Frequency-Inverse Document Frequency (TF-IDF) matrix using a CountVectorizer and a TfidfTransformer.

df_tfidf = df_to_tfidf(dataframe)

def csv_to_tfidf(file_path)

Converts a CSV file of preprocessed text data (obtained using preprocessing() ) into a Term Frequency-Inverse Document Frequency (TF-IDF) matrix using a CountVectorizer and a TfidfTransformer.

df_tfidf_guardian = csv_to_tfidf("guardian.csv")

def get_term_position(df_tfidf, term)

Gets the column index of a given term in a TF-IDF matrix represented by a pandas DataFrame.

get_term_position(df_tfidf_guardian, "climate")

def compare_term_position(term)

Compares the positions of a given term in the TF-IDF matrices of four different CSV files and write the results to a new CSV file.

compare_term_position("climate")

def read_csv_files()

Reads the CSV files for The Guardian, Daily Mail, The Times, and The Sun and returns them as dataframes.

def plot_tfidf(term, save=False)

Plots the development of the normalised TF-IDF score for a given term across four UK newspapers: The Times, Daily Mail, The Sun, and The Guardian, for the period between September 2022 and February 2023.

plot_tfidf("lgbt", save=True)

def get_vocab_from_csv(csv_file, lemma_col='lemmas')

Reads a CSV file containing lemmas and returns a set of unique lemmas (the vocabulary).

times_vocab = get_vocab_from_csv("times_rare.csv")

def get_avg_token_length(vocab)

Calculates the average length of tokens in a vocabulary (input vocabulary must be a set).

get_avg_token_length(times_vocab)

Analysis

Type-token ratio

Newspaper	The Sun	Daily Mail	The Guardian	The Times
types	17499	24781	39283	22586
tokens	271541	754718	1273947	362093
type-token ratio	0,06444	0,03283	0,03083	0,06238

Average token length

Newspaper	The Sun	Daily Mail	The Guardian	The Times
average token length	7.11898	7.32034	7.44423	7.43956

Topic modelling

Reads CSV and calculates the Topics for each newspaper.

topics.R

Cooccurrence measures

Show the terms that frequently occur together and the structure of the data.

cooccurrence.R

Time series

Calculates the data and analyzed it overtime.

timeseries.R

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.idea		.idea
crawler		crawler
data		data
rstudio		rstudio
README.md		README.md
combined.csv		combined.csv
cooccurrence.ipynb		cooccurrence.ipynb
get_playernames.py		get_playernames.py
guardian.csv		guardian.csv
guardian_articles.json		guardian_articles.json
guardian_rare.csv		guardian_rare.csv
mail.csv		mail.csv
mail_articles.json		mail_articles.json
mail_hrefList.txt		mail_hrefList.txt
mail_rare.csv		mail_rare.csv
methods.py		methods.py
preprocessing.py		preprocessing.py
sun.csv		sun.csv
sun_articles.json		sun_articles.json
sun_hrefList.txt		sun_hrefList.txt
sun_rare.csv		sun_rare.csv
times.csv		times.csv
times_articles.json		times_articles.json
times_hrefList.txt		times_hrefList.txt
times_rare.csv		times_rare.csv
visualization.ipynb		visualization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qatar World Cup UK Newspaper Analysis

Newspapers

Crawlers

The Sun, Daily Mail

The Times

The Guardian

Functions

preprocessing.py

def preprocess(newspaper, csv=False, rare=False)

methods.py

def df_to_dtm(df)

def df_to_tfidf(df)

def csv_to_tfidf(file_path)

def get_term_position(df_tfidf, term)

def compare_term_position(term)

def read_csv_files()

def plot_tfidf(term, save=False)

def get_vocab_from_csv(csv_file, lemma_col='lemmas')

def get_avg_token_length(vocab)

Analysis

Type-token ratio

Average token length

Topic modelling

Cooccurrence measures

Time series

About

Contributors 2

Languages

cconzen/DH-Proj

Folders and files

Latest commit

History

Repository files navigation

Qatar World Cup UK Newspaper Analysis

Newspapers

Crawlers

The Sun, Daily Mail

The Times

The Guardian

Functions

preprocessing.py

def preprocess(newspaper, csv=False, rare=False)

methods.py

def df_to_dtm(df)

def df_to_tfidf(df)

def csv_to_tfidf(file_path)

def get_term_position(df_tfidf, term)

def compare_term_position(term)

def read_csv_files()

def plot_tfidf(term, save=False)

def get_vocab_from_csv(csv_file, lemma_col='lemmas')

def get_avg_token_length(vocab)

Analysis

Type-token ratio

Average token length

Topic modelling

Cooccurrence measures

Time series

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages