BBC News Text Analysis

Description

Text analysis was performed on a dataset consisting of 2225 documents from the BBC news website corresponding to stories from 2004-2005.

Analysis goals:

determine most and least frequently used words (excluding stopwords)
create a word cloud
lemmatize the text
generate n-grams and tag parts of speech
creat a topic model of the text

Technologies Used

Python

Text Preparation

To prepare the data for analysis, text was converted to lowercase, and any digits, punctuation, stopwords, web links, and email addresses were removed.

Word Frequency and Lemmatization

Lemmatization changed both the contents and order of the most/least used words. For example, the original list of the ten most frequently used words had the word "year" in 9th place. After lemmatization, this word moved to 3rd place since both the words "year" and "years" share the same root word.

In regard to the least frequently used words, the contents of the list were drastically altered. The original list conatined words like "leukoencephalopathy" and "restating", but after lemmatization, their frequency count increased, since they are now grouped with similar words. On the other hand, words such as "cassette" do not share a root word, so their frequency count was not increased.

Original 10 Most Frequently Used Words	Lemmatized 10 Most Frequently Used Words

Original 10 Least Frequently Used Words	Lemmatized 10 Least Frequently Used Words

Word Cloud

The generated word cloud reflects the contents of the top occurring words in the news entries. Therefore, words such as "government", "minister", and "market" are visible. Since the BBC is a news reporting agency, the word "said" is the boldest one in the image.

Parts of Speech Tagging

The tags NN, RB, and JJ have the highest frequency counts and correspond to singular nouns, adverbs, and adjectives. This writing style is to be expected from a news reporting agency, since it deals with factual events while also trying to make them sensational, hence the high number of adverbs and adjectives.

Topic Model of the Text

Using Python's Genism module, a topic model of the text was created, which revealed that news stories tend to fall into one of 14 topics. For each topic, the top ten most important words are moderately coherent, with the model receiving a coherence score of .559. Therefore, one can vaguely deduce the meaning of each topic by looking at their top ten most important words. For example, one can deduce that topic 7 is concerned with a sporting event that involved england, wales, and france. One must examine each word in a topic to infer the overall meaning, since looking at the top word alone does not suffice.

Based on a visualization of the model, it appears that the 14 topics fall into four groups: economics, sports, technology, and music/film. The majority of the topics fall into the first of these groups. It is also within this group that the most overlap occurs. This overlap might be attribuated to the fact that the topics in these groups contain the same lemmatized words.

Overall, the news articles have a high probability of being represented by topic 1 or 2, which seem to deal with a governmental issue in the UK, and a sporting event, respectively.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Hernandez-Lab-7.ipynb		Hernandez-Lab-7.ipynb
README.md		README.md
topic_model_visualization.html		topic_model_visualization.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BBC News Text Analysis

Description

Technologies Used

Text Preparation

Word Frequency and Lemmatization

Word Cloud

Parts of Speech Tagging

Topic Model of the Text

About

Uh oh!

Releases

Packages

Languages

al-hernandez/NewsTextAnalysis

Folders and files

Latest commit

History

Repository files navigation

BBC News Text Analysis

Description

Technologies Used

Text Preparation

Word Frequency and Lemmatization

Word Cloud

Parts of Speech Tagging

Topic Model of the Text

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages