Skip to content

al-hernandez/NewsTextAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

BBC News Text Analysis

Description

Text analysis was performed on a dataset consisting of 2225 documents from the BBC news website corresponding to stories from 2004-2005.

Analysis goals:

  • determine most and least frequently used words (excluding stopwords)
  • create a word cloud
  • lemmatize the text
  • generate n-grams and tag parts of speech
  • creat a topic model of the text

Technologies Used

  • Python

Text Preparation

To prepare the data for analysis, text was converted to lowercase, and any digits, punctuation, stopwords, web links, and email addresses were removed.
Clean TExt

Word Frequency and Lemmatization

Lemmatization changed both the contents and order of the most/least used words. For example, the original list of the ten most frequently used words had the word "year" in 9th place. After lemmatization, this word moved to 3rd place since both the words "year" and "years" share the same root word.

In regard to the least frequently used words, the contents of the list were drastically altered. The original list conatined words like "leukoencephalopathy" and "restating", but after lemmatization, their frequency count increased, since they are now grouped with similar words. On the other hand, words such as "cassette" do not share a root word, so their frequency count was not increased.

Original 10 Most Frequently Used Words Lemmatized 10 Most Frequently Used Words
Original 10 Least Frequently Used Words Lemmatized 10 Least Frequently Used Words

Word Cloud

The generated word cloud reflects the contents of the top occurring words in the news entries. Therefore, words such as "government", "minister", and "market" are visible. Since the BBC is a news reporting agency, the word "said" is the boldest one in the image.

Word Cloud

Parts of Speech Tagging

The tags NN, RB, and JJ have the highest frequency counts and correspond to singular nouns, adverbs, and adjectives. This writing style is to be expected from a news reporting agency, since it deals with factual events while also trying to make them sensational, hence the high number of adverbs and adjectives.

Speech Tagging

Topic Model of the Text

Using Python's Genism module, a topic model of the text was created, which revealed that news stories tend to fall into one of 14 topics. For each topic, the top ten most important words are moderately coherent, with the model receiving a coherence score of .559. Therefore, one can vaguely deduce the meaning of each topic by looking at their top ten most important words. For example, one can deduce that topic 7 is concerned with a sporting event that involved england, wales, and france. One must examine each word in a topic to infer the overall meaning, since looking at the top word alone does not suffice.

TopicVis

Based on a visualization of the model, it appears that the 14 topics fall into four groups: economics, sports, technology, and music/film. The majority of the topics fall into the first of these groups. It is also within this group that the most overlap occurs. This overlap might be attribuated to the fact that the topics in these groups contain the same lemmatized words.

Overall, the news articles have a high probability of being represented by topic 1 or 2, which seem to deal with a governmental issue in the UK, and a sporting event, respectively.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published