Analysis goals:
- determine most and least frequently used words (excluding stopwords)
- create a word cloud
- lemmatize the text
- generate n-grams and tag parts of speech
- creat a topic model of the text
- Python
Lemmatization changed both the contents and order of the most/least used words. For example, the original list of the ten most frequently used words had the word "year" in 9th place. After lemmatization, this word moved to 3rd place since both the words "year" and "years" share the same root word.
In regard to the least frequently used words, the contents of the list were drastically altered. The original list conatined words like "leukoencephalopathy" and "restating", but after lemmatization, their frequency count increased, since they are now grouped with similar words. On the other hand, words such as "cassette" do not share a root word, so their frequency count was not increased.
| Original 10 Most Frequently Used Words | Lemmatized 10 Most Frequently Used Words |
|---|
| Original 10 Least Frequently Used Words | Lemmatized 10 Least Frequently Used Words |
|---|
The generated word cloud reflects the contents of the top occurring words in the news entries. Therefore, words such as "government", "minister", and "market" are visible. Since the BBC is a news reporting agency, the word "said" is the boldest one in the image.
The tags NN, RB, and JJ have the highest frequency counts and correspond to singular nouns, adverbs, and adjectives. This writing style is to be expected from a news reporting agency, since it deals with factual events while also trying to make them sensational, hence the high number of adverbs and adjectives.
Using Python's Genism module, a topic model of the text was created, which revealed that news stories tend to fall into one of 14 topics. For each topic, the top ten most important words are moderately coherent, with the model receiving a coherence score of .559. Therefore, one can vaguely deduce the meaning of each topic by looking at their top ten most important words. For example, one can deduce that topic 7 is concerned with a sporting event that involved england, wales, and france. One must examine each word in a topic to infer the overall meaning, since looking at the top word alone does not suffice.
Based on a visualization of the model, it appears that the 14 topics fall into four groups: economics, sports, technology, and music/film. The majority of the topics fall into the first of these groups. It is also within this group that the most overlap occurs. This overlap might be attribuated to the fact that the topics in these groups contain the same lemmatized words.
Overall, the news articles have a high probability of being represented by topic 1 or 2, which seem to deal with a governmental issue in the UK, and a sporting event, respectively.

