In this notebook:

1. [Setting up and obtaining credentials for the Guardian API](#1)
2. [Searching the Guardian Explore interface](#2)
3. [Searching the Guardian API and collecting data](#3)
4. [Analysing Guardian news data](#4)

# Badge 08: News Data Analysis and Visualisation


## Objectives

* Setting up access and validity signing into API
* Defining search parameters
* Running a search and collecting data
* Analysing data

In this notebook, you will learn to search the Guardian API, collect all the data returned for a search and analyse it.  [The Guardian](https://en.wikipedia.org/wiki/The_Guardian) is a British daily newspaper which was founded in 1821. 

<a id="1"></a>
# 1. Setting up and obtaining credentials for the Guardian API

First, we will import all the libraries that are needed for this notebook.

In [None]:
# Import libraries
import matplotlib.pyplot as plt
import nltk
import pandas as pd
import requests
import seaborn as sns
import string

Next, you need to set up your API key and specify the endpoint.

You can obtain your API key at the following website: https://open-platform.theguardian.com/access/.  Click on "Register developer key" under Developer and fill in the form. Select "Student project" for the reason for applying for the key.  You don't need to fill in any of the optional information fields and for product name you can enter "not applicable". You will receive an email which you need to verify before you will receive a second email with your key. Please enter the key in double quotes, where it says `ENTER_KEY_HERE` and make sure that there are no extra spaces.

The API endpoint is the base URL used to access the Guardian API.

In [None]:
# Set up API credentials and base URL
MY_API_KEY = "ENTER_KEY_HERE"
API_ENDPOINT = 'http://content.guardianapis.com/search'

The key is all you need, in terms of credentials, to search the Guardian API. Other APIs might also ask you to provide a token of some kind to obtain access but this isn't needed in this case.

<a id="2"></a>
# 2. Searching the Guardian Explore interface

Before we collect data for a search, let us explore the Guardian API online using their Explore interface: https://open-platform.theguardian.com/explore/. This is a web interface to the API and easily allows you to see your search results in a browser which is handy for exploring data before collecting it for analysis. The screenshot below shows the output for a search for "Tori Amos" for all content published from the 1st of January 2023 until now.

What is returned are the results (6 items) including their URLs and other useful information, like the section it was published in, the title of the item etc.).  Note that the number of search results will differ as time goes by but this is just to provide you an example of how it looks.  If you follow the link above you can try this and other searches out for yourself by specifying a search term and optionally selecting search parameters.

Note you can perform an exact search by placing a string in double quotes.  You can also use the AND, OR and NOT operators for more complex search queries. You can check the Guardian API documentation for some examples on how this is done: https://open-platform.theguardian.com/documentation/.

![Explore Guardian API](./images/guardian-explore.png)

At the top of the list of results, there's also a URL link which (when you click on it) will show you the results in Json format (see Figure above: https://content.guardianapis.com/search?from-date=2023-01-01&q=%22Tori%20Amos%22&api-key=a8e3192d-36ed-4ed0-8466-ce5b3e412efe).

When we collect data for further processing it comes in this Json format, so we need to be able to read and process that.  In Python, Json can be easily read and processed using a data frame.  We'll come back to this when we collect a dataset from the Guardian API for further analysis. But first let's explore the API using the web interface. 

### 🐛Mini task: Trying out your own Guardian Explore search 

- Choose a search term in the Guardian Explore interface.

- Change the search parameters, e.g. you can specify a different search period, search in different sections or oder the results differently. 

- Try out a few searches to see how many results you get. 

<a id="3"></a>
# 3. Searching the Guardian API and collecting data

Next, you specify the search query. 

In [None]:
# Define search query
query = "ChatGPT"

Now you have to select your search parameters.

You may have already seen in the Explore interface that you can specify different parameters, e.g. a search period (`from-date`, `to-date`), an order of the search results (`order_by` which can be `newest`, `oldest` or `relevance`) and where the search should take place (`query-fields`). You can also specify the number of results per page to return (`page-size`; default is 10, max is 200) and you need to provide the query and your API key (stored in the `MY_API_KEY` variable) to make it run.

You can also choose a `section` of the Guardian to search in (e.g. technology, sport, books, news, opinion etc.) but we've left this commented out for this first search as we want to search the entire Guardian website.

You get a more detailed overview of how the search API works here: https://open-platform.theguardian.com/documentation/ but let's give this first search a go to see how it works.

In [None]:
# Define the parameters for the API call
my_params = {
    'from-date': '2023-01-01',  #this specifies the date you'd like to collect data from; if you don't specify a 'to-date', the search will check for data published until the current date
    'order-by': 'oldest',       #this means the results are ordered by oldest to newest
    'show-fields': 'all',       #this specifies to give all the information stored for each item in the results
    'query-fields': 'body',     #this specifies in which indexed field to search for the query. 
#    'section':'technology',     #this specified the section in which to search in
    'q': query,                 #this is your query
    'page-size': 200,           #this specifies the number of results returned per page
    'api-key': MY_API_KEY       #your API key
}

The following bit of code performs the searches for all the pages available for the search and collects all the data that is returned per page. The results are returned in json format at a series of news items and their URLs for each page.

The code goes through each page and extracts the news item and their information and appends them to a data frame (`df_all`).  This bit of code may look quite complicated but we've provided comments for each line to explain it in detail.

In [None]:
# Collect search results in data frame
current_page = 1           # we start at page 1
total_pages = float('inf') # set to infinity to ensure that the loop runs until there are no more pages left
df_all = pd.DataFrame()    # this is the data frame which will store all the results

while current_page <= total_pages:                    # a while loop which check for results for each current page 
    print("... extracting results from page", current_page)                    # prints which page the code is on
    try:                                              # the while loop then tries to do the following:
        my_params['page'] = current_page                # adds a additional search parameter ('page') to collect the current page which increase which each iteration
        r = requests.get(API_ENDPOINT, my_params)       # the API request to collect the data for the specified page
        data = r.json()                                 # this turns the results into Json
        news_items = data['response']['results']          # this extracts the news items from the results for the page
        df_page = pd.json_normalize(news_items)           # this flattens the Json structure into a flat data frame table
        df_all = pd.concat([df_all, df_page], ignore_index=True) # this appends the data frame for the pa
    except Exception as e:                              # unless there is an error
        print(f"Error: {e}")                              # at that point the code prints an error message 
        break                                             # and the while loop is interrupted
    total_pages = data['response']['pages']             # the number of pages available for a search is extracted from the data
    current_page += 1                                   # the current page is incremented by 1 for the next iteration

All the data received for the search is stored in the `df_all` data frame and you can proceed to explore it using the functions you've already learned previously.  For example, you can print out the first few rows of the data frame and see all the information that is provided for each news item.

In [None]:
# Print the first 10 rows of the data frame
df_all.head(10)

Or you can print the last few rows.

In [None]:
# Print the last 5 rows of the data frame
df_all.tail()

To inspect particular news artiles as they appear online, you can select the column in the data frame containing all the URLs and list them on screen.

In [None]:
# Print all the URLs in the data frame
df_all['webUrl'].tolist()

To print the titel and text of a particular article, you print the content of `webTitle` and `fields.bodyText` by using the loc accessor.

In [None]:
# Grabbing the title and article text for a paricular news item
index=6
title=df_all.loc[index, 'webTitle']
article=df_all.loc[index, 'fields.bodyText']

# Printing the output
print(title)       
print()            # this prints an empty line
print(article)

### 🐛Mini task: Looking at some examples
Take a look at some example URLs with your partner.  You can paste the link into your browser to see an article.  Do you notice anything in particular about some example news pages with respect to your search?  What do you need to consider in any follow-on text analysis of such data?

<a id="4"></a>
# 4. Analysing Guardian news data

Once you have all the data stored in your data frame, you can analyse it in the same way we did for other datasets previously. For example, you can tokenise the text of all the news articles and lowercase it.

## 4.1 Tokenising and cleaning text

This next bit of code takes a while to run as it installs the punkt tokeniser first and may process a lot of text if your dataset is large.  If you want to speed it up you could change line 5 of the next code cell to only look at the first 100 articles  (`s = " ".join(articles[:100]`) but this will affect your visualisations below. 

In [None]:
# Collecting all the articles from the data frame and tokenise the text
from nltk.tokenize import word_tokenize              # import word tokeniser
nltk.download('punkt')                               # download the punkt tokeniser
articles=df_all['fields.bodyText'].tolist()          # collect list of articles from data frame
s = " ".join(articles)                               # join the strings of all articles into one big text string
s_tokens = word_tokenize(s)                          # tokenise list of words in the text
lower_s_tokens = [word.lower() for word in s_tokens] # lowercase list of words
print(len(lower_s_tokens))                           # print number of tokens
print(lower_s_tokens[0:5000])                        # print the first 5000 tokens

As before, you can filter the text and remove stopwords and any words that are not of interest...

In [None]:
# Remove some stopwords (incl. a long list of uninteresting words,
# which can be adapted as necessary)
nltk.download('stopwords')
from nltk.corpus import stopwords
uninteresting_words=['’','‘','“','”','–',"0o", "0s", "3a", "3b", "3d", "6b", "6o", "a", "a1", "a2", "a3", "a4", "ab", "able", "about", "above", "abst", "ac", "accordance", "according", "accordingly", "across", "act", "actually", "ad", "added", "adj", "ae", "af", "affected", "affecting", "affects", "after", "afterwards", "ag", "again", "against", "ah", "ain", "ain't", "aj", "al", "all", "allow", "allows", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "announce", "another", "any", "anybody", "anyhow", "anymore", "anyone", "anything", "anyway", "anyways", "anywhere", "ao", "ap", "apart", "apparently", "appear", "appreciate", "appropriate", "approximately", "ar", "are", "aren", "arent", "aren't", "arise", "around", "as", "a's", "aside", "ask", "asking", "associated", "at", "au", "auth", "av", "available", "aw", "away", "awfully", "ax", "ay", "az", "b", "b1", "b2", "b3", "ba", "back", "bc", "bd", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "begin", "beginning", "beginnings", "begins", "behind", "being", "believe", "below", "beside", "besides", "best", "better", "between", "beyond", "bi", "bill", "biol", "bj", "bk", "bl", "bn", "both", "bottom", "bp", "br", "brief", "briefly", "bs", "bt", "bu", "but", "bx", "by", "c", "c1", "c2", "c3", "ca", "call", "came", "can", "cannot", "cant", "can't", "cause", "causes", "cc", "cd", "ce", "certain", "certainly", "cf", "cg", "ch", "changes", "ci", "cit", "cj", "cl", "clearly", "cm", "c'mon", "cn", "co", "com", "come", "comes", "con", "concerning", "consequently", "consider", "considering", "contain", "containing", "contains", "corresponding", "could", "couldn", "couldnt", "couldn't", "course", "cp", "cq", "cr", "cry", "cs", "c's", "ct", "cu", "currently", "cv", "cx", "cy", "cz", "d", "d2", "da", "date", "dc", "dd", "de", "definitely", "describe", "described", "despite", "detail", "df", "di", "did", "didn", "didn't", "different", "dj", "dk", "dl", "do", "does", "doesn", "doesn't", "doing", "don", "done", "don't", "down", "downwards", "dp", "dr", "ds", "dt", "du", "due", "during", "dx", "dy", "e", "e2", "e3", "ea", "each", "ec", "ed", "edu", "ee", "ef", "effect", "eg", "ei", "eight", "eighty", "either", "ej", "el", "eleven", "else", "elsewhere", "em", "empty", "en", "end", "ending", "enough", "entirely", "eo", "ep", "eq", "er", "es", "especially", "est", "et", "et-al", "etc", "eu", "ev", "even", "ever", "every", "everybody", "everyone", "everything", "everywhere", "ex", "exactly", "example", "except", "ey", "f", "f2", "fa", "far", "fc", "few", "ff", "fi", "fifteen", "fifth", "fify", "fill", "find", "fire", "first", "five", "fix", "fj", "fl", "fn", "fo", "followed", "following", "follows", "for", "former", "formerly", "forth", "forty", "found", "four", "fr", "from", "front", "fs", "ft", "fu", "full", "further", "furthermore", "fy", "g", "ga", "gave", "ge", "get", "gets", "getting", "gi", "give", "given", "gives", "giving", "gj", "gl", "go", "goes", "going", "gone", "got", "gotten", "gr", "greetings", "gs", "gy", "h", "h2", "h3", "had", "hadn", "hadn't", "happens", "hardly", "has", "hasn", "hasnt", "hasn't", "have", "haven", "haven't", "having", "he", "hed", "he'd", "he'll", "hello", "help", "hence", "her", "here", "hereafter", "hereby", "herein", "heres", "here's", "hereupon", "hers", "herself", "hes", "he's", "hh", "hi", "hid", "him", "himself", "his", "hither", "hj", "ho", "home", "hopefully", "how", "howbeit", "however", "how's", "hr", "hs", "http", "hu", "hundred", "hy", "i", "i2", "i3", "i4", "i6", "i7", "i8", "ia", "ib", "ibid", "ic", "id", "i'd", "ie", "if", "ig", "ignored", "ih", "ii", "ij", "il", "i'll", "im", "i'm", "immediate", "immediately", "importance", "important", "in", "inasmuch", "inc", "indeed", "index", "indicate", "indicated", "indicates", "information", "inner", "insofar", "instead", "interest", "into", "invention", "inward", "io", "ip", "iq", "ir", "is", "isn", "isn't", "it", "itd", "it'd", "it'll", "its", "it's", "itself", "iv", "i've", "ix", "iy", "iz", "j", "jj", "jr", "js", "jt", "ju", "just", "k", "ke", "keep", "keeps", "kept", "kg", "kj", "km", "know", "known", "knows", "ko", "l", "l2", "la", "largely", "last", "lately", "later", "latter", "latterly", "lb", "lc", "le", "least", "les", "less", "lest", "let", "lets", "let's", "lf", "like", "liked", "likely", "line", "little", "lj", "ll", "ll", "ln", "lo", "look", "looking", "looks", "los", "lr", "ls", "lt", "ltd", "m", "m2", "ma", "made", "mainly", "make", "makes", "many", "may", "maybe", "me", "mean", "means", "meantime", "meanwhile", "merely", "mg", "might", "mightn", "mightn't", "mill", "million", "mine", "miss", "ml", "mn", "mo", "more", "moreover", "most", "mostly", "move", "mr", "mrs", "ms", "mt", "mu", "much", "mug", "must", "mustn", "mustn't", "my", "myself", "n", "n2", "na", "name", "namely", "nay", "nc", "nd", "ne", "near", "nearly", "necessarily", "necessary", "need", "needn", "needn't", "needs", "neither", "never", "nevertheless", "new", "next", "ng", "ni", "nine", "ninety", "nj", "nl", "nn", "no", "nobody", "non", "none", "nonetheless", "noone", "nor", "normally", "nos", "not", "noted", "nothing", "novel", "now", "nowhere", "nr", "ns", "nt", "ny", "o", "oa", "ob", "obtain", "obtained", "obviously", "oc", "od", "of", "off", "often", "og", "oh", "oi", "oj", "ok", "okay", "ol", "old", "om", "omitted", "on", "once", "one", "ones", "only", "onto", "oo", "op", "oq", "or", "ord", "os", "ot", "other", "others", "otherwise", "ou", "ought", "our", "ours", "ourselves", "out", "outside", "over", "overall", "ow", "owing", "own", "ox", "oz", "p", "p1", "p2", "p3", "page", "pagecount", "pages", "par", "part", "particular", "particularly", "pas", "past", "pc", "pd", "pe", "per", "perhaps", "pf", "ph", "pi", "pj", "pk", "pl", "placed", "please", "plus", "pm", "pn", "po", "poorly", "possible", "possibly", "potentially", "pp", "pq", "pr", "predominantly", "present", "presumably", "previously", "primarily", "probably", "promptly", "proud", "provides", "ps", "pt", "pu", "put", "py", "q", "qj", "qu", "que", "quickly", "quite", "qv", "r", "r2", "ra", "ran", "rather", "rc", "rd", "re", "readily", "really", "reasonably", "recent", "recently", "ref", "refs", "regarding", "regardless", "regards", "related", "relatively", "research", "research-articl", "respectively", "resulted", "resulting", "results", "rf", "rh", "ri", "right", "rj", "rl", "rm", "rn", "ro", "rq", "rr", "rs", "rt", "ru", "run", "rv", "ry", "s", "s2", "sa", "said", "same", "saw", "say", "saying", "says", "sc", "sd", "se", "sec", "second", "secondly", "section", "see", "seeing", "seem", "seemed", "seeming", "seems", "seen", "self", "selves", "sensible", "sent", "serious", "seriously", "seven", "several", "sf", "shall", "shan", "shan't", "she", "shed", "she'd", "she'll", "shes", "she's", "should", "shouldn", "shouldn't", "should've", "show", "showed", "shown", "showns", "shows", "si", "side", "significant", "significantly", "similar", "similarly", "since", "sincere", "six", "sixty", "sj", "sl", "slightly", "sm", "sn", "so", "some", "somebody", "somehow", "someone", "somethan", "something", "sometime", "sometimes", "somewhat", "somewhere", "soon", "sorry", "sp", "specifically", "specified", "specify", "specifying", "sq", "sr", "ss", "st", "still", "stop", "strongly", "sub", "substantially", "successfully", "such", "sufficiently", "suggest", "sup", "sure", "sy", "system", "sz", "t", "t1", "t2", "t3", "take", "taken", "taking", "tb", "tc", "td", "te", "tell", "ten", "tends", "tf", "th", "than", "thank", "thanks", "thanx", "that", "that'll", "thats", "that's", "that've", "the", "their", "theirs", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "thered", "therefore", "therein", "there'll", "thereof", "therere", "theres", "there's", "thereto", "thereupon", "there've", "these", "they", "theyd", "they'd", "they'll", "theyre", "they're", "they've", "thickv", "thin", "think", "third", "this", "thorough", "thoroughly", "those", "thou", "though", "thoughh", "thousand", "three", "throug", "through", "throughout", "thru", "thus", "ti", "til", "tip", "tj", "tl", "tm", "tn", "to", "together", "too", "took", "top", "toward", "towards", "tp", "tq", "tr", "tried", "tries", "truly", "try", "trying", "ts", "t's", "tt", "tv", "twelve", "twenty", "twice", "two", "tx", "u", "u201d", "ue", "ui", "uj", "uk", "um", "un", "under", "unfortunately", "unless", "unlike", "unlikely", "until", "unto", "uo", "up", "upon", "ups", "ur", "us", "use", "used", "useful", "usefully", "usefulness", "uses", "using", "usually", "ut", "v", "va", "value", "various", "vd", "ve", "ve", "very", "via", "viz", "vj", "vo", "vol", "vols", "volumtype", "vq", "vs", "vt", "vu", "w", "wa", "want", "wants", "was", "wasn", "wasnt", "wasn't", "way", "we", "wed", "we'd", "welcome", "well", "we'll", "well-b", "went", "were", "we're", "weren", "werent", "weren't", "we've", "what", "whatever", "what'll", "whats", "what's", "when", "whence", "whenever", "when's", "where", "whereafter", "whereas", "whereby", "wherein", "wheres", "where's", "whereupon", "wherever", "whether", "which", "while", "whim", "whither", "who", "whod", "whoever", "whole", "who'll", "whom", "whomever", "whos", "who's", "whose", "why", "why's", "wi", "widely", "will", "willing", "wish", "with", "within", "without", "wo", "won", "wonder", "wont", "won't", "words", "world", "would", "wouldn", "wouldnt", "wouldn't", "www", "x", "x1", "x2", "x3", "xf", "xi", "xj", "xk", "xl", "xn", "xo", "xs", "xt", "xv", "xx", "y", "y2", "yes", "yet", "yj", "yl", "you", "youd", "you'd", "you'll", "your", "youre", "you're", "yours", "yourself", "yourselves", "you've", "yr", "ys", "yt", "z", "zero", "zi", "zz"]
remove_these = set(stopwords.words('english') + list(string.punctuation) + list(string.digits) + uninteresting_words)

filtered_text = [token 
                 for token in lower_s_tokens 
                 if not token in remove_these]

## 4.2 Creating a word cloud

Now, let's create a word cloud for this data.

In [None]:
# If you haven't already, install wordcloud
!pip install wordcloud

In [None]:
# Draw the word cloud
from collections import Counter
simple_frequencies_dict = Counter(filtered_text)

import matplotlib.pyplot as plt
from wordcloud import WordCloud

cloud = WordCloud(width=1200, height=900,max_font_size=160,colormap="hsv").generate_from_frequencies(simple_frequencies_dict)
plt.figure(figsize=(16,12))

plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

## 4.3 Plotting word frequency

To plot word frequency, we can start by visualising the frequency distribution of the most common word tokens.

In [None]:
from nltk.probability import FreqDist
frequencies = FreqDist(filtered_text)
frequencies.plot(25,title='Frequency distribution for the 25 most common tokens in the textual data')

You can also do the same thing using a bar plot by first converting the data into a pandas series and using a library called `seaborn` which we have imported at the start of this notebook.  Running the following code creates the bar plot of the most frequent 25 words.

In [None]:
# Creating a dictionary of the 25 most common words, and prints it
dictionary = dict(frequencies.most_common(25))
print(dictionary)

# Converting the dictionary to a pandas series
pd_series = pd.Series(dictionary)

# Setting up the fig/ax variables
fig, ax = plt.subplots(figsize=(8,8))

# Using seaborn to plot
all_plot = sns.barplot(x=pd_series.index, y=pd_series.values, ax=ax).set(title='')

# Setting the figure title and axes labels
plt.title('Frequency of 25 most common tokens')
plt.xlabel('Count')
plt.ylabel('Word Tokens')

# Using xtick rotation for ease of viewing the x axis labels (most frequent words)
plt.xticks(rotation=60);

Next, you can search your data for a list of specified word tokens and only plot them.  This is done by creating a dictionary and entering the words and their frequencies first. Then the dictionary is used as in the previous example to do the plotting.

In [None]:
# This creates your dictionary containing a list of words and prints it
targets=["ai","ethics","danger","humanities","openai","microsoft"]
my_dict = {}
for word in targets:
    my_dict[word] = frequencies[word]
print(my_dict)

# This creates the pandas series for your dictionary, sets up the figure and plots the frequencies
new_pd_series = pd.Series(my_dict)

fig, ax = plt.subplots(figsize=(3,6))

all_plot = sns.barplot(x=new_pd_series.index, y=new_pd_series.values, ax=ax)

# This formats the table as before
plt.title('Frequency of specified tokens')
plt.xlabel('Count')
plt.ylabel('Word Tokens')
plt.xticks(rotation=60);

### 🐛Mini task: Plotting your own frequency plot

- Use the data we've already worked with.
- Adapt the previous code to plot frequencies of a different list of specified tokens.

In [None]:
# Write your answer here


<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    
    Change the contents of targets=["ai","ethics","danger","humanities","openai","microsoft"] above to include your chosen words and re-run that code cell.
         
    ### END SOLUTION

</details>

## 4.4 Lexical dispersion plot

Now, let's create a lexical dispersion plot of the data, so plotting where terms appear over the course of the data (so over time).  This kind of plot only makes sense for larger datasets and you need to make sure that the temporal order of the data is retained before you plot it.  E.g. when you query the Guardian API you need to have `order-by` set to `oldest`. This guarantees that the articles are collected in order from oldest to newest.

You have seen the following bit of code before in Badge 05, so it should be straight-forward to understand.

In [None]:
# Downloading the appropriate library
from nltk.draw.dispersion import dispersion_plot

# Increasing the size of the plot using width and height specifications
plt.figure(figsize=(12, 9))

# Setting the words we wish to look for as targets
targets=['ai','ethics','danger','humanities','humanity','humans','government','governments','essays','students']

# Creating the plot using the filtered_text as input
dispersion_plot(filtered_text, targets, ignore_case=True, title='Lexical diversion plot for the Guardian data on ChatGPT')

Note that we've looked seperately for the words "humanities" and "humanity", we also looked for "essays" but the plot doesn't include occurrences for "essay". In some cases we may want to treat these different word forms (i.e. singular and plural word forms) the same and visualise their overall occurrence (i.e. just one row for all occurrences of "essay" and "essays". To do that we can use lemmatisation.

## 4.5 Lemmatisation

Lemmatisation is the last types of processing we will introduce in this badge. It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatisation is similar to stemming but it brings context to the words. So it links words with similar meanings to one word. To perform lemmatisation of each word, we use the `WordNetLemmatiser` which comes with NLTK.

To run it you need to download the following two packages ... 

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

... you need to import the lemmatiser and create it.

In [None]:
# Importing the WordNetLemmatizer module and initialise it
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

To try out the lemmatiser you can run the following command:

In [None]:
print("humanities :", lemmatizer.lemmatize("humanities"))
print("geese :", lemmatizer.lemmatize("geese"))
print("presents :", lemmatizer.lemmatize("presents"))
print("presented :", lemmatizer.lemmatize("presented"))
print("walks :", lemmatizer.lemmatize("walks"))
print("walked :", lemmatizer.lemmatize("walked"))
print("walking :", lemmatizer.lemmatize("walking"))
print("walker :", lemmatizer.lemmatize("walker"))
print("walkers :", lemmatizer.lemmatize("walkers"))

You can see that it conflates singular and plural nouns, as well as base form of verbs and third person singular forms of verbs into one and is able to deal with irregular plural forms (geese -> goose).  It still treats verbs ending in "-ed" or "-ing" separately though.

### 🐛Mini task: Lemmatise a few more words

- Try the lemmatiser out using different words.
- Can you find any other weirdnesses or words where it does well?

In [None]:
# Write your answer here


<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    
    Replace the words in the following line of code and run the cell: print("walkers :", lemmatizer.lemmatize("walkers"))
         
    ### END SOLUTION

</details>

Lastly, let's visualise the lexical dispersion plot but using lemmatised words. We first need to run the lemmatiser over all the words in our `filtered_text` list to create a list of lemmatised word tokens.

In [None]:
# Creating a list of stemmed words
lemmatised_words=[]
for w in filtered_text:
    lemmatised_words.append(lemmatizer.lemmatize(w))

# Printing the first 5000 stemmed words
print(lemmatised_words[0:5000])  

In [None]:
# Downloading the appropriate library
from nltk.draw.dispersion import dispersion_plot

# Increasing the size of the plot using width and height specifications
plt.figure(figsize=(12, 9))

# Setting the words we wish to look for as targets
targets=['ai','ethic','danger','humanity','human','robot','government','essay','student']

# Creating the plot using the filtered_text as input
dispersion_plot(lemmatised_words, targets, ignore_case=True, title='Lexical diversion plot for the Guardian data on ChatGPT')

### 🦋Final task: Choosing your own data to carry out the analyses

- Choose a new search query and run it to collect the data.
- Create a frequency plot and a word cloud for this data.
- Think about the data and see if there are any further words or symbols you might want to filter out.
- Remember that you can use the Guardian Explorer site to see how many results you can obtain for different queries before you start collecting the data for analysis.


In [None]:
# Write your answer here
