# Importing Data to a DataFrame

## A important note on file organisation

Starting with this chapter I will be making use of existing data. This data is available on the webpage for this book, even if it is drawn from publicly available sources. You should store this data in a folder adjacent to the folder where you store your Jupyter notebooks, not in the same folder as the notebooks. This keeps your files and your data much more organised. I have opted to do that with the code below. So if this file is in a folder called '`notebooks`' under a folder called `'book'`, then you should also have a folder called `'data'` under `'book'`, like so: 

~~~
book
 |- notebooks
 |- data
 |- findings
 |- etc.
~~~

To access that folder we can use a _relative_ path. I presently use the wonderful `pathlib` library in Python, which simplifies a lot of file operations and navigation that were previously scattered across a few libraries. With this library, every path is its own special path object. If you print it then it will look like a standard path (such as `C:\Program Files\Anaconda` or `/users/ada/documents/book/`). However, the path object is more flexible and allows for a nice tidy syntax. Below, I will create a path to the `data_dir`. Then I will check if that path exists, if it does not, the program will create it. In there you should place your data. A path object is created using `Path()`. You can place a path inside the parentheses or you can use another approach. Here I use `Path.cwd()` to get the current working directory (which should be the directory with your Jupyter notebooks). `.parent` means the directory above. Then we can use the directory separator (`/`) to indicate we want to see a folder in the directory above.     

In [1]:
from pathlib import Path 

In [2]:
data_dir = Path.cwd().parent / "data"

try:
    if not data_dir.exists(): data_dir.mkdir()
except:
    print(f"There was an issue creating the directory at {data_dir}")
else:
    print(f"The data directory can be found at: {data_dir}.")

The data directory can be found at: /Users/work/Documents/GitHub/FSSDS/book/data.


~~~ 
The data directory can be found at: /Users/accountname/Documents/GitHub/fsstds/book/data.
~~~

## Example data 

This chapter uses example data from places online where the data is freely available for secondary use. These will be available in the GitHub repository for this book. The path for the data folder is https://github.com/berniehogan/fsstds/tree/main/data . From here you can download the example files one by one. If you "clone" the repository and you have a copy of GitHub desktop (or know how to use GitHub from the command line) you can then copy the entire repository, which will include the data as well as all the notebooks with the paths in the correct place. To remind, these notebooks primarily contain the code. The book itself is available through Sage.

# Rectangular data: CSV 

To begin importing data into a DataFrame, we start with the humble CSV format which stands for 'comma-separated values'. We start here because this format most closely resembles the DataFrame. A DataFrame has rows and columns, and similarly CSV data is organised according to rows and columns. 

Reading in a CSV file should be pretty reliable. However, since CSV is not a strict format but more like a loose set of conventions, there are a few subtle considerations. For example, does the file have headers and indices? What character groups strings together? What character (usually a comma, of course) separates the values?

For CSV we can use a variety of approaches in Python, including the `csv` library itself. That library does not work directly with `pandas`. Instead it imports data into dictionaries or lists. Let's briefly see how it works in an example. 

## Using the `csv` library 

`csv.reader()` will take in a CSV file and return a data structure that you can use, typically by looping over it. What it returns is a parsed line. 

In [3]:
import csv

In [4]:
with open(data_dir / "MuppetsTable_simple.csv") as filein:
    file_reader = csv.reader(filein, delimiter=',', quotechar='"')
    for row in file_reader:
        print(row)

['Name', 'Gender', 'Species', 'Appearance']
['Fozzie', 'Male', 'Bear', '1976']
['Kermit', 'Male', 'Frog', '1955']
['Piggy', 'Female', 'Pig', '1974']
['Gonzo', 'Male', '', '1970']
['Rowlf', 'Male', 'Dog', '1962']
['Beaker', '', 'Muppet', '1977']
['Janice', 'Female', 'Muppet', '1975']
['Hilda', 'Female', 'Muppet', '1976']


### CSV and Quote characters 

The CSV format uses a delimiter character to separate out the data, typically a comma (`,`). But what if your CSV has a string which itself has a comma inside? Imagine having a location column that includes a city and a country, such as "Gander, CAN". We can avoid tripping up the parser by using a quote character, such as double quotes. That way the parser will not stop at the comma inside the quotes but consider it all one string. But there is a small gotcha that is worth mentioning: different programs (and different languages) use different quote characters! For example, Microsoft Word has a tendency to replace the symmetric double quote `"`, with asymmetric open and closing quotes, `“` and `”`. Then if you copy and paste these auto-completed quotes into your data it will not parse in the right place. We can observe this (and fix it) in the `MuppetsTable_broken.csv` file.

In [5]:
with open(data_dir / "MuppetsTable_broken.csv") as filein:
    file_reader = csv.reader(filein)
    for row in file_reader:
        print(len(row),row)

5 ['Name', 'Gender', 'Species', 'Appearance', 'Notable Phrase']
5 ['Fozzie', 'Male', 'Bear', '1976', 'Wocka, Wokca!']
5 ['Kermit', 'Male', 'Frog', '1955', "It's not easy being green."]
6 ['Piggy', 'Female', 'Pig', '1974', '“I don’t care what you think of me', ' unless you think I’m awesome. In which case you are right.”']
5 ['Gonzo', 'Male', '', '1970', 'Weirdos have more fun.']
6 ['Rowlf', 'Male', 'Dog', '1962', '“Boy', ' is this piano outta tune! I love outta tune pianos.”']
5 ['Beaker', '', 'Muppet', '1977', 'Meep']
5 ['Janice', 'Female', 'Muppet', '1975', 'Groovy, man']
5 ['Hilda', 'Female', 'Muppet', '1976', "Gonzo, aren't you a little old to carry around a teddy bear?"]


See how the two rows had 6 items. These rows had the `“` characters, which broke the parser. This helps remind us that the data that comes in has to be consistent. It also reminds us that sometimes we actually have to clean a little of it ourselves. In general, any change we make to data really ought to be embedded in code. So, I will do that myself below:

In [6]:
with open(data_dir / "MuppetsTable_broken.csv") as filein:
    new_table = filein.read().replace('“','"').replace('”','"')

    fileout = open(data_dir / "MuppetsTable_fixed.csv",'w')
    fileout.write(new_table)
    fileout.close()

We can cross check our work.

In [7]:
with open(data_dir / "MuppetsTable_fixed.csv") as filein:
    file_reader = csv.reader(filein)
    for row in file_reader:
        print(len(row), row)

5 ['Name', 'Gender', 'Species', 'Appearance', 'Notable Phrase']
5 ['Fozzie', 'Male', 'Bear', '1976', 'Wocka, Wokca!']
5 ['Kermit', 'Male', 'Frog', '1955', "It's not easy being green."]
5 ['Piggy', 'Female', 'Pig', '1974', 'I don’t care what you think of me, unless you think I’m awesome. In which case you are right.']
5 ['Gonzo', 'Male', '', '1970', 'Weirdos have more fun.']
5 ['Rowlf', 'Male', 'Dog', '1962', 'Boy, is this piano outta tune! I love outta tune pianos.']
5 ['Beaker', '', 'Muppet', '1977', 'Meep']
5 ['Janice', 'Female', 'Muppet', '1975', 'Groovy, man']
5 ['Hilda', 'Female', 'Muppet', '1976', "Gonzo, aren't you a little old to carry around a teddy bear?"]


This time we can see that each row as 5 items, making it a nice rectangular data set. Before going to pandas, I want to highlight one other nice thing about `csv`: the use of the `DictReader`. This returns a dictionary with the header as the key and the values in that row as the value. If there's no header line, you can specify a list to be the keys using the `fieldnames` argument, such as `fieldnames = ["Name","Location","User"].` 

In [8]:
with open(data_dir / "MuppetsTable_fixed.csv") as filein:
    reader = csv.DictReader(filein)
    for row in reader:
        print(row['Name'], row['Appearance'])

Fozzie 1976
Kermit 1955
Piggy 1974
Gonzo 1970
Rowlf 1962
Beaker 1977
Janice 1975
Hilda 1976


## Using the Pandas CSV reader: `read_csv()` 
To import into a DataFrame directly using pandas, you can use the `pd.read_csv()` method like below: 

In [16]:
import pandas as pd 

In [None]:
df = pd.read_csv(data_dir / "MuppetsTable_fixed.csv") 
display(df.iloc[:,:4]) # Using iloc to get first four columns.

<!-- print(df.iloc[:,:4].style.to_latex(hrules=True)) -->

\begin{tabular}{llllr}
\toprule
 & Name & Gender & Species & Appearance \\
\midrule
0 & Fozzie & Male & Bear & 1976 \\
1 & Kermit & Male & Frog & 1955 \\
2 & Piggy & Female & Pig & 1974 \\
3 & Gonzo & Male & nan & 1970 \\
4 & Rowlf & Male & Dog & 1962 \\
5 & Beaker & nan & Muppet & 1977 \\
6 & Janice & Female & Muppet & 1975 \\
7 & Hilda & Female & Muppet & 1976 \\
\bottomrule
\end{tabular}

Just like the `csv.reader()`, the pandas `pd.read_csv()` method has many arguments to help handle the variety of scenarios that you will encounter in the way data is formatted. A few parameters worth mentioning here: 

- `sep` or `delimiter` (default=`','`): Although the data is often separated by comma, `csv` actually stands in for a variety of textual tabular data. Sometimes for example, you'll see a tab-separated file, which may be `data.csv`, `data.tsv`, `data.txt`, or just simply `data`. In that case you can set sep to be `'\t'`. You can set the separator for a variety of circumstances. Watch how sometimes people use `#` or `|` to separate columns.
- `header` (default=`"infer"`): Pandas is pretty clever at inferring whether the file has a header or not. Yet, it is still useful to set it to either `True` or `False` as the case may be. 
- `quotechar` (default=`'"'`): So above we assumed that the quote character was `"`. This way, we could have something like `"Wocka, Wocka!"`, which has a comma in it, and still have it treated as a single cell of data. Sometimes people use a single quote, sometimes people forget to use a quote at all which can make wrangling the data really, really unpleasant. 

# Rectangular rich data: Excel 

_Excel_ is the popular spreadsheet program from Microsoft. Files can be stored as either `.xls` or `.xlsx`. The original `.xls` is a proprietary file format, but the details can still be reasonably handled by `pandas`. The second one (`xlsx`) was published as an open standard and is in fact a wrapper over a specific format of XML.  

In general, we simply want to import a sheet with `<sheet> = pd.read_excel(<file_path>)`.


If you are looking for tabular data from the web, it is not rare to see it in an Excel sheet rather than CSV. Here we can see an example sheet downloaded from the World Bank's Databank (https://databank.worldbank.org/). I took a few popular indicators (Population, Internet diffusion, $CO_2$ emissions, and GNI) and exported them to Excel using their online portal. There are a few other ways to collect this data, but this way shows an Excel sheet. There are some advantages to Excel when viewed in the software application itself including formatting and multiple tabs. One issue with Excel, however, is that a spreadsheet _sheet_ is not really a DataFrame. The sheet can have all kinds of writing and formatting that do not really work with the notion of cases in rows and variables in columns. This data is available on the book GitHub repository, but you can also find an equivalent export from the World Bank pretty easily and it is worth seeing what data sets they have available.

In [6]:
wb_df = pd.read_excel(data_dir / "World Bank Indicators 2012-2021.xlsx")

display(wb_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1069 entries, 0 to 1068
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Country Name   1066 non-null   object
 1   Country Code   1064 non-null   object
 2   Series Name    1064 non-null   object
 3   Series Code    1064 non-null   object
 4   2011 [YR2011]  1064 non-null   object
 5   2012 [YR2012]  1064 non-null   object
 6   2013 [YR2013]  1064 non-null   object
 7   2014 [YR2014]  1064 non-null   object
 8   2015 [YR2015]  1064 non-null   object
 9   2016 [YR2016]  1064 non-null   object
 10  2017 [YR2017]  1064 non-null   object
 11  2018 [YR2018]  1064 non-null   object
 12  2019 [YR2019]  1064 non-null   object
 13  2020 [YR2020]  1064 non-null   object
dtypes: object(14)
memory usage: 117.0+ KB


None

In [None]:
display(wb_df.iloc[:,:3])

<!-- import numpy as np
print(wb_df.iloc[np.r_[0:5,-5:0],:3].style.to_latex(hrules=True))
 -->
 
\begin{tabular}{llll}
\toprule
 & Country Name & Country Code & Series Name \\
\midrule
0 & Afghanistan & AFG & Population, total \\
1 & Afghanistan & AFG & Total greenhouse gas emissions (kt of CO2 equivalent) \\
2 & Afghanistan & AFG & GNI, Atlas method (current US\$) \\
3 & Afghanistan & AFG & Individuals using the Internet (\% of population) \\
4 & Albania & ALB & Population, total \\
... & ... & ... & ... \\
1064 & nan & nan & nan \\
1065 & nan & nan & nan \\
1066 & nan & nan & nan \\
1067 & Data from database: World Development Indicators & nan & nan \\
1068 & Last Updated: 04/08/2022 & nan & nan \\
\bottomrule
\end{tabular}

At the bottom of the display it says the data is 1741 rows and 4 columns (since we only asked for the first four). Notice that it doesn't display all the rows? Instead these are truncated. After the fifth row (most likely) you'll see a row that looks like: 

~~~
...	...	...	...	...	...	
~~~

This is to indicate that there's data unseen. Similarly if there are too many columns (something pandas works out itself) then it will also truncate the columns as well, but in this case we only asked for the first four columns, so noting is truncated. You should also notice that at the bottom we have not data, but a bunch of `NaN` cells. This is because pandas is trying to include all the text on the first sheet in its DataFrame including that little bit at the bottom where it says 

~~~
Data from database: World Development Indicators
Last Updated: 04/08/2022
~~~

There are a number of strategies to clean this up. You might be inclined at first to just open the file in Excel, delete the junk data and start again. Please don't! Remember from above with the CSV - any time you change the data, I strongly recommend you _do it with reproducible code_. So just like above where we had to create a cleaned CSV with the right quote characters, here, we will create a cleaned DataFrame. But how? 

I'm not going to shy away from using a spreadsheet program to view the data and get an intuition. It is fine to use Excel to read the data. And realistically a lot of temp work can be done in Excel. However, in the case of academic work, it is very bad practice to make lasting changes by hand rather than in a reproducible and documented way. 

When I open the data in a spreadsheet, I discover that it seems to be only the bottom five lines with missing data. We can check by using `df.tail()`. Recall that both `head()` and `tail()` take an argument for the number of lines. So let's cross check that: 

In [None]:
wb_df.iloc[:,:4].tail(7)

<!-- print(wb_df.iloc[:,:4].tail(7).style.to_latex(hrules=True)) -->

\begin{tabular}{lllll}
\toprule
 & Country Name & Country Code & Series Name & Series Code \\
\midrule
1062 & World & WLD & GNI, Atlas method (current US\$) & NY.GNP.ATLS.CD \\
1063 & World & WLD & Individuals using the Internet (\% of population) & IT.NET.USER.ZS \\
1064 & nan & nan & nan & nan \\
1065 & nan & nan & nan & nan \\
1066 & nan & nan & nan & nan \\
1067 & Data from database: World Development Indicators & nan & nan & nan \\
1068 & Last Updated: 04/08/2022 & nan & nan & nan \\
\bottomrule
\end{tabular}

Sure enough. This suggests we should be able to slice the data. I want to ensure that I don't have to do this every time but I still want to document this change. Therefore, I will actually write this DataFrame back to Excel rather than merely delete it in the original. 

To note, the first `xlsx` file actually had two sheets: the one with the data and a second sheet called `Series - Metadata` with some important facts about how the data was collected. Here I am just writing the DataFrame from the first sheet to a new Excel sheet. 

In [15]:
# Remember the [:-5] is what slices off the last five rows
wb_df.iloc[:-5].to_excel(data_dir / "Cleaned_Popular_Indicators.xlsx",
                        index=False)

Feel free to reimport this file or view it in a spreadsheet program to see the difference. One thing you'll notice is that above I did not write the index to the data. That's because the index carried no specific meaning here. However, whether you want to keep the index or not will depend on the type of data you have and whether the index represents meaningful data (such as an ID number or timestamp).

In addition to the built-in pandas reader, Python has extensive packages for reading and writing to Excel, as well as adding formatting. Perhaps the most widely used of these is XLSXWriter (https://xlsxwriter.readthedocs.io/), but there are many as Excel for Python (https://www.excelpython.org/) have compiled. 

# Nested data: `JSON` (JavaScript Object Notation)

Data are measurements about the world. DataFrames are analytical devices for making comparisons and examining the data at varying levels of scale. But the DataFrame's row-by-column arrangement doesn't always reflect the way that data about the world is organised. 

An important concept to consider when wrangling data is **nesting**. That is to say data structures can be nested inside other data structures. Twitter is a useful case here. If you collect raw tweet data (something we will be doing later in the book), you'll see that it looks kind of like a dictionary structure. But it is not simply a list of key:value pairs. This is a list of key:value pairs: 

~~~ Python 
tweet = {"tweet":"This is the tweet",
         "likes":123,
         "retweets":141,
         "time":"Saturday, March 14, 2015."
         ...}
~~~

Compared to this, a tweet object would have dictionaries nested inside other dictionaries. For example, one dictionary will include details about the account that sent the tweet. So it might look a little more like: 

~~~ bash
{
  "created_at": "Thu Apr 06 15:24:15 +0000 2017",
  "id_str": "850006245121695744",
  "text": "1\/ Today we\u2019re sharing our vision for the future of the Twitte
           r API platform!\nhttps:\/\/t.co\/XweGngmxlP",
  "user": {
    "id": 2244994945,
    "name": "Twitter Dev",
    "screen_name": "TwitterDev",
    "location": "Internet",
    "url": "https:\/\/dev.twitter.com\/",
    "description":  "Your official source for Twitter Platform news, updates & e
                    vents. Need technical help? Visit https:\/\/twittercommunit
                    y.com\/ \u2328\ufe0f #TapIntoTwitter"
  },
  "place": {   
  },
  "entities": {
    "hashtags": [      
    ],
    "urls": [
      {
        "url": "https:\/\/t.co\/XweGngmxlP",
        "unwound": {
          "url": "https:\/\/cards.twitter.com\/cards\/18ce53wgo4h\/3xo1c",
          "title": "Building the Future of the Twitter API Platform"
        }
      }
    ],
    "user_mentions": [     
    ]
  }
}
~~~
    
As we can see here, it is an unruly combination of dictionaries and lists nested within each other. Fortunately, Python has no trouble with objects nested inside other objects. The data above is termed JSON or JavaScript Object Notation. It is a combination of lists and dictionaries as they would be formatted for JavaScript (which means there are tiny differences compared to Python, but it is basically the same).  

## Loading JSON

To load JSON into Python directly you can use the `json` library. It provides a means to load data into memory (`json.loads(<THE_DATA>)`) and a means to take a data structure and transform it into valid JSON for writing to disk (`json.dumps(<THE_DATA>)`). 

The data shown above is in a JSON structure that a Twitter data parser can understand. That said, this is Twitter data from their prior v1.1 API. Twitter have recently released a new v2 API which focuses on data minimisation so you would get much less data by default but it might show a similar nesting. We see that API in Chapter \ref{ch:apis}.  In the tweet above, we can see a hierarchy, which I sketch part of below:

~~~
object
 --created_at
 ...
 --user
   --id
   --name
   --screen_name
 --entities
   --hashtags
 --place
 ...
~~~

Below, we are going to work with a simpler data structure just to demonstrate JSON. 

The file `omdb_Muppet_search_page_1.json` is data that I downloaded from the site Open Movie Database (https://omdbapi.com/). To collect machine readable data it is common to use a practice called an API, or Application Programming Interface. When dealing with data from the web, this term usually means that a server has a series of web addresses that you request data from, but instead of an HTML page, you would get some data that you would have to parse. 

The OMDB has a simple API (i.e. a pipeline for requesting specific data) which I used to search for movies or shows that include the word "Muppet" in the title. By requesting data using the right address, my program received some JSON. As you will see below, the JSON referred to the first 10 entries of movies with the word "Muppet" in the title. Any subsequent entries would have to be collected using a supplementary call.  

In [16]:
import json 

mdata = json.loads(open(data_dir / "omdb_Muppet_search_page_1.json").read())
type(mdata)

dict

Below I will print the first 500 characters of the JSON as text. However, you will see that I use the parameter `indent` with the argument `2`. This means that it will indent each line 2 spaces in for each level of nesting. Every time the data has an open `{` or `[` it would imply another deeper level of nesting. Printed in this way is referred to as 'pretty printing'. Printed without the indent it would look much harder to comprehend. Remove `indent=2` to see for yourself. Also try removing `[:300]` to print _all_ the data (it's just ten entries).

In [17]:
print(json.dumps(mdata,indent=2)[:300])

{
  "Search": [
    {
      "Title": "The Muppet Christmas Carol",
      "Year": "1992",
      "imdbID": "tt0104940",
      "Type": "movie"
    },
    {
      "Title": "The Muppet Movie",
      "Year": "1979",
      "imdbID": "tt0079588",
      "Type": "movie"
    },
    {
      "Title": "The Muppet


Since the top level of the JSON is a dictionary we can explore the data structure by asking for the keys and then observing the values. In general, it is preferable to have some guide or schema for how the data is structured, but in my experience it is very common to need to explore it yourself to understand the structure. 

In [18]:
mdata.keys()

dict_keys(['Search', 'totalResults', 'Response'])

In this case, like in many cases, the top level keys segment out the JSON into a part that contains rows of data and a part that helps with managing the flow of data. In this case, `Search` is a list of the results. We will focus on the value for `Search` later. `totalResults` gives us the number of total rows (of which `mdata` contains the first 10) and `Response`, which is a boolean `True` or `False` for whether the response contains data or an error. 

There is nesting here because `Search` is a list of the first ten responses. Each response itself is also a dictionary: A dictionary in a list in a dictionary. 

In [19]:
print(mdata['totalResults'],
      mdata['Response'])

62 True


In [20]:
print(mdata['Search'][0])

{'Title': 'The Muppet Christmas Carol', 'Year': '1992', 'imdbID': 'tt0104940', 'Type': 'movie'}


Although `json.dumps` is for writing JSON data to disk, it is also useful as a way to nicely print a dictionary, just to help us examine its structure. 

In [21]:
print(json.dumps(mdata['Search'][0],indent=2))

{
  "Title": "The Muppet Christmas Carol",
  "Year": "1992",
  "imdbID": "tt0104940",
  "Type": "movie"
}


While JSON is a nested structure, there is a convenience method in pandas to help in turning it into a DataFrame. This command  `json_normalize()` takes a list of dictionary objects and then creates a table based on the keys,  so that: 

In [20]:
results = { "Search":[
                 {"Title":"Ghosts of Hidden Valley",
                  "Year":2010},
                 {"Title":"The Perspex Event",
                  "Year":2018}]
          }

can use `pd.json_normalize(results["Search"])` to transform into a table with a row for each element of the `Search` list. 

In [None]:
pd.json_normalize(results["Search"])

<!-- print(pd.json_normalize(results["Search"]).style.to_latex(hrules=True)) -->

\begin{tabular}{llr}
\toprule
 & Title & Year \\
\midrule
0 & Ghosts of Hidden Valley & 2010 \\
1 & The Perspex Event & 2018 \\
\bottomrule
\end{tabular}

Doing this with the JSON search data we loaded as `mdata` is a similar matter. 

In [None]:
mdf = pd.json_normalize(mdata["Search"])
display(mdf.iloc[:,:3])

<!-- print(mdf.iloc[:,:3].style.to_latex(hrules=True)) -->

\begin{tabular}{llll}
\toprule
 & Title & Year & imdbID \\
\midrule
0 & The Muppet Christmas Carol & 1992 & tt0104940 \\
1 & The Muppet Movie & 1979 & tt0079588 \\
2 & The Muppet Show & 1976–1981 & tt0074028 \\
3 & Muppet Treasure Island & 1996 & tt0117110 \\
4 & The Great Muppet Caper & 1981 & tt0082474 \\
5 & Muppet Babies & 1984–2020 & tt0086764 \\
6 & It's a Very Merry Muppet Christmas Movie & 2002 & tt0329737 \\
7 & A Muppet Family Christmas & 1987 & tt0251282 \\
8 & Muppet*vision 3-D & 1991 & tt0102481 \\
9 & Muppet Classic Theater & 1994 & tt0213096 \\
\bottomrule
\end{tabular}

Notice above that we ran `json_normalize(mdata["Search"])`. Below I will show what happens when we do this on `mdata` (the parent dictionary). 

In [None]:
pd.json_normalize(mdata)

<!-- print(pd.json_normalize(mdata).style.to_latex(hrules=True)) -->

\begin{tabular}{llll}
\toprule
 & Search & totalResults & Response \\
\midrule
0 & [\{'Title': 'The Muppet Christmas Carol', 'Year... & 62 & True \\
\bottomrule
\end{tabular}

This is not what we want. It shows that json_normalise takes each entry of the dictionary as a row. So if we used `mdata` we had one entry with these three keys (`Search`, `totalResults`, `Response`). What we really wanted was ten entries, each with the keys `"Title","The Muppet Christmas Carol",
"Year","imdbID","Type","Poster"`. Thus we pass `mdata['Search']` instead.  

# Nested markup languages: HTML and XML 

## HTML: Hypertext Markup Language 

JSON is pretty similar to Python. You can recognise the dictionaries as having `{` and the lists as having `[` characters. Markup languages look a little different. They typically use tags to open and close levels of the hierarchy. Below I will show how to parse two of the more popular markup languages, HTML and XML. 

A markup language is a formal syntax that appends characters to either side of data in order to give the enclosed data some meaning. For example, you can enclose the words \*\***big deal**\*\* in asterisks like that to tell a program that it should be printed in bold. This book was written in a simple markup language called 'markdown'.  Markdown is a light syntax used to encase certain words. It does not carry many semantics. However, HTML and XML use tags that carry a lot of meaning. By encasing values in _tags_, such as `<title>Here is a title</title>`, we can arrange data in a nested way and determine what the data represents. That case, `Here is a title` was nested in the title tags, but we can nest tags in tags, hence the hierarchy. We can also put other attributes in the tags themselves, like `<title font="Helvetica">Here is a title</title>` where `font` is an attribute of the `<title>` tag with a value of `"Helvetica"`. Notice that the ending tag is just the first tag with a `/` inside to denote that it is a closing tag. Some tags are self closing, which is denoted `<tag />`. On its own a self-closing tag does not seem that interesting, but it often carries a lot of relevant data in the attributes.   

HTML is the markup language used all over the web. Here is some really simple, but valid, HTML. 

~~~ HTML
<html>
    <head>
        <title> 
            This is the title! 
        </title>
    </head>
    <body>
        This is a webpage! <p/> 
        Learn more about the web through <a href="https://w3c.org">The W3C</a>
    </body>
</html>
~~~

If you were to copy that text, paste it into a plain text file with the extension `.html`, and open it in a browser you will see a blank page with the title bar saying "This is the title!" and a single line saying "This is a webpage!" in the plain, default format. Underneath should be a line saying "Learn more about the web through [The w3c](https://www.w3c.com)"

The tags give meaning and thus structure. They are also sources of data. For example, we can extract the body text ("This is a ... W3C") which is a _value_ encased in the \<body\> tags. We can also extract a link https://w3c.org/ which is an _attribute_ of the \<a\> tag. 

HTML data is pretty far from tabular data. It's nested and also relies on files that exist alongside the HTML like a CSS (cascading style sheet) file, via some imported JavaScript. Many of the tags (like the many `<div>` tags that proliferate in modern HTML) are not semantic, but more to help with layout of the page and the layout of the code that renders the page. Regardless, HTML data is still data that can be used. There is lots of work to be done that involves parsing HTML. For example: 

- Comments and other textual data have HTML in it; 
- Sometimes there is no API, but the data is really consistent (like Craigslist ads); 
- A page might have some tables on it;
- You're collecting data by crawling the web; 
- You might want to extract the links on a page and do some analytics on them.

In Chapter \ref{ch:web} we will use HTML directly in order to collect web data using Python's `requests` library. 

## Wikipedia as a data source 

In my courses and my research I lean a lot on data from Wikipedia. It is truly a marvel of the Internet age. The accuracy of pages on Wikipedia is often high quality and the data that is available from the site is often staggering in its depth. In past work I have made use of Wikipedia pages, pages for authors, statistics for page views and edits, pages in different languages, and more. In research I like to suggest that Wikipedia is a great place to start but not a great place to end. This means an emphasis on critically engaging the content as well as checking out the sources. 

One of the nice things about Wikipedia is that as a freely accessible encyclopedia, there's always content that can be used in teaching and research. In this chapter we will use a snapshot of a Wikipedia page that has been stored in the Data file. We will compare that snapshot as formatted HTML as well as unformatted XML with "wikitext" (i.e. text that uses the wiki syntax behind the scenes, see https://en.wikipedia.org/wiki/Help:Wikitext). 

## Wikipedia as HTML 
On the web, Wikipedia is formatted as HTML. It has links that go both within Wikipedia as well as links that go to other sites. The page will have a consistent format regardless of the Wikipedia entry. You can see the underlying text that we are working with by opening `Canada_Wiki.html` in a text editor, or see it formatted by opening it in a web browser. The page should look similar to `https://en.wikipedia.org/wiki/Canada` although the live page will undoubtedly have at least a few tweaks to the content between when the book was published and when you look at the page. In fact, while the page has likely edited between when this book was written and when you are reading it, you should still be able to see the exact version of this page on the site. How? This page will have a revision number referring to the specific revision of the page. 

Below we will open the page as well as do a little parsing. 

In [27]:
with open(data_dir / "Canada - Wikipedia.html") as infile: 
    wiki_HTML = infile.read()

print(len(wiki_HTML))

1091805


At this point `wiki_HTML` is just raw text. Printing the length shows it is a long series of characters, so it is probably the page as expected. We can preview the text by printing a range of characters such as `print(wiki_HTML[:200])` for the first 200 characters. This gets as far as showing the title of the page is Canada. So far, so good. 

In [28]:
print(wiki_HTML[:200])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Canada - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakF


## Using BeautifulSoup for markup data

BeautifulSoup takes in a blob of markup text and parses it for use. When you use the library it is convention to call the parsed text a `soup`. We use a soup to help us find text that could be anywhere on the page. Since XML and HTML documents are hierarchical, if we did not have this ability we would have to navigate through the hierarchy. In the above example of HTML, getting the text from the title hierarchically would be `soup.html.head.title.text`. however, the soup knows that title is a tag so you can just ask for `soup.title.text` and it will return `"This is the title!"`. This ability to just look for tags is especially useful for things like looking for links (which all start with the `<a>` tag, as in `<a href="https://www.duckduckgo.com/">Search with DuckDuckGo</a>`.

To see BeautifulSoup in practice, let's first have a look at an HTML page from Wikipedia and then an XML data file. We are going to use Wikipedia in both cases since it renders on the web as HTML but is exported for analysis as XML, so conveniently we can compare the differences. To remind, both `Canada - Wikipedia.html` and `Canada - Export.xml` should be in the data folder on the course webpage. See below how we first parse the page, print the title text, and look for links.  

In [29]:
import bs4

In [30]:
soup = bs4.BeautifulSoup(wiki_HTML, 'html.parser')

print(soup.title.text)

# Query the soup for all 'a' tags. (Knowing that 'a' tags refer to links)
links = soup.find_all("a")
print(len(links))

Canada - Wikipedia
4381


This approach came up with 4381 unique links in the HTML page for Canada. This is a considerable number of links. Even for a single page on Wikipedia, we are already approaching a scale that would be hard for a single human coder to work with. What if all two-hundred-plus countries in almost one hundred Wikipedia languages each have their page? Getting the URLs on each one would be a huge task! 

## Data Scepticism

It is healthy and useful to be sceptical of overly mechanistic approaches. This is especially important when copying other people's code or using 'black box' algorithms. Scepticism implies that we express some uncertainty about whether the data we have is the data we want. We can alleviate this scepticism through a number of approaches, though no single approach will be sufficient as we shall see throughout this book. Some tactics for checking data:

- Plotting distributions: Are there unexpected outliers? 
- Spot checking results: Do they look like what you expected?
- "Top and tails": Looking at the first and last results - are they appropriate?
- Tabulating results: Does a `value_counts()` give the sort of result as expected? 

In this case, we will be spot checking the results for now and tabulating later. These are often considered under 'preprocessing' tasks in a data pipeline, but that implies you know what to look for to clean up your data. Before we create a pipeline for many pages, it is useful to start with a single page and investigate for any errors that might end up being systematic errors.

In [31]:
import random 

In [32]:
print("Head: ", links[0])
print("Tail: ", links[-1])

spot = random.choice(links)
print("Random: ", spot)

Head:  <a id="top"></a>
Tail:  <a href="https://www.mediawiki.org/"><img alt="Powered by MediaWiki" height="31" loading="lazy" src="/static/images/footer/poweredby_mediawiki_88x31.png" srcset="/static/images/footer/poweredby_mediawiki_132x47.png 1.5x, /static/images/footer/poweredby_mediawiki_176x62.png 2x" width="88"/></a>
Random:  <a href="/wiki/Toronto" title="Toronto">Toronto</a>


The first tag is clearly not a URL. We can determine this by looking in the tag attributes. Normally with an `a` tag there's a `href` attribute which points to the URL. An external URL would have a `://` included, such as `https://`, `sftp://`, or `http`. However, the top tag as `id="top"` inside, which is just for internal navigation.

To check the attributes of a tag you can call `<tag>.attrs`. They will be returned as a dictionary of key-value pairs. Observe:

In [33]:
links[-1].attrs

{'href': 'https://www.mediawiki.org/'}

The following code snippet uses the `attrs` feature to check if `href` is an attribute of the `a` tag.

In [34]:
href_links = [x for x in soup.find_all('a') if 'href' in x.attrs]

print(f"There are {len(href_links)} 'href' links in this file.")

print(f"The first 'href' link:\n{href_links[0]}")

There are 4377 'href' links in this file.
The first 'href' link:
<a href="/wiki/Wikipedia:Featured_articles" title="This is a featured article. Click here for more information."><img alt="Featured article" data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>


This particular link was internal to Wikipedia. This means that if you click on it you will go to another Wikipedia page. We can determine this since it starts with `"/wiki/..."`. A simple way to check if it is an external link is to take all the links returned via `href` and check which ones include `://`. We can use `<tag>.get(<attr>)` which will get the attribute value for whatever is the attribute key `<attr>`. For example, if we see `<a href="https://www.eff.org">Electronic Frontier Foundation</a>`. We can then use `x.get('href')` and it will return `https://www.eff.org`.   

In [35]:
href_ext_links = [x for x in soup.find_all('a') if
                  'href' in x.attrs and # Do this if clause first
                  '://' in x.get('href')] # since this depends on first if

print(f"There are {len(href_ext_links)} 'href' & '://' links in this file.")
print(f"The first 'href' and '://' link:\n{href_ext_links[0].get('href')}")

There are 950 'href' & '://' links in this file.
The first 'href' and '://' link:
https://en.wikipedia.org/w/index.php?title=Canada&action=edit


In [36]:
href_int_links = [x for x in soup.find_all('a')
                  if 'href' in x.attrs and "://" not in x.get('href')]

print(f"There are {len(href_int_links)} 'href' internal links in this file.")
print(f"The first 'href' internal link:\n{href_int_links[0].get('href')}")

There are 3427 'href' internal links in this file.
The first 'href' internal link:
/wiki/Wikipedia:Featured_articles


If you sum together $950$ (the external `href` links), $3427$ (the internal `href` links) and $4$ (the `<a>` tags without an `href` attribute, then you get the total: $4381$, so all links accounted for.  

We will do a little more with HTML in Chapter \ref{ch:cleaning} where we clean the HTML out of some comments. Before we leave, however, I wanted to note some of the limitations of HTML. For example, what is the revision ID of this page? Below we will see how to get the revision from structured XML really easily. But here it is not so straightforward. It is indeed embedded in the HTML...somewhere. A good exercise for you is to open the HTML in a browser, view the source, and look for "wgRevisionId". It will be a key buried inside some JavaScript. If you find it and the value associated with that key (it will be a 10-digit number), then you can find this exact version online at `https://en.wikipedia.org/w/index.php?title=Canada&oldid=<wgRevisionId>` by replacing `<wgRevisionId>` with the number.     

# XML

XML stands for 'extensible mark-up language'. XML files can be generic or have a document type. For example, the popular GraphML format for social network analysis is really just XML with a specific schema that is used for network graph types. 

Like HTML, XML is a markup language that uses _less than_ ("`<`") and _greater than_ ("`>`") symbols to encase the element tags. 

~~~ XML 
<start> 
    <middle>
        <end1>   Here is an element! </end1>
        <end2>   Here is an element! </end2>
    </middle>
</start>
~~~

Elements have an "element tree". Above, `start` is the root node, `middle` is a child and `end1` is a child of middle. `end1` and `end2` are siblings. XML is a "self-documenting" style, which means that you can insert details about the elements into the document itself. For example, at the beginning of the `Canada - Wikipedia Special Export.xml` file is a tag that points to the specific XML schema used for Wikipedia data. 

In [37]:
with open(data_dir / "Canada - Wikipedia Special Export.xml") as infile: 
    wiki_XML = infile.read()

print(len(wiki_XML))
print(wiki_XML[:300])

277240
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>


Most of the time, we will not be so concerned with the top of an XML document. Rather, we usually just want to navigate the element tree to get to the element(s) that are of concern to us. Sometimes, parsers will already be written which takes the XML and loads it into a data structure for us. This is the case with GraphML, the common format for social network data. We cover the use of GraphML in Chapter \ref{ch:nets} on networks.

Getting data from a webpage or XML into a DataFrame is often trickier than using JSON. There are some approaches that can help, but they will depend on the kind of XML that is being wrangled. The Wikipedia XML we saw above has all the important data as text between tags. By contrast, the Stack Exchange data that we will be using as the extended example in Chapter \ref{ch:cleaning} has all the data as attributes in self-closing XML tags.  

## XML and character encoding

Computers use data as a stream of bytes, but we read it as a stream of text. Bytes and text are very similar but there are some slight differences that can throw off a program trying to read or parse one rather than the other. If you get an error below, it might be that depending on how you downloaded the file (such as using an older browser), you will get the XML as byte data rather than text and the program will throw an error. If that's the case, where it says `bs4.BeautifulSoup(wikitext, "lxml")`, you would first _decode_ the text as in `bs4.BeautifulSoup(wikitext.decode("utf8"), "lxml")`  

In [38]:
import bs4

wiki_XML = wiki_XML.replace('<text xml:space=','<wikitext xml:space=')

soup = bs4.BeautifulSoup(wiki_XML, 'lxml')

print(soup.mediawiki.page.revision.id.text )

1079679373


Since the XML data includes important information inside structured tags, it is thus a lot easier to get the `revisionID` here simply by querying `.revision.id.text` rather than search for it through JavaScript nested inside an HTML document.

## Navigating XML

Navigating XML involves moving up and down or sideways along an _element tree_, which is the term for how we think of the tags as being nested instead/adjacent to each other. In the case above it was clear that I know where to go for the text I wanted (`mediawiki.page.revision.id`). In general, however, navigating to the right element is a bit tedious. Some people prefer the use of Python's built-in ElementTree package. In either case, what you will be doing with your code is navigating a tree structure. Trees tend to use the following nomenclature that borrows from both the natural tree but also the notion of a family tree: 

- **Root**: The base or primary node is called the root node; 
- **Parent and child**: A parent is a node that has nodes nested within, like `ID` nested within `revision` above. In that case, revision is the parent node and ID is the child node; 
- **Sibling**: Two child nodes with the same parent. Like how `sitename` and `dbname` are both children of `siteinfo`;
- **Leaf**: A sometimes used term to indicate a child node with no children of it's own. 

Below I use BeautifulSoup to navigate through the tags so that I can get to the data I want. (I also use the recent Python walrus operator `:=` below. This does a comparison and assigns the comparison to a variable. I check if the tag has a name, which case I print it.

In [39]:
# for i in soup.children: print(i.name)
# for i in soup.html.children: print(i.name)
# for i in soup.html.body.children: print(i.name)
# for i in soup.html.body.mediawiki.children: print(i.name) 
# for i in soup.html.body.mediawiki.page: print(i.name)
for i in soup.html.body.mediawiki.page.revision: 
    if name := i.name: print(name)

id
parentid
timestamp
contributor
minor
comment
model
format
text
sha1


In [40]:
print(soup.html.body.mediawiki.page.revision.id)
# Notice how we can shorten it (and get the text directly): 
print(soup.revision.id.text)

<id>1079679373</id>
1079679373


BeautifulSoup allowed us to shorten the text to `soup.revision.id`. What if we tried `soup.id`? 

In [41]:
print(soup.id, soup.id.parent.name, sep="\n")

<id>5042916</id>
page


It's a different number since there are multiple id tags and this selected the first one (which was `page.id` as we can discover through `soup.id.parent`. `revision.id` on the other hand would uniquely indicate which revision.

## Using `xmltodict`

One clever approach to sidestepping XML is to turn the entirety of the XML document into a JSON file, so then you can work with it like a dictionary. This is using the external `xmltodict` module that you will have to download yourself. I know a lot of students have preferred this method to navigating XML in the past. The only thing to note (that tripped me up), is that in the resulting `dict` object, if a key has `@` at the beginning  (e.g.,  `@href`, that means it was an attribute of a tag, whereas a key without `@` at the beginning (e.g. `revision`), is the tag itself with the value being whatever was encased in between the `<> </>` tags.

The advantage of BeautifulSoup is that you get to query by the tags directly and you can avoid some of the issues with a nesting structure (for example by iterating through all the `<a>` tags. The advantage of `xmltodict` is that you can then use `json_normalize` (there's no equivalent `xml_noramlize` to my knowledge) or `pd.from_dict()` to pipe the XML data into a DataFrame. This works well when the XML is an export of many similar entries, like many revisions of a page or many rows of data.The ideal approach will depend on the structure of the data. I have used both BeautifulSoup and `xmltodict` in recent times for different tasks. 

In [42]:
# You may need to install xmltodict.
# This code is extra careful to check for the right version of Python for
# installation. In fairness, `pip install <library>` usually works  fine.

try: 
    import xmltodict
except ModuleNotFoundError:
    import sys
    !{sys.executable} -m pip install xmltodict
    import xmltodict

In [43]:
with open(data_dir / "Canada - Wikipedia Special Export.xml") as infile:
    doc = xmltodict.parse(infile.read())

In [44]:
print(doc.keys())
print(doc['mediawiki'].keys())
print(doc['mediawiki']['@xmlns'])

odict_keys(['mediawiki'])
odict_keys(['@xmlns', '@xmlns:xsi', '@xsi:schemaLocation', '@version', '@xml:lang', 'siteinfo', 'page'])
http://www.mediawiki.org/xml/export-0.10/


Finding the revision of the page through this approach is very tedious since we don't have any shortcuts like `revision.id.text`. Instead we must navigate through the whole nested dictionary. This does not seem particularly useful here, but in Chapter \ref{ch:cleaning} we will see how it is easy to transform an entire XML export into JSON data and then easily upload it into a DataFrame. 

In [45]:
print(doc['mediawiki']['page']['revision']['id'])

1079679373


And now, you can navigate to (https://en.wikipedia.org/w/index.php?title=Canada&oldid=1079679373) and see exactly the version of this page that we used for parsing. 

# Serialization

Sometimes, you want to close a program and pick up right where you left off. This might mean ensuring that all the objects are in the state that you want them to be with no further processing. This process of creating a file that will represent the state of some values is called _serialization_ (with a 'z' for American spelling). In Python this process is called pickling.  

One useful approach with pickling is when you are processing text on a server, you can pickle your current state of each object. Then if the program goes sour (for example it loses connection to an external server), the program can pickle all the variables marking your progress when it shuts down so you can pick up where you left off after addressing any issues with data collection. You can only serialise one object at a time, but that object can be a collection of other objects. 

Since these files are meant for the computer, they will be written and read as _bytestreams_. In this case, we have to let Python know we are reading bytes with the `rb` and `wb` arguments in our `open()` command, instead of the classic `r` for read, and `w` for write.

In [46]:
import pickle

data_example = {'RevisionID':1079679373, 'PageID':5042916}
data_for_pickle = [data_example,'Other Data',3.1415]

pickle.dump(data_for_pickle,
            open(data_dir / 'temp.pkl','wb'))

# Check to see if the data comes back as we expected
data_from_pkl = pickle.load(open(data_dir / 'temp.pkl','rb'))
print(data_from_pkl)

[{'RevisionID': 1079679373, 'PageID': 5042916}, 'Other Data', 3.1415]


## Long term storage: Pickles and feather 

Because pickles are so tightly coupled to the specific version of Python (and the libraries installed even), they are really handy for short term storage but too fragile for long term storage. Instead, one should use one of the file formats discussed above, such as CSV / XML / JSON or even Excel which has extensive support and care with backwards compatibility. 

If you find yourself with really demanding storage needs, you will probably want to seek out extra resources on this. One example would be to look into the `feather` package. It was co-written by the creator of pandas, Wes McKinney, is very fast, compact, and scalable to very large data.   

With the information you have here, implementing work in feather shouldn't be a challenge, especially with numerous online tutorials. Regardless of file type you choose, remember to check both writing the file to disk and re-reading it again before you put it away for a while.  

# Summary

File formats might not be the most exciting topic and certainly one that is often considered far away from traditional social science, or so one might think. In practice, I certainly remember in graduate school the trials of getting data for SPSS and having to convert it to Stata or SAS. Prior to the massive rise in the use of Python and R, quantitative data in social science was almost exclusively done using programs that were for pay, syntactically unorthodox, and often incompatible with each other. By contrast, what we see here is that Python is for free, that the file formats are not software specific, and that it is assumed that Python should be able to open the data. This is great news not just for data science, but for science. In general, we want science to be as open as possible. Obviously, some data must be restricted for reasons of privacy, but the norm now is towards being less locked into a single product or version. 

Social science often dreams of that perfect world in a potentially dangerous way. An emphasis on survey research and qualitative coding of transcripts imply that claims are made with data of a specific shape and size. We might be inclined to call this the 'independent case model'. It has each row as a case and columns to represent variables. It looks a lot like a DataFrame, and for good reason. The tabulation of cases allows for all kinds of statistical routines that otherwise are not as accessible or tractable. 

What I am suggesting here is that the process of transforming social life into this table does not have to happen as a part of data collection. The data can be collected from a variety of sources, in a variety of ways. Granted, independent random sample data collection is still an excellent way to make a generalisable claim. However, an emphasis on generalisability to a population bounded by national borders sometimes unduly restrict our ability to make claims about a specific population or group. Having all of the comments from a message board or all of the pictures posted in a forum means we can make very extensive claims about that board as a social system. Stated differently, we do not start by looking for places where life imitates the DataFrame and try to come up with questions to ask. Instead, we start with the data in the shape we can get it and then work on transforming it into a DataFrame.   

As we step outside of what we can do with survey research we will find that there's all kinds of ways of creating and managing data for the purposes of making claims. In the next chapter we begin to embark on this process of reshaping data so that it can meet our needs. 

# Extending and reflecting 

- Did you know that you can get JSON from a Reddit site simply by replacing `www.reddit.com` with `api.reddit.com`? Go to a particular subreddit and make that change, save the data and get it into a DataFrame. Try some early exploration like learning what are the columns, how to filter to data that has been upvoted, or even just cross check the data in the DataFrame with what you would see through the interface. For example, there are a lot of columns. How will you navigate them? Will you export to Excel and view in a spreadsheet? Print columns? Scroll through Jupyter? There are many approaches to data reduction here. One example might be to first check if a column has all missing values and remove it. 
- You can extract multiple Wikipedia pages at the same time from Wikipedia's special export. They will come down as XML. Here is a command to get the first thousand edits to the article "Data" from the terminal using `curl`. This is a commonly used application to collect data from the web: 

~~~ bash
curl -d "" 'https://en.wikipedia.org/w/index.php?
            title=Special:Export&pages=Data&offset=1&action=submit' 
            -o "wiki_data_batch1.xml"
~~~

- If you run that from the Terminal (or on Windows PowerShell replacing `curl` with `invoke-RestMethod`) you will get a relatively large XML file (around 12mb) with the first thousand revisions of the article for "data" on Wikipedia. If you can wrangle that data using XML and get it into a DataFrame, you can start to look at how the page changes over time. For example, when did DIKW appear? Was it always there? The skills for assessing the change over time will partially depend on some of the techniques in Chapter \ref{ch:time}, but at least being able to ask `df["text"].map(lambda x: "DIKW" in x)` should already start to give you some ideas. Try using `xmltodict` and `pd.json_normalize` to start your exploration. Some example code related to this is available on the course GitHub page.