# Webscraping MS Word Version History table with python
Addie Olson

First, download the necessary packages: BeautifulSoup, Pandas, and Requests.

In [2]:
from bs4 import BeautifulSoup

In [3]:
import requests
import pandas as pd

Next, pull the xml code from the webpage and identify the tables in the page.

In [4]:
WIKI_URL = "https://en.wikipedia.org/wiki/Microsoft_Word"

req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["wikitable"]}
wikitables = soup.findAll("table", table_classes)


print(wikitables)

[<table class="templateVersion t wikitable" style="text-align: center;">
<tbody><tr>
<th class="hintergrundfarbe5 t" style="padding: 0.5ex 0.75em; text-align: left;">Legend:
</th>
<td style="background-color: #FDB3AB; padding: 0.5ex 0.75em;" title="Old version, no longer supported">Old version
</td>
<td style="background-color: #FEF8C6; padding: 0.5ex 0.75em;" title="Older version, yet still supported">Older version, still supported
</td>
<td style="background-color: #D4F4B4; padding: 0.5ex 0.75em;" title="Current stable version"><b>Current stable version</b>
</td>
<td style="background-color: #FED1A0; padding: 0.5ex 0.75em; display: none;" title="Latest preview version of a future release">Latest preview version
</td>
<td style="background-color: #C1E6F5; padding: 0.5ex 0.75em; display: none;" title="Future release">Future release
</td></tr></tbody></table>, <table class="wikitable sortable">
<caption>Microsoft Word for Windows release history
</caption>
<tbody><tr>
<th>Year Released


Build the dataframe, cleaning footnotes and links from the cell contents. 

In [5]:
cells = []
for row in wikitables[1].find_all("tr"):
    cells.append(row.find_all(["th","td"]))

In [6]:
x = []
for item in cells:
    x.append(item)

In [10]:
main = []
for item in cells:
    row = []
    for x in item:
        text=str(x.text)
        if '[' in text:
            text = text[:-3]
        text =text.strip('\n')   
        row.append(text)
    main.append(row)

In [11]:
df = pd.DataFrame(main)

In [12]:
df

Unnamed: 0,0,1,2,3
0,Year Released,Name,Version,Comments
1,1989,Word for Windows 1.0,"Old version, no longer supported: 1.0",Code-named Opus[citation neede
2,1990,Word for Windows 1.1,"Old version, no longer supported: 1.1",For Windows 3.0.[88] Code-named Bill the Cat[c...
3,1990,Word for Windows 1.1a,"Old version, no longer supported: 1.1a","On March 25, 2014 Microsoft made the source co..."
4,1991,Word for Windows 2.0,"Old version, no longer supported: 2.0",Code-named Spaceman Spiff[citation needed]. In...
5,1993,Word for Windows 6.0,"Old version, no longer supported: 6.0",Code-named T3[citation needed] (renumbered 6 t...
6,1995,Word for Windows 95,"Old version, no longer supported: 7.0",Included in Office 95
7,1997,Word 97,"Old version, no longer supported: 8.0",Included in Office 97
8,1998,Word 98,"Old version, no longer supported: 8.5","Included in Office 97 Powered By Word 98, whic..."
9,1999,Word 2000,"Old version, no longer supported: 9.0",Included in Office 2000


Export the dataframe to a csv.

In [13]:
df.to_csv('word_versions.csv', sep=',')