# Webscraping Adobe Photoshop Version History table with python
Addie Olson


First, download the necessary packages: BeautifulSoup, Pandas, and Requests.

In [38]:
from bs4 import BeautifulSoup

In [39]:
import requests
import pandas as pd

Next, pull the xml code from the webpage and identify the tables in the page.

In [40]:
WIKI_URL = "https://en.wikipedia.org/wiki/Adobe_Photoshop_version_history"

req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["wikitable"]}
wikitables = soup.findAll("table", table_classes)




Build the dataframe, cleaning footnotes and links from the cell contents. 

In [41]:
    rows=wikitables[0].findAll("tr")
    row_lengths=[len(r.findAll(['th','td'])) for r in rows]
    ncols=max(row_lengths)
    nrows=len(rows)
    data=[]
    for i in range(nrows):
        rowD=[]
        for j in range(ncols):
            rowD.append('')
        data.append(rowD)
    
    # process html
    for i in range(len(rows)):
        row=rows[i]
        rowD=[]
        cells = row.findAll(["td","th"])
        for j in range(len(cells)):
            cell=cells[j]
            if '[' in cell:
                cell = cell[:-3]
           
            #lots of cells span cols and rows so lets deal with that
            cspan=int(cell.get('colspan',1))
            rspan=int(cell.get('rowspan',1))
            l = 0
            for k in range(rspan):
                # Shifts to the first empty cell of this row
                while data[i+k][j+l]:
                    l+=1
                for m in range(cspan):
                    data[i+k][j+l+m]+=str(cell.text).strip('\n')
data.append(rowD)



In [42]:
df = pd.DataFrame(data)

In [43]:
df

Unnamed: 0,0,1,2,3,4
0,Version,Platform,Codename,Release date,Notes and significant changes
1,0.07,Macintosh,Bond,January 1988,Not publicly released - This demo was the firs...
2,0.63,Macintosh,October 1988,,
3,0.87\n0.87,Macintosh,Seurat,March 1989,First version distributed commercially (by the...
4,1.0\n1.0,Macintosh,February 1990,Last release for System 6.0.3\nIn February 201...,
5,2.0\n2.0,Macintosh,Fast Eddy,June 1991,Paths\nCMYK Color\nEPS Rasterization\nLast rel...
6,2.5,Macintosh,Merlin,November 1992,"16 bit per channel support\n""Deluxe"" edition a..."
7,2.5,Windows,Brimstone,November 1992,"16 bit per channel support\n""Deluxe"" edition a..."
8,2.5,"IRIX, Solaris",November 1993,,"16 bit per channel support\n""Deluxe"" edition a..."
9,3.0\n3.0,Macintosh,Tiger Mountain,September 1994,Tabbed Palettes\nLayers\nLast release for Wind...


Export the dataframe to a csv.

In [44]:
df.to_csv('photoshop_versions.csv', sep=',')