# Capstone Data Scrape
Ben Katz

benkatz21@gmail.com

BrainStation Data Science Boot Camp

4/1/2022

## Introduction

In this notebook I will be using *Beautiful Soup* to scrape the market values and demographic information from https://www.transfermarkt.us/. The information scraped will be used in my BrainStation Data Science boot camp capstone project. The goal of this project is to model the Transfer market values of European soccer players based on various soccer statistics.

*Cristiano Ronaldo*, one of the two most famous soccer players in the world, was recently bought by *Manchester United* from the Italian club *Juventus* for **23 million euros**. This means that *Manchester United* bought the right for Cristiano Ronaldo to play for their side. This is an example of a **transfer**. The **transfer market** is the aggregation of all **transfers**, the players being transfered, the teams selling and buying. https://www.transfermarkt.us/ has **transfer market valuations** for all players, which approximates how much a team should pay for a player. These values represent the target variable for my analysis. 

## Important Information of The Data I am Scraping

### ## What I am Scraping

Below are a few examples of the links I will be scraping information from. I will be scraping the following information for each player. 

**Name**

**Position**

**Age**

**Club**

**League**

**Market Value**

### Where the Players are From

The information will include players from Europe's big five soccer leagues: 
* *English Premier League* 

* *German Bundesliga*

* *Italian Serie A*

* *Spanish La Liga* 

* *French Ligue 1*. 

These represent the best players in the world, and the therefore, will provide the best data. This will provide a good baseline for players in leagues across Europe to be compared against the best of the best.

### Breakdown of Players by Positions. 

I will be scraping the following players for each league.

* The top 100 attackers by market values from each league (500 attackers in total)
* The top 25 *attacking midfielders* by market value from each league (125 *attacking midfielders* in total)
* The top 25 *left midfielders* by market values from each league (125 *left midfielders* in total)
* The top 25 *right midfielders* by market values from each league (125 *right midfielders* in total)
* The top 50 *center midfielders* by market values from each league (125 *center midfielders* in total)
* The top 25 *defensive midfielders* by market values from each league (125 *defensive midfielders* in total)
* The top 50 *center backs* by market values from each league (125 *center backs* in total)
* The top 25 *left backs* by market values from each league (125 *left backs* in total)
* The top 25 *right backs* by market values from each league (125 *right backs* in total)
* The top 25 *goalkeepers* by market value from each league (125 *goalkeepers* in total)

The reason that I am taking the top 100 attackers regardless of position is because many attackers play all attacking positions. There are a lot of *strikers* and *center forwards* who can play on the wings sometimes, and a lot of *left and right wingers*  who play striker sometimes. I believe that the numbers will come out even after the scrape. 

The reason why I am pulling 50 *center backs* and *center midfielders*, twice as much as every other position, is because every club plays at least two of these positions at the same time, whereas all the other positions only have 1 player on the field per game. However, right backs and left backs, play the same role just on opposite sides of the field, so if I have 125 of each, it's equivalent to the 250 center backs. The same can be said for the midfield position.

### Link Examples

Below are examples of some of the web pages I will be scraping data from. 


Bundesliga Keepers - https://www.transfermarkt.us/bundesliga/marktwerte/wettbewerb/L1/plus//galerie/0?pos=&detailpos=1&altersklasse=alle

Bundesliga Center Backs - https://www.transfermarkt.us/bundesliga/marktwerte/wettbewerb/L1/plus//galerie/0?pos=&detailpos=3&altersklasse=alle

Ligue 1 Center Backs
https://www.transfermarkt.us/ligue-1/marktwerte/wettbewerb/FR1/plus//galerie/0?pos=&detailpos=3&altersklasse=alle

La Liga Left Wingers - https://www.transfermarkt.us/laliga/marktwerte/wettbewerb/ES1/plus//galerie/0?pos=&detailpos=11&altersklasse=alle

**Differences on links above**

These differences are important to note so that I can use for loops to create links for the pages I will be scraping from. This will be easier and less time consuming that copying and pasting all of the necessary links into a list. 

* League Codes after us/
    * ie: bundesliga and ligue-1
* Country codes after wettbewerb/
    * ie: FR1 and ES1
* Position codes at end of link
    * In order of listing on transfermarkt.us drop down menu.
* page number
    * 25 players per page
        * Will need 1 or 2 pages depending on position
    
  
Top 100 Attackers (Wingers, Strikers and Center Forwards)

La Liga Strikers (Page 1)

Premier League Strikers (Page 2) - 
https://www.transfermarkt.us/premier-league/marktwerte/wettbewerb/GB1/pos/Sturm/detailpos//altersklasse/alle/plus//galerie/0/page/2

* country codes and league are the same as above
* page number changes need 1-4

# Preliminary Scrape

This is the link for the top 25 Premier League attackers by market value.

https://www.transfermarkt.us/premier-league/marktwerte/wettbewerb/GB1/pos/Sturm/detailpos//altersklasse/alle/plus//galerie/0/page/1

I am going to use this in my preliminary look into the code for the web pages I'll be scraping. I am also going to use this page as the test page for scraping.  

To do this I will need to

1. Creating a header dictionary to avoid potential access issues. This is not a necessary step, but in my research I found that it's better to instantiate in case it is needed
2. Assign link to variable
3. Create page tree with requests
4. Create a beautiful soup to scrape from. 

In [2]:
#importing necessary packages
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

#creating header dictionary to bypass potential access issues. 
headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

#assinging link to variable
prem_attackers_page = 'https://www.transfermarkt.us/premier-league/marktwerte/wettbewerb/GB1/pos/Sturm/detailpos//altersklasse/alle/plus//galerie/0/page/1'

#creating page tree
prem_attackers_pagetree = requests.get(prem_attackers_page, headers = headers)

#creating Beautiful Soup
prem_attackers_soup = BeautifulSoup(prem_attackers_pagetree.content, 'html.parser')

In [4]:
#Examing Soup
prem_attackers_soup

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<script type="text/javascript">
    !function () { var e = function () { var e, t = "__tcfapiLocator", a = [], n = window; for (; n;) { try { if (n.frames[t]) { e = n; break } } catch (e) { } if (n === window.top) break; n = n.parent } e || (!function e() { var a = n.document, r = !!n.frames[t]; if (!r) if (a.body) { var i = a.createElement("iframe"); i.style.cssText = "display:none", i.name = t, a.body.appendChild(i) } else setTimeout(e, 5); return !r }(), n.__tcfapi = function () { for (var e, t = arguments.length, n = new Array(t), r = 0; r < t; r++)n[r] = arguments[r]; if (!n.length) return a; if ("setGdprApplies" === n[0]) n.length > 3 && 2 === parseInt(n[1], 10) && "boolean" == typeof n[3] && (e = n[3], "function" == typeof n[2] && n[2]("set", !0)); else if ("ping" === n[0]) { var i = { gdprApplies: e, cmpLoaded: !1, cmpStatus: "stub" }; "function" == typeof n[2] && n[2](i) } else a.push(n) }, n.addEventListener("message", (f

Again the following information needs to be pulled above beautiful soup. 
1) Names 

2) Ages

3) Nationality

4) Last Club

5) Market Value

6) Position

In beautiful soup, you can use *find* or *find_all* methods to find various html tags and classes. I will be using this to find all of the tags and classes for the information above. 

All of the tags and classes were found by using *Command F* to search through the soup and identify what tags and classes to enter into the *find_all* function. These tags and classes are the same for all the players in the tags. I know this by scrolling through the soup above.

Once I pull the tags for each of these categories. I will use the string split function to remove everything but the desired information. The split function splits the string at the specified point, and returns a list with 2 elements, one with everything before the split and one with everything that comes after the split. I will have to use the split function twice, once to remove everything before the desired portion of the string, and once to remove everything after the desired portion of the string

First I will extract the player names.

##  Player names

In [16]:
#pulling player names
player_names = prem_attackers_soup.find_all('img', {'class': "bilderrahmen-fixed lazy lazy"})
player_names

[,
 ,
 ,
 <img alt="Marcus Rashford" class="bilderrahmen-fixed lazy lazy" data-src="https://img.a.transfermarkt.technology/portrait/medium/258923-1565603308.png?lm=1" src="data:image/gif;base64,R0lGODlhAQAB

Looking at the first element in the **player_names** list, I need to cut out everything before and after *Romelu Lukaku*. I can do this for everybody using a for loop, iterating through each element in the list and using the split function as describes above. I will append the player name to an empty list instantiated before the for loop. 

In [19]:
#extracting all the player names
#list with full code lines featuring player names
player_names = prem_attackers_soup.find_all('img', {'class': "bilderrahmen-fixed lazy lazy"})

#empty list to append names to 
player_list = []

#stripping everything but the names using split function
for i in range(0, len(player_names)):
    player_list.append(str(player_names[i]).split('" class=')[0].split('alt="')[1])
player_list

['Romelu Lukaku',
 'Mohamed Salah',
 'Harry Kane',
 'Marcus Rashford',
 'Jadon Sancho',
 'Raheem Sterling',
 'Heung-min Son',
 'Sadio Mané',
 'Jack Grealish',
 'Gabriel Jesus',
 'Diogo Jota',
 'Richarlison',
 'Mason Greenwood',
 'Timo Werner',
 'Dominic Calvert-Lewin',
 'Christian Pulisic',
 'Luis Díaz',
 'Wilfried Zaha',
 'Raphinha',
 'Riyad Mahrez',
 'Roberto Firmino',
 'Ollie Watkins',
 'Cristiano Ronaldo',
 'Jarrod Bowen',
 'Emiliano Buendía']

## Player Ages
Names are have been successfully scraped. Now I can extract the player ages. I am going to keep a checklist just for organizational purposes. 

1) Names - __Check__ 

2) Ages

3) Nationality

4) Last Club

5) Market Value

6) Position

First I need to look through the soup to see which tags are needed to extract player ages. 

In [20]:
#looking at soup
prem_attackers_soup

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<script type="text/javascript">
    !function () { var e = function () { var e, t = "__tcfapiLocator", a = [], n = window; for (; n;) { try { if (n.frames[t]) { e = n; break } } catch (e) { } if (n === window.top) break; n = n.parent } e || (!function e() { var a = n.document, r = !!n.frames[t]; if (!r) if (a.body) { var i = a.createElement("iframe"); i.style.cssText = "display:none", i.name = t, a.body.appendChild(i) } else setTimeout(e, 5); return !r }(), n.__tcfapi = function () { for (var e, t = arguments.length, n = new Array(t), r = 0; r < t; r++)n[r] = arguments[r]; if (!n.length) return a; if ("setGdprApplies" === n[0]) n.length > 3 && 2 === parseInt(n[1], 10) && "boolean" == typeof n[3] && (e = n[3], "function" == typeof n[2] && n[2]("set", !0)); else if ("ping" === n[0]) { var i = { gdprApplies: e, cmpLoaded: !1, cmpStatus: "stub" }; "function" == typeof n[2] && n[2](i) } else a.push(n) }, n.addEventListener("message", (f

Looking through, I see that players club, nationality and age are all under the same tag and class. I can pull this tag and class using the find all method, and then scrape each one individually.

In [23]:
#scraping club/nation/age tags
prem_attackers_club_nation_age = prem_attackers_soup.find_all('td', {'class':'zentriert'})

#looking at list
prem_attackers_club_nation_age

[<td class="zentriert">1</td>,
 <td class="zentriert"><img alt="Belgium" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/19.png?lm=1520611569" title="Belgium"/><br/><img alt="DR Congo" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/193.png?lm=1520611569" title="DR Congo"/></td>,
 <td class="zentriert">28</td>,
 <td class="zentriert"><a href="/fc-chelsea/startseite/verein/631/saison_id/2021" title="Chelsea FC"><img alt="Chelsea FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/631.png?lm=1628160548" title=" "/></a></td>,
 <td class="zentriert">2</td>,
 <td class="zentriert"><img alt="Egypt" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/2.png?lm=1520611569" title="Egypt"/></td>,
 <td class="zentriert">29</td>,
 <td class="zentriert"><a href="/fc-liverpool/startseite/verein/31/saison_id/2021" title="Liverpool FC"><img alt="Liverpool FC" class="" src="https://tmssl.akamaized.

Now I can pull ages, the process for this scrape will be similar as the scrape performed to extract player names. One difference for club, nation and age scrapes is that I will need to take only elements of the above list of code that contain these pieces of information. For example, the ages start at the 3rd element of the list (index 2), and take up every fourth element after that. There is an extra table in this code that represents updated player values. These are not pertinent to this project. The only ages I need will be the first 25 ages in the list. I need to pull these elements and these elements only to scrape. 

In [27]:
#pulling ages - starting with 3rd element and taking every 4th after that
attacker_ages = prem_attackers_club_nation_age[2::4]

#taking the  first 25
attacker_ages = player_nationalities[:25]
#chekcing length
print(len(player_nationalities))

#instantiating empty list
player_ages = []

#looping though and extracting
for age_index in range(0, len(player_nationalities)):
    player_ages.append(str(player_nationalities[age_index]).split('</td>')[0].split('"zentriert">')[1])

#checking list
player_ages

25


['28',
 '29',
 '28',
 '24',
 '21',
 '27',
 '29',
 '29',
 '26',
 '24',
 '25',
 '24',
 '20',
 '25',
 '24',
 '23',
 '25',
 '29',
 '25',
 '31',
 '30',
 '26',
 '37',
 '25',
 '25']

Updating Checklist
1) Names - __Check__ 

2) Ages - __Check__

3) Nationality - 

4) Last Club

5) Market Value

6) Position

## Extracting Nationalities

The process for extracting nationalities is very similar to the scraping of player ages. Nationalities start at the 2nd element, and again appear every fourth element after that. The process is the same besides that. 

In [28]:
#pulling up soup to find where to split.
prem_attackers_club_nation_age

[<td class="zentriert">1</td>,
 <td class="zentriert"><img alt="Belgium" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/19.png?lm=1520611569" title="Belgium"/><br/><img alt="DR Congo" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/193.png?lm=1520611569" title="DR Congo"/></td>,
 <td class="zentriert">28</td>,
 <td class="zentriert"><a href="/fc-chelsea/startseite/verein/631/saison_id/2021" title="Chelsea FC"><img alt="Chelsea FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/631.png?lm=1628160548" title=" "/></a></td>,
 <td class="zentriert">2</td>,
 <td class="zentriert"><img alt="Egypt" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/2.png?lm=1520611569" title="Egypt"/></td>,
 <td class="zentriert">29</td>,
 <td class="zentriert"><a href="/fc-liverpool/startseite/verein/31/saison_id/2021" title="Liverpool FC"><img alt="Liverpool FC" class="" src="https://tmssl.akamaized.

In [33]:
#pulling nationalities
#separating nationalities from the rest of the list
attacker_nationalities = prem_attackers_club_nation_age[1::4]
#checking length
print(len(attacker_nationalities))
#pulling first 25
attacker_nationalities = attacker_nationalities[:25]
#instantiating empty list
player_nationalities = []
#scraping out player nationalities
for nation_index in range(0, len(attacker_nationalities)):
    nationality = str(attacker_nationalities[nation_index]).split('" class="flaggenrahmen"', 1)[0].split('alt="', 1)[1]
    player_nationalities.append(nationality)
    
print(len(player_nationalities))
player_nationalities

28
25


['Belgium',
 'Egypt',
 'England',
 'England',
 'England',
 'England',
 'Korea, South',
 'Senegal',
 'England',
 'Brazil',
 'Portugal',
 'Brazil',
 'England',
 'Germany',
 'England',
 'United States',
 'Colombia',
 "Cote d'Ivoire",
 'Brazil',
 'Algeria',
 'Brazil',
 'England',
 'Portugal',
 'England',
 'Argentina']

Premier League Striker Page 1 Checklist

1. Names - Check
2. Ages - Check
3. Nationality - Check
4. Last Club
5. Market Value
6. Position

## Extracting Last Club
This is the last piece of information from the club/nationality/age list. The process is the exact same as the previous two scrapes, except the clubs start at the 4th element of the list. 

In [34]:
#looking at soup to find where to scrape
prem_attackers_club_nation_age

[<td class="zentriert">1</td>,
 <td class="zentriert"><img alt="Belgium" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/19.png?lm=1520611569" title="Belgium"/><br/><img alt="DR Congo" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/193.png?lm=1520611569" title="DR Congo"/></td>,
 <td class="zentriert">28</td>,
 <td class="zentriert"><a href="/fc-chelsea/startseite/verein/631/saison_id/2021" title="Chelsea FC"><img alt="Chelsea FC" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/631.png?lm=1628160548" title=" "/></a></td>,
 <td class="zentriert">2</td>,
 <td class="zentriert"><img alt="Egypt" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/2.png?lm=1520611569" title="Egypt"/></td>,
 <td class="zentriert">29</td>,
 <td class="zentriert"><a href="/fc-liverpool/startseite/verein/31/saison_id/2021" title="Liverpool FC"><img alt="Liverpool FC" class="" src="https://tmssl.akamaized.

In [41]:
#pulling last club
attacker_last_club = prem_attackers_club_nation_age[3::4]
#length of 27 should be 25
#have to remove to because of table on website with similar tags at end of soup above
attacker_last_club = attacker_last_club[:25]
#checking lenght
print(len(attacker_last_club))

#instantiating empty list
player_last_clubs =[]

#scraping last clubs
for last_club_index in range(0, len(attacker_last_club)):
    last_club = str(attacker_last_club[last_club_index]).split('" class=""',1)[0].split('img alt="', 1)[1]
    player_last_clubs.append(last_club)


#checking legth and clubs
print(len(player_last_clubs))
player_last_clubs

25
25


['Chelsea FC',
 'Liverpool FC',
 'Tottenham Hotspur',
 'Manchester United',
 'Manchester United',
 'Manchester City',
 'Tottenham Hotspur',
 'Liverpool FC',
 'Manchester City',
 'Manchester City',
 'Liverpool FC',
 'Everton FC',
 'Manchester United',
 'Chelsea FC',
 'Everton FC',
 'Chelsea FC',
 'Liverpool FC',
 'Crystal Palace',
 'Leeds United',
 'Manchester City',
 'Liverpool FC',
 'Aston Villa',
 'Manchester United',
 'West Ham United',
 'Aston Villa']

Premier League Attacker Page 1 Check List
1. Names - __Check__

2. Ages - __Check__

3. Nationality - __Check__

4. Last Club - __Check__

5. Market Value

6. Position

## Extracting Market Values

This process is similar to extracting the names. I will find the tags and classes to isolate the elements with market values, then I will use a for loop and the split method to extract the values themselves

In [42]:
#looking at soup to find market value classes
prem_attackers_soup

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<script type="text/javascript">
    !function () { var e = function () { var e, t = "__tcfapiLocator", a = [], n = window; for (; n;) { try { if (n.frames[t]) { e = n; break } } catch (e) { } if (n === window.top) break; n = n.parent } e || (!function e() { var a = n.document, r = !!n.frames[t]; if (!r) if (a.body) { var i = a.createElement("iframe"); i.style.cssText = "display:none", i.name = t, a.body.appendChild(i) } else setTimeout(e, 5); return !r }(), n.__tcfapi = function () { for (var e, t = arguments.length, n = new Array(t), r = 0; r < t; r++)n[r] = arguments[r]; if (!n.length) return a; if ("setGdprApplies" === n[0]) n.length > 3 && 2 === parseInt(n[1], 10) && "boolean" == typeof n[3] && (e = n[3], "function" == typeof n[2] && n[2]("set", !0)); else if ("ping" === n[0]) { var i = { gdprApplies: e, cmpLoaded: !1, cmpStatus: "stub" }; "function" == typeof n[2] && n[2](i) } else a.push(n) }, n.addEventListener("message", (f

In [52]:
#looking at market value tags
attacker_market_value = prem_attackers_soup.find_all('td', {'class':"rechts hauptlink"})
attacker_market_value

[<td class="rechts hauptlink"><a href="/romelu-lukaku/marktwertverlauf/spieler/96341">$110.00m</a> </td>,
 <td class="rechts hauptlink"><a href="/mohamed-salah/marktwertverlauf/spieler/148455">$110.00m</a> </td>,
 <td class="rechts hauptlink"><a href="/harry-kane/marktwertverlauf/spieler/132098">$110.00m</a> </td>,
 <td class="rechts hauptlink"><a href="/marcus-rashford/marktwertverlauf/spieler/258923">$93.50m</a> </td>,
 <td class="rechts hauptlink"><a href="/jadon-sancho/marktwertverlauf/spieler/401173">$93.50m</a> </td>,
 <td class="rechts hauptlink"><a href="/raheem-sterling/marktwertverlauf/spieler/134425">$93.50m</a> </td>,
 <td class="rechts hauptlink"><a href="/heung-min-son/marktwertverlauf/spieler/91845">$88.00m</a> </td>,
 <td class="rechts hauptlink"><a href="/sadio-mane/marktwertverlauf/spieler/200512">$88.00m</a> </td>,
 <td class="rechts hauptlink"><a href="/jack-grealish/marktwertverlauf/spieler/203460">$88.00m</a> </td>,
 <td class="rechts hauptlink"><a href="/gabriel-

In [57]:
#creating list of element with values
attacker_market_value = prem_attackers_soup.find_all('td', {'class':"rechts hauptlink"})
#checking length
print(len(attacker_market_value)) #30 need only the first 25
#pulling first 25
attacker_market_value = attacker_market_value[:25]
#checking length - should be 25
print(len(attacker_market_value)) 

#initialiing empty list
player_market_values = []

#extract market values
for market_value_index in range(0, len(attacker_market_value)):
    market_value = str(attacker_market_value[market_value_index]).split('$',1)[1].split('m',1)[0]
    player_market_values.append(market_value)

#check steps
print(len(player_market_values))
player_market_values

25


['110.00',
 '110.00',
 '110.00',
 '93.50',
 '93.50',
 '93.50',
 '88.00',
 '88.00',
 '88.00',
 '66.00',
 '66.00',
 '60.50',
 '55.00',
 '55.00',
 '49.50',
 '49.50',
 '44.00',
 '44.00',
 '44.00',
 '44.00',
 '41.80',
 '38.50',
 '38.50',
 '38.50',
 '38.50']

Premier League Attacker Page 1 Check List
1) Names - __Check__

2) Ages - __Check__

3) Nationality - __Check__

4) Last Club - __Check__

5) Market Value - __Check__

6) Position

## Extracting Positions

The final extraction of the test scrape. The process described more below, as it is a little bit more involved. 

In [59]:
#now we need positions
prem_attackers_soup


<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<script type="text/javascript">
    !function () { var e = function () { var e, t = "__tcfapiLocator", a = [], n = window; for (; n;) { try { if (n.frames[t]) { e = n; break } } catch (e) { } if (n === window.top) break; n = n.parent } e || (!function e() { var a = n.document, r = !!n.frames[t]; if (!r) if (a.body) { var i = a.createElement("iframe"); i.style.cssText = "display:none", i.name = t, a.body.appendChild(i) } else setTimeout(e, 5); return !r }(), n.__tcfapi = function () { for (var e, t = arguments.length, n = new Array(t), r = 0; r < t; r++)n[r] = arguments[r]; if (!n.length) return a; if ("setGdprApplies" === n[0]) n.length > 3 && 2 === parseInt(n[1], 10) && "boolean" == typeof n[3] && (e = n[3], "function" == typeof n[2] && n[2]("set", !0)); else if ("ping" === n[0]) { var i = { gdprApplies: e, cmpLoaded: !1, cmpStatus: "stub" }; "function" == typeof n[2] && n[2](i) } else a.push(n) }, n.addEventListener("message", (f

The tag that position falls under contains a lot of different elements, and only some of them contain positions. To separate out all of the elements that contain information about positions, I will use the regular expression method re.search. I will iterate through all of the tags and if they contain either **Winger**, **Forward or **Striker** anywhere in the element, the research will return a value, otherwise it will return a *None* value. If they return a none value, this element will be skipped, and the loop will move on to the next element. If it returns something besides, none, then it contains information regarding a players position, and we can append that element to a list to be looped through and stripped later. 

Note** The position appears in two elements per player, so I will only be taking every other element starting at the 2nd element of the list. 

In [87]:
#importing regex module
import re

#pulling tags
td_list = prem_attackers_soup.find_all('td')

#instatiating empty list
attacker_position_list = []

#for loop - loops thorugh to check if positional infomration contained in element. 
for item in td_list:
    m = re.search('Winger', str(item))
    n = re.search('Forward', str(item))
    p = re.search('Striker', str(item))
    #skipping through items without positional information
    if m is None and n is None and p is None:
        continue
    #appending those with positional information
    else:
        attacker_position_list.append(item)

#cut out first table listing all position options
attacker_position_list = attacker_position_list[2:]
attacker_position_list
#position listed twice for each player so we can split it in two
#have to cut it off at 50 to avoid tables below
attacker_position_list = attacker_position_list[1:50:2]


#extracting positions.
player_positions = []
for position_index in range(0, len(attacker_position_list)):
    position = str(attacker_position_list[position_index]).split('</td>', 1)[0].split('<td>',1)[1]
    player_positions.append(position)
print(len(player_positions))
player_positions

25


['Centre-Forward',
 'Right Winger',
 'Centre-Forward',
 'Left Winger',
 'Left Winger',
 'Left Winger',
 'Left Winger',
 'Left Winger',
 'Left Winger',
 'Centre-Forward',
 'Centre-Forward',
 'Centre-Forward',
 'Right Winger',
 'Centre-Forward',
 'Centre-Forward',
 'Right Winger',
 'Left Winger',
 'Left Winger',
 'Right Winger',
 'Right Winger',
 'Centre-Forward',
 'Centre-Forward',
 'Centre-Forward',
 'Right Winger',
 'Right Winger']

Prem Striker Page 1 Test
1) Names - __Check__


2) Ages - __Check__

3) Nationality - __Check__

4) Last Club - __Check__

5) Market Value - __Check__

6) Position - __Check__

Now we can test out the full thing on a different link. 

# All Information Scrape - One Link

In this section I am going to combine all of the different scrapes into one big scrape with a different link. This link represents the top 25 attackers by market value in the top Spanish league *La Liga*. The process is the exact same as the different extractions above, just happening all at once. First I need to recreate the header, page tree and beautiful soup to scrape. I am also going to check if club/nation/age are again in the same tags. 

In [99]:
#instantiating link
test_page = 'https://www.transfermarkt.us/laliga/marktwerte/wettbewerb/ES1/pos/Sturm/detailpos//altersklasse/alle/plus//galerie/0/page/1'

#creating pagetree
test_pagetree = requests.get(test_page, headers = headers)
#creating soup
test_soup = BeautifulSoup(test_pagetree.content, 'html.parser')
test_soup

club_nation_age = test_soup.find_all('td', {'class':'zentriert'})
club_nation_age


[<td class="zentriert">1</td>,
 <td class="zentriert"><img alt="Brazil" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/26.png?lm=1520611569" title="Brazil"/></td>,
 <td class="zentriert">21</td>,
 <td class="zentriert"><a href="/real-madrid/startseite/verein/418/saison_id/2021" title="Real Madrid"><img alt="Real Madrid" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/418.png?lm=1580722449" title=" "/></a></td>,
 <td class="zentriert">2</td>,
 <td class="zentriert"><img alt="Spain" class="flaggenrahmen" src="https://tmssl.akamaized.net/images/flagge/verysmall/157.png?lm=1520611569" title="Spain"/></td>,
 <td class="zentriert">24</td>,
 <td class="zentriert"><a href="/real-sociedad-san-sebastian/startseite/verein/681/saison_id/2021" title="Real Sociedad"><img alt="Real Sociedad" class="" src="https://tmssl.akamaized.net/images/wappen/verysmall/681.png?lm=1614795530" title=" "/></a></td>,
 <td class="zentriert">3</td>,
 <td class="zentrier

They are all in the same tags, so I can use the same code as before. 

In [103]:
#reinstantiating just in case
test_page = 'https://www.transfermarkt.us/laliga/marktwerte/wettbewerb/ES1/pos/Sturm/detailpos//altersklasse/alle/plus//galerie/0/page/1'
test_pagetree = requests.get(test_page, headers = headers)
test_soup = BeautifulSoup(test_pagetree.content, 'html.parser')
test_soup


#player names
player_names = test_soup.find_all('img', {'class': "bilderrahmen-fixed lazy lazy"})
player_list = []
for i in range(0, len(player_names)):
    player_list.append(str(player_names[i]).split('" class=')[0].split('alt="')[1])
player_list

club_nation_age = test_soup.find_all('td', {'class':'zentriert'})
club_nation_age

#pulling ages
ages = club_nation_age[2::4]
ages = ages[:25]
print(len(ages))
player_ages = []
for age_index in range(0, len(ages)):
    player_ages.append(str(ages[age_index]).split('</td>')[0].split('"zentriert">')[1])
    
#pulling nationalities
nationalities = club_nation_age[1::4]
nationalities = nationalities[:25]
nationalities = []
for nation_index in range(0, len(nationalities)):
    nationality = str(nationalities[nation_index]).split('" class="flaggenrahmen"', 1)[0].split('alt="', 1)[1]
    player_nationalities.append(nationality)
    
print(len(player_nationalities))
player_nationalities

#pulling last club
last_clubs = club_nation_age[3::4]
#length of 27 should be 25
#have to remove to because of table on website with similar tags at end of soup above
last_clubs = last_clubs[:25]

player_last_clubs =[]
for last_club_index in range(0, len(last_clubs)):
    last_club = str(last_clubs[last_club_index]).split('" class=""',1)[0].split('img alt="', 1)[1]
    player_last_clubs.append(last_club)
print(len(player_last_clubs))
player_last_clubs

#market values
attacker_market_value = test_soup.find_all('td', {'class':"rechts hauptlink"})
len(attacker_market_value) #30 need only the first 25
#pulling only first 25 items in the list. these represent the actual market values
attacker_market_value = attacker_market_value[:25]
len(attacker_market_value) #len now 25
#initialiing empty list
player_market_values = []
#extract market values
for market_value_index in range(0, len(attacker_market_value)):
    market_value = str(attacker_market_value[market_value_index]).split('$',1)[1].split('m',1)[0]
    player_market_values.append(market_value)
    
print(len(player_market_values))
player_market_values



#position list
td_list = test_soup.find_all('td')
attacker_position_list = []
#pulling out desired positions
for item in td_list:
    m = re.search('Winger', str(item))
    n = re.search('Forward', str(item))
    p = re.search('Striker', str(item))
    if m is None and n is None and p is None:
        continue
    else:
        attacker_position_list.append(item)

#cut out first table listing all position options
attacker_position_list = attacker_position_list[2:]
attacker_position_list
#position listed twice for each player so we can split it in two
#have to cut it off at 50 to avoid tables below
attacker_position_list = attacker_position_list[1:50:2]


player_positions = []
for position_index in range(0, len(attacker_position_list)):
    position = str(attacker_position_list[position_index]).split('</td>', 1)[0].split('<td>',1)[1]
    player_positions.append(position)
print(len(player_positions))
player_positions

25
28
25
25
25
25


['Left Winger',
 'Left Winger',
 'Left Winger',
 'Second Striker',
 'Second Striker',
 'Right Winger',
 'Centre-Forward',
 'Left Winger',
 'Centre-Forward',
 'Right Winger',
 'Centre-Forward',
 'Left Winger',
 'Right Winger',
 'Second Striker',
 'Centre-Forward',
 'Centre-Forward',
 'Right Winger',
 'Left Winger',
 'Left Winger',
 'Right Winger',
 'Centre-Forward',
 'Left Winger',
 'Right Winger',
 'Centre-Forward',
 'Centre-Forward']

It worked! Now I will try it on the top attacker across all leagues. 

# All information Scrape - All Links
In this section I will be scraping through the top 100 attackers from all 5 leagues. I know the scraping code I have created works through testing. The only thing I need to is create the links. This will be done through a for loop, adding the different parts of each link to the parts of the URL that stay the same. Here are the parts that change

* Page Number - 1-4
* League Code - Each league has a unique code in the url
* Country Code - Each country has a unique code in the url

By creating a nesting for loop, I can create all the necessary links to scrape through. 

In [7]:
#creating link list

#creating page code list
page_code_list = ['1', '2', '3', '4']


##league code list
league_codes = ['premier-league', 'laliga', 'bundesliga', 'serie-a', 'ligue-1']

#country code list
country_codes = ['GB1', 'ES1', 'L1', 'IT1', 'FR1']

### Link for Reference

https://www.transfermarkt.us/premier-league/marktwerte/wettbewerb/GB1/pos/Sturm/detailpos//altersklasse/alle/plus//galerie/0/page/1

This is what all links look like, with the interchangeable codes listed above. This will be a double nested for loop. The first level will loop through country and league codes. Because the league codes and country codes need to match I will be pulling the same index from each of the respective lists to add to the link string. The second level will be the page numbers. Each country/league code needs a link with each page number. I am scraping 4 pages from each of these links.  

In [230]:
attacker_link_list = []

for i in range(0, len(league_codes)):
    for j in range(0, len(page_code_list)):
        #creating link
        link = "https://www.transfermarkt.us/" + league_codes[i] + "/marktwerte/wettbewerb/" + country_codes[i] + "/pos/Sturm/detailpos//altersklasse/alle/plus//galerie/0/page/" + page_code_list[j]  
        
        #appending link to 
        attacker_link_list.append(link)
        
        print(f'{league_codes[i]} {page_code_list[j]} link created')

len(attacker_link_list)

premier-league 1 link created
premier-league 2 link created
premier-league 3 link created
premier-league 4 link created
laliga 1 link created
laliga 2 link created
laliga 3 link created
laliga 4 link created
bundesliga 1 link created
bundesliga 2 link created
bundesliga 3 link created
bundesliga 4 link created
serie-a 1 link created
serie-a 2 link created
serie-a 3 link created
serie-a 4 link created
ligue-1 1 link created
ligue-1 2 link created
ligue-1 3 link created
ligue-1 4 link created


20

Now that links are created, I can run them through the same for loop used in the previous section.

In [231]:
#creating empty lists to append to
player_list = []
player_ages = []
player_nationalities = []
player_last_clubs =[]
player_market_values = []
player_positions = []
player_leagues = []

#looping through links
for link in attacker_link_list:

    #creating soup
    test_page = link
    test_pagetree = requests.get(test_page, headers = headers)
    test_soup = BeautifulSoup(test_pagetree.content, 'html.parser')
    test_soup


    #leagues
    leagues = []
    league = link.split('.us/', 1)[1].split('/marktwerte', 1)[0]
    leagues.append(league)
    leagues = leagues * 25
    player_leagues = player_leagues + leagues
    print(f'There are {len(player_leagues)} names in goalie_leagues list')
    
    #player names
    player_names = test_soup.find_all('img', {'class': "bilderrahmen-fixed lazy lazy"})
    
    for i in range(0, len(player_names)):
        player_list.append(str(player_names[i]).split('" class=')[0].split('alt="')[1])
    print(f'There are {len(player_list)} names in player_list')

    club_nation_age = test_soup.find_all('td', {'class':'zentriert'})
    club_nation_age

    #pulling ages
    ages = club_nation_age[2::4]
    ages = ages[:25]
    print(len(ages))
    
    for age_index in range(0, len(ages)):
        player_ages.append(str(ages[age_index]).split('</td>')[0].split('"zentriert">')[1])
    
    print(f'There are {len(player_ages)} names in player_ages list')
    
    #pulling nationalities
    nationalities = club_nation_age[1::4]
    print(len(nationalities))
    nationalities = nationalities[:25]
    
    for nation_index in range(0, len(nationalities)):
        nationality = str(nationalities[nation_index]).split('" class="flaggenrahmen"', 1)[0].split('alt="', 1)[1]
        player_nationalities.append(nationality)

    print(f'There are {len(player_nationalities)} names in player_nationalities list')

    #pulling last club
    last_clubs = club_nation_age[3::4]
    #length of 27 should be 25
    #have to remove to because of table on website with similar tags at end of soup above
    last_clubs = last_clubs[:25]

    
    for last_club_index in range(0, len(last_clubs)):
        last_club = str(last_clubs[last_club_index]).split('" class=""',1)[0].split('img alt="', 1)[1]
        player_last_clubs.append(last_club)
    print(len(player_last_clubs))
    player_last_clubs
    print(f'There are {len(player_last_clubs)} names in player_last_clubs list')

    #market values
    attacker_market_value = test_soup.find_all('td', {'class':"rechts hauptlink"})
    len(attacker_market_value) #30 need only the first 25
    #pulling only first 25 items in the list. these represent the actual market values
    attacker_market_value = attacker_market_value[:25]
    len(attacker_market_value) #len now 25
    #initialiing empty list
    
    #extract market values
    for market_value_index in range(0, len(attacker_market_value)):
        market_value = str(attacker_market_value[market_value_index]).split('$',1)[1].split('m',1)[0]
        player_market_values.append(market_value)

    print(f'There are {len(player_market_values)} names in player_market_values list')

    #position list
    td_list = test_soup.find_all('td')
    attacker_position_list = []
    #pulling out desired positions
    for item in td_list:
        m = re.search('Winger', str(item))
        n = re.search('Forward', str(item))
        p = re.search('Striker', str(item))
        if m is None and n is None and p is None:
            continue
        else:
            attacker_position_list.append(item)

    #cut out first table listing all position options
    attacker_position_list = attacker_position_list[2:]
    attacker_position_list
    #position listed twice for each player so we can split it in two
    #have to cut it off at 50 to avoid tables below
    attacker_position_list = attacker_position_list[1:50:2]

    for position_index in range(0, len(attacker_position_list)):
        position = str(attacker_position_list[position_index]).split('</td>', 1)[0].split('<td>',1)[1]
        player_positions.append(position)
    print(f'There are {len(player_positions)} names in player_positions list')
    
print(len(player_ages))
print(len(player_list))
print(len(player_nationalities))
print(len(player_market_values))
print(len(player_positions))
print(len(player_last_clubs))

There are 25 names in goalie_leagues list
There are 25 names in player_list
25
There are 25 names in player_ages list
28
There are 25 names in player_nationalities list
25
There are 25 names in player_last_clubs list
There are 25 names in player_market_values list
There are 25 names in player_positions list
There are 50 names in goalie_leagues list
There are 50 names in player_list
25
There are 50 names in player_ages list
28
There are 50 names in player_nationalities list
50
There are 50 names in player_last_clubs list
There are 50 names in player_market_values list
There are 50 names in player_positions list
There are 75 names in goalie_leagues list
There are 75 names in player_list
25
There are 75 names in player_ages list
28
There are 75 names in player_nationalities list
75
There are 75 names in player_last_clubs list
There are 75 names in player_market_values list
There are 75 names in player_positions list
There are 100 names in goalie_leagues list
There are 100 names in player_

In [159]:
#checking league
print(player_leagues[:25])
print(player_leagues[25:50])

['premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league']
['premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league']


The scrape was a success! We can use the same code for keepers, which will be done in the next section.

# Scraping Keepers

The same code can be used to scrape goalkeeper information. I know this through examining the source code of the top goalie pages on https://www.transfermarkt.us. The differences in this scrape are the following:

1. I will only be taking the top 25 keepers from each league, so I only need one link from each league to scrape from

2. I don't need a positions scrape, as everybody is a goalkeeper.

Now to create the links.

Link for reference
'https://www.transfermarkt.us/la_liga/marktwerte/wettbewerb/ES1/pos/Torwart/detailpos//altersklasse/alle/plus//galerie/0/page/1'

In [129]:
#empty link list
goalie_link_list = []

for i in range(0, len(league_codes)):
    #creating link
    #don't need page codes since I am only taking the first page
    link = "https://www.transfermarkt.us/" + league_codes[i] + "/marktwerte/wettbewerb/" + country_codes[i] + "/pos/Torwart/detailpos//altersklasse/alle/plus//galerie/0/page/1"
        
    #appending link to 
    goalie_link_list.append(link)
        
    print(f'{league_codes[i]} goalie link created')
#should be five
len(goalie_link_list)

premier-league goalie link created
laliga goalie link created
bundesliga goalie link created
serie-a goalie link created
ligue-1 goalie link created


5

Now that the links are created I can run them through the same for loop I used above. I will create new lists to append to. I will add all corresponding lists together at the end, then create my dataframe.

In [133]:
goalie_link_list

['https://www.transfermarkt.us/premier-league/marktwerte/wettbewerb/GB1/pos/Torwart/detailpos//altersklasse/alle/plus//galerie/0/page/1',
 'https://www.transfermarkt.us/laliga/marktwerte/wettbewerb/ES1/pos/Torwart/detailpos//altersklasse/alle/plus//galerie/0/page/1',
 'https://www.transfermarkt.us/bundesliga/marktwerte/wettbewerb/L1/pos/Torwart/detailpos//altersklasse/alle/plus//galerie/0/page/1',
 'https://www.transfermarkt.us/serie-a/marktwerte/wettbewerb/IT1/pos/Torwart/detailpos//altersklasse/alle/plus//galerie/0/page/1',
 'https://www.transfermarkt.us/ligue-1/marktwerte/wettbewerb/FR1/pos/Torwart/detailpos//altersklasse/alle/plus//galerie/0/page/1']

https://www.transfermarkt.us/premier-league/marktwerte/wettbewerb/GB1/pos/Torwart/detailpos//altersklasse/alle/plus//galerie/0/page/1'

Instead of pulling all of the links from each player. It occurs to me that I can just take the league code from the link and create a list with this code 25 times, as I am pulling 25 keepers from each league. This will make life easier for the next scrape, as I won't have to search through the beautiful soup to find the tags for each league. 

I can also just append GK to the position lists. Every player being scraped here is a goalkeeper, so I do not need to pull their positions. 

In [154]:
#testing theory
test_link = 'https://www.transfermarkt.us/premier-league/marktwerte/wettbewerb/GB1/pos/Torwart/detailpos//altersklasse/alle/plus//galerie/0/page/1'
test_link.split('.us/', 1)[1].split('/marktwerte', 1)[0]

'premier-league'

In [227]:
#instantiating empty lists
goalie_list = []
goalie_ages = []
goalie_nationalities = []
goalie_last_clubs =[]
goalie_market_values = []
goalie_positions = []
goalie_leagues =[]

for link in goalie_link_list:

    #creating soup
    #creating soup
    goalies_page = link
    goalies_pagetree = requests.get(goalies_page, headers = headers)
    goalies_soup = BeautifulSoup(goalies_pagetree.content, 'html.parser')
    goalies_soup

    #leagues
    leagues = []
    league = link.split('.us/', 1)[1].split('/marktwerte', 1)[0]
    leagues.append(league)
    leagues = leagues * 25
    goalie_leagues = goalie_leagues + leagues
    print(f'There are {len(goalie_leagues)} names in goalie_leagues list')
    
    #player names
    player_names = goalies_soup.find_all('img', {'class': "bilderrahmen-fixed lazy lazy"})
    
    for i in range(0, len(player_names)):
        goalie_list.append(str(player_names[i]).split('" class=')[0].split('alt="')[1])
    print(f'There are {len(goalie_list)} names in player_list')

    club_nation_age = goalies_soup.find_all('td', {'class':'zentriert'})
    club_nation_age

    #pulling ages
    ages = club_nation_age[2::4]
    ages = ages[:25]
    print(len(ages))
    
    for age_index in range(0, len(ages)):
        goalie_ages.append(str(ages[age_index]).split('</td>')[0].split('"zentriert">')[1])
    
    print(f'There are {len(goalie_ages)} names in player_ages list')
    
    #pulling nationalities
    nationalities = club_nation_age[1::4]
    print(len(nationalities))
    nationalities = nationalities[:25]
    
    for nation_index in range(0, len(nationalities)):
        nationality = str(nationalities[nation_index]).split('" class="flaggenrahmen"', 1)[0].split('alt="', 1)[1]
        goalie_nationalities.append(nationality)

    print(f'There are {len(goalie_nationalities)} names in player_nationalities list')

    #pulling last club
    last_clubs = club_nation_age[3::4]
    #length of 27 should be 25
    #have to remove to because of table on website with similar tags at end of soup above
    last_clubs = last_clubs[:25]

    
    for last_club_index in range(0, len(last_clubs)):
        last_club = str(last_clubs[last_club_index]).split('" class=""',1)[0].split('img alt="', 1)[1]
        goalie_last_clubs.append(last_club)
    
    print(f'There are {len(goalie_last_clubs)} names in player_last_clubs list')

    #market values
    attacker_market_value = goalies_soup.find_all('td', {'class':"rechts hauptlink"})
    len(attacker_market_value) #30 need only the first 25
    #pulling only first 25 items in the list. these represent the actual market values
    attacker_market_value = attacker_market_value[:25]
    
    
    #extract market values
    for market_value_index in range(0, len(attacker_market_value)):
        market_value = str(attacker_market_value[market_value_index]).split('$',1)[1].split('m',1)[0]
        goalie_market_values.append(market_value)

    print(f'There are {len(goalie_market_values)} names in player_market_values list')

    #position list not needed

#creating goalie position list    
goalie_positions = ['GK']
goalie_positions = goalie_positions * len(goalie_ages)
    
print(len(goalie_ages))
print(len(goalie_list))
print(len(goalie_nationalities))
print(len(goalie_market_values))
print(len(goalie_positions))
print(len(goalie_last_clubs))
print(len(goalie_leagues))

There are 25 names in goalie_leagues list
There are 25 names in player_list
25
There are 25 names in player_ages list
28
There are 25 names in player_nationalities list
There are 25 names in player_last_clubs list
There are 25 names in player_market_values list
There are 50 names in goalie_leagues list
There are 50 names in player_list
25
There are 50 names in player_ages list
28
There are 50 names in player_nationalities list
There are 50 names in player_last_clubs list
There are 50 names in player_market_values list
There are 75 names in goalie_leagues list
There are 75 names in player_list
25
There are 75 names in player_ages list
28
There are 75 names in player_nationalities list
There are 75 names in player_last_clubs list
There are 75 names in player_market_values list
There are 100 names in goalie_leagues list
There are 100 names in player_list
25
There are 100 names in player_ages list
28
There are 100 names in player_nationalities list
There are 100 names in player_last_clubs 

In [157]:
print(goalie_leagues)

['premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'premier-league', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'laliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bundesliga', 'bund

Now we can move on the the other positions a reminder that we are pulling the following from each league

__Top 50__
* Center Backs
* Center Midfielders



__Top 25__
* Left Backs
* Right Backs
* Center Defensive Midfielders
* Left Midfielders
* Right Midfielders
* Attacking Midfielders


# Final Scrape
In this section I will scrape the information for the above positions. First I will test out the scrape on the link below. 

In [3]:
#creating Beautiful Soup
scrape_page = 'https://www.transfermarkt.us/bundesliga/marktwerte/wettbewerb/L1/pos//detailpos/3/altersklasse/alle/plus//galerie/0/page/1'
scrape_pagetree = requests.get(scrape_page, headers = headers)
scrape_soup = BeautifulSoup(scrape_pagetree.content, 'html.parser')
scrape_soup

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<script type="text/javascript">
    !function () { var e = function () { var e, t = "__tcfapiLocator", a = [], n = window; for (; n;) { try { if (n.frames[t]) { e = n; break } } catch (e) { } if (n === window.top) break; n = n.parent } e || (!function e() { var a = n.document, r = !!n.frames[t]; if (!r) if (a.body) { var i = a.createElement("iframe"); i.style.cssText = "display:none", i.name = t, a.body.appendChild(i) } else setTimeout(e, 5); return !r }(), n.__tcfapi = function () { for (var e, t = arguments.length, n = new Array(t), r = 0; r < t; r++)n[r] = arguments[r]; if (!n.length) return a; if ("setGdprApplies" === n[0]) n.length > 3 && 2 === parseInt(n[1], 10) && "boolean" == typeof n[3] && (e = n[3], "function" == typeof n[2] && n[2]("set", !0)); else if ("ping" === n[0]) { var i = { gdprApplies: e, cmpLoaded: !1, cmpStatus: "stub" }; "function" == typeof n[2] && n[2](i) } else a.push(n) }, n.addEventListener("message", (f

In [166]:
#scraping player names

player_scrapes = scrape_soup.find_all('img', {'class': "bilderrahmen-fixed lazy lazy"})
scrape_player_list =[]    
for i in range(0, len(player_scrapes)):
    scrape_player_list.append(str(player_scrapes[i]).split('" class=')[0].split('alt="')[1])
print(f'There are {len(scrape_player_list)} names in scrape_player_list')

scrape_player_list

There are 25 names in scrape_player_list


['Dayot Upamecano',
 'Lucas Hernández',
 'Edmond Tapsoba',
 'Niklas Süle',
 'Manuel Akanji',
 'Evan Ndicka',
 'Jonathan Tah',
 'Maxence Lacroix',
 'Josko Gvardiol',
 'Matthias Ginter',
 'Nico Elvedi',
 'Odilon Kossounou',
 'Nico Schlotterbeck',
 'Mohamed Simakan',
 'Moussa Niakhaté',
 'Konstantinos Mavropanos',
 'Philipp Lienhart',
 'Jeremiah St. Juste',
 'Felix Uduokhai',
 'Dan-Axel Zagadou',
 'Willi Orbán',
 'Sebastiaan Bornauw',
 'Martin Hinteregger',
 'Piero Hincapié',
 'John Anthony Brooks']

Scrape appears to work for the position specific pages as well. I need to create a list of links again and then run them through the scrape. Country codes, league codes and page codes will be the same but we will need positional codes for these pages as well. These are listed below along with their position codes in each link . 

**2 Pages Scraped (50 total players)**
* Center Backs - 3
* Center Midfielders - 7

**1 Page Scraped (25 total players)**
* Left Backs - 4
* Right Backs - 5
* Defensive Midfielders - 6
* Right Midfielders - 8
* Left Midfielders - 9
* Attacking Midfielders - 10


Link for reference
https://www.transfermarkt.us/bundesliga/marktwerte/wettbewerb/L1/pos//detailpos/3/altersklasse/alle/plus//galerie/0/page/1

The dictionary below will be used to add the position information to the scrape. It will look at the corresponding link code for each position and create a list of 25 with the actual positions listed below to add to the final dataframe.

In [20]:
#creating a dictionary that maps code to positions
position_dict = {3: 'CB', 7: 'CM', 10: 'CAM', 4: 'LB', 5: 'RB', 6: 'CDM', 9: 'LM', 8: 'RM'}

In [32]:
#printing country and league code for reference
print(country_codes)
print(league_codes)

#creating page number codes
scrape_page_codes = [1, 2]

#creating position code list to match up with dictionary created above. 
position_code_list = [3, 7, 10, 4, 5, 6, 8, 9]



['GB1', 'ES1', 'L1', 'IT1', 'FR1']
['premier-league', 'laliga', 'bundesliga', 'serie-a', 'ligue-1']


Now the link list can be created. 

In [33]:
#instatiating link list
scrape_link_list = []
#nesting for loops to create links
for x in range(0, len(position_code_list)):
    for y in range (0, len(league_codes)):
        for z in range(0, len(scrape_page_codes)):
            #only taking the first 25 of positions with codes 4-6 and 8-9
            if position_code_list[x] in range(4, 7) and page_code_list[z] == '2':
                       continue
            elif position_code_list[x] in range(8, 11) and page_code_list[z] == '2':
                       continue
            else:
                scrape_link = 'https://www.transfermarkt.us/' + league_codes[y] + '/marktwerte/wettbewerb/' + country_codes[y] + '/pos//detailpos/' + str(position_code_list[x]) + '/altersklasse/alle/plus//galerie/0/page/' + page_code_list[z]
                scrape_link_list.append(scrape_link)



In [34]:
#for coders reference
print(scrape_link_list[21:26])
print(len(scrape_link_list))

['https://www.transfermarkt.us/laliga/marktwerte/wettbewerb/ES1/pos//detailpos/10/altersklasse/alle/plus//galerie/0/page/1', 'https://www.transfermarkt.us/bundesliga/marktwerte/wettbewerb/L1/pos//detailpos/10/altersklasse/alle/plus//galerie/0/page/1', 'https://www.transfermarkt.us/serie-a/marktwerte/wettbewerb/IT1/pos//detailpos/10/altersklasse/alle/plus//galerie/0/page/1', 'https://www.transfermarkt.us/ligue-1/marktwerte/wettbewerb/FR1/pos//detailpos/10/altersklasse/alle/plus//galerie/0/page/1', 'https://www.transfermarkt.us/premier-league/marktwerte/wettbewerb/GB1/pos//detailpos/4/altersklasse/alle/plus//galerie/0/page/1']
50


__NOTE: There are not 25 attacking midfielders in La Liga, or right or left midfielders in any of the leagues listed on transfermrkt. I'm going to enter these manually in excel after the scrape is complete.__ 

I will need to remove these related links from the scraping pool

In [35]:
#removing spanish attacking midfielder link
scrape_link_list.remove('https://www.transfermarkt.us/laliga/marktwerte/wettbewerb/ES1/pos//detailpos/10/altersklasse/alle/plus//galerie/0/page/1')
len(scrape_link_list)

49

In [36]:
print(scrape_link_list[39])

https://www.transfermarkt.us/premier-league/marktwerte/wettbewerb/GB1/pos//detailpos/8/altersklasse/alle/plus//galerie/0/page/1


In [37]:
#removing bad links (left and right midfielders)
scrape_link_list_good = scrape_link_list[:39]
scrape_link_list_bad = scrape_link_list[39:]

In [38]:
two_pagers = position_code_list[:2]
one_pagers = position_code_list[2:]
#for reference
print(two_pagers)
print(one_pagers)

[3, 7]
[10, 4, 5, 6, 8, 9]


Now I need to create a list of 50 position codes to use for all of the 50 links I will be scraping. The CM and CDM position codes are needed 10 times (two links per league), and the rest of the codes are needed 5 times (one link per league).

In [39]:
#creating a list of position codes to use to pull positions in scrape
#should be 50, since there is 50 in each list

#instantiating empty list

scrape_position_codes = []
for code in two_pagers:
    ten_codes = [code] * 10
    scrape_position_codes += ten_codes
    
for codes in one_pagers:
    five_codes = [codes] * 5
    scrape_position_codes += five_codes
    
print(scrape_position_codes)
    

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9]


**Note** The list above was created before the issues with Spanish attacking and left/right midfielders was discovered. So I am now going to remove those from the list, as the info will be manually entered. 

In [40]:
#removing a 10 from the position codes list as well
scrape_position_codes.remove(10)
len(scrape_position_codes)


49

In [41]:
#removing left and right mid
scrape_position_codes_good = scrape_position_codes[:39]
scrape_position_codes_bad = scrape_position_codes[39:]

## Final Scrape

Now I am ready to scrape the 39 links left in the link list. The process is the same as the ones above. I discovered that the source code is the same through examining the different pages. The process will be the same as the other scrapes as well. 

In [42]:
#empty lists 
scrape_player_list = []
scrape_ages = []
scrape_nationalities = []
scrape_last_clubs =[]
scrape_market_values = []
scrape_positions = []
scrape_leagues =[]


for link_index in range(0, len(scrape_link_list_good)):

    #creating soup
    scrape_page = scrape_link_list_good[link_index]
    scrape_pagetree = requests.get(scrape_page, headers = headers)
    scrape_soup = BeautifulSoup(scrape_pagetree.content, 'html.parser')
    scrape_soup

    #leagues
    leagues = []
    league = scrape_link_list_good[link_index].split('.us/', 1)[1].split('/marktwerte', 1)[0]
    leagues.append(league)
    leagues = leagues * 25
    scrape_leagues = scrape_leagues + leagues
    print(f'There are {len(scrape_leagues)} items in scrape_leagues list')
    
    #player names
    player_names = scrape_soup.find_all('img', {'class': "bilderrahmen-fixed lazy lazy"})
    
    for i in range(0, len(player_names)):
        scrape_player_list.append(str(player_names[i]).split('" class=')[0].split('alt="')[1])
    print(f'There are {len(scrape_player_list)} items in scrape_player_list')

    club_nation_age = scrape_soup.find_all('td', {'class':'zentriert'})
    club_nation_age

    #pulling ages
    ages = club_nation_age[2::4]
    ages = ages[:25]
    
    
    for age_index in range(0, len(ages)):
        scrape_ages.append(str(ages[age_index]).split('</td>')[0].split('"zentriert">')[1])
    
    print(f'There are {len(scrape_ages)} items in scrape_ages list')
    
    #pulling nationalities
    nationalities = club_nation_age[1::4]
    
    nationalities = nationalities[:25]
    
    for nation_index in range(0, len(nationalities)):
        nationality = str(nationalities[nation_index]).split('" class="flaggenrahmen"', 1)[0].split('alt="', 1)[1]
        scrape_nationalities.append(nationality)

    print(f'There are {len(scrape_nationalities)} items in scrape_nationalities list')

    #pulling last club
    last_clubs = club_nation_age[3::4]
    #length of 27 should be 25
    #have to remove to because of table on website with similar tags at end of soup above
    last_clubs = last_clubs[:25]

    
    for last_club_index in range(0, len(last_clubs)):
        last_club = str(last_clubs[last_club_index]).split('" class=""',1)[0].split('img alt="', 1)[1]
        scrape_last_clubs.append(last_club)
    
    print(f'There are {len(scrape_last_clubs)} items in scrape_last_clubs list')

    #market values
    market_values = scrape_soup.find_all('td', {'class':"rechts hauptlink"})
    #pulling only first 25 items in the list. these represent the actual market values
    market_values = market_values[:25]
    
    
    #extract market values
    for market_value_index in range(0, len(market_values)):
        market_value = str(market_values[market_value_index]).split('$',1)[1].split('m',1)[0]
        scrape_market_values.append(market_value)

    print(f'There are {len(scrape_market_values)} items in scrape_market_values list')

    #position list not needed
    
    position = [position_dict[scrape_position_codes_good[link_index]]]
    position = position*25
    scrape_positions += position
    print(f'There are {len(scrape_positions)} items in scrape_positions list')
    
print(len(scrape_ages))
print(len(scrape_player_list))
print(len(scrape_nationalities))
print(len(scrape_market_values))
print(len(scrape_positions))
print(len(scrape_last_clubs))
print(len(scrape_leagues))

There are 25 items in scrape_leagues list
There are 25 items in scrape_player_list
There are 25 items in scrape_ages list
There are 25 items in scrape_nationalities list
There are 25 items in scrape_last_clubs list
There are 25 items in scrape_market_values list
There are 25 items in scrape_positions list
There are 50 items in scrape_leagues list
There are 50 items in scrape_player_list
There are 50 items in scrape_ages list
There are 50 items in scrape_nationalities list
There are 50 items in scrape_last_clubs list
There are 50 items in scrape_market_values list
There are 50 items in scrape_positions list
There are 75 items in scrape_leagues list
There are 75 items in scrape_player_list
There are 75 items in scrape_ages list
There are 75 items in scrape_nationalities list
There are 75 items in scrape_last_clubs list
There are 75 items in scrape_market_values list
There are 75 items in scrape_positions list
There are 100 items in scrape_leagues list
There are 100 items in scrape_player

There are 700 items in scrape_leagues list
There are 700 items in scrape_player_list
There are 700 items in scrape_ages list
There are 700 items in scrape_nationalities list
There are 700 items in scrape_last_clubs list
There are 700 items in scrape_market_values list
There are 700 items in scrape_positions list
There are 725 items in scrape_leagues list
There are 725 items in scrape_player_list
There are 725 items in scrape_ages list
There are 725 items in scrape_nationalities list
There are 725 items in scrape_last_clubs list
There are 725 items in scrape_market_values list
There are 725 items in scrape_positions list
There are 750 items in scrape_leagues list
There are 750 items in scrape_player_list
There are 750 items in scrape_ages list
There are 750 items in scrape_nationalities list
There are 750 items in scrape_last_clubs list
There are 750 items in scrape_market_values list
There are 750 items in scrape_positions list
There are 775 items in scrape_leagues list
There are 775 i

Now i'm going to check to see if this big scrape worked


In [44]:
#creating dataframe
scrape_df = pd.DataFrame({'Name': scrape_player_list,
                   'League': scrape_leagues,
                   'Club': scrape_last_clubs,
                   'Position': scrape_positions,
                  'Age': scrape_ages,
                  'Nationality': scrape_nationalities,
                  'Market_Values': scrape_market_values})

In [45]:
#looking to see if all data is in data frame
scrape_df.head()

Unnamed: 0,Name,League,Club,Position,Age,Nationality,Market_Values
0,Rúben Dias,premier-league,Manchester City,CB,24,Portugal,82.5
1,Raphaël Varane,premier-league,Manchester United,CB,28,France,71.5
2,Virgil van Dijk,premier-league,Liverpool FC,CB,30,Netherlands,60.5
3,Harry Maguire,premier-league,Manchester United,CB,28,England,52.8
4,Aymeric Laporte,premier-league,Manchester City,CB,27,Spain,49.5


In [46]:
#checking positons
scrape_df['Position'].unique()

array(['CB', 'CM', 'CAM', 'LB', 'RB', 'CDM'], dtype=object)

# Combining The Scrapes

Now that all of the data has been scraped I can combine all of the corresponding lists and create a data frame to export to an Excel document. From there I will manually enter data that was not scraped. 

In [47]:
#players
final_player_list = player_list + goalie_list + scrape_player_list
#checkstep
print(len(final_player_list))

#leagues
final_leagues_list = player_leagues + goalie_leagues + scrape_leagues
print(len(final_leagues_list))

#position
final_position_list = player_positions + goalie_positions + scrape_positions
print(len(final_position_list))
      
#ages
final_age_list = player_ages + goalie_ages + scrape_ages
print(len(final_age_list))

      
      
#nationality
final_nationality_list = player_nationalities + goalie_nationalities + scrape_nationalities
print(len(final_nationality_list))
      
      
#club
final_club_list = player_last_clubs + goalie_last_clubs + scrape_last_clubs
print(len(final_club_list))
      
#market values
final_market_value_list = player_market_values + goalie_market_values + scrape_market_values
print(len(final_market_value_list))


NameError: name 'player_list' is not defined

All the data has been gathered. The final step is to combine all of these lists into a dataframe that I can then transfer to excel. The data manipulation process that will take place in excel is described below. 

In [245]:
#creating datafarme
market_value_df = pd.DataFrame({'Name': final_player_list,
                   'League': final_leagues_list,
                   'Club': final_club_list,
                   'Position': final_position_list,
                  'Age': final_age_list,
                  'Nationality': final_nationality_list,
                  'Market_Values': final_market_value_list})

In [246]:
#checking the dataframe
market_value_df.head()

Unnamed: 0,Name,League,Club,Position,Age,Nationality,Market_Values
0,Romelu Lukaku,premier-league,Chelsea FC,Centre-Forward,28,Belgium,110.0
1,Mohamed Salah,premier-league,Liverpool FC,Right Winger,29,Egypt,110.0
2,Harry Kane,premier-league,Tottenham Hotspur,Centre-Forward,28,England,110.0
3,Marcus Rashford,premier-league,Manchester United,Left Winger,24,England,93.5
4,Jadon Sancho,premier-league,Manchester United,Left Winger,21,England,93.5


Now that my data frame is created. I am going to upload it to an excel document and edit in there. 

## Transporting to Excel Document

In [248]:
market_value_df.to_excel('/Users/benkatz/Desktop/Capstone Project/Complete Market Values.xlsx')

# Conclusion

All the data has been scraped successfully as the excel document contains all the information. The statistics I will be using can be downloaded directly from the https://fbref.com/en/ in excel format. Once all the statistics are gathered, I will combine the 2020-2021 statistics and 2021-2022 statistics (as of February 27th, 2022) for each player. Then the statistics and the data scraped in this notebook will be combined by player names. I will use these raw statistics to create per 90-minute statistics and success rates. Per 90-minute statistics are essentially per game statistics (there are 90 minutes per soccer game). Success rate statistics measure what percentage of the time a player successfully completes an action. The reason for this is so I can compare players on the same scale. Some players play more games than others, and raw statistics would favor these players. This will serve as the data set for my Capstone Project. 