In [3]:
import requests
from bs4 import BeautifulSoup
import csv
import time
import pandas as pd
import lxml
import re
import numpy as np


# Feature Extraction from HLTV 

This notebook only contains the section to extract features from the HLTV player pages used later for building a model. 

## Unfiltered players

Firstly, the player names from each year must be collected from HLTV's repo. Since HLTV did not publish advanced statistics until late 2015, only the years 2016, 2017, 2018, 2019 are used. 

Only LAN matches are counted, while no filter restriction is placed on the Player Filter. The **_oneYear(year)_** function is a helper function that returns a string that has the date range depending on the year supplied. Each year has its own HTML file, and the collection of all four HTML files are placed in the 'unfilteredplayers' folder.

In [None]:
def oneYear(year):
    return 'startDate=' + str(year) + '-01-01&endDate=' + str(year) + '-12-31'

# for year in np.arange(2016,2020):
#     url = 'https://www.hltv.org/stats/players?' + oneYear(year) + '&matchType=Lan'
#     temp = requests.get(url).text
#     with open('unfilteredplayers/players' + str(year) + '.html', 'w', encoding='utf-8') as f:
#         f.write(temp)
#     time.sleep(1)

These html files now must be read and turned into .csv files. The **_allplayersonLAN(year)_** function below takes in a year and reads that year's HTML page that has the list of players who have official statistics on LAN events only. A dictionary is returned that maps players' names to players' individual statistic page based on the year.

In [None]:
def allplayersonLAN(year):
    with open('unfilteredplayers/players' + str(year) + '.html','r',encoding='utf-8') as players_html:
        soup = BeautifulSoup(src,'html.parser')
    temp = {}
    for player in soup.find_all('td',class_='playerCol'):
        temp[player.text] = 'https://hltv.org' + player.find('a')['href']
    return temp

The **_htmlToCsvUnfiltered(year)_** function is the main function that transforms the HTML files into CSV files. It uses **_allplayersonLAN_** as a helper function to produce a dictionary, which is then converted to a DataFrame and then exported as a .csv. These .csv files are then placed in 'unfilteredplayers' as well, accompanying the .html files.

In [None]:
def htmlToCsvUnfiltered(year): 
    playerlist = allplayersonLAN(year)
    tab = pd.DataFrame.from_dict(playerlist,orient='index',columns=['Webpage'])
    tab.to_csv('unfilteredplayers/players' + str(year) + '.csv')
    print(str(year) + "-Number of unfiltered players: " + str(len(playerlist)))
    return

# for year in np.arange(2016,2020):
#     htmlToCsvUnfiltered(year)

## Did a player make HLTV's Top 20?

In Section 2, **_aggregateTop20()_** produced a DataFrame of all the Top 20 lists for the four years. 

In [None]:
def getTop20(year, index):
    url = 'https://www.hltv.org/news/' + str(index) + '/top-20-players-of-' + str(year) + '-introduction'
    src = requests.get(url).text
    soup = BeautifulSoup(src,'html.parser')
    arr = np.array([])
    
    if year in [2016, 2017]:
        for i in soup.find_all('tr'):
            arr = np.concatenate((arr,[i.text.split()[2][1:-1]]))
    elif year in [2018, 2019]:
        top20 = soup.find_all('blockquote')[1].text.strip()
        top20list = re.compile("[0-9]+\.  ").split(top20)
        for player in top20list:
            if not len(player) <= 1:
                arr = np.concatenate((arr,[player.split('"')[1]]))
    else:
        print('Error')
    return arr


def aggregateTop20():  
    top20indices = {2016: 19558, 2017: 22348, 2018: 25735, 2019: 28749}

    top20 = np.array([])
    for year in top20indices:
        top20 = np.concatenate((top20,getTop20(year, top20indices[year]))) 
        time.sleep(1)
    df = pd.DataFrame(top20.reshape(4,20).swapaxes(0,1), columns=list(top20indices.keys()))
    df.index = np.arange(1,21)
    return df

top20DF = aggregateTop20()

**_checkIfTop20(row,top20)_** is a function that is to be applied across a DataFrame of all the players. It takes in a DataFrame row as well as a Series that contains the Top 20 based on the provided year. An example is shown below with the players from 2019.

In [None]:
def checkIfTop20(row,top20):
    name = row.name
    if top20.str.match(name).sum() > 0:
        return True
    else:    
        return False

df = pd.read_csv('unfilteredplayers/players2019.csv',index_col=0)
df.insert(df.shape[1],'HLTV Top 20',df.apply(checkIfTop20,axis=1,args=(top20DF.loc[:,2019],)))
df[df['HLTV Top 20']]

## Features

Feature extraction is based on the three tabs of a player: Overview, Individual, and Matches. A total of 24 statistics will be used as features. Maps are complementary and necessary to obtain as they will vary the weighting on the ratings. 

**_getRating(sp)_** is a helper function that takes a player's rating against top 5, 10, 20, 30, and 50 opponents. Additionally, the map counts against each of the 5 types of opponents are taken down so that they can be scaled. **_getFromOverview(url,arr)_** is the main function that fills in a stats array with the statistics that can be taken from the player's overview page. 

From **_getRating()_**:
- Rating Avg (0): Rating 2.0 averaged across all maps played in year
- Rating 5 (1): Rating 2.0 against top 5 opponents
- Rating 10 (2): Rating 2.0 against top 10 opponents
- Rating 20 (3): Rating 2.0 against top 20 opponents
- Rating 30 (4): Rating 2.0 against top 30 opponents
- Rating 50 (5): Rating 2.0 against top 50 opponents
- Maps 5 (6): Maps against top 5 opponents
- Maps 10 (7): Maps against top 10 opponents
- Maps 20 (8): Maps against top 20 opponents
- Maps 30 (9): Maps against top 30 opponents
- Maps 50 (10): Maps against top 50 opponents

The rest:
- ADR (11): Average Damage per round
- KPR (12): Kills per round
- DPR (13): Deaths per round
- APR (14): Assists per round
- IMPACT (15): Quanitatively measures multikills, clutches, first bloods
- KAST (16): % of rounds in which a player had a kill, assist, survived, or was traded
- Grenade dmg (17): Grenade damage per round
- HS% (18): % of kills that were headshots
- Rounds played (19)

In [1]:
def getRating(sp,a):
    ratingsTab = sp.find('div',class_='featured-ratings-container')
    scales = {'vs top 5 opponents': 1,'vs top 10 opponents':2,'vs top 20 opponents':3,
                       'vs top 30 opponents':4,'vs top 50 opponents': 5}

    for rating in ratingsTab.find_all('div',class_='rating-breakdown'):
        ratingtype = rating.find('div',class_='rating-description').text
        mapcount = int(rating.find('div',class_='rating-maps').text[1:-1].split()[0])
        if mapcount == 0:
            continue 
        
        temp = float(rating.find('div',class_='rating-value').text)
        a[scales[ratingtype]] = temp  
        a[scales[ratingtype] + 5] = mapcount
    return a


def getFromOverview(url, arr):
    src = requests.get(url).text
    soup = BeautifulSoup(src,'lxml')
    statsIndex = {'Rating': 0, 'Rating5': 1,'Rating10': 2,'Rating20': 3, 'Rating30': 4, 'Rating50': 5,              
                  'ADR': 11, 'KPR': 12, 'DPR': 13, 'Assists / round': 14, 'Impact': 15, 'KAST': 16, 'Grenade dmg / Round': 17,
                  'Headshot %': 18, 'Rounds played': 19}
    
    # Ratin 2.0, ADR, KPR, DPR, KAST, IMPACT
    for i in soup.find_all('div',class_=re.compile('summaryStatBreakdown ')):
        statname = i.find('div',class_='summaryStatBreakdownSubHeader').text.split()[0]
        if statname in statsIndex:
            if statname == 'KAST':
                arr[statsIndex[statname]] = i.find('div',class_='summaryStatBreakdownDataValue').text[:-1]
            else:
                arr[statsIndex[statname]] = i.find('div',class_='summaryStatBreakdownDataValue').text
    
    # APR, Grenade Dmg/ Round
    for i in soup.find_all('div',class_='stats-row'):
        if i.text.find('Grenade dmg / Round') > -1:
            arr[statsIndex['Grenade dmg / Round']] = i.find_all('span')[1].text
        elif i.text.find('Assists / round') > -1:
            arr[statsIndex['Assists / round']] = i.find_all('span')[1].text
        elif i.text.find('Headshot %') > -1:
            arr[statsIndex['Headshot %']] = i.find_all('span')[1].text[:-1]
        elif i.text.find('Rounds played') > -1:
            arr[statsIndex['Rounds played']] = i.find_all('span')[1].text
                
        
    # Rating Scale
    arr = getRating(soup, arr)
    return arr


**_getFromIndividual(url,arr)_** is the main function that fills in a stats array with the statistics that can be taken from the player's Individual page. 

- Rounds with kills (20)
- k-d diff (21): Kills minus Deaths
- Opening kill ratio (22)
- Opening kill rating (23)
- Team win percent after 1st kill (24)
- First kill in won rounds (25)
- Kill (26)
- Death(27)


In [None]:
def getFromIndividual(url, arr):
    src = requests.get(url).text
    soup = BeautifulSoup(src,'lxml')
    statsIndex = {'Rounds with kills': 20, 'Kill - Death difference': 21, 'Opening kill ratio': 22, 'Opening kill rating': 23,
                  'Team win percent after first kill': 24, 'First kill in won rounds': 25, 'Kills': 26, 'Deaths': 27}
    for i in soup.find_all('div',class_='stats-row'):
        statname = i.span.text
        if statname in statsIndex:
            if statname == 'Kill - Death difference':
                arr[statsIndex[statname]] = i.span.next_sibling.next_sibling.text
            elif statname == 'Team win percent after first kill':
                arr[statsIndex[statname]] = i.span.next_sibling.text[:-1]
            elif statname == 'First kill in won rounds':
                arr[statsIndex[statname]] = i.span.next_sibling.text[:-1]
            else:
                arr[statsIndex[statname]] = i.span.next_sibling.text
                
    return arr

**_getFromMatches(url,arr)_** is the main function that fills in a stats array with the statistics that can be taken from the player's Matches page. 

- % of maps with 1+ rating (28)

In [None]:
def getFromMatches(url, arr):
    src = requests.get(url).text
    soup = BeautifulSoup(src,'lxml')
    statsIndex = {'Maps with 1+ rating': 28}
    for i in soup.find_all('div',class_='col'):
        statname = i.find('div',class_='description').text
        if statname is None:
            continue
        elif statname in statsIndex:
            arr[statsIndex[statname]] = i.find('div',class_='value').text[:-1]
            break
    return arr

**_getFeatures(row)_** is applied across an entire DataFrame and uses the three functions defined above. It takes in a row and returns a numpy array that contains the statistics. 

In [None]:
def getFeatures(row):
    statsarr = np.zeros((29,))
    
    print("Working on:" + str(row.name))
    
    overvUrl = row['Webpage']
    statsarr = getFromOverview(overvUrl,statsarr)
    time.sleep(7.5)
    indivUrl = overvUrl[:31] + 'individual/' + overvUrl[31:]
    statsarr = getFromIndividual(indivUrl,statsarr)
    time.sleep(7.5)
    matchUrl = overvUrl[:31] + 'matches/' + overvUrl[31:]
    statsarr = getFromMatches(matchUrl,statsarr)
    time.sleep(7.5)
    return statsarr

Since applying **_getFeatures_** returns a Series of 12x1 numpy arrays, this nested structure needs to be flattened into a non-nested two-dimensional array. **_flattenFeatures(s,df)** extracts the nested structure made from the applied function and combines it with the existing DataFrame that came from the end of section 3-1. The Series is first converted into a numpy array, then numpy's **stack** function will flatten it. It is then concatenated with the original DataFrame that just had player names and hltv URLs.

In [None]:
def flattenFeatures(s,df):
    flat = np.stack(s.to_numpy())
   
    featureNames = np.array(['Rating','RatingV5','RatingV10','RatingV20','RatingV30','RatingV50',
                             'MapsV5','MapsV10','MapsV20','MapsV30','MapsV50',
                             'ADR','KPR','DPR','APR', 'Impact','KAST','NadeDPR','HS%','Rounds played',
                             'Rounds with kills', 'K-D Diff', 'Opening kill ratio', 'Opening kill rating',
                             'Team win % after 1st kill', 'First kill in won rounds','Kills','Deaths',
                             '% of maps with 1+ rating'])
    features = pd.DataFrame(data=flat, columns=featureNames)
    features.index = df.index
    
    final = pd.concat([df,features],axis=1)
    return final

Since applying **_extractFinalFeature_** is the main function for feature extraction. It takes in a year as well as the top20DF created in section 2. It reads in the DataFrame saved in section 3-1 and then adds a column checking whether or not a player made it in that year's HLTV top 20. It will then call **_getFeatures_** to produce the Series of numpy arrays that contains all the stats. That series is inserted along with the DataFrame into **_flattenFeatures_** to produce a combined DataFrame with all the features necessary for analysis. 

In [None]:
def extractFinalFeatures(year):
    top20DF = aggregateTop20()
    df = pd.read_csv('unfilteredplayers/players' + str(year) + '.csv',index_col=0).iloc[:5,:]
    df.insert(df.shape[1],'HLTV Top 20',df.apply(checkIfTop20,axis=1,args=(top20DF.loc[:,year],)))
    
    statsonly = df.apply(getFeatures,axis=1)
    
    finalTab = flattenFeatures(statsonly,df)
    
    finalTab.to_csv('features/meme'+str(year)+'.csv') 
    
    return

# for year in [2016,2017,2018,2019]:
#     extractFinalFeatures(year)