# Scraping to Sentiment

> The aim of this notebook is to demonstrate how match reports can be converted to a database record that stores information about the sentiment of football players written about in that report.

### Steps Involved

> 1. Collect links from landing URL
2. Get HTML from all match report links
2. Convert HTML to clean text
3. Remove Stopwords / Lemmatize 
4. Tokenize the sentence
5. Predict sentiment at sentence level
6. Identify subject of sentence
7. Identify other tags
8. Store record if there is a match with player names
9. Convert records into sentiment score
10. Store sentiment scores
11. Allow user to search for sentiment scores

### Notes

> - Removing Stopwords & lemmatizing _was_ unproductive and result in skewed results: Code is still in the notebook but unused
- In future, aspect level sentiment analysis should be better than sentence level analysis

### Modules Used

> - requests
- BeautifulSoup
- time
- re
- NLTK
- stopwords (custom module - available in zip folder)
- pickle
- fullplayernames (custom module - available in zip folder)
- spacy
- pandas

-----

## Step 1 - Collect Links from landing URL

> Collecting all match reports from page one of the Mirror match reports site
- https://www.mirror.co.uk/sport/football/match-reports/

In [67]:
# Declare the landing URL
url = "https://www.mirror.co.uk/sport/football/match-reports/"

In [68]:
# Import Requests
import requests

In [69]:
# Declare Headers - Will need to change for other users

the_headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36; Andy Clarke/clarkeaj3@cardiff.ac.uk"}



#### Make the landing page request 

In [70]:
landing_page = requests.get(url, headers=the_headers, timeout=5)

In [71]:
landing_page.status_code

200

#### Make Beautiful Soup object from landing page

In [72]:
# Import Beautiful Soup 
from bs4 import BeautifulSoup

In [73]:
# Make BS Object
soup = BeautifulSoup(landing_page.text, 'html.parser')

#### Find all headline links from landing page

In [81]:
# Declare list of links
links = []

#### Loop through < a > elements with class of headline and add href to list

> Note: This is specific to match reports from the mirror and is unlikely to be the same for other sources

In [82]:
for a in soup.find_all('a', {'class': 'headline'}):
    if "match-reports" in str(a['href']):
        links.append(a['href'])

-----

## Step 2 - Get HTML from all links

#### Note: 
> - Loop will send one request to retrieve each link in list: 
- Timer applied so that request each new request sent after 5 seconds
- Will take approx 2 mins to collect hmtl from all links

In [92]:
from time import sleep

In [103]:
# Declare list of html & counter
link_html = []
link_counter = 1

#### Loop through links and store HTML - _2 Mins!!_
> There are 24 links to retrieve

In [104]:
for link in links:
    
    # Make the request
    page = requests.get(link, headers=the_headers, timeout=5)
    
    # Check Status Code before continuing 
    if page.status_code == 200:
        
        # Make soup object from response and append to list
        link_html.append(BeautifulSoup(page.text, 'html.parser'))
        
    # If status code doesn't look good - stop the loop!
    else: 
        print("Error: " + str(page.status_code))
        break
        
    # Sleep for 5 secs before making the next request
    print("Link number " + str(link_counter) + " complete")
    link_counter += 1
    sleep(5) 

Link number 1 complete
Link number 2 complete
Link number 3 complete
Link number 4 complete
Link number 5 complete
Link number 6 complete
Link number 7 complete
Link number 8 complete
Link number 9 complete
Link number 10 complete
Link number 11 complete
Link number 12 complete
Link number 13 complete
Link number 14 complete
Link number 15 complete
Link number 16 complete
Link number 17 complete
Link number 18 complete
Link number 19 complete
Link number 20 complete
Link number 21 complete
Link number 22 complete
Link number 23 complete
Link number 24 complete


-----

## Step 2 - Convert HTML to Clean Text

#### Define function to clean HTML

> Function will: 
- Remove tag characters "<...>"
- Remove links
- Return a string

In [1361]:
# Import Regular Expressions
import re

In [1362]:
def clean_text(input):
    result = re.sub("<(.*?)((?!\bterm\b).)>","", str(input))
    result = re.sub("&amp;apos","'", result)
    result = re.sub("\.,",".", result)
    result = result[1:-1]
    return result

#### Find < p > elements and clean the text

> Note: This is specific to match reports from the mirror and is unlikely to be the same for other sources

In [1363]:
# Declare clean text list
clean_text_list = []

In [1364]:
# Loop through HTML pages - Find p elements and clean text before appending to list
for page in link_html:
    p_elements = str(page.find_all('p'))
    clean_text_list.append(clean_text(p_elements))

-----

## Step 3 - Remove Stopwords and Lemmatize

> This step will run a loop to further prepare the text for analysis
>> Steps involved:
- Removing stopwords
- Lemmatizing words
- Replenishing the place of punctuation in the text

> _Note_: This step might be unproductive 
- If that is the case - comment out the code below

#### Imports to remove stopwords

In [1365]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# Import customised list of stopwords
import stopwords

#### Function to put '.' & ',' back in place

In [1366]:
# Function to replenish full stops and commas
def punctuation_replenish(input):
    r = re.sub(" \.",". ", str(input))
    r = re.sub(" ,",", ", str(r))
    return r

#### Declare new list to hold results & perform loop

In [1367]:
lemmatized_results = []

In [1368]:
# Loop through result text - remove stopwords - lemmatize and replenish punctuation
for article in clean_text_list:
    lemmatized_result = " ".join([WordNetLemmatizer().lemmatize(word) for word in word_tokenize(article) if word not in stopwords.stopwords()])
    lemmatized_results.append(punctuation_replenish(lemmatized_result))

-----

### Step 4 - Sentence Tokenize

> Split the sentences in the text
- Remove commas from beginning of lines

#### Function to remove commas from sentences

In [1369]:
# Function to remove commas from sentences
def remove_commas(sentences):
    new_sentence_list = []
    for sentence in sentences:
        if sentence[0] == ",":
            new_sentence_list.append(sentence[2:])
        else: 
            new_sentence_list.append(sentence)
    
    return new_sentence_list

#### Create tokens from lemmatized results

In [1370]:
from nltk.tokenize import sent_tokenize

In [1371]:
# Declare list to hold lemmatized tokens
lemmatized_tokens = []

In [1372]:
# Loop through lemmatized sentences, tokenize and remove commas
for article in lemmatized_results:
    sentences = sent_tokenize(article)
    lemmatized_tokens.append(remove_commas(sentences))

#### Create tokens from original sentences

In [1373]:
# Declare list to hold original sentences
original_sentences = []

In [1374]:
# Loop through lemmatized sentences, tokenize and remove commas
for article in clean_text_list:
    sentences = sent_tokenize(article)
    original_sentences.append(remove_commas(sentences))

-----

### Step 5 - Predict Sentiment

> _Note_: 
- At this stage the text will be put into a dataframe so that it can be associated with it's sentiment
- Uses Original Sentences to predict sentiment rather than lemmatized tokens as this produced better results

#### Import Trained Classifier

> The trained classifier is unpickled and imported here


In [1375]:
import pickle

In [1376]:
# Specify Filename
filename = "sentiment_model.sav"

In [1377]:
# Read in model
infile = open(filename,'rb')

In [1378]:
# Save instance of model
model = pickle.load(infile)

In [1379]:
# Close file
infile.close()

#### Vectorize input text 

> ##### Represents the article's text as a bag of words
>> Features of my vectoriser are the same as were used to train the model
> - Ngram Range = (1,3)
> - CountVectoriser prefered over TfIdf Vectorizer

#### Import Vectorizer

In [1380]:
# Specify Filename
filename_vect = "vectorizer.sav"

In [1381]:
# Read in vectorizer
infile_vect = open(filename_vect,'rb')

In [1382]:
# Save instance of vectorizer
vect = pickle.load(infile_vect)

In [1383]:
# Close file
infile_vect.close()

#### Import dictionary of players and teams 

In [1384]:
import fullplayernames

In [1385]:
players = fullplayernames.player_names

In [1386]:
# Declare list of teams
teams = list(players.keys())

#### Function to determine teams playing in a match from first line of article

In [1387]:
def match_identifier(string):
    
    # Define match list
    match = []
    
    # Iterate through teams
    for team in teams:
        # Iterate through sentence and pick out words that match with an item in list of teams
        for word in word_tokenize(string):
            if len(word) > 3 and word in team:
                match.append(team)
        
    
    return str(list(dict.fromkeys(match)))

#### Function to make the prediction

In [1388]:
import pandas as pd
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', None)

> #### Function will:
- Vectorize the sentences
- Apply sentiment function to sentences
- Loops through list of sentences and makes an individual dataframe for each of the articles
- Will add original sentences to dataframe alongside sentiment prediction
- Will identify what teams are involved in each match
- Return list of dataframes

In [1389]:
def sentiment_dataframe():
    
    # Declare list to store all dataframes
    df_list = []
    
    for index, list_of_sentences in enumerate(original_sentences):
    
        # Predictions - Vectorise the sentences and make a sentiment prediction
        predictions = pd.DataFrame(model.predict(vect.transform(list_of_sentences)))

        # Rename sentiment cols
        predictions = predictions.rename(columns={0: "sentiment"})

        # Add original sentences
        predictions['sentences'] = original_sentences[index]
        
        # Add match information based on teams in first or second row 
        first_sentence = predictions.iloc[0]['sentences'] + predictions.iloc[1]['sentences']
        
        match = match_identifier(first_sentence)
        
        # Add matches to df 
        predictions['match'] = match
        
        # Append to df_list
        df_list.append(predictions)
    
    return df_list

#### Concatenate list of dataframes

In [1390]:
df = pd.concat(sentiment_dataframe())

------

### Step 6 - Identify Sentence Subjects

> Spacy provides a way to identify the subject of the sentence based on it's relationship to words around it
  . This will need to become much more sophisticated in order to recognise more detailed information about the sentence and the sentiment contained within it
  - _Note_: 
      - Find a way to perform aspect based sentiment analysis
      - Improve Information extraction 
  
> #### Advantages:
 - Simple
 - Effective 
 - Liteweight 
 
 > #### Disdvantages:
 - Not 100% accurate
 - Struggles to identify two sentence subjects ("Him and her ... ")

In [1391]:
import spacy

In [1392]:
# Declare instance of spacy language processor
nlp = spacy.load('en')

#### Loop to create a list of sentence subjects

> Loop will:
- Find the subject of a sentence 
- Add that list to corresponding row in dataframe

In [1393]:
# Create new row to store sentence subjects
df['sentence_subjects'] = ""

In [1394]:
# Loop - Adding sentence subjects to dataframe row
for index, row in df.iterrows():
    document = nlp(row['sentences'])
    row['sentence_subjects'] += str([tok for tok in document if (tok.dep_ == "nsubj")])

-------


### Step 7 - Identify Other Tags

##### Note: _Consider changing to identifying just PLAYERS ..._

#### Loop to identify all proper nouns in sentence 
> ##### Loop will
- Find the proper nouns in a sentence 
- Add those proper nouns to corresponding row in df

In [1395]:
# Create new row to store proper nouns
df['proper_nouns'] = ""

In [1396]:
# Loop - Adding sentence subjects to dataframe row
for index, row in df.iterrows():
    document = nlp(row['sentences'])
    row['proper_nouns'] += str([tok for tok in document if (tok.pos_ == "PROPN")])

-----

### Step 8 - Store Record if Match with Player Name 

#### Extract Target Players from Dataframe

- _Note_: Should be possible to provide a much more sophisticated way of identifying target players

#### Re-index the dataframe

In [1397]:
df = df.reset_index(drop=True)

#### Convert 'sentence_subjects' & 'proper_nouns', 'match' columns back to lists

In [1398]:
def create_list(string):
    
    # Remove brackets
    clean_string = string[1:-1]
    
    # Return Split string 
    return clean_string.split(",")

#### Apply list function to columns 

In [1399]:
df['sentence_subjects'] = df['sentence_subjects'].apply(create_list)

In [1400]:
df['proper_nouns'] = df['proper_nouns'].apply(create_list)

In [1401]:
df['match'] = df['match'].apply(create_list)

##### Create a new empty column

In [1402]:
df['targets'] = ""

#### Function to create list of players based on match

- Create a list of players belonging to both teams within the match 

In [1403]:
def create_players_list(row):
    
    # Players list 
    list_of_players = [] 
    
    # Add players based on teams in a match
    for team in row['match']:
        clean = team.strip("'' ")
        if clean != "":
            list_of_players.append(players[clean])
            
            
    # Join lists within list_of_players if length > 1
    if len(list_of_players) == 2:
        list_of_players = list_of_players[0] + list_of_players[1]
    elif len(list_of_players) == 1:
        list_of_players = list_of_players[0]
        
    return list_of_players

### _Adding target players_

#### Add Sentence Subjects

##### Function to add sentence subjects
> _Function will_:
- Add a matching player to the targets column of the corresponding dataframe row

In [1404]:
def sentence_subjects(row, list_of_players):
        
    # Sentence Subject List
    subjects = []
    
    # Loop through sentence subjects and match with players
    for item in row['sentence_subjects']:
        for player in list_of_players:
            if str(item) in player and str(item) != "" and str(item) not in stopwords.stopwords() and str(item) != "he" and str(item) != "we":
                subjects.append(player)
                
    return subjects

#### Loop through each row and apply the function

In [1405]:
for index, row in df.iterrows():
    players_list = create_players_list(row)
    row['targets'] = str(sentence_subjects(row, players_list))[1:-1]

#### If 'He' is the sentence subject then find player previously spoken about

##### Function to add target based on mention of 'he', 'his'
> _Function will_:
- Check if 'he' or 'his' is sentence subject
- Return the player this is referring to  

In [1406]:
def previous_subject(row, index, list_of_players):
    
    # Previous players list 
    previous_players = []
    
    # Loop through sentence subjects - identify mentions of 'his', 'he' and find matching player
    for item in row['sentence_subjects']:
        if 'he' in item or 'He' in item or 'his' in item or 'His' in item:
            if len(item) < 4:
                for player in list_of_players:
                    if df.iloc[index-1]['proper_nouns'][-1] in player and str(df.iloc[index-1]['proper_nouns'][-1]) != "":
                        previous_players.append(player)
                    elif len(df.iloc[index-1]['proper_nouns']) > 1:
                        if df.iloc[index-1]['proper_nouns'][-2] in player and str(df.iloc[index-1]['proper_nouns'][-2]) != "":
                            previous_players.append(player)
    
    return previous_players
    

#### Loop through each row and apply the function 

In [1407]:
for index, row in df.iterrows():
    players_list = create_players_list(row)
    row['targets'] += str(previous_subject(row, index, players_list))[1:-1]
    

#### If target still empty and player is only one spoken about in sentence - add target

##### Function to add target based on mention of only one player as a proper noun
> _Function will_:
- Check if there is only one player mentioned as a proper noun 
- Return the player this is referring to  

In [1408]:
def single_mention(row, list_of_players):
    
    # Single mention players list
    mentioned_players = []
    
    if row['targets'] == "":
        for player in list_of_players:
            for word in row['proper_nouns']:
                if word in player and word != "":
                    mentioned_players.append(player)
      
    # Remove duplicate values & check for nested lists
    if any(isinstance(i, list) for i in mentioned_players):
        mentioned_players = mentioned_players[0]
    
    mentioned_players = list(dict.fromkeys(mentioned_players))
   
    # If there is only one mentioned player - return them
    if len(mentioned_players) == 1:
        return mentioned_players
    else:
        return ""

    
    

#### Loop through each row and apply the function

In [1409]:
for index, row in df.iterrows():
    players_list = create_players_list(row)
    row['targets'] += str(single_mention(row, players_list))[1:-1]

#### View final records Dataframe

In [1411]:
#df

------

### Step 9 - Convert Records to Sentiment Score

#### Calculating the Sentiment Score

 - Should develop a more sophisticated way of deciding the sentiment score of a player

> Variables currently available: 
- Number of positive reviews 
- Number of negative reviews
- Number of neutral reviews 
- Total number of reviews 

> ##### Simple Method 
- Positive reviews as a % of total positive & negative reviews

`N Positive Reviews / len(N positive reviews + N negative reviews)`

#### Function to calculate a players sentiment score

In [1412]:
def sentiment_score(player):
    
    # Declare player dictionary - to be used as row in dataframe
    player_dictionary = {}
    
    # Begin positive and negative scores at 0
    positive = 0
    negative = 0
    neutral = 0
    
    
    # Loop through dataframe and total up positive | negative scores
    for index, row in df.iterrows():
        if player in row['targets']:
            if row['sentiment'] == "POSITIVE":
                positive += 1
            elif row['sentiment'] == "NEGATIVE":
                negative += 1
            elif row['sentiment'] == "NEUTRAL":
                neutral += 1
    
    # Calculate Player Score
    player_score = 0
    if (positive + negative) > 0:
        player_score += positive / (positive + negative)
    else: 
        player_score = positive
    
    # Make Dictionary Entries
    player_dictionary['Name'] = player
    player_dictionary['Score'] = player_score
    player_dictionary['N_positive_reviews'] = positive
    player_dictionary['N_negative_reviews'] = negative
    player_dictionary['N_neutral_reviews'] = neutral
    player_dictionary['Total_reviews'] = positive + negative + neutral
    
    
    return player_dictionary

#### Split target players into a list within the DF

In [1413]:
# Function to be applied to all rows[target] in df
def split_players(string):
    
    # Split the players
    split = string.split("'")
    
    # Final list
    clean_list = []
    
    # Only return entries with alphabetic characters in 
    for entry in split:
        if re.search(r'[a-zA-Z]+', entry):
            clean_list.append(entry)
            
    # Remove duplicates on return 
    return list(dict.fromkeys(clean_list))

In [1414]:
# Apply Function to targets column
df['targets'] = df['targets'].apply(split_players)

#### Make a list of all unique target players

In [1415]:
target_players = []

In [1416]:
for index, row in df.iterrows():
    for player in row['targets']:
        if player in target_players:
            pass
        else:
            target_players.append(player)

#### Apply sentiment score to all target players

In [1417]:
list_of_sentiment_scores = []

In [1418]:
for player in target_players:
    list_of_sentiment_scores.append(sentiment_score(player))

-------

### Step 10 - Store sentiment scores

#### Create a dataframe from all of the sentiment scores

In [1419]:
sentiment_scores = pd.DataFrame(list_of_sentiment_scores)

#### Reorder the cols 

In [1420]:
sentiment_scores = sentiment_scores[["Name", "Score", "Total_reviews", "N_positive_reviews", "N_negative_reviews", "N_neutral_reviews"]]

In [1421]:
sentiment_scores

Unnamed: 0,Name,Score,Total_reviews,N_positive_reviews,N_negative_reviews,N_neutral_reviews
0,Joshua Kimmich,1.0,3,1,0,2
1,Robert Lewandowski,1.0,3,1,0,2
2,Jadon Sancho,0.0,2,0,0,2
3,Alphonso Davies,1.0,2,1,0,1
4,Erling Haaland,1.0,6,2,0,4
5,Benjamin Pavard,0.0,2,0,0,2
6,Manuel Neuer,1.0,3,2,0,1
7,Ivan Perisic,1.0,3,3,0,0
8,Julian Brandt,1.0,1,1,0,0
9,Torgan Hazard,1.0,2,1,0,1


----

### Step 11 - Allow User to search for player

#### Function to allow df search

In [1425]:
def search(player_name):
    
    counter = len(sentiment_scores)
    
    for index, row in sentiment_scores.iterrows():
        if player_name in row["Name"]:
            return pd.DataFrame(row)
        elif counter == 1:
            return "Sorry ... There is no player in the dataset that matches your search"
        counter -= 1

#### Search for any player in the dataset using the 'search' function
> `search("Mo Salah")`

In [1426]:
search("Billy")

Unnamed: 0,71
Name,Billy Gilmour
Score,1
Total_reviews,9
N_positive_reviews,2
N_negative_reviews,0
N_neutral_reviews,7


-----

### Evaluation

> ##### Processes to improve:
- Information extraction: A better way to identify players as the subject of sentences 
- Sentiment analysis: More accurate results
- Tying together information extraction and sentiment analysis to provide more precise results