# Expanding the Process 

__Total time taken to run notebook__: 10 - 15 mins

## Notebook Aim:
> - Sketch out process of moving from scraping to db entry
- Implement modular file structure

## Processes:

> #### 1. <a href="#Step1">Scrape Articles - Process Text</a>
> #### 2. <a href="#Step2">Identify Targets</a>
> #### 3. <a href="#Step3">Predict Sentiment</a>
> #### 4. <a href="#Step4">Calculate Sentiment Score</a>

---

 <section id="Step1"><h1>Scrape Articles - Process Text</h1></section>

The following steps are applied by functions within local modules. They are called together using the function `article_scraper.scrape()` which produces a dataframe:
>`media_source | url | date | original_sentence | phrase | lemmatized_no_stopwords_phrase | teams_described |`

### Steps Involved:
> 1. The spider will loop through the landing urls stored in the sources dictionary
> 2. It will extract the article urls that match the specific landing_characteristics described in the sources dictionary
> 3. It will then return a list of article dictionaries with the new article urls available from the corresponding source url - if that article url has not already been accessed
> 4. Next each new article url is crawled
> 5. The text is gathered from each article - according to specific article_characteristics described in the sources dictionary
> 6. The text is then processed -- (cleaned, stopwords removed, teams identified, lemmatized and split into phrases)
> 7. The output is a dataframe for each article with the structure outlined above
> 8. Before returning, the dataframes are concatenated

##### Local Modules Used in `article_scraper.scrape()`:
> - __modules.article_scraper__ ------>  *contains scrapy spider, scrape function to trigger spider and functionality to store accessed articles in pickle list*
> - __modules.sources_dictionary__ ------>  *contains dictionary of sources and their characteristics*
> - __modules.clean_text__ ------>  *contains function to clean text of irregularities*
> - __modules.match_info__ ------>  *contains function to identify teams described in article*
> - __modules.phrases__ ------>  *contains function to return text as dataframe of phrases*
> - __modules.lemmatize_remove_stopwords__ ------>  *contains function to remove stopwords and lemmatize phrase*
> - __modules.stopwords__ ------>  *contains a customised list of stopwords - adapted from sklearn 'ENGLISH_STOP_WORDS' module*

#### Apply the article scraper function  
__Time taken__ ----> Approx 46 secs


Save the result to `df`

In [1]:
# Import pandas to display all rows
import pandas as pd
pd.set_option('display.max_rows', None)

In [2]:
# Import the article scraper
import modules.article_scraper as article_scraper

In [3]:
from time import time

In [4]:
df = article_scraper.scrape()

2020-06-25 17:05:21 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot)
2020-06-25 17:05:21 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.3 (default, Mar 27 2019, 16:54:48) - [Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1b  26 Feb 2019), cryptography 2.6.1, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-06-25 17:05:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-06-25 17:05:21 [scrapy.crawler] INFO: Overridden settings:
{'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) '
               'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 '
               'Safari/537.36; Andy Clarke/clarkeaj3@cardiff.ac.uk'}
2020-06-25 17:05:22 [scrapy.extensions.telnet] INFO: Telnet Password: 9ac8e24561591ab3
2020-06-25 17:05:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats'

In [5]:
# Uncomment to view dataframe at this stage
len(df)

5046

----
<section id="Step2"><h1>Identify Targets</h1></section>

The following steps are applied by functions within local modules. They are called together using the function `target_identifier.target_identifier(df)` which adds target players to the dataframe and removes rows where targets are absent

### Steps Involved:
> 1. Function will loop through the rows in the dataframe
> 2. It will then loop through the players in the teams listed 
> 3. Using the function 'identify_nsubj_pobj' it will identify players listed in the sentence from their player identifiers listed in 'players dictionary'
> 4. It will add a list of the target players to the dataframe's corresponding row
> 5. Finally it will return a dataframe in which rows without target players have been removed

##### Local Modules Used in `target_identifier.target_identifier(df)`:
> - __modules.target_identifier__ ------>  *contains function to identify players in a phrase and return df with targets and remove rows without targets*
> - __modules.identify_nsubj_pobj__ ------>  *contains function to return players from a phrase when they meet certain natural language characteristics - called in target_identifier*
> - __modules.players_dictionary__ ------>  *contains dictionary of squads in Premier League*


#### Apply the target identifier function  
__Time taken__: Approx 6 mins 30 secs

Save the result to `df2`

In [10]:
import modules.players_dictionary as players_dictionary

In [14]:
for key, value in players_dictionary.players_dictionary.items():
    print(key)

Arsenal
Aston Villa
AFC Bournemouth
Brighton and Hove Albion
Burnley
Chelsea
Crystal Palace
Everton
Leicester City
Liverpool
Manchester City
Manchester United
Newcastle United
Norwich City
Sheffield United
Southampton
Tottenham Hotspur
Watford
West Ham United
Wolverhampton Wanderers


In [6]:
import modules.target_identifier as target_identifier

In [7]:
df2 = target_identifier.target_identifier(df)

No Match Luke Freeman and David McGoldrick are the two men coming on for the Blades -----> Kieron Freeman
Luke Freeman and David McGoldrick are the two men coming on for the Blades
Luke ----> Luke Freeman
No Match Luke Freeman and David McGoldrick are the two men coming on for the Blades -----> Kieron Freeman
Luke Freeman and David McGoldrick are the two men coming on for the Blades
Luke ----> Luke Freeman
No Match Spurs looked to be heading to victory thanks to Steven Bergwijn's strike but Bruno Fernandes levelled with a late penalty to earn a share of the spoils in north London -----> Gedson Fernandes
Spurs looked to be heading to victory thanks to Steven Bergwijn's strike but Bruno Fernandes levelled with a late penalty to earn a share of the spoils in north London
Bruno ----> Bruno Fernandes
No Match Spurs looked to be heading to victory thanks to Steven Bergwijn's strike but Bruno Fernandes levelled with a late penalty to earn a share of the spoils in north London -----> Gedson Fe

In [8]:
# Uncomment to view dataframe at this stage
len(df2)

1466

In [9]:
df2

Unnamed: 0,date,lemmatized_no_stopwords_phrase,media_source,original_sentence,phrase,teams,time,url,targets
0,25/06/2020,Follow all latest live update from St Mary 's ...,The Mirror - Match Reports,Follow all the latest live updates from St Mar...,Follow all the latest live updates from St Mar...,"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:22,https://www.mirror.co.uk/sport/football/match-...,"[Southampton, Arsenal]"
2,25/06/2020,Southampton forward Shane Long will available ...,The Mirror - Match Reports,Southampton forward Shane Long will be availab...,Southampton forward Shane Long will be availab...,"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:24,https://www.mirror.co.uk/sport/football/match-...,"[Shane Long, Arsenal]"
5,25/06/2020,missed Saints ' winning return action Norwich ...,The Mirror - Match Reports,"The 33-year-old, who recently signed a contrac...",missed Saints' winning return to action at No...,"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:24,https://www.mirror.co.uk/sport/football/match-...,[Norwich City]
6,25/06/2020,Moussa Djenepo remains suspended for Southampt...,The Mirror - Match Reports,Moussa Djenepo remains suspended for Southampt...,Moussa Djenepo remains suspended for Southampt...,"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:24,https://www.mirror.co.uk/sport/football/match-...,"[Moussa Djenepo, Southampton, Newcastle United]"
7,25/06/2020,Arsenal will host first-team regular they head...,The Mirror - Match Reports,Arsenal will be without a host of first-team r...,Arsenal will be without a host of first-team r...,"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:24,https://www.mirror.co.uk/sport/football/match-...,[Southampton]
8,25/06/2020,David Luiz serf last two-game ban fellow defen...,The Mirror - Match Reports,David Luiz serves the last of a two-game ban w...,David Luiz serves the last of a two-game ban w...,"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:24,https://www.mirror.co.uk/sport/football/match-...,[David Luiz]
9,25/06/2020,Calum Chambers ( knee ) Cedric Soares ( facial...,The Mirror - Match Reports,David Luiz serves the last of a two-game ban w...,Calum Chambers (knee) and Cedric Soares (faci...,"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:24,https://www.mirror.co.uk/sport/football/match-...,[Calum Chambers]
10,25/06/2020,Bernd Leno Gabriel Martinelli ( both knee ) ab...,The Mirror - Match Reports,Bernd Leno and Gabriel Martinelli (both knee) ...,Bernd Leno and Gabriel Martinelli (both knee) ...,"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:24,https://www.mirror.co.uk/sport/football/match-...,[Bernd Leno]
11,25/06/2020,Lucas Torreira ( ankle ),The Mirror - Match Reports,Bernd Leno and Gabriel Martinelli (both knee) ...,as is Lucas Torreira (ankle),"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:24,https://www.mirror.co.uk/sport/football/match-...,[Lucas Torreira]
12,25/06/2020,with Granit Xhaka ( ankle ) available for sele...,The Mirror - Match Reports,Bernd Leno and Gabriel Martinelli (both knee) ...,with Granit Xhaka (ankle) now available for s...,"[Southampton, Arsenal, Norwich City, Newcastle...",17:05:24,https://www.mirror.co.uk/sport/football/match-...,[Granit Xhaka]


----
<section id="Step3"><h1>Predict Sentiment</h1></section>

`predict.predict(df)` uses a trained classifier and vectorizer which have been pickled and saved in the pickles directory of this project

### Steps Involved:
> 1. Function will loop through the rows in the dataframe and extract a list of lemmatized no stopwords phrases
> 2. It will then make a prediction based on a vectorized form of the lemmatized no stopwords phrases
> 3. It will return a list of predictions for each phrase as either 'POSITIVE', "NEUTRAL" or "NEGATIVE"

##### Local Pickles Used in `predict.predict(df)`:
> - __"./pickles/cv_1_1.sav"__ ------>  *contains Count Vectorizer trained on Ngram range (1, 1)  - used to vectorize phrase*
> - __"./pickles/logreg_cv_1_1.sav"__ ------>  *contains Logistic Regression classifier trained on above vectorizer  - used to predict sentiment of phrase*



#### Apply the predict sentiment function  
__Time taken__: Under 1 second

In [1]:
import modules.predict as predict

In [9]:
# Add the list of predictions to the dataframe
df2['sentiment'] = predict.predict(df2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until



function takes 0.650529


In [12]:
# Uncomment to view dataframe at this stage
#df2

----
<section id="Step4"><h1>Calculate Sentiment Score</h1></section>

`sentiment_score.sentiment_scores(df)` calculates the sentiment score for all of the players identified as targets. The outcome is a new dataframe:
>`date | player | club | field_position | nationality | sentiment_score | sample_sentences | n_positive | n_negative | n_neutral | total_reviews`


Sentiment scores are calculated as follows: ---->   *`Positive reviews as a percentage of total positive and negative reviews`*

### Steps Involved:
> 1. Function will loop through the rows in the dataframe and make a list of players that feature as targets
> 2. For each target player the `calculate_sentiment_score` function will produce a dictionary with the keys listed above
> 3. The dictionaries are joined together to produce a dataframe

##### Local Modules Used in `target_identifier.target_identifier(df)`:
> - __modules.players_dictionary__ ------>  *contains dictionary of squads in Premier League*
> - __modules.sentiment_score__ ------>  *contains two functions - one to produce individual player dictionary - one to produce dataframe of all players scores*


#### Apply the predict sentiment function  
__Time taken__: Approx 3 mins 45 secs

In [10]:
import modules.sentiment_score as sentiment_score

In [11]:
t0 = time()
scores = sentiment_score.sentiment_scores(df2)
t1 = time()
print('function takes %f' %(t1-t0))

function takes 222.025772


In [12]:
# Uncomment to view final dataframe
scores

Unnamed: 0,date,player,club,field_position,nationality,sentiment_score,sample_sentences,n_positive,n_negative,n_neutral,total_reviews
0,12/06/2020,Hugo Lloris,Tottenham Hotspur,GK,France,0.428571,"[{'sentiment': 'POSITIVE', 'url': 'https://www...",3,4,44,51
1,12/06/2020,Lucas Moura,Tottenham Hotspur,MD,Brazil,0.625,"[{'sentiment': 'POSITIVE', 'url': 'https://www...",5,3,48,56
2,12/06/2020,Moussa Sissoko,Tottenham Hotspur,MD,France,1.0,"[{'sentiment': 'POSITIVE', 'url': 'https://www...",1,0,15,16
3,12/06/2020,Jack Simpson,AFC Bournemouth,DF,England,0.0,"[{'sentiment': 'NEGATIVE', 'url': 'https://www...",0,2,7,9
4,12/06/2020,Bernardo Silva,Manchester City,MD,Portugal,,[],0,0,29,29
5,12/06/2020,David Silva,Manchester City,MD,Spain,,[],0,0,29,29
6,12/06/2020,Kelechi Iheanacho,Leicester City,FWD,Nigeria,0.4,"[{'sentiment': 'POSITIVE', 'url': 'https://epl...",2,3,36,41
7,12/06/2020,Wilfred Ndidi,Leicester City,MD,Nigeria,1.0,"[{'sentiment': 'POSITIVE', 'url': 'https://www...",2,0,21,23
8,12/06/2020,Youri Tielemans,Leicester City,MD,Belguim,1.0,"[{'sentiment': 'POSITIVE', 'url': 'https://epl...",3,0,12,15
9,12/06/2020,Lucas Digne,Everton,DF,France,,[],0,0,13,13


...