# Another tagging script
Was thinking about it and I think will just try and start out with a basic `extract_all` that pulls everything with capital letters, and see where I get from there

`([A-ZÁÉÍÓÚÑ]+[A-záéíóúñÁÉÍÓÚÑ-]+(?: [A-ZÁÉÍÓÚÑ]+[A-záéíóúñÁÉÍÓÚÑ-]+)*)`

Looks for a capital letter followed by other NON caps (Maybe should allow for caps too?? FA cup), then it also looks for the same pattern, but preceeded by a space, 0 or more times. Note that by not looking for apostrophe we also manage to drop that part.

Main issue is with the first words of headlines - but could filter with stop words

#### Improvements

* Numbers - U21 etc. should be captured, I guess

#### Issues

* Millwall FA Cup tie
    * Not sure how to deal with this - but maybe it revolves around getting some "ground truth" like labels that we later extract from others
    * i.e. we could do something like penalise adding a new label that is equal to the counts of the labels we already have
        * round 1: count all labels 
        * round 2: look at those with fewest and find in others - starting from the top
        * Could also maybe do this using Spacy to reduce words to stem - e.g. English > England
        
#### Future

* Need to define what sort of connections they are (graph?)
* Need to also classify articles

In [1]:
import pandas as pd
from pathlib import Path
import spacy

data_loc = Path('~/Dropbox/Projects/Football/Data')

In [2]:
disable = ['parser', 'ner', 'textcat', 'tokenizer', 'tagger']
nlp = spacy.load('en_core_web_lg', disable = disable)

In [3]:
story_data = pd.read_csv(data_loc / 'stories.csv')

In [4]:
regex_string = r'([A-ZÁÉÍÓÚÑ]+[A-záéíóúñÁÉÍÓÚÑ-]+(?: [A-ZÁÉÍÓÚÑ]+[A-záéíóúñÁÉÍÓÚÑ-]+)*)'

Note:
- If nothing found, nothing is shown - but I guess the index is maintained??

In [5]:
article_matches = pd.DataFrame(story_data.article_title.str.extractall(regex_string).reset_index().groupby('level_0')[0].apply(lambda x: x.tolist()))

In [47]:
double_reset = article_matches.reset_index().reset_index()
print((double_reset['index'] != double_reset.level_0).sum())

15696


If no match then no result - but the index is maintained!

In [6]:
match_list = article_matches[0].tolist()

In [7]:
reduced_matches = [[entity for entity in story if entity.lower() not in nlp.Defaults.stop_words] for story in match_list]

In [9]:
example = set(['team', 'league'])
example.remove('league')

In [10]:
example

{'team'}

In [None]:
['1967 establishments in Texas',
    'All accuracy disputes',
    'American Basketball Association teams',
    'Basketball teams established in 1967',
    'CS1 maint: archived copy as title',
    'Dallas Chaparrals',
    'National Basketball Association teams',
    'San Antonio Spurs',
    'Spurs Sports & Entertainment',
    'Use mdy dates from January 2019',
    'Webarchive template wayback links'],

In [33]:
import wikipedia as wiki

def process_cateoriges(categories):
    """
    Function that takes in the categories and cleans them
    up to send back the desired result
    """
    IGNORE_WORDS = ['articles', 'wikipedia', 'wikidata']
    check_cat = lambda x: not any(word in x.lower() for word in IGNORE_WORDS)
    categories = [category for category in categories if check_cat(category)]
    clean_categories = []
    
    for category in categories:
        if 'players' in category:
            clean_categories.append('player')
        if 'Football clubs' in category:
            clean_categories.append('team')
        if 'managers' in category:
            clean_categories.append('manager')
        if 'leagues' in category or 'cups' in category or 'derbies' in category:
            clean_categories.append('competition')
        if 'countries' in category:
            clean_categories.append('national team')
        if 'people' in category:
            clean_categories.append('person')
    
    clean_categories = set(clean_categories)
    
    # Some logic to stop us having things like LEAGUE + MANAGER
    if 'team' in clean_categories:
        for remove in ['player', 'manager', 'competition', 'person']:
            if remove in clean_categories:
                clean_categories.remove(remove)
    # Logic to remove non people grouped with people
    elif any(check in clean_categories for check in ['player', 'manager', 'person']):
        for remove in ['competition', 'countries']:
            if remove in clean_categories:
                clean_categories.remove(remove)
    
    return clean_categories, categories
        
def wikipedia_lookup(entity):
    """
    Get suggested titles for the entity
    Find which ones might be football related and search those - otherwise search all
    Return the first one that looks interesting, otherwise return the most relevant to the search
    """

    # First get the possible matches
    search_results = wiki.search(entity)

    # If we have something then move into it
    if len(search_results) > 0:
        # Then go through
        for i, search_result in enumerate(search_results):
            try:
                page = wiki.page(search_result)
            except wiki.DisambiguationError as e:
                first_results = search_results[0], 'DisambiguationError', [], []
                continue

            title = page.title
            categories = page.categories
            
            # Clean up the categories and see if it fits anywhere
            clean_categories, categories = process_cateoriges(categories)
            
            # For the first one we find, return the values
            if len(clean_categories) > 0:
                return title, 'Success', categories, clean_categories
            
            if i == 0:
                first_results = title, 'First', categories, clean_categories
            
        # If we didn't find anything then just return the first one
        return first_results
    else:
        return None, 'NotFoundError', [], []
    
def check_entity(entity, wiki_results):
    """
    Function that checks if a certain entity is valid for saving or not
    It will return what should be - don't need to return the wikiresult
    as we add entry and it should be kept
    
    Could also add in the nickname replacement here
    """
    # First check if it is a stop word
    if entity.lower() not in nlp.Defaults.stop_words:
        if entity in wiki_results:
            return entity, wiki_results[entity]
        else:
            lookup_results = wikipedia_lookup(entity)
            wiki_results[entity] = lookup_results
            
        return entity

If the first is a disambiguation error, maybe skip

In [39]:
search_results = wiki.search('Old Firm')

In [42]:
wiki.page('Old Firm').categories

['1888 establishments in Scotland',
 'Articles with short description',
 'Association football terminology',
 'CS1: Julian–Gregorian uncertainty',
 'CS1 Italian-language sources (it)',
 'Celtic F.C.',
 'Christianity in Glasgow',
 'Commons category link is on Wikidata',
 'Politics and sports',
 'Politics of Glasgow',
 'Rangers F.C.',
 'Recurring sporting events established in 1888',
 'Scotland football derbies',
 'Sectarianism',
 'Use British English from February 2016',
 'Use dmy dates from December 2013']

In [40]:
search_results

['Old Firm',
 'Rangers F.C.',
 'Celtic F.C.',
 'Scottish Premier League',
 'Lars Frederiksen',
 'Celtic F.C. supporters',
 'List of sports rivalries in the United Kingdom',
 'List of Scottish football champions',
 'Rangers F.C. supporters',
 'Law firm']

In [35]:
results = []
for entities in reduced_matches:
    for entity in entities:
        results.append((entity, wikipedia_lookup(entity)))



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


KeyboardInterrupt: 

In [36]:
results

[('Son Hueng-min',
  ('Son Heung-min',
   'Success',
   ['1992 births',
    '2011 AFC Asian Cup players',
    '2014 FIFA World Cup players',
    '2015 AFC Asian Cup players',
    '2018 FIFA World Cup players',
    '2019 AFC Asian Cup players',
    'Asian Games gold medalists for South Korea',
    'Asian Games medalists in football',
    'Association football forwards',
    'Association football wingers',
    'Bayer 04 Leverkusen players',
    'Best Footballer in Asia',
    'Bundesliga players',
    'CS1 Chinese-language sources (zh)',
    'CS1 French-language sources (fr)',
    'CS1 German-language sources (de)',
    'CS1 Korean-language sources (ko)',
    'CS1 uses Korean-language script (ko)',
    'EngvarB from January 2019',
    'Expatriate footballers in England',
    'Expatriate footballers in Germany',
    'Footballers at the 2016 Summer Olympics',
    'Footballers at the 2018 Asian Games',
    'Hamburger SV II players',
    'Hamburger SV players',
    'Living people',
    'Medal

In [133]:
results = []
for entities in reduced_matches:
    for entity in entities:
        results.append((entity, wikipedia_lookup(entity)))



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


ConnectionError: HTTPSConnectionPool(host='en.wikipedia.org', port=443): Max retries exceeded with url: /w/api.php?list=search&srprop=&srlimit=10&limit=10&srsearch=Seven+Newcastle&format=json&action=query (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f56cd803c88>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

In [65]:
entity

'Son Hueng-min'

In [60]:
def check_entitiy(entity):


In [21]:
article_matches['article_entities'] = [[entity for entity in story if entity.lower() not in nlp.Defaults.stop_words] for story in match_list]

In [51]:
story_data['article_entities'] = article_matches.article_entities

In [53]:
key_info = story_data[['article_title', 'article_entities']]

In [56]:
key_info.to_csv(data_loc / 'articles-with-entities.csv', index=False)

Next thing to do:

* Overview of what we have pulled and then look at where doing well and where doing badly
* Start to build up relations 
* Start to build up database

### DBpedia look up
http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?QueryString=XXX_XXX

Need to try this with accept json > Accept: application/json

This seems like a pretty simple way of doing something like this to get an idea of what the thing is - as it searches something that is similar

Some search terms like Son Heung-Min hasn't worked and not really sure why

In [22]:
import requests

In [54]:
headers = {'Accept' : 'application/json'}
url = 'http://lookup.dbpedia.org/api/search.asmx/KeywordSearch?MaxHits=1&QueryString={}'
search_term = 'Football'
response = requests.get(url.format(search_term.replace(' ', '_')), headers=headers)

In [58]:
response.status_code

200

In [55]:
response.json()['results'][0]['classes']

[{'label': 'sport', 'uri': 'http://dbpedia.org/ontology/Sport'},
 {'label': 'owl#Thing', 'uri': 'http://www.w3.org/2002/07/owl#Thing'},
 {'label': 'activity', 'uri': 'http://dbpedia.org/ontology/Activity'}]

### Wikipedia
Basic look up on wikipedia which might be more flexible and we could check the basic description 

Wikipedia is quite good for simply building up relations too - maybe we could do this later too and expand on what relations we have - or maybe we could simply do this all later and then tag on the information (i.e. have a full description of the things)

We could also potentially use this to get the normalised form

I think what we should do is:
* For each entitiy
    * Check if we have previously looked up in wikipedia (in last X time?)
    * If not - then look up
    * With the look up, take the first result and get categories
    * Do a basic clean up and if we are not in the basic categories that we have - we ignore those that have missing (these are our base sets)
    * Then save the information somewhere

In [66]:
import wikipedia as wiki

In [120]:
import wikipedia as wiki

def process_cateoriges(categories):
    """
    Function that takes in the categories and cleans them
    up to send back the desired result
    """
    clean_categories = []
    
    for category in categories:
        if 'players' in category:
            clean_categories.append('player')
        if 'Football clubs' in category or 'team' in category:
            clean_categories.append('team')
        if 'managers' in category:
            clean_categories.append('manager')
        if 'leagues' in category or 'cups' in category:
            clean_categories.append('competition')
        if 'countries' in category:
            clean_categories.append('national team')
    
    check_cat = lambda x: 'articles' not in x.lower() and 'wikipedia' not in x.lower()
    return set(clean_categories), [category for category in categories if check_cat(category)]
        
def wikipedia_lookup(entity):
    """
    Get the first title suggested for the entity and 
    get its categories, looking to see if it is one of the ones we want
    """
    search_results = wiki.search(entity)
    
    if len(search_results) > 0:
        page = wiki.page(search_results[0])
        title = page.title
        categories = page.categories
        clean_categories, categories = process_cateoriges(categories)
        
        return title, categories, clean_categories
    else:
        return None, [], []

In [124]:
result = wikipedia_lookup('football')

In [110]:
title = wiki.search('spain')

In [107]:
wiki.page(title[0]).title

'England'

In [111]:
wiki.page(title[0]).categories

['All Wikipedia articles needing clarification',
 'All articles containing potentially dated statements',
 'All articles with failed verification',
 'Articles containing Basque-language text',
 'Articles containing Catalan-language text',
 'Articles containing Galician-language text',
 'Articles containing Latin-language text',
 'Articles containing Occitan-language text',
 'Articles containing Spanish-language text',
 'Articles containing potentially dated statements from 2012',
 'Articles incorporating a citation from the 1913 Catholic Encyclopedia with Wikisource reference',
 'Articles with Curlie links',
 'Articles with failed verification from January 2016',
 'Articles with hAudio microformats',
 'Articles with short description',
 'CS1: Julian–Gregorian uncertainty',
 'CS1 Catalan-language sources (ca)',
 'CS1 Greek-language sources (el)',
 'CS1 Spanish-language sources (es)',
 'CS1 maint: BOT: original-url status unknown',
 'CS1 maint: archived copy as title',
 'CS1 maint: extra

In [99]:
wiki.page(title[0]).categories

['1871 establishments in England',
 'All articles with dead external links',
 'All articles with specifically marked weasel-worded phrases',
 'All articles with unsourced statements',
 'All pages needing factual verification',
 'Articles with dead external links from December 2016',
 'Articles with permanently dead external links',
 'Articles with specifically marked weasel-worded phrases from June 2011',
 'Articles with unsourced statements from November 2019',
 'CS1 Indonesian-language sources (id)',
 'Commons category link is on Wikidata',
 'EngvarB from October 2015',
 'FA Cup',
 'Football cup competitions in England',
 'National association football cups',
 'Recurring sporting events established in 1871',
 'Use dmy dates from February 2019',
 'Wikipedia articles needing factual verification from August 2019']

In [80]:
page = wiki.page(title)

In [82]:
page.categories

['1880 establishments in England',
 'Articles with hAudio microformats',
 'Articles with short description',
 'Association football clubs established in 1880',
 'Commons category link is on Wikidata',
 'EFL Cup winners',
 'FA Cup winners',
 'FIFA (video game series) teams',
 'Featured articles',
 'Football clubs in England',
 'Football clubs in Manchester',
 'Football team templates which use short name parameter',
 'Former English Football League clubs',
 'Manchester City F.C.',
 'Premier League clubs',
 'Spoken articles',
 'Sport in Manchester',
 'Use British English from August 2018',
 'Use dmy dates from October 2019',
 'Wikipedia articles with GND identifiers',
 'Wikipedia articles with LCCN identifiers',
 'Wikipedia articles with NDL identifiers',
 'Wikipedia articles with VIAF identifiers',
 'Wikipedia articles with WorldCat-VIAF identifiers',
 'Wikipedia indefinitely move-protected pages',
 'Wikipedia indefinitely semi-protected pages']

In [83]:
title = wiki.search('Son Hueng-min')[0]

In [89]:
wiki.page('Football').categories

['All articles with unsourced statements',
 'Articles containing French-language text',
 'Articles with French-language external links',
 'Articles with short description',
 'Articles with unsourced statements from January 2012',
 'Articles with unsourced statements from June 2009',
 'Ball games',
 'Broad-concept articles',
 'CS1 errors: missing periodical',
 'CS1 maint: archived copy as title',
 'CS1 maint: uses authors parameter',
 'Football',
 'Pages using multiple image with auto scaled images',
 'Use British English from September 2016',
 'Use dmy dates from November 2019',
 'Webarchive template wayback links',
 'Wikipedia indefinitely move-protected pages',
 'Wikipedia indefinitely semi-protected pages']

In [84]:
page = wiki.page(title)

In [86]:
page.summary

"Son Heung-min (Hangul: 손흥민; Hanja: 孫興慜; [son.hɯŋ.min]; born 8 July 1992) is a South Korean professional footballer who plays as a forward for Premier League club Tottenham Hotspur and captains the South Korea national team. Considered as one of the best wingers in the world, Son is often cited as an icon of South Korea.Born in Chuncheon, Son joined Hamburger SV at the age of 16 and made his debut in the German Bundesliga in 2010. In 2013, he moved to Bayer Leverkusen for a club record €10 million before signing for English side Tottenham for £22 million two years later, becoming the most expensive Asian player in history. While at Tottenham, Son became the top Asian goalscorer in Premier League history and surpassed Cha Bum-kun's record for most goals scored by a Korean player in European competition.A full international since 2010, Son has represented South Korea at the 2014 and 2018 FIFA World Cups and is South Korea's joint highest scorer at the World Cup alongside Park Ji-sung and

In [85]:
page.categories

['1992 births',
 '2011 AFC Asian Cup players',
 '2014 FIFA World Cup players',
 '2015 AFC Asian Cup players',
 '2018 FIFA World Cup players',
 '2019 AFC Asian Cup players',
 'All articles with dead external links',
 'Articles containing Korean-language text',
 'Articles needing Korean script or text',
 'Articles using Template:Medal with Runner-up',
 'Articles with dead external links from March 2018',
 'Articles with permanently dead external links',
 'Articles with short description',
 'Asian Games gold medalists for South Korea',
 'Asian Games medalists in football',
 'Association football forwards',
 'Association football wingers',
 'Bayer 04 Leverkusen players',
 'Best Footballer in Asia',
 'Bundesliga players',
 'CS1 Chinese-language sources (zh)',
 'CS1 French-language sources (fr)',
 'CS1 German-language sources (de)',
 'CS1 Korean-language sources (ko)',
 'CS1 uses Korean-language script (ko)',
 'Commons category link from Wikidata',
 'EngvarB from January 2019',
 'Expatriate 

In [75]:
wiki.summary('Manchester City', sentences=1)

'Manchester City Football Club is an English football club based in Manchester, that competes in the Premier League, the top flight of English football.'

In [77]:
wiki.summary('Football', sentences=1)

'Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'

In [78]:
wiki.summary('Son Hueng-min', sentences=1)

'Son Heung-min (Hangul: 손흥민; Hanja: 孫興慜; [son.hɯŋ.min]; born 8 July 1992) is a South Korean professional footballer who plays as a forward for Premier League club Tottenham Hotspur and captains the South Korea national team.'

In [76]:
wiki.suggest("Manchester City")

In [None]:
wiki.suggest("Manchester City")

In [None]:
Son Hueng-min

In [71]:
#page.content