# Description
The purpose of this notebook is to do prospection on data obtained from scriping interests links and Zefix results.

Indeed, on issue that may arise is that the data aren't clean enough to find usable results. In other words, we can have this two type of errors:

- False positive: for example, we search for a company linked with a politician, and we obtain on a lot of different companies
- False negatives: data isn't clean and thus can't find the real company in Zefix

This notebook is trying to have an insight on this two issues.

# Set-up

In [None]:
import os
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from zefix_scraper import zefix_search_raw

In [None]:
def count_findings(name):
    """
    Count how many findings zefix has done.
    """
    page = zefix_search_raw(name)
    
    if page is None:
        return 0

    content = BeautifulSoup(page, 'lxml')
    
    return len(content.body.find_all('p')) -1

def cached_call(generator, filename, as_series=False):
    """
    Simple function that try to load from cache or generate data (and cache it)
    
    `generator` must returns a DataFrame (and not a Series) in order to simplify the work.
    """
    path = os.path.join('cache', "{}.json".format(filename))
    try:
        if as_series:
            ans = pd.read_json(path, typ='series', orient='records')
        else:
            ans = pd.read_json(path)
    except Exception as e:
        print("Loading data... ({})".format(e))
        ans = generator()
        ans.to_json(path)
        
    return ans

In [None]:
interests = pd.read_json('data/all_interests.json')
interests

# Naive lookup

In [None]:
import concurrent.futures

all_interests = interests.interest_name.unique()

def lookup_interests(all_interests):
    """
    Propagates Zefix lookups asynchronously usint a thread pool.
    
    The argument is simply a list, and the result is a DataFrame containing the interest as 
    a key and a value unique field called `findings_count`. 
    
    This method should be refactored before being used elsewhere because:
    - Not clean to have a simple list in argument, should use Pandas' Series
    - Not clean to have a DataFrame in output, should have a serie which maintains the same index
    """
    all_interests = map(lambda x: x.strip(), all_interests)
    all_interest_count = {}

    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures_data = {executor.submit(count_findings, interest): interest for interest in all_interests}

        for future in concurrent.futures.as_completed(futures_data):
            interest = futures_data[future]
            try:
                count = future.result()
            except Exception as e:
                print("{} generated exception: {}".format(interest, e))
            else: 
                all_interest_count[interest] = count
            
    return pd.DataFrame(all_interest_count, index=["findings_count"]).transpose()

found_interests = cached_call(lambda : lookup_interests(all_interests), 'interests_counts')
found_interests = found_interests.findings_count
found_interests

In [None]:
found_interests.describe()

In [None]:
found_interests.sort_values(ascending=False)

In [None]:
def describe_ratios(serie):
    data = [
        ("equal_one", (serie == 1).mean()),
        ("null", (serie == 0).mean()),
        ("more_one", (serie > 1).mean()),
    ]
    
    for k,v in data:
        print("{}: {:0.1f}%".format(k,v * 100))

describe_ratios(found_interests)

As we can see, the vast majority of Zefix lookups have no result (roughly 75%), while 8% have more than one results.

On the 75% of lookups having no results, some are true negative (i.e. the legal entity isn't registered in the Commercial Register, which is totally valid on some cases). However, 85% looks to be a lot too much.

# Resolve more lookups

Let have a look to the distribution of terms in the interest list

In [None]:
from wordcloud import WordCloud

In [None]:
def flatten(l):
    return [val for sublist in l for val in sublist]

def strip_flatten(l):
    return [e.strip() for e in flatten(l)]

all_words = pd.Series(strip_flatten([interest.split() for interest in all_interests]))
all_words 

In [None]:
len(all_words)

In [None]:
grouped_words = all_words.groupby(all_words).count()
grouped_words.sort_values(ascending=False)

In [None]:
describe_ratios(grouped_words)

In [None]:
def show_cloud(words):
    cloud = WordCloud().generate(' '.join(words))
    plt.imshow(cloud)
    plt.axis('off')
    plt.show()
    
show_cloud(all_words)

In [None]:
ordered = grouped_words.sort_values(ascending=False).reset_index(drop=True)

fig, axs = plt.subplots(1,2)
ordered.plot(ax=axs[0])
ordered.apply(np.log).plot(ax=axs[1])

axs[0].set_title("Linear scale")
axs[1].set_title("Log scale")

As we can see, there is _very few very frequent_ words, and a lot of non-frequent word. In this case - rather than on standard information retrieval – low frequence words might be worth considering, since companies have precise names.

## Tokenizer pipeline

Let check parenthesis in the data (which might be a problem)

In [None]:
interests[interests.interest_name.str.contains('\(')]

In general, all of the content between the two parenthesis is irrelevant for the Zefix search (being either status or acronyms for the company). In both cases, this might result 

In [None]:
import re

def remove_parenthesis(text):
    return re.sub("\(.*\)?",'', text).strip()

clean_interests = interests.interest_name.apply(remove_parenthesis)
clean_interests

In [None]:
import nltk
from nltk.corpus import stopwords
#nltk.download()        # run once

def stringify_list(l):
    return ' '.join(l)

def tokenize_interest_pipeline(interest):
    def tokenize(interest):
        return nltk.word_tokenize(interest)
    
    def remove_stopwords(sentence):
        stop_words = stopwords.words(['german', 'french'])
        return [w for w in sentence if w.lower() not in stop_words]
    
    ans = remove_parenthesis(interest)
    ans = tokenize(ans)
    ans = remove_stopwords(ans)
    
    return ans

tokenized_interests = list(set(flatten([tokenize_interest_pipeline(i) for i in all_interests])))
sorted(tokenized_interests, key=len)

As we can see, there is token of size 1 which doens't provide a lot of information. Thus, we update the pipeline in order to remove it

In [None]:
def more_one_letter_pipeline(interest):
    ans = tokenize_interest_pipeline(interest)
    
    return [token for token in ans if len(token) > 1]

tokenized_interests = flatten([more_one_letter_pipeline(i) for i in all_interests])
show_cloud(tokenized_interests)

As we can see, _AG_ is the most frequent word in the list now. It's an issue since it might provide enormously false-positive on Zefix search engine (due to its small size). Therefore, AG and SA are dropped:

In [None]:
dark_words = ['AG','SA']

def remove_dark_words_pipeline(interest):
    return [token for token in more_one_letter_pipeline(interest) if token.upper() not in dark_words]

tokenized_interests = flatten([remove_dark_words_pipeline(i) for i in all_interests])
show_cloud(tokenized_interests)

## Sanitized search
Now that we've a pipeline for sanitizing interests, let do the lookup on Zefix

In [None]:
final_pipeline = remove_dark_words_pipeline
string_pipeline = lambda s: stringify_list(final_pipeline(s))

interests['sanitized_interest'] = interests.interest_name.apply(string_pipeline)
interests

In [None]:
all_sanitzed_interests = interests.sanitized_interest.unique()
found_sanitized_interests = cached_call(lambda : lookup_interests(all_sanitzed_interests), 'analyze_weak_lookup')

In [None]:
found_sanitized_interests.sort_values('findings_count', ascending=False)

In [None]:
describe_ratios(found_sanitized_interests.findings_count)

# Mixing all together
Now that we've two methods that works differently, we should take advantage of both ones to have better results from Zefix.

The idea is to first look for unique result using an explicit lookup, and then to change the name of the interest for the other cases.

First, we need a new method for looking up Zefix

In [None]:
def async_series_lookup(f, input_series, number_parallel_tasks=None):
    """
    Asynchronous lookup for Pandas Series
    
    The argument is simply a list, and the result is a DataFrame containing the interest as 
    a key and a value unique field called `findings_count`. 
    
    Output: f(input_series) asynchronously
    
    f -- function to apply
    input_series -- Series to apply the function on
    """
    results = pd.Series()
    
    if number_parallel_tasks is not None:
        get_executor = lambda: concurrent.futures.ThreadPoolExecutor(max_workers=number_parallel_tasks)
    else:
        get_executor = lambda: concurrent.futures.ThreadPoolExecutor()

    with get_executor() as executor:
        futures_data = {executor.submit(f, val): key for (key, val) in input_series.iteritems()}

        for future in concurrent.futures.as_completed(futures_data):
            key = futures_data[future]
            try:
                ans = future.result()
            except Exception as e:
                print("{} generated exception: {}".format(key, e))
            else: 
                results.set_value(key, ans)
            
    return results

Now that we've this helper function, let call it on our data

In [None]:
resolved_interests = pd.DataFrame(all_interests, columns=['fullname'])
resolved_interests['sanitized_interest'] = resolved_interests.fullname.apply(string_pipeline)
resolved_interests

## Strong lookup
Lookup the fullname entirely

In [None]:
resolved_interests['strong_lookup_counts'] = cached_call(lambda : async_series_lookup(count_findings, resolved_interests.fullname, number_parallel_tasks=1), 'strong_lookup_resolved', True)
resolved_interests

## Weak lookup

In [None]:
def weak_lookup(resolved_interests):
    interests_to_look = resolved_interests[resolved_interests.strong_lookup_counts != 1].sanitized_interest
    ans = resolved_interests.strong_lookup_counts.copy()
    ans.update(async_series_lookup(count_findings, interests_to_look, number_parallel_tasks=1))
    
    return ans

resolved_interests['grouped_lookup_counts'] = cached_call(lambda: weak_lookup(resolved_interests), 'weak_lookup_resolved', True)
resolved_interests

## Check the results

In [None]:
describe_ratios(resolved_interests.grouped_lookup_counts)