## Question:
What makes a quote go viral?

## Terminology:
**VIRAL:** more than 100 occurrences on different sites.

## Application:
Providing insight on how politicians, influencers, etc. may obtain lots of visibility on a single quote.

Providing insight on what classes of people are given more media attention to choose representative of whatever accordingly.

## Outcome Variables:
- Viral: yes / no
- How fast viral viral: in how long viral quotes reached 2/3 of occurrences.

## Features:
- Indicator variables for 3 most common jobs
- Indicator variables for 3 most common genders
- Indicator variables for 3 most common ethnicities
- Age of speaker
- Date of quote (jour + mois + an) in 1 number
- Topic: detect most common topics and indicator vars of them

## Technique:
Linear regression / Logistic regression / SVM

## Data Pre-Processing
Removal of quotes for which speaker information are not available, as well as quotes from speakers which are not contemporary.
Also solve ambiguities in speakers (sometimes several possible speakers possible).

In [1]:
import re
import bz2
import json
import os
import utils
import pandas as pd

In [2]:
DATA_DIR = "Data"
CACHE_DIR = "Cache"
SPEAKER_INFO_FILE_PATH = os.path.join(DATA_DIR, "speaker_attributes.parquet")
PREPROCESSED_DATASET_FILE_PATH = os.path.join(CACHE_DIR, "preprocessed_dataset.json.bz2")

In [3]:
@utils.cache_to_file_pickle("function-query_wikidata_for_linkcounts_and_labels")
def query_wikidata_for_linkcounts_and_labels(data_dir, speaker_info_file_path):    
    all_speakers = set()
    speakers_needing_linkcounts = set()
    
    for line in utils.all_quotes_generator(data_dir, 1000000):
        line_qids_set = set(line['qids'])
        
        if len(line['qids']) > 1:
            speakers_needing_linkcounts |= line_qids_set
            
        all_speakers |= line_qids_set    
        
    # Load part of data extracted from Wikidata dump about speakers.
    speaker_data = pd.read_parquet(speaker_info_file_path, columns = ['id', 'label', 'nationality', 'gender', 'ethnic_group', 'occupation', 'party', 'academic_degree', 'candidacy', 'religion'])
    
    # Immediately remove useless lines to save memory.
    speaker_data = speaker_data[speaker_data['id'].isin(all_speakers)]
        
    # Store id-labels pairs in another variable and remove them from original dataframe.
    speaker_qid_labels = speaker_data[['id', 'label']]
    speaker_data.drop(columns = ['id', 'label'], inplace = True)
        
    # Put all qids of informations of all speakers into one single set.
    qids_needing_labels = utils.ragged_nested_sequence_to_set(speaker_data.values)
    qids_needing_labels.remove(None)
        
    # Sanity check.
    assert all(utils.str_is_qid(qid) for qid in qids_needing_labels)

    # Retrieve English labels for informations of all speakers. 
    qid_labels = utils.get_labels_of_wikidata_ids(ids = qids_needing_labels)
    qid_labels = {k: v.title() for k, v in qid_labels.items()}

    # Add speakers' id-labels pairs to qid_labels.
    speaker_qid_labels = speaker_qid_labels[~speaker_qid_labels.isna().any(axis = 1)].set_index('id').to_dict('index')
    speaker_qid_labels = {k: v['label'].title() for k, v in speaker_qid_labels.items()}
    qid_labels.update(speaker_qid_labels)
    
    # Retrieve link counts for speakers for which we need it (used to decide which speaker is most likely being cited
    # amongst homonyms).
    linkcounts = utils.get_link_counts_of_wikidata_ids(ids = speakers_needing_linkcounts)
    linkcounts = {k: int(v) for k, v in linkcounts.items()}

    return qid_labels, linkcounts

In [4]:
qid_labels, linkcounts = query_wikidata_for_linkcounts_and_labels(data_dir = DATA_DIR, speaker_info_file_path = SPEAKER_INFO_FILE_PATH)

In [11]:
def solve_ambiguous_speakers(speakers_qids):   
        
    # Convert to set to avoid repeating action for same speaker multiple times.
    speakers_qids = set(speakers_qids)
        
    # If there is no ambiguity in the possible speaker qids, return the only possible value.
    if len(speakers_qids) == 1:
        return speakers_qids.pop()
            
    # Recover link counts of each speaker queried from Wikidata. If unavailable, fill with 0.
    speakers_linkcounts = {speaker_qid: linkcounts.get(speaker_qid, 0) for speaker_qid in speakers_qids} 
     
    # Return the qid corresponding to the speaker with the largest link count.
    return max(speakers_linkcounts, key = speakers_linkcounts.get)


def get_speaker_age(birth_date, quote_date, min_age = 5, max_age = 90):
    """Return param: age: None if speaker too old or an error in dates format encountered.
    The value computed for the speaker age otherwise."""
    
    if birth_date is None or quote_date is None:
        return
        
    # CLEVER WAY TO FILTER MOST PROBABLE DATE FROM AMBIGUOUS ONES
    birth_date = birth_date[0]

    # Regular expression matching to year, month and day in dates string in the two used formats. 
    date_matcher = re.compile(r"^[+]?(?P<year>-?\d{4})-(?P<month>\d{2})-(?P<day>\d{2})[T ]\d{2}:\d{2}:\d{2}Z?$")
    
    birth_date_match = date_matcher.match(birth_date)
    if birth_date_match is None:
        print("Bad formatted date:", birth_date)
        return
    
    quote_date_match = date_matcher.match(quote_date)
    if quote_date_match is None:
        print("Bad formatted date:", quote_date)
        return
        
    birth_year, birth_month, birth_day = (int(number) for number in birth_date_match.group('year', 'month', 'day'))
    quote_year, quote_month, quote_day = (int(number) for number in quote_date_match.group('year', 'month', 'day'))
    
    age = quote_year - birth_year
    if quote_month < birth_month or (quote_month == birth_month and quote_day < birth_day):
        age -= 1
    
    return age if min_age <= age <= max_age else None


def extract_features(line):    
    features = {}
    
    # Extract outcome variable.
    features['num_occurrences'] = line['numOccurrences']
    
    # Extract speaker informations.
    
    
    
    # Extract topics of quote.    
    
    
    
    # Extract domains fron news urls.
    domain_matcher = re.compile(r"^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?(?P<domain>[^:\/?\n]+)")
    get_domain_from_url = lambda url: domain_matcher.match(url).group('domain')
    features['domains'] = [get_domain_from_url(url) for url in line['urls']]    
    
    return features
    
    


def preprocess_dataset(data_dir, output_file_path, speaker_info_file_path, can_reuse_output = True):
    if os.path.isfile(output_file_path) and can_reuse_output:
        return
    
    # Load part of data extracted from Wikidata dump about speakers.
    speaker_data = pd.read_parquet(speaker_info_file_path, columns = ['id', 'date_of_birth']).set_index('id').to_dict('index')

    with bz2.open(output_file_path, "wb") as output_file:
        
        for line in utils.all_quotes_generator(data_dir):
            # Ignore lines for which speaker information is not available.
            if not line['qids']:
                continue

            # Convert list of speaker qids into a single value.
            # If several qids possible, choose the one with largest link count.
            line['qids'] = solve_ambiguous_speakers(line['qids'])

            # Try computing age of speaker and ignore lines for which speaker birth date is not available or
            # is born too soon to be our contemporary.
            speaker_birth_date = speaker_data.get(line['qids'], {}).get('date_of_birth', None)
            speaker_age = get_speaker_age(speaker_birth_date, line['date'])
            
            if speaker_age is None:
                continue
            
            # Extract features from line.
            features = extract_features(line)
            features['speaker_age'] = speaker_age
            
            # Store features of line.
            output_file.write((json.dumps(features) + '\n').encode('utf-8'))

In [12]:
preprocess_dataset(DATA_DIR,
                   PREPROCESSED_DATASET_FILE_PATH,
                   SPEAKER_INFO_FILE_PATH,
                   can_reuse_output = False)

Starting processing Data\quotes-2015.json.bz2
Processed 1000000 lines from Data\quotes-2015.json.bz2 in 0.670 minutes
Processed 2000000 lines from Data\quotes-2015.json.bz2 in 1.337 minutes
Processed 3000000 lines from Data\quotes-2015.json.bz2 in 1.985 minutes
Processed 4000000 lines from Data\quotes-2015.json.bz2 in 2.635 minutes
Processed 5000000 lines from Data\quotes-2015.json.bz2 in 3.273 minutes
Processed 6000000 lines from Data\quotes-2015.json.bz2 in 3.912 minutes
Processed 7000000 lines from Data\quotes-2015.json.bz2 in 4.561 minutes
Processed 8000000 lines from Data\quotes-2015.json.bz2 in 5.203 minutes
Processed 9000000 lines from Data\quotes-2015.json.bz2 in 5.846 minutes
Processed 10000000 lines from Data\quotes-2015.json.bz2 in 6.519 minutes
Processed 11000000 lines from Data\quotes-2015.json.bz2 in 7.182 minutes
Processed 12000000 lines from Data\quotes-2015.json.bz2 in 7.844 minutes
Processed 13000000 lines from Data\quotes-2015.json.bz2 in 8.510 minutes
Processed 1400

Processed 21000000 lines from Data\quotes-2019.json.bz2 in 13.838 minutes
Finished processing Data\quotes-2019.json.bz2 in 14.342 minutes
Starting processing Data\quotes-2020.json.bz2
Processed 1000000 lines from Data\quotes-2020.json.bz2 in 0.655 minutes
Processed 2000000 lines from Data\quotes-2020.json.bz2 in 1.305 minutes
Processed 3000000 lines from Data\quotes-2020.json.bz2 in 1.954 minutes
Processed 4000000 lines from Data\quotes-2020.json.bz2 in 2.602 minutes
Processed 5000000 lines from Data\quotes-2020.json.bz2 in 3.252 minutes
Finished processing Data\quotes-2020.json.bz2 in 3.410 minutes


68579656