## Question:
What makes a quote go viral?

## Terminology:
**VIRAL:** more than 100 occurrences on different sites.

## Application:
Providing insight on how politicians, influencers, etc. may obtain lots of visibility on a single quote.

Providing insight on what classes of people are given more media attention to choose representative of whatever accordingly.

## Outcome Variables:
- Viral: yes / no
- How fast viral viral: in how long viral quotes reached 2/3 of occurrences.

## Features:
- Indicator variables for 3 most common jobs
- Indicator variables for 3 most common genders
- Indicator variables for 3 most common ethnicities
- Age of speaker
- Date of quote (jour + mois + an) in 1 number
- Topic: detect most common topics and indicator vars of them

## Technique:
Linear regression / Logistic regression / SVM

## Data Pre-Processing
Removal of quotes for which speaker information are not available, as well as quotes from speakers which are not contemporary.
Also solve ambiguities in speakers (sometimes several possible speakers possible).

In [1]:
import re
import bz2
import json
import os
import pandas as pd

import utils
import feature_extraction

In [2]:
DATA_DIR = "Data"
CACHE_DIR = "Cache"
SPEAKER_INFO_FILE_PATH = os.path.join(DATA_DIR, "speaker_attributes.parquet")
PREPROCESSED_DATASET_FILE_PATH = os.path.join(CACHE_DIR, "preprocessed_dataset.json.bz2")

In [3]:
qid_labels, linkcounts = utils.query_wikidata_for_linkcounts_and_labels(data_dir = DATA_DIR, speaker_info_file_path = SPEAKER_INFO_FILE_PATH)

In [4]:
def extract_speaker_features(line, speaker_data, min_age = 5, max_age = 95):
    features = {}
    
    # Try computing age of speaker and ignore lines for which speaker birth date is not available or
    # is born too soon to be our contemporary.
    speaker_birth_date = speaker_data.get(line['qids'], {}).get('date_of_birth', None)
    speaker_age = feature_extraction.get_speaker_age(speaker_birth_date, line['date'])
    
    if speaker_age is None or speaker_age < min_age or speaker_age > max_age:
        return
        
    # Extract gender of the speaker. Possible genders are summarized in 3 categories: "male", "female", "other".
    speaker_gender = speaker_data.get(line['qids'], {}).get('gender', None)
    
    if speaker_gender is None or len(speaker_gender) == 0:
        return
     
    features['speaker_gender'] = 'other'
    if len(speaker_gender) == 1:
        speaker_gender, = speaker_gender
        speaker_qid_label = qid_labels.get(speaker_gender, '').lower()        
        if speaker_qid_label in ['male', 'female']:
            features['speaker_gender'] = speaker_qid_label
    
    # Extract which of the most common occupation the speaker has.
    most_common_occupations = {'actor', 'american football player', 'association football player', 'baseball player',
                               'basketball player', 'businessperson', 'chief executive officer', 'composer',
                               'entrepreneur', 'film actor', 'film director', 'film producer', 'investor', 'journalist',
                               'lawyer', 'musician', 'non-fiction writer', 'politician', 'researcher', 'restaurateur',
                               'screenwriter', 'singer', 'television actor', 'television presenter', 'television producer',
                               'university teacher', 'writer'}
    
    speaker_occupations = speaker_data.get(line['qids'], {}).get('occupation', None)
    speaker_occupations = [] if speaker_occupations is None else speaker_occupations
    
    features['speaker_occupation'] = {occupation: False for occupation in most_common_occupations}
    for occupation in speaker_occupations:
        occupation = qid_labels.get(occupation, '').lower()
        if occupation in features['speaker_occupation']:
            features['speaker_occupation'][occupation] = True
            
    return features
    


def extract_features(line, speaker_data):    
    features = {}
    
    # Extract outcome variable.
    features['num_occurrences'] = line['numOccurrences']
    
    # Extract speaker information.
    speaker_features = extract_speaker_features(line, speaker_data)
    if speaker_features is None:
        return
    
    features.update(speaker_features)
    
    # Extract topics of quote.
    

    # Extract domains from news urls.
    # features['domains'] = feature_extraction.domains_from_urls(line['urls'])
    
    return features
    
    


def preprocess_dataset(data_dir, output_file_path, speaker_info_file_path,
                       can_reuse_output = True):
    if os.path.isfile(output_file_path) and can_reuse_output:
        return
    
    # Load part of data extracted from Wikidata dump about speakers.
    speaker_data = pd.read_parquet(speaker_info_file_path, columns = ['id', 'date_of_birth', 'gender', 'occupation']).set_index('id').to_dict('index')

    with bz2.open(output_file_path, "wb") as output_file:
        
        for line in utils.json_lines_generator(data_dir):
            # Convert list of speaker qids into a single value.
            # If several qids possible, choose the one with largest link count.
            line['qids'] = feature_extraction.solve_ambiguous_speakers(line['qids'], linkcounts)
                 
            # Ignore lines for which speaker information is not available.
            if line['qids'] is None:
                continue
                
            # Extract features from line.
            features = extract_features(line, speaker_data)
            
            # Ignore lines for which feature extraction failed due to unavailability of information or
            # due to filtering of extreme values.
            if features is None:
                continue
            
            # Store features of line.
            output_file.write((json.dumps(features) + '\n').encode('utf-8'))

In [5]:
preprocess_dataset(DATA_DIR,
                   PREPROCESSED_DATASET_FILE_PATH,
                   SPEAKER_INFO_FILE_PATH,
                   can_reuse_output = False)

Starting processing Data\quotes-2015.json.bz2
Processed 1000000 lines from Data\quotes-2015.json.bz2 in 1.679 minutes
Processed 2000000 lines from Data\quotes-2015.json.bz2 in 3.397 minutes
Processed 3000000 lines from Data\quotes-2015.json.bz2 in 5.106 minutes


KeyboardInterrupt: 