# MLS pandas - Apply USUK scaler to full newspaper corpora & national representative corpora (2021-12-10)

_by A. Maurits van der Veen_  

_Modification history:_  
_2021-12-03 - Convert to csv lexica; use newest versions of lexica, as publicly available_  
_2021-12-10 - Clean up & streamline for GitHub repo_  
_2021-12-11 - Applied to representative corpora_  

__Note: This is for tables 1&2 in the MLS method paper__

The files used here are too large for Github. They can be found on Zenodo

### 0. Set-up

Import necessary code modules; specify location of sentiment analysis lexica and associated files; specify corpus location.


In [7]:
projectfolder = '/Users/xxx/Replication/'

In [8]:
import sys
# import os

import csv
import numpy as np
import pandas as pd

# Print summary version info (for fuller info, simply print sys.version)
print('You are using python version {}.'.format(sys.version.split()[0]))

You are using python version 3.10.12.


In [9]:
repcorpusfolder = projectfolder + 'Mean valence by corpus (table 1 & 2)/'

calibratedsuffix = '_vals_cal.csv'


### 1. Get mean valence for 3-year periods for 2 US & 2 UK papers

In [10]:
# Use 3-year combined corpora: 1993-1995, 2001-2004, 2008-2010, 2016-2018
yearranges = {'1993-1995': (1993, 1994, 1995),
              '2001-2003': (2001, 2002, 2003),
              '2008-2010': (2008, 2009, 2010),
              '2016-2018': (2016, 2017, 2018)}

corpusnames = [yearrange for yearrange, years in yearranges.items()]


In [11]:
# Get the mean valence & article count

for paper in ['Mail', 'Observer', 'Sunday Times', 'USA Today']:
    for yearrange in ['1993-1995', '2001-2003', '2008-2010', '2016-2018']:
    
       valencefile = repcorpusfolder + paper + '/' + yearrange + calibratedsuffix
       df = pd.read_csv(valencefile, index_col=False)
       print('Mean corpus valence (over {} articles) for {}, {}: {:5.3f}'.format(len(df), paper, yearrange, df['avg_valence'].mean()))

         

Mean corpus valence (over 113705 articles) for Mail, 1993-1995: -0.236
Mean corpus valence (over 159984 articles) for Mail, 2001-2003: -0.221
Mean corpus valence (over 202058 articles) for Mail, 2008-2010: -0.172
Mean corpus valence (over 126349 articles) for Mail, 2016-2018: -0.255
Mean corpus valence (over 37814 articles) for Observer, 1993-1995: -0.129
Mean corpus valence (over 54634 articles) for Observer, 2001-2003: -0.070
Mean corpus valence (over 86303 articles) for Observer, 2008-2010: 0.077
Mean corpus valence (over 21507 articles) for Observer, 2016-2018: 0.022
Mean corpus valence (over 47409 articles) for Sunday Times, 1993-1995: -0.071
Mean corpus valence (over 75459 articles) for Sunday Times, 2001-2003: 0.024
Mean corpus valence (over 139445 articles) for Sunday Times, 2008-2010: 0.050
Mean corpus valence (over 130349 articles) for Sunday Times, 2016-2018: 0.195
Mean corpus valence (over 93365 articles) for USA Today, 1993-1995: -0.079
Mean corpus valence (over 65672 arti

### 2. Get mean valence for representative corpora

The search and collection criteria for these corpora are described in the supplementary info for the paper.
MLS sentiment analysis and calibration steps were applied to each corpus. Here we simply calculated information about the 
average calibrated valence for each corpus.


In [12]:

for repcorpus in ['US', 'UK', 'CA', 'AU', 'NZ']:
    valencefile = repcorpusfolder + repcorpus + calibratedsuffix
    df = pd.read_csv(valencefile, index_col=False)
    print('Mean corpus valence (over {} articles) for {}: {:5.3f}'.format(len(df), repcorpus, df['avg_valence'].mean()))

    

Mean corpus valence (over 48283 articles) for US: 0.018
Mean corpus valence (over 59404 articles) for UK: -0.014
Mean corpus valence (over 22860 articles) for CA: 0.014
Mean corpus valence (over 24114 articles) for AU: 0.132
Mean corpus valence (over 7455 articles) for NZ: 0.136


### Done!