# Introduction

(1) C. Cochrane, L. Rheault, T. Whyte, M.-C. Wong, J.-F. Godbout, T. Whyte

(2) Script v.2019-09-24

(3) Title: Comparing Human and Machine Classification of Written and Video Records of Political Speech

(4) Abstract: The volume of machine-readable transcripts of legislative and other speeches has increased exponentially over the past three decades, leading to the widespread application of existing tools for the automated analyses of emotion in text.  Unlike in writing, however, expressing emotion in speech is not confined to word-choice and syntax, and instead relies heavily on intonation, facial expressions, and body language, which go undetected in analyses of political text.  This raises the question of whether tools developed for analyses of writing can detect emotion in transcripts of political speeches. Drawing on a new corpus of text and video data from the Canadian House of Commons, this paper does three things.  First, we examine whether transcripts capture the emotional content of speeches by comparing human judgments of video clips to human judgments of the corresponding transcripts. We find that transcripts capture the sentiment, but not the emotional intensity, of political speeches. Second, we compare strategies for the automated analysis of sentiment in text, and test their outputs against human-coded sentiment analysis of speech transcripts. We find that leading dictionary and supervised approaches to sentiment detection performed reasonably well, but lexicons generated using word embeddings far surpassed these other approaches. Finally, we test the robustness of word embeddings to domain specificity, choice of seed words, chance, and corpus size.   We find that word embeddings can be transferred across domains and are reasonably robust to alternative specifications and conditions. We conclude by discussing the implications of these findings for the analysis of emotion in speech. 

# Script

## I. Initialization

In [None]:
import pandas as pd
import numpy as np

## II. Randomization Script

The first stage of the project involved selecting speech fragments from video record of question period maintained by the Canadian Parliamentary Access Channel (CPAC).  We recorded the full video of every third question period during the four years beginning November 26, 2013 (last session of 41st Parliament and Harper Government) to December 13, 2017 (first session of 42nd Parliament and Trudeau Government). The videos of the question periods were trimmed to run from the beginning of the first speech to the end of the last speech.  

The script below identifies 10 random time points within each video.  We manually created video clips of the sentences being spoken at each time point. If nobody was speaking at an indicated time, or if two selected time points captured the same sentence, we selected the sentence spoken immediately prior to the sentence being spoken at a given time point. There were three cases with an error in an identified clip (email dings). We chose the sentence prior to these clips as well. To identify the beginning and end of a sentence, we used the official record of debates in the language of the speaker. C. Cochrane trimmed the clips manually. The clips are available on YouTube (see hansardExtractedVideoTranscripts.csv, below, for links) or by request (16GB), and the full videos are available upon request (870GB).

The script in this section was originally run in R, but can be run in this notebook.  

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R -w 5 -h 5 --units in -r 200
## FILE PATHS
install.packages('lubridate')
library(lubridate)

## FILE PATHS
outpath <- "/Users/chriscochrane/Dropbox/Sentiment/emotionInParliament/clipSelections/" # (add your desination fold

## META DATA
year <- 2017 #Enter Year (YYYY)
month <- 12 #Enter Month (1-12)
day <- 13 #Enter Day (1-31)
videoLength = "47:54" #Enter Length of Video (MM:SS)

##EXTRACTION POINTS

nOfExtractionPoints = 10 #Enter number of sentences to extract

## DEFINE RANDOMIZER

runRandomizer <- function(nOfExtractionPoints, videoLength){
  #generates a set of n random numbers in the domain of 0 to k,
  #where n is the number of extraction points requested (line 9)
  #and k is the length of the video (line 7):

  timeStamps <- runif(nOfExtractionPoints, 0,as.period(ms(videoLength), unit="sec") )
  timeStamps <- round(seconds_to_period(timeStamps), digits=0) #rounding to nearest second
  timeStamps <- sort(timeStamps) #ascending order

  return(timeStamps)
}

##RUN RANDOMIZER
#Function accepts arguments for number of extraction points and
#lenth of the video. These are defined above.

extractionPoints <- runRandomizer(nOfExtractionPoints, videoLength)

## OUTPUT
subpath <- sprintf("%s-%s-%s%s", year, month, day,"/")

dir.create(file.path(subpath))
extractionPoints <- data.frame(extractionPoints) #creating dataframe
colnames(extractionPoints) <- c("timeStamp") #naming column
extractionPoints[c("Speaker", "French", "Party", "Length", 
                   "English Hansard", "Hansard Floor", 
                   "Google Translate")] <- NA #creating columns

write.csv(extractionPoints, file=sprintf("%s%s-%s-%s-%s%s", subpath,  year, month, day, "selections", ".csv"))
write.csv(extractionPoints, file=sprintf("%s%s-%s-%s-%s%s", subpath,  year, month, day, "final", ".csv"))

## III. Video Coding

Once the video clips were created, they were uploaded to youTube and piped into a Qualtrics survey instrument.  The transcripts, metadata, and YouTube links for these videos is in hansardExtractedVideoTranscripts.csv.  

In [None]:
hansardExtractedVideos = pd.read_csv('hansardExtractedVideoTranscripts.csv')
hansardExtractedVideos['Video'] = hansardExtractedVideos['ID_main']
hansardExtractedVideos.head(5)

The clips were shown at random---via a Qualtrics Survey Instrument---to three independent, bilingual coders.  The coders were asked to seperately adjudge the valence (sentiment) and activation (intensity/arousal) of each speech fragment.  The raw Qualtrics output is in the file: 'emotionInHansard_December+20%2C+2018_14.52.csv'  

In [None]:
rawQualtricsOutput = pd.read_csv('emotionInHansard_December+20%2C+2018_14.52.csv')
rawQualtricsOutput.tail(5)

### III.A Extracting Data from Qualtrics Output

Each coder saw each video more than once.  Qualtrics output is a dog's breakfast. The following script extracts from the Qualtrics output, and structures, the first two sets of scores assigned by each coder to each video.  

#### III.A.1 Empty Lists

In [None]:
videoList = []
coderList = []
sentimentList = []
activationList = []

#### III.A.2 A function for splitting strings based on a property.  

The qualtrics output puts the values of different variables into the same cell, separating the variables by some symbol or another, but typically by a ' | '. The qualtrics data is also broken into blocs.  This is because the survey creation tool loads a still image for every video linked anywhere in the instrument, thus crashing the browser.  Blocks provide a workaround.    

In [None]:
def findStringParts(string, symbol, n):
    string = str(string)
    parts = string.split(symbol, n+1)
    if len(parts)<=n+1:
        return parts
    return parts

#### III.A.3 Parse by row

In [None]:
for i in range(1,len(rawQualtricsOutput)):

    if rawQualtricsOutput['distributionChannel'][i] != "preview": #drop previews
        
        coder = rawQualtricsOutput['QID78_TEXT'][i]
        if coder == "js":
            coder = "JS"
        
        #-----------------
        # Bloc A
        #-----------------
        '''count n of bars, which separate variables'''
        numBars = rawQualtricsOutput['BL_9z3f3SVI5D1eOk5_DO'][i].count('|') #count seperators (|)       
        
        '''break strings into components, which are separated by bars'''
        blocA_DO = findStringParts(rawQualtricsOutput['BL_9z3f3SVI5D1eOk5_DO'][i], '|', numBars) 
        blocA_title_vid1 = blocA_DO[0] # The first video title
        blocA_title_vid2 = blocA_DO[1] # The second video title
        
        '''activation score for first video'''
        blocA_act_vid1 = rawQualtricsOutput['QID23_1'][i] 

        '''activation score for second video'''
        blocA_act_vid2 = rawQualtricsOutput['QID23_2'][i] 

        '''sentiment score for first video'''
        blocA_sent_vid1 = rawQualtricsOutput['QID26_1'][i] 

        '''sentiment score for second video'''
        blocA_sent_vid2 = rawQualtricsOutput['QID26_2'][i] 

        '''Add Bloc A data to running Tally'''

        '''For Video 1'''
        coderList.append(coder)
        videoList.append(blocA_title_vid1)
        sentimentList.append(blocA_sent_vid1)
        activationList.append(blocA_act_vid1)

        '''For Video 2'''
        coderList.append(coder)
        videoList.append(blocA_title_vid2)
        sentimentList.append(blocA_sent_vid2)
        activationList.append(blocA_act_vid2)

        #-----------------
        # Bloc B
        #-----------------
        '''count n of bars, which separate variables'''

        numBars = rawQualtricsOutput['BL_6Rb4ytpLTIRc2ot_DO'][i].count('|') #count seperators (|)       

        '''break strings into components, which are separated by bars'''

        blocB_DO = findStringParts(rawQualtricsOutput['BL_6Rb4ytpLTIRc2ot_DO'][i], '|', numBars) 
        blocB_title_vid1 = blocB_DO[0] # The first video title
        blocB_title_vid2 = blocB_DO[1] # The second video title

        # *** CC FIXED. For BLOC B, Sentiment and Activation Labels Reversed in Qualtrics Data***

        '''activation score for first video'''
        blocB_act_vid1 = rawQualtricsOutput['QID3450_1'][i] 

        '''activation score for second video'''
        blocB_act_vid2 = rawQualtricsOutput['QID3450_2'][i] 

        '''sentiment score for first video'''
        blocB_sent_vid1 = rawQualtricsOutput['QID3451_1'][i]

        '''sentiment score for second video'''
        blocB_sent_vid2 = rawQualtricsOutput['QID3451_2'][i]
        
        '''Add Bloc B data to running Tally'''

        '''For Video 1'''
        coderList.append(coder)
        videoList.append(blocB_title_vid1)
        sentimentList.append(blocB_sent_vid1)
        activationList.append(blocB_act_vid1)

        '''For Video 2'''
        coderList.append(coder)
        videoList.append(blocB_title_vid2)
        sentimentList.append(blocB_sent_vid2)
        activationList.append(blocB_act_vid2)
   

        #-----------------
        # Bloc C
        #-----------------

        '''count n of bars, which separate variables'''
        numBars = rawQualtricsOutput['BL_9YmtLZfF0ecSpuJ_DO'][i].count('|') #count seperators (|)       

        '''break strings into components, which are separated by bars'''
        blocC_DO = findStringParts(rawQualtricsOutput['BL_9YmtLZfF0ecSpuJ_DO'][i], '|', numBars) 
        blocC_title_vid1 = blocC_DO[0] # The first video title
        blocC_title_vid2 = blocC_DO[1] # The second video title

        '''arousal code for first video'''
        blocC_act_vid1 = rawQualtricsOutput['QID1055_1'][i]  

        '''arousal code for second video'''
        blocC_act_vid2 = rawQualtricsOutput['QID1055_2'][i]

        '''sentiment code for first video'''
        blocC_sent_vid1 = rawQualtricsOutput['QID1056_1'][i]

        '''sentiment code for second video'''
        blocC_sent_vid2 = rawQualtricsOutput['QID1056_2'][i]

        '''Add Bloc C data to running Tally'''

        '''For Video 1'''
        coderList.append(coder)
        videoList.append(blocC_title_vid1)
        sentimentList.append(blocC_sent_vid1)
        activationList.append(blocC_act_vid1)

        '''For Video 2'''
        coderList.append(coder)
        videoList.append(blocC_title_vid2)
        sentimentList.append(blocC_sent_vid2)
        activationList.append(blocC_act_vid2)
        
        #-----------------
        # BlocD
        #-----------------

        '''count n of bars, which separate variables'''
        numBars = rawQualtricsOutput['BL_6rskZu2BNcPiZfL_DO'][i].count('|') #count seperators (|)       

        '''break strings into components, which are separated by bars'''
        blocD_DO = findStringParts(rawQualtricsOutput['BL_6rskZu2BNcPiZfL_DO'][i], '|', numBars) 
        blocD_title_vid1 = blocD_DO[0] # The first video title
        blocD_title_vid2 = blocD_DO[1] # The second video title

        # *** CC FIXED! Sentiment and Activation Labels Reversed in Qualtrics ***

        '''arousal code for first video'''  
        blocD_act_vid1 = rawQualtricsOutput['QID3454_1'][i] #FIXED!
        
        '''arousal code for second video'''
        blocD_act_vid2 = rawQualtricsOutput['QID3454_2'][i] #FIXED!
        
        '''sentiment code for first video'''
        blocD_sent_vid1 = rawQualtricsOutput['QID3455_1'][i] #FIXED!
        
        '''sentiment code for second video'''
        blocD_sent_vid2 = rawQualtricsOutput['QID3455_2'][i] #FIXED!
        
        '''Add Bloc D data to running Tally'''
        '''For Video 1'''
        coderList.append(coder)
        videoList.append(blocD_title_vid1)
        sentimentList.append(blocD_sent_vid1)
        activationList.append(blocD_act_vid1)

        '''For Video 2'''
        coderList.append(coder)
        videoList.append(blocD_title_vid2)
        sentimentList.append(blocD_sent_vid2)
        activationList.append(blocD_act_vid2)

        #-----------------
        # BlocE
        #-----------------

        '''count n of bars, which separate variables'''
        numBars = rawQualtricsOutput['BL_3aS8nqvLjwudQl7_DO'][i].count('|') #count seperators (|)       

        '''break strings into components, which are separated by bars'''
        blocE_DO = findStringParts(rawQualtricsOutput['BL_3aS8nqvLjwudQl7_DO'][i], '|', numBars) 
        blocE_title_vid1 = blocE_DO[0] # The first video title
        blocE_title_vid2 = blocE_DO[1] # The second video title

        '''arousal code for first video'''
        blocE_act_vid1 = rawQualtricsOutput['QID1060_1'][i] 

        '''arousal code for second video'''
        blocE_act_vid2 = rawQualtricsOutput['QID1060_2'][i] 

        '''sentiment code for first video'''
        blocE_sent_vid1 = rawQualtricsOutput['QID1061_1'][i] 

        '''sentiment code for second video'''
        blocE_sent_vid2 = rawQualtricsOutput['QID1061_2'][i]    
                
        '''Add Bloc E data to running Tally'''
        '''For Video 1'''
        coderList.append(coder)
        videoList.append(blocE_title_vid1)
        sentimentList.append(blocE_sent_vid1)
        activationList.append(blocE_act_vid1)

        '''For Video 2'''

        coderList.append(coder)
        videoList.append(blocE_title_vid2)
        sentimentList.append(blocE_sent_vid2)
        activationList.append(blocE_act_vid2)
        
        #-----------------
        # BlocF
        #-----------------

        '''count n of bars, which separate variables'''
        numBars = rawQualtricsOutput['BL_cNiU4FegXk1GkCx_DO'][i].count('|') #count seperators (|)       

        '''break strings into components, which are separated by bars'''
        blocF_DO = findStringParts(rawQualtricsOutput['BL_cNiU4FegXk1GkCx_DO'][i], '|', numBars) 
 
        blocF_title_vid1 = blocF_DO[0] # The first video title
        blocF_title_vid2 = blocF_DO[1] # The second video title

        '''arousal score for first video'''
        blocF_act_vid1 = rawQualtricsOutput['QID1063_1'][i] 

        '''arousal code for second video'''
        blocF_act_vid2 = rawQualtricsOutput['QID1063_2'][i] 

        '''sentiment code for first video'''
        blocF_sent_vid1 = rawQualtricsOutput['QID1064_1'][i] 

        '''sentiment code for second video'''
        blocF_sent_vid2 = rawQualtricsOutput['QID1064_2'][i] 

        '''Add Bloc F data to running Tally'''

        '''For Video 1'''
        coderList.append(coder)
        videoList.append(blocF_title_vid1)
        sentimentList.append(blocF_sent_vid1)
        activationList.append(blocF_act_vid1)

        '''For Video 2'''
        coderList.append(coder)
        videoList.append(blocF_title_vid2)
        sentimentList.append(blocF_sent_vid2)
        activationList.append(blocF_act_vid2)


        #-----------------
        # BlocA1B1
        #-----------------

        '''count n of bars, which separate variables'''
        if pd.notnull(rawQualtricsOutput['BL_0uhk6r0UWlNJ3tb_DO'][i]):

            numBars = rawQualtricsOutput['BL_0uhk6r0UWlNJ3tb_DO'][i].count('|') #count seperators (|)       
            
            '''break strings into components, which are separated by bars'''
            blocA1B1_DO = findStringParts(rawQualtricsOutput['BL_0uhk6r0UWlNJ3tb_DO'][i], '|', numBars) 
            
            blocA1B1_title_vid1 = blocA1B1_DO[0] # The first video title
            
            blocA1B1_title_vid2 = blocA1B1_DO[1] # The second video title
            
            '''arousal code for first video'''
            blocA1B1_act_vid1 = rawQualtricsOutput['QID4911_1'][i] 

            '''arousal code for second video'''
            blocA1B1_act_vid2 = rawQualtricsOutput['QID4911_2'][i] 

            '''sentiment code for first video'''
            blocA1B1_sent_vid1 = rawQualtricsOutput['QID4910_1'][i] 

            '''sentiment code for second video'''
            blocA1B1_sent_vid2 = rawQualtricsOutput['QID4910_2'][i] 

            
            '''Add Bloc A1B1 data to running Tally'''

            '''For Video 1'''

            coderList.append(coder)
            videoList.append(blocA1B1_title_vid1)
            sentimentList.append(blocA1B1_sent_vid1)
            activationList.append(blocA1B1_act_vid1)

            '''For Video 2'''
            coderList.append(coder)
            videoList.append(blocA1B1_title_vid2)
            sentimentList.append(blocA1B1_sent_vid2)
            activationList.append(blocA1B1_act_vid2)
        
        
        #-----------------
        # BlocA1B2
        #-----------------

        '''count n of bars, which separate variables'''
        if pd.notnull(rawQualtricsOutput['BL_eCHbbimkb0dc5i5_DO'][i]):
            numBars = rawQualtricsOutput['BL_eCHbbimkb0dc5i5_DO'][i].count('|') #count seperators (|)

            '''count n of bars, which separate variables'''
            numBars = rawQualtricsOutput['BL_eCHbbimkb0dc5i5_DO'][i].count('|') #count seperators (|)       

            '''break strings into components, which are separated by bars'''
            blocA1B2_DO = findStringParts(rawQualtricsOutput['BL_eCHbbimkb0dc5i5_DO'][i], '|', numBars) 

            blocA1B2_title_vid1 = blocA1B2_DO[0] # The first video title

            blocA1B2_title_vid2 = blocA1B2_DO[1] # The second video title

            '''arousal code for first video'''
            blocA1B2_act_vid1 = rawQualtricsOutput['QID5054_1'][i] 

            '''arousal code for second video'''
            blocA1B2_act_vid2 = rawQualtricsOutput['QID5054_2'][i] 

            '''sentiment code for first video'''
            blocA1B2_sent_vid1 = rawQualtricsOutput['QID5052_1'][i] 

            '''sentiment code for second video'''
            blocA1B2_sent_vid2 = rawQualtricsOutput['QID5052_2'][i] 


            '''Add Bloc A1B2 data to running Tally'''
            '''For Video 1'''

            coderList.append(coder)
            videoList.append(blocA1B2_title_vid1)
            sentimentList.append(blocA1B2_sent_vid1)
            activationList.append(blocA1B2_act_vid1)

            '''For Video 2'''

            coderList.append(coder)
            videoList.append(blocA1B2_title_vid2)
            sentimentList.append(blocA1B2_sent_vid2)
            activationList.append(blocA1B2_act_vid2)


        #-----------------
        # BlocA1F1
        #-----------------
        
        '''count n of bars, which separate variables'''

        if pd.notnull(rawQualtricsOutput['BL_0kXaBSNouGm7OXH_DO'][i]):

            numBars = rawQualtricsOutput['BL_0kXaBSNouGm7OXH_DO'][i].count('|') #count seperators (|)

            '''count n of bars, which separate variables'''
            numBars = rawQualtricsOutput['BL_0kXaBSNouGm7OXH_DO'][i].count('|') #count seperators (|)       

            '''break strings into components, which are separated by bars'''
            blocA1F1_DO = findStringParts(rawQualtricsOutput['BL_0kXaBSNouGm7OXH_DO'][i], '|', numBars) 
            blocA1F1_title_vid1 = blocA1F1_DO[0] # The first video title
            blocA1F1_title_vid2 = blocA1F1_DO[1] # The second video title

            '''arousal code for first video'''
            blocA1F1_act_vid1 = rawQualtricsOutput['QID5342_1'][i] 

            '''arousal code for second video'''
            blocA1F1_act_vid2 = rawQualtricsOutput['QID5342_2'][i] 

            '''sentiment code for first video'''
            blocA1F1_sent_vid1 = rawQualtricsOutput['QID5341_1'][i] 

            '''sentiment code for second video'''
            blocA1F1_sent_vid2 = rawQualtricsOutput['QID5341_2'][i] 
        

            '''Add Bloc A1F1 data to running Tally'''

            '''For Video 1'''

            coderList.append(coder)
            videoList.append(blocA1F1_title_vid1)
            sentimentList.append(blocA1F1_sent_vid1)
            activationList.append(blocA1F1_act_vid1)

            '''For Video 2'''

            coderList.append(coder)
            videoList.append(blocA1F1_title_vid2)
            sentimentList.append(blocA1F1_sent_vid2)
            activationList.append(blocA1F1_act_vid2)


        #-----------------
        # BlocA1F2
        #-----------------

        '''count n of bars, which separate variables'''

        if pd.notnull(rawQualtricsOutput['BL_0kTwd0XKn0tOQ7z_DO'][i]):
            numBars = rawQualtricsOutput['BL_0kTwd0XKn0tOQ7z_DO'][i].count('|') #count seperators (|)     

            '''break strings into components, which are separated by bars'''
            blocA1F2_DO = findStringParts(rawQualtricsOutput['BL_0kTwd0XKn0tOQ7z_DO'][i], '|', numBars) 
            blocA1F2_title_vid1 = blocA1F2_DO[0] # The first video title
            blocA1F2_title_vid2 = blocA1F2_DO[1] # The second video title

            '''arousal code for first video'''
            blocA1F2_act_vid1 = rawQualtricsOutput['QID5520_1'][i] 

            '''arousal code for second video'''
            blocA1F2_act_vid2 = rawQualtricsOutput['QID5520_2'][i] 

            '''sentiment code for first video'''
            blocA1F2_sent_vid1 = rawQualtricsOutput['QID5522_1'][i] 

            '''sentiment code for second video'''
            blocA1F2_sent_vid2 = rawQualtricsOutput['QID5522_2'][i] 

            '''Add Bloc A1F2 data to running Tally'''
            '''For Video 1'''
            coderList.append(coder)
            videoList.append(blocA1F2_title_vid1)
            sentimentList.append(blocA1F2_sent_vid1)
            activationList.append(blocA1F2_act_vid1)

            '''For Video 2'''
            coderList.append(coder)
            videoList.append(blocA1F2_title_vid2)
            sentimentList.append(blocA1F2_sent_vid2)
            activationList.append(blocA1F2_act_vid2)


        #-----------------
        # BlocA2F1
        #-----------------

        '''count n of bars, which separate variables'''

        if pd.notnull(rawQualtricsOutput['BL_8expPTVNXuTVOUB_DO'][i]):
            numBars = rawQualtricsOutput['BL_8expPTVNXuTVOUB_DO'][i].count('|') #count seperators (|)       

            '''break strings into components, which are separated by bars'''
            blocA2F1_DO = findStringParts(rawQualtricsOutput['BL_8expPTVNXuTVOUB_DO'][i], '|', numBars) 
            blocA2F1_title_vid1 = blocA2F1_DO[0] # The first video title
            blocA2F1_title_vid2 = blocA2F1_DO[1] # The second video title

            '''arousal code for first video'''
            blocA2F1_act_vid1 = rawQualtricsOutput['QID5693_1'][i] 

            '''arousal code for second video'''
            blocA2F1_act_vid2 = rawQualtricsOutput['QID5693_2'][i] 

            '''sentiment code for first video'''
            blocA2F1_sent_vid1 = rawQualtricsOutput['QID5695_1'][i] 

            '''sentiment code for second video'''
            blocA2F1_sent_vid2 = rawQualtricsOutput['QID5695_2'][i] 
            
            
            '''Add Bloc A2F1 data to running Tally'''
            '''For Video 1'''
            coderList.append(coder)
            videoList.append(blocA2F1_title_vid1)
            sentimentList.append(blocA2F1_sent_vid1)
            activationList.append(blocA2F1_act_vid1)

            '''For Video 2'''
            coderList.append(coder)
            videoList.append(blocA2F1_title_vid2)
            sentimentList.append(blocA2F1_sent_vid2)
            activationList.append(blocA2F1_act_vid2)
            
        #-----------------
        # BlocA2B1
        #-----------------

        '''count n of bars, which separate variables'''
        if pd.notnull(rawQualtricsOutput['BL_6tBHFflywlCq6nb_DO'][i]):
            
            '''count n of bars, which separate variables'''
            numBars = rawQualtricsOutput['BL_6tBHFflywlCq6nb_DO'][i].count('|') #count seperators (|)       

            '''break strings into components, which are separated by bars'''
            blocA2B1_DO = findStringParts(rawQualtricsOutput['BL_6tBHFflywlCq6nb_DO'][i], '|', numBars) 
            blocA2B1_title_vid1 = blocA2B1_DO[0] # The first video title
            blocA2B1_title_vid2 = blocA2B1_DO[1] # The second video title

            '''arousal code for first video'''
            blocA2B1_act_vid1 = rawQualtricsOutput['QID5781_1'][i] 

            '''arousal code for second video'''
            blocA2B1_act_vid2 = rawQualtricsOutput['QID5781_2'][i] 

            '''sentiment code for first video'''
            blocA2B1_sent_vid1 = rawQualtricsOutput['QID5782_1'][i] 

            '''sentiment code for second video'''
            blocA2B1_sent_vid2 = rawQualtricsOutput['QID5782_2'][i] 


            '''Add Bloc A2B1 data to running Tally'''
            '''For Video 1'''
            coderList.append(coder)
            videoList.append(blocA2B1_title_vid1)
            sentimentList.append(blocA2B1_sent_vid1)
            activationList.append(blocA2B1_act_vid1)

            '''For Video 2'''
            coderList.append(coder)
            videoList.append(blocA2B1_title_vid2)
            sentimentList.append(blocA2B1_sent_vid2)
            activationList.append(blocA2B1_act_vid2)
            
        #-----------------
        # BlocA2B2
        #-----------------

        '''count n of bars, which separate variables'''

        if pd.notnull(rawQualtricsOutput['BL_42F6yra3ogqCtpP_DO'][i]):

            '''count n of bars, which separate variables'''
            numBars = rawQualtricsOutput['BL_42F6yra3ogqCtpP_DO'][i].count('|') #count seperators (|)
            
            '''break strings into components, which are separated by bars'''
            blocA2B2_DO = findStringParts(rawQualtricsOutput['BL_42F6yra3ogqCtpP_DO'][i], '|', numBars) 
            blocA2B2_title_vid1 = blocA2B2_DO[0] # The first video title
            blocA2B2_title_vid2 = blocA2B2_DO[1] # The second video title

            '''arousal code for first video'''
            blocA2B2_act_vid1 = rawQualtricsOutput['QID5954_1'][i] 

            '''arousal code for second video'''
            blocA2B2_act_vid2 = rawQualtricsOutput['QID5954_2'][i] 

            '''sentiment code for first video'''
            blocA2B2_sent_vid1 = rawQualtricsOutput['QID5955_1'][i] 

            '''sentiment code for second video'''
            blocA2B2_sent_vid2 = rawQualtricsOutput['QID5955_2'][i] 


            '''Add Bloc A2B2 data to running Tally'''
            '''For Video 1'''
            coderList.append(coder)
            videoList.append(blocA2B2_title_vid1)
            sentimentList.append(blocA2B2_sent_vid1)
            activationList.append(blocA2B2_act_vid1)

            '''For Video 2'''
            coderList.append(coder)
            videoList.append(blocA2B2_title_vid2)
            sentimentList.append(blocA2B2_sent_vid2)
            activationList.append(blocA2B2_act_vid2)



#### III.A.3 Save Structured Data

In [None]:
# Fix Video Labelling Error
videoList = [video.replace('2017 12 1 ','2017 12 01 ' ) for video in videoList]

'''Combine Running Tally Lists in Dictionary'''
qualtricsStructured = {'Coder': coderList,
                       'Video': videoList,
                       'Sentiment': sentimentList,
                       'Activation': activationList
                      }

'''Convert to Pandas DF'''
qualtricsStructured = pd.DataFrame(qualtricsStructured)

'''Assign Missing Values'''
qualtricsStructured = qualtricsStructured.replace('-99',pd.NaT)

'''Drop Missing Values, as there was no reason for coders to produce them
and only one coder did, as if he was interrupted in the middle
of coding a single video.  This could have been a connection issue.'''

qualtricsStructured = qualtricsStructured.dropna()

qualtricsStructured = qualtricsStructured[['Video', 'Coder', 'Sentiment', 
                                           'Activation']]

qualtricsStructured.index.name = 'Order'

'''Export to CSV'''
qualtricsStructured.to_csv('qualtricsStructured.csv', encoding='utf-8')

qualtricsStructured[qualtricsStructured['Video'] == "2017 10 05 1"].head(10)

#### III.A.3 Extract First Two Codes of Each Video

We extract the first two codes because we later compare the consistency of video coders to the consistency of text coders, and text coders coded each sentence twice.  If we keep all of the video codes, then video codes may become more reliable simply by virtue of averaging a greater number of observations for each coder.  We avoid the problem of comparing the reliability of video and text coders by contraining the analysis to the first two coding decisions of each coder for each video.   

In [None]:
import warnings
warnings.filterwarnings('ignore')

#Choose first two coding decisions
qualtricsStructured['Sequence'] = qualtricsStructured.groupby(['Coder', 'Video']).cumcount()
qualtricsStructured.tail(5)

In [None]:
qualtricsStructured = qualtricsStructured[qualtricsStructured['Sequence'] < 2]
qualtricsStructured.head(5)

In [None]:
qualtricsStructured['Mode'] = "Video"

Linking the qualtrics output to metadata about the extracted segment:

In [None]:
qualtricsStructured = qualtricsStructured.merge(hansardExtractedVideos, how='left', on='Video')

In [None]:
qualtricsStructured.tail(5)

In [None]:
"""
firstTwo = qualtricsStructured[qualtricsStructured['Sequence'] <= 1]
firstTwo['Sentiment'] = pd.to_numeric(firstTwo['Sentiment'])
firstTwo['Activation'] = pd.to_numeric(firstTwo['Activation'])

firstTwo = firstTwo.assign(sentimentFirst = pd.to_numeric(firstTwo['Sentiment'])) #create sentimentFirst assign Sentiment
firstTwo.loc[firstTwo['Sequence'] == 1, 'sentimentFirst'] = np.NaN #replace sentimentFirst with NaN if Sequence with vals (0,1) = 1
firstTwo = firstTwo.assign(sentimentSecond = pd.to_numeric(firstTwo['Sentiment'])) #create sentimentSecond assign Sentiment
firstTwo.loc[firstTwo['Sequence'] == 0, 'sentimentSecond'] = np.NaN #replace sentimentSecond with NaN if Sequence with vals (0,1) = 0

firstTwo = firstTwo.assign(activationFirst = pd.to_numeric(firstTwo['Activation'])) #create sentimentFirst assign Sentiment
firstTwo.loc[firstTwo['Sequence'] == 1, 'activationFirst'] = np.NaN #replace sentimentFirst with NaN if Sequence with vals (0,1) = 1
firstTwo = firstTwo.assign(activationSecond = pd.to_numeric(firstTwo['Activation'])) #create sentimentSecond assign Sentiment
firstTwo.loc[firstTwo['Sequence'] == 0, 'activationSecond'] = np.NaN #replace sentimentSecond with NaN if Sequence with vals (0,1) = 0

videoAverages = firstTwo.groupby(['Coder', 'Video']).agg({
    'Video': 'first',
    'Coder': 'first',
    'sentimentFirst': 'mean',
    'sentimentSecond': 'mean',
    'activationFirst': 'mean',
    'activationSecond': 'mean',
    'Sentiment': 'mean',
    'Activation': 'mean'})

videoAverages.head()
"""


## IV. Text Coding

We asked three independent coders to code the sentiment and activation of the transcripts of each video clip.  The order of the transcripts were randomized.  We then had the same coders return, three months later, to redo their coding. The following script extracts these codes and integrates them with the results of the video coding.    

### IV. A. Append together the text coding files for each coder

In [None]:
jv1 = pd.read_csv('jv_coding1.csv')
jv2 = pd.read_csv('jv_coding2.csv')
cm1 = pd.read_csv('cm_coding1.csv')
cm2 = pd.read_csv('cm_coding2.csv')
sf1 = pd.read_csv('sf_coding1.csv')
sf2 = pd.read_csv('sf_coding2.csv')

jv1['Sequence'] = 0
cm1['Sequence'] = 0
sf1['Sequence'] = 0
jv2['Sequence'] = 1
cm2['Sequence'] = 1
sf2['Sequence'] = 1

jv1['Coder'] = 'JV'
jv2['Coder'] = 'JV'
sf1['Coder'] = 'SF'
sf2['Coder'] = 'SF'
cm1['Coder'] = 'CM'
cm2['Coder'] = 'CM'

textScores = pd.DataFrame(columns=['Label', 'Coder', 'Sentiment', 'Activation', 'Sequence'])
textScores = textScores.append([jv1, jv2, sf1, sf2, cm1, cm2])
textScores['Mode'] = "Text"


In [None]:
textScores.head(5)

### IV. B. Linking speech data to extracted data about the speech segment

In [None]:
textScores = textScores.merge(hansardExtractedVideos, how="left", on = "Label")

### IV. C. Cleaning

In [None]:
list(textScores.columns)

In [None]:
textScores = textScores.drop(columns=['Video_x', 'Speaker_x', 'Party', 'Shuffle'])

In [None]:
textScores.rename(columns = {'Video_y': 'Video', 'Video_x': 'Video', 'Speaker_y': 'Speaker'}, inplace=True)

In [None]:
textScores = textScores.sort_values(by=['Video', 'Coder'])

In [None]:
textScores.head(5)

## V. Appending Text and Video Data

In [None]:
fullCodingData = pd.DataFrame(columns = ['Activation', 'Coder', 'Label', 'Sentiment', 'Sequence', 'englishHansard', 'Mode', 'Date', 'ID_main', 'youTube', 'timeStamp', 'Speaker', 'French', 'party', 'seconds', 'english', 'floor', 'Video'])

In [None]:
fullCodingData = fullCodingData.append([textScores, qualtricsStructured])

In [None]:
fullCodingData.head(1)

## VI. Aggregating Data to Video Level 

In [None]:
fullCodingDataVideo = fullCodingData

Generates for each video its average sentiment and activation scores from each coder.

In [None]:
fullCodingDataVideo[['Sentiment', 'Activation']] = fullCodingDataVideo[['Sentiment', 'Activation']].apply(pd.to_numeric)
fullCodingDataVideo['coderMeanSent'] = fullCodingDataVideo.groupby(['Video', 'Coder'])['Sentiment'].transform('mean')
fullCodingDataVideo['coderMeanAct'] = fullCodingDataVideo.groupby(['Video', 'Coder'])['Activation'].transform('mean')

Captures the first and second scores from each coder for each video

In [None]:
fullCodingDataVideo['sent1'] = np.where(fullCodingDataVideo['Sequence']==0, fullCodingDataVideo['Sentiment'], np.NaN)
fullCodingDataVideo['sent2'] = np.where(fullCodingDataVideo['Sequence']==1, fullCodingDataVideo['Sentiment'], np.NaN)
fullCodingDataVideo['act1'] = np.where(fullCodingDataVideo['Sequence']==0, fullCodingDataVideo['Activation'], np.NaN)
fullCodingDataVideo['act2'] = np.where(fullCodingDataVideo['Sequence']==1, fullCodingDataVideo['Activation'], np.NaN)
fullCodingDataVideo['sent1'] = fullCodingDataVideo.groupby(['Video', 'Coder'])['sent1'].transform('mean')
fullCodingDataVideo['sent2'] = fullCodingDataVideo.groupby(['Video', 'Coder'])['sent2'].transform('mean')
fullCodingDataVideo['act1'] = fullCodingDataVideo.groupby(['Video', 'Coder'])['act1'].transform('mean')
fullCodingDataVideo['act2'] = fullCodingDataVideo.groupby(['Video', 'Coder'])['act2'].transform('mean')

In [None]:
fullCodingDataVideo.head(5)

#### Transform the coder scores into columns for arrangement at the video level

For both activation and sentiment, the following section will create seperate columns in the dataframe for the first, second, and average coding scores assigned to each video/snippet by each coder. E.g., each video will have columns for t1Sent1, t1Sent2, t1SentAvg, t1Act1, ...., v3ActAvg.

In [None]:
fullCodingDataVideo['Coder'].unique()

##### For Coder SF, extract first and second code for each video, as well as the average of the two

In [None]:
fullCodingDataVideo['t1Sent1'] = np.where(fullCodingDataVideo['Coder']=="SF", fullCodingDataVideo['sent1'], np.NaN) #the first sentiment score if coder == SF
fullCodingDataVideo['t1Act1'] = np.where(fullCodingDataVideo['Coder']=="SF", fullCodingDataVideo['act1'], np.NaN)
fullCodingDataVideo['t1Sent2'] = np.where(fullCodingDataVideo['Coder']=="SF", fullCodingDataVideo['sent2'], np.NaN) #the second sentiment score if coder == SF
fullCodingDataVideo['t1Act2'] = np.where(fullCodingDataVideo['Coder']=="SF", fullCodingDataVideo['act2'], np.NaN)
fullCodingDataVideo['t1SentAvg'] = np.where(fullCodingDataVideo['Coder']=="SF", fullCodingDataVideo['coderMeanSent'], np.NaN) #the average sentiment score if coder == SF
fullCodingDataVideo['t1ActAvg'] = np.where(fullCodingDataVideo['Coder']=="SF", fullCodingDataVideo['coderMeanAct'], np.NaN)

Generalize the columns extracted above to all rows

In [None]:
fullCodingDataVideo['t1Sent1'] = fullCodingDataVideo.groupby(['Video'])['t1Sent1'].transform('mean')
fullCodingDataVideo['t1Sent2'] = fullCodingDataVideo.groupby(['Video'])['t1Sent2'].transform('mean') 
fullCodingDataVideo['t1SentAvg'] = fullCodingDataVideo.groupby(['Video'])['t1SentAvg'].transform('mean') 
fullCodingDataVideo['t1Act1'] = fullCodingDataVideo.groupby(['Video'])['t1Act1'].transform('mean')
fullCodingDataVideo['t1Act2'] = fullCodingDataVideo.groupby(['Video'])['t1Act2'].transform('mean')
fullCodingDataVideo['t1ActAvg'] = fullCodingDataVideo.groupby(['Video'])['t1ActAvg'].transform('mean')

##### For Coder CM, extract first and second code for each video, as well as the average of the two

In [None]:
fullCodingDataVideo['t2Sent1'] = np.where(fullCodingDataVideo['Coder']=="CM", fullCodingDataVideo['sent1'], np.NaN) #the first sentiment score if coder == SF
fullCodingDataVideo['t2Act1'] = np.where(fullCodingDataVideo['Coder']=="CM", fullCodingDataVideo['act1'], np.NaN)
fullCodingDataVideo['t2Sent2'] = np.where(fullCodingDataVideo['Coder']=="CM", fullCodingDataVideo['sent2'], np.NaN) #the second sentiment score if coder == SF
fullCodingDataVideo['t2Act2'] = np.where(fullCodingDataVideo['Coder']=="CM", fullCodingDataVideo['act2'], np.NaN)
fullCodingDataVideo['t2SentAvg'] = np.where(fullCodingDataVideo['Coder']=="CM", fullCodingDataVideo['coderMeanSent'], np.NaN) #the average sentiment score if coder == SF
fullCodingDataVideo['t2ActAvg'] = np.where(fullCodingDataVideo['Coder']=="CM", fullCodingDataVideo['coderMeanAct'], np.NaN)

Generalize the columns extracted above to all rows

In [None]:
fullCodingDataVideo['t2Sent1'] = fullCodingDataVideo.groupby(['Video'])['t2Sent1'].transform('mean')
fullCodingDataVideo['t2Sent2'] = fullCodingDataVideo.groupby(['Video'])['t2Sent2'].transform('mean') 
fullCodingDataVideo['t2SentAvg'] = fullCodingDataVideo.groupby(['Video'])['t2SentAvg'].transform('mean') 
fullCodingDataVideo['t2Act1'] = fullCodingDataVideo.groupby(['Video'])['t2Act1'].transform('mean')
fullCodingDataVideo['t2Act2'] = fullCodingDataVideo.groupby(['Video'])['t2Act2'].transform('mean')
fullCodingDataVideo['t2ActAvg'] = fullCodingDataVideo.groupby(['Video'])['t2ActAvg'].transform('mean')

##### For Coder JV, extract first and second code for each video, as well as the average of the two

In [None]:
fullCodingDataVideo['t3Sent1'] = np.where(fullCodingDataVideo['Coder']=="JV", fullCodingDataVideo['sent1'], np.NaN) #the first sentiment score if coder == SF
fullCodingDataVideo['t3Act1'] = np.where(fullCodingDataVideo['Coder']=="JV", fullCodingDataVideo['act1'], np.NaN)
fullCodingDataVideo['t3Sent2'] = np.where(fullCodingDataVideo['Coder']=="JV", fullCodingDataVideo['sent2'], np.NaN) #the second sentiment score if coder == SF
fullCodingDataVideo['t3Act2'] = np.where(fullCodingDataVideo['Coder']=="JV", fullCodingDataVideo['act2'], np.NaN)
fullCodingDataVideo['t3SentAvg'] = np.where(fullCodingDataVideo['Coder']=="JV", fullCodingDataVideo['coderMeanSent'], np.NaN) #the average sentiment score if coder == SF
fullCodingDataVideo['t3ActAvg'] = np.where(fullCodingDataVideo['Coder']=="JV", fullCodingDataVideo['coderMeanAct'], np.NaN)

Generalize the columns extracted above to all rows

In [None]:
fullCodingDataVideo['t3Sent1'] = fullCodingDataVideo.groupby(['Video'])['t3Sent1'].transform('mean')
fullCodingDataVideo['t3Sent2'] = fullCodingDataVideo.groupby(['Video'])['t3Sent2'].transform('mean') 
fullCodingDataVideo['t3SentAvg'] = fullCodingDataVideo.groupby(['Video'])['t3SentAvg'].transform('mean') 
fullCodingDataVideo['t3Act1'] = fullCodingDataVideo.groupby(['Video'])['t3Act1'].transform('mean')
fullCodingDataVideo['t3Act2'] = fullCodingDataVideo.groupby(['Video'])['t3Act2'].transform('mean')
fullCodingDataVideo['t3ActAvg'] = fullCodingDataVideo.groupby(['Video'])['t3ActAvg'].transform('mean')

##### For Coder PO, extract first and second code for each video, as well as the average of the two

In [None]:
fullCodingDataVideo['v1Sent1'] = np.where(fullCodingDataVideo['Coder']=="P-O R. B.", fullCodingDataVideo['sent1'], np.NaN) #the first sentiment score if coder == SF
fullCodingDataVideo['v1Act1'] = np.where(fullCodingDataVideo['Coder']=="P-O R. B.", fullCodingDataVideo['act1'], np.NaN)
fullCodingDataVideo['v1Sent2'] = np.where(fullCodingDataVideo['Coder']=="P-O R. B.", fullCodingDataVideo['sent2'], np.NaN) #the second sentiment score if coder == SF
fullCodingDataVideo['v1Act2'] = np.where(fullCodingDataVideo['Coder']=="P-O R. B.", fullCodingDataVideo['act2'], np.NaN)
fullCodingDataVideo['v1SentAvg'] = np.where(fullCodingDataVideo['Coder']=="P-O R. B.", fullCodingDataVideo['coderMeanSent'], np.NaN) #the average sentiment score if coder == SF
fullCodingDataVideo['v1ActAvg'] = np.where(fullCodingDataVideo['Coder']=="P-O R. B.", fullCodingDataVideo['coderMeanAct'], np.NaN)

Generalize the columns extracted above to all rows

In [None]:
fullCodingDataVideo['v1Sent1'] = fullCodingDataVideo.groupby(['Video'])['v1Sent1'].transform('mean')
fullCodingDataVideo['v1Sent2'] = fullCodingDataVideo.groupby(['Video'])['v1Sent2'].transform('mean') 
fullCodingDataVideo['v1SentAvg'] = fullCodingDataVideo.groupby(['Video'])['v1SentAvg'].transform('mean') 
fullCodingDataVideo['v1Act1'] = fullCodingDataVideo.groupby(['Video'])['v1Act1'].transform('mean')
fullCodingDataVideo['v1Act2'] = fullCodingDataVideo.groupby(['Video'])['v1Act2'].transform('mean')
fullCodingDataVideo['v1ActAvg'] = fullCodingDataVideo.groupby(['Video'])['v1ActAvg'].transform('mean')

##### For Coder MS, extract first and second code for each video, as well as the average of the two

In [None]:
fullCodingDataVideo['v2Sent1'] = np.where(fullCodingDataVideo['Coder']=="MS", fullCodingDataVideo['sent1'], np.NaN) #the first sentiment score if coder == SF
fullCodingDataVideo['v2Act1'] = np.where(fullCodingDataVideo['Coder']=="MS", fullCodingDataVideo['act1'], np.NaN)
fullCodingDataVideo['v2Sent2'] = np.where(fullCodingDataVideo['Coder']=="MS", fullCodingDataVideo['sent2'], np.NaN) #the second sentiment score if coder == SF
fullCodingDataVideo['v2Act2'] = np.where(fullCodingDataVideo['Coder']=="MS", fullCodingDataVideo['act2'], np.NaN)
fullCodingDataVideo['v2SentAvg'] = np.where(fullCodingDataVideo['Coder']=="MS", fullCodingDataVideo['coderMeanSent'], np.NaN) #the average sentiment score if coder == SF
fullCodingDataVideo['v2ActAvg'] = np.where(fullCodingDataVideo['Coder']=="MS", fullCodingDataVideo['coderMeanAct'], np.NaN)

Generalize the columns extracted above to all rows

In [None]:
fullCodingDataVideo['v2Sent1'] = fullCodingDataVideo.groupby(['Video'])['v2Sent1'].transform('mean')
fullCodingDataVideo['v2Sent2'] = fullCodingDataVideo.groupby(['Video'])['v2Sent2'].transform('mean') 
fullCodingDataVideo['v2SentAvg'] = fullCodingDataVideo.groupby(['Video'])['v2SentAvg'].transform('mean') 
fullCodingDataVideo['v2Act1'] = fullCodingDataVideo.groupby(['Video'])['v2Act1'].transform('mean')
fullCodingDataVideo['v2Act2'] = fullCodingDataVideo.groupby(['Video'])['v2Act2'].transform('mean')
fullCodingDataVideo['v2ActAvg'] = fullCodingDataVideo.groupby(['Video'])['v2ActAvg'].transform('mean')

##### For Coder JS, extract first and second code for each video, as well as the average of the two

In [None]:
fullCodingDataVideo['v3Sent1'] = np.where(fullCodingDataVideo['Coder']=="JS", fullCodingDataVideo['sent1'], np.NaN) #the first sentiment score if coder == SF
fullCodingDataVideo['v3Act1'] = np.where(fullCodingDataVideo['Coder']=="JS", fullCodingDataVideo['act1'], np.NaN)
fullCodingDataVideo['v3Sent2'] = np.where(fullCodingDataVideo['Coder']=="JS", fullCodingDataVideo['sent2'], np.NaN) #the second sentiment score if coder == SF
fullCodingDataVideo['v3Act2'] = np.where(fullCodingDataVideo['Coder']=="JS", fullCodingDataVideo['act2'], np.NaN)
fullCodingDataVideo['v3SentAvg'] = np.where(fullCodingDataVideo['Coder']=="JS", fullCodingDataVideo['coderMeanSent'], np.NaN) #the average sentiment score if coder == SF
fullCodingDataVideo['v3ActAvg'] = np.where(fullCodingDataVideo['Coder']=="JS", fullCodingDataVideo['coderMeanAct'], np.NaN)

Generalize the columns extracted above to all rows

In [None]:
fullCodingDataVideo['v3Sent1'] = fullCodingDataVideo.groupby(['Video'])['v3Sent1'].transform('mean')
fullCodingDataVideo['v3Sent2'] = fullCodingDataVideo.groupby(['Video'])['v3Sent2'].transform('mean') 
fullCodingDataVideo['v3SentAvg'] = fullCodingDataVideo.groupby(['Video'])['v3SentAvg'].transform('mean') 
fullCodingDataVideo['v3Act1'] = fullCodingDataVideo.groupby(['Video'])['v3Act1'].transform('mean')
fullCodingDataVideo['v3Act2'] = fullCodingDataVideo.groupby(['Video'])['v3Act2'].transform('mean')
fullCodingDataVideo['v3ActAvg'] = fullCodingDataVideo.groupby(['Video'])['v3ActAvg'].transform('mean')

### Sample 1 for each video

In [None]:
fullCodingDataVideoOnePer = fullCodingDataVideo.groupby(['Label']).first()

##### Cleaning

In [None]:
list(fullCodingDataVideoOnePer.columns)

Dropping columns not measured at video level

In [None]:
fullCodingDataVideoOnePer = fullCodingDataVideoOnePer.drop(columns=['Activation', 
                                                                    'Coder', 
                                                                    'ID_main', 
                                                                    'Mode', 
                                                                    'Sentiment',
                                                                    'Sequence',
                                                                    'englishHansard',
                                                                    'coderMeanSent',
                                                                    'coderMeanAct',
                                                                    'sent1',
                                                                    'sent2',
                                                                    'act1',
                                                                    'act2'
                                                                   ])


In [None]:
fullCodingDataVideoOnePer.head(1)

In [None]:
fullCodingDataVideoOnePer.to_csv('fullCodingDataVideoOnePer.csv')

# VII. Coder Reliability

### use coderReliability.R in R-Studio.

# VIII. Hansard Parser

This script accesses the directory /Hansard/forParser, extracts all xml files, and then parses them into a dataframe format.  It requires the file authorityFile.p.

In [None]:
#XML parser for Canadian Record of Parliamentary Debates
#version: C.Cochrane, 2018-01-03

######################################################################################################################
# OVERVIEW
######################################################################################################################
#Notes:     (1) Python 3.6 Anaconda
#           (2) Encoding UTF-8
#           (3) Runs from parent directory of Hansard folder.

#
#Modules:   (1) xml.etree.ElementTree
#               see https://docs.python.org/3.3/library/xml.etree.elementtree.html
#           (2) os
#               see https://docs.python.org/2/library/os.html
#           (3) re
#               see
#           (4) Pandas
#               see
#
# Data:     (1) xml from www.ourcommons.ca
#               (e.g., http://www.ourcommons.ca/DocumentViewer/en/42-1/house/sitting-200/hansard)
######################################################################################################################
# INITIALIZE
######################################################################################################################
#Load Modules
import xml.etree.ElementTree as ET
import re
import glob
import pandas as pd
from datetime import datetime
import pickle
import spacy

#Load Authority File

authorityFile = pickle.load( open("authorityFile.p", "rb"))



##########################################################################################################
# DEFINING FUNCTIONS FOR NLP TOOLS
##########################################################################################################
#Stopwords
from nltk.corpus import stopwords
stopwords = stopwords.words('english')

#Spacy Parser
nlp = spacy.load('en')



######################################################################################################################
# Hansard XML Function for extracting Speeches and Metadata about Speeches
######################################################################################################################

def hansardXmlParserSpeeches(directory="Hansard/forParser/"): #accepts path of folder containing xml files



    # Notes re Structure of XML Schema from ourcommons.ca.
    #            (1) "Children" are categories that begin within some
    #                 other category. (denoted by a "tag")
    #            (2) "Parent" is the category within which some other
    #                 category begins.
    #            (3) "Geneology" traces parent-child connections back
    #                 through earlier generations. Geneological terms
    #                 are used here as they are in everyday speech (.e.g,
    #                 grandparents, great-grandparents, uncles, siblings...)
    #            (4)  "Attributes" are traits passed on to later
    #                 generations (children, etc). They are always shared by siblings.

    # Rules:
    #           (1) The Parliament's xml schema is not obviously hierarchical/linear.
    #               The following rules are necessary to make sense
    #               of the schema:
    #                   a) All parents pass down their attributes to their
    #                      children (forward inheritance).
    #                   b) All parents possess the attributes
    #                      of their children, even when the attribute first
    #                      appears at the level of the child (backward
    #                      inheritance for one level).
    #                   c) All siblings possess the same
    #                      attributes, even when the attribute is only listed for
    #                      one sibling and not the other (horizontal
    #                      inheritance).
    #                   d) Taking a), b) and c) together, attributes that first
    #                      appear in cousins are passed up to the cousin's
    #                      parent (i.e., aunt) via backward inheritance,
    #                      across to the the child's parent via horizontal
    #                      inheritance, and down to the child via forward
    #                      inheritance.  In short, children inherit the
    #                      attributes of their cousins.
    #
    # ================================================================

    dfList = []

    fileList = glob.glob(str(directory)+'*.XML')  #extract list of xml files from the indicated directory
    #print("FileList", fileList)
    print("LengthOfFileList", len(fileList))
    #for file in fileList:
    #    print(str(file))
    #Declaring variables and default values

    topLine=None
    secondLine=None
    orderOfBusinessRubric=None
    subjectOfBusinessTitle=None
    subjectOfBusinessID=None
    subjectOfBusinessQualifier = None
    speechId = None
    interventionId = None
    date = None
    year = None
    month = None
    day = None
    weekday = None
    speakerName = None
    affiliationType = None
    affiliationDbId = None
    floorLanguage = None
    speech = None
    mentionedEntityName = None
    mentionedEntityType = None


    for file in fileList:

        print("Currently Processing:", file)


        ##################
        ##################
        ##################
        #Part 1
        ##################
        ##################
        ##################

        file = str(file)


        with open(file, 'r', encoding='utf-8', errors='ignore') as xml_file:
            tree = ET.parse(xml_file)
        xml_file.close()



        root = tree.getroot()

        # Declaring to capture speakerIDs associated with each party

        capturedConId = []
        capturedLibId = []
        capturedNDPId = []
        capturedGrnId = []
        capturedFDId = []
        capturedIndId = []
        capturedBQId = []

        for elementHansard in root:
            #=================================================================
            # elementHansard is a part of the second generation (level).
            # ElementHansard has five children (tags): (1) StartPageNumber
            #                                          (2) DocumentTitle (ignored)
            #                                          (3) ExtractedInformation
            #                                          (4) HansardBody
            # Geneology: elementHansard is the child of root.
            # ================================================================

            if elementHansard.tag ==  "StartPageNumber":
                # =================================================================
                # StartPageNumber is a part of the third generation (level), has
                # one new attribute, and no children of its own.
                # genes:        (1) sourceStartPageNumber
                #
                # Geneology: StartPageNumber is the child of elementHansard, and
                # the grandchild of root.
                # ================================================================
                sourceStartPageNumber = int(elementHansard.text) #The page number of the first page for
                                                                 #this day in the Official Record of Debates.
            if elementHansard.tag == "ExtractedInformation":

                for elementExtractedInformation in elementHansard:
                    #==========================================================
                    # elementExtractedInformation is also part of the third
                    # generation (level). It is a sibling of StartPageNumber
                    # and therefore carries the attribute sourceStartPageNumber.
                    # "ExtractedInformation" has no children and 28 new
                    # attributes.
                    #               (1)  {'Name': 'InstitutionDebate'}
                    #               (2)  {'Name': 'Volume'}
                    #               (3)  {'Name': 'Number'}
                    #               (4)  {'Name': 'Session'}
                    #               (5)  {'Name': 'Parliament'}
                    #               (6)  {'Name': 'Date'}
                    #               (7)  {'Name': 'SpeakerName'}
                    #               (8)  {'Name': 'Institution'}
                    #               (9)  {'Name': 'Country'}
                    #               (10)  {'Name': 'TOCNote'} (ignored)
                    #               (11) {'Name': 'HeaderTitle'}
                    #               (12) {'Name': 'HeaderDate'}
                    #               (13) {'Name': 'MetaDocumentCategory'} (ignored)
                    #               (14) {'Name': 'MetaTitle'}
                    #               (15) {'Name': 'MetaTitleEn'}
                    #               (16) {'Name': 'MetaTitleFr'}
                    #               (17) {'Name': 'MetaVolumeNumber'}
                    #               (18) {'Name': 'MetaNumberNumber'}
                    #               (19) {'Name': 'MetaDateNumDay'}
                    #               (20) {'Name': 'MetaDateNumMonth'}
                    #               (21) {'Name': 'MetaDateNumYear'}
                    #               (22) {'Name': 'MetaCreationTime'}
                    #               (23) {'Name': 'MetaInstitution'}
                    #               (24) {'Name': 'InstitutionDebateFr'}
                    #               (25) {'Name': 'InstitutionDebateEn'}
                    #               (26) {'Name': 'ParliamentNumber'}
                    #               (27) {'Name': 'SessionNumber'}
                    #               (28) {'Name': 'InCameraNote'}
                    # Geneology: ExtractedInformation is the child of elementHansard,
                    #            the grandchild of root, and has no children of its
                    #            own. StartPageNumber, DocumentTitle, and
                    #            ExtractedInformation have properties shared by
                    #            all descendants of HansardBody, but they have
                    #            no children of their own.
                    #=========================================================

                    if elementExtractedInformation.attrib=={'Name': 'MetaInstitution'}: #e.g., House of Commons
                        chamber = elementExtractedInformation.text

                    if elementExtractedInformation.attrib=={'Name': 'ParliamentNumber'}: #Which Parliament Number
                        parliamentNumber = int(elementExtractedInformation.text)

                    if elementExtractedInformation.attrib=={'Name': 'SessionNumber'}: #Which Session Number
                        parliamentSession = int(elementExtractedInformation.text)

                    if elementExtractedInformation.attrib=={'Name': 'HeaderTitle'}: #E.g., "Commons Debates"
                        sourceTitle = elementExtractedInformation.text

                    if elementExtractedInformation.attrib=={'Name': 'MetaVolumeNumber'}: #Hansard Volume
                        sourceVolume = elementExtractedInformation.text

                    if elementExtractedInformation.attrib=={'Name': 'MetaNumberNumber'}: #Hansard Number
                        sourceNumber = int(elementExtractedInformation.text)

                    if elementExtractedInformation.attrib=={'Name': 'HeaderDate'}: #Date in Format (January 1, 2001)
                        date = elementExtractedInformation.text

                    if elementExtractedInformation.attrib=={'Name': 'MetaDateNumYear'}: #Year
                        year = int(elementExtractedInformation.text)

                    if elementExtractedInformation.attrib=={'Name': 'MetaDateNumMonth'}: #Month Number
                        month = int(elementExtractedInformation.text)

                    if elementExtractedInformation.attrib=={'Name': 'MetaDateNumDay'}: #Day Number
                        day = int(elementExtractedInformation.text)

                    if elementExtractedInformation.attrib=={'Name': 'Date'}:
                        weekday = elementExtractedInformation.text.split(',',1)[0] #Day of Week, e.g., "Monday"

            if elementHansard.tag == "HansardBody":
                for elementHansardBody in elementHansard:
                # ======================================================================
                # elementHansard Body is also part of the third generation.  It is a
                # sibling of StartPageNumber and elementExtractedInformation, and thus
                # carries the attributes sourceStartPageNumber from startPageNumber, and the
                # 28 attributes from elementExtractedInformation.  elementHansardBody has 10
                # new genes (attributes) and two children (tags).
                # attrib:       (1)  {} #Empty
                #               (2)  {'id': '9796727', 'Rubric': 'Other'}
                #               (3)  {'id': '9794282', 'Rubric': 'StatementsByMembers'}
                #               (4)  {'id': '9794342', 'Rubric': 'NewMP'}
                #               (5)  {'id': '9794349', 'Rubric': 'OralQuestionPeriod'}
                #               (6)  {'id': '9794620', 'Rubric': 'RoutineProceedings'}
                #               (7)  {'id': '9794803', 'Rubric': 'Other'}
                #               (8)  {'id': '9796436', 'Rubric': 'Other'}
                #               (9)  {'id': '9796555', 'Rubric': 'RoutineProceedings'}
                #               (10)  {'id': '9796639', 'Rubric': 'LateShow'}
                # children:     (1)   Intro
                #               (2)   OrderOfBusiness
                # Geneology: elementHansardBody is the child of elementHansard, and the
                #            grandchild of root.  It has two children.  It shares all of
                #            the genes of its siblings, elementExtractedInformation and
                #            StartPageNumber, and it passes down all of these attribs to
                #            its two children.  It also has 10 new attribs of its own, for a
                #            total of 39 attribs that it passes down.
                # ======================================================================
                    if elementHansardBody.tag=="Intro":
                        for elementIntro in elementHansardBody:
                            #====================================================================
                            # elementIntro is a part of the fourth generation.  It has two new
                            # attribs and no children (tags).
                            # attribs:       (1) Paratext
                            #                (2) Prayer
                            # Geneology: elementIntro is the child of elementHansardBody, the
                            # grandchild of elementHansard, and the great-grandchild of root.
                            # elementIntro has two new genes and two children.  Although
                            # elementIntro's line dies off in this generation, its two genes are
                            # recorded here because they are shared by its sibling,
                            # OrderOfBusiness, and passed down through that lineage.
                            #====================================================================

                            if elementIntro.tag=="ParaText": #Attribute of elementIntro.  Procedural text in this case.
                                topLine = elementIntro.text #top line for the day.

                            if elementIntro.tag=="Prayer": #Attribute of elementIntro
                                secondLine = elementIntro.text

                    if elementHansardBody.tag=="OrderOfBusiness":
                        for elementOrderOfBusiness in elementHansardBody:
                            # ======================================================================
                            # elementOrderOfBusiness is also a part of the fourth generation, and therefore
                            # shares the two attributes of its sibling, elementIntro.  elementOrderOfBusiness
                            # has two additional new attributes. It has one child.
                            # genes:        (1)   OrderOfBusinessTitle
                            #               (2)   Catchline
                            # children:     (1)   SubjectOfBusiness
                            #
                            # Note: OrderOfBusinessTitle repeats Catchline.  It is therefore ignored.
                            #
                            # Geneology: elementOrderOfBusiness is the child of elementHansardBody,
                            # the grandchild of elementHansard, and the great-grandchild of root. It is
                            # the sibling of elementIntro.
                            # ======================================================================
                                if elementOrderOfBusiness.tag== "SubjectOfBusiness":
                                    for elementSubjectOfBusiness in elementOrderOfBusiness:
                                        subjectOfBusinessID = elementOrderOfBusiness.attrib.get("id") #stores official ID
                                        #========================================================================
                                        # elementSubjectOfBusiness is the fifth generation, and the only child
                                        # of elementOrderOfBusiness.  elementSubjectOfBusiness has 4 genes and
                                        # 1 child.
                                        # genes:        (1) Timestamp
                                        #               (2) FloorLanguage
                                        #               (3) SubjectOfBusinessTitle
                                        #               (4) SubjectOfBusinessQualifier
                                        #
                                        # children:     (1) SubjectOfBusinessContent
                                        #
                                        # Geneology: elementSubjectOfBusiness is the only child of
                                        # elementOrderofBusiness, the grandchild of elementHansardBody,
                                        # the great-grandchild of elementHansard, and the great-great-
                                        # grandchild of root.
                                        #========================================================================
                                        if elementSubjectOfBusiness.tag=="Timestamp":
                                            timeStampHr = elementSubjectOfBusiness.attrib['Hr']
                                            timeStampMin = elementSubjectOfBusiness.attrib['Mn']


                                        orderOfBusinessRubric = elementHansardBody.attrib['Rubric']  # A secondary description,
                                        #                                                              an attribute inherited from
                                        #                                                              elementHansardBody.
                                        #                                                              Needs to go with
                                        #                                                              each speech.
                                        if elementSubjectOfBusiness.tag=="FloorLanguage":
                                            floorLanguage = elementSubjectOfBusiness.attrib['language']

                                        if elementSubjectOfBusiness.tag=='SubjectOfBusinessTitle':     #A general title, procedurally oriented

                                            subjectOfBusinessTitle = elementSubjectOfBusiness.text
                                            subjectOfBusinessQualifier = "" #Not always available. resets so it doesn't get the previous qualifier

                                            if subjectOfBusinessTitle == "":
                                                subjectOfBusinessTitle = "NA"
                                                subjectOfBusinessQualifier = ""  # Not always available. resets so it doesn't get the previous qualifier


                                        if elementSubjectOfBusiness.tag=="SubjectOfBusinessQualifier":   #A more specific (and substantive title). Not always available.
                                            subjectOfBusinessQualifier = elementSubjectOfBusiness.text


                                        for elementSubjectOfBusinessContent in elementSubjectOfBusiness:
                                            # ========================================================================
                                            # elementSubjectOfBusinessContent is the sixth generation, and the only child
                                            # of elementSubjectOfBusiness.  elementSubjectOfBusinessContent has 2 attribs and
                                            # 1 child.
                                            # attribs:      (1) Timestamp
                                            #               (2) FloorLanguage
                                            #
                                            # children:     (1) Intervention
                                            #
                                            #
                                            # Geneology: elementSubjectOfBusinessContent is the only child of
                                            # elementSubjectofBusiness, the grandchild of elementOrderOfBusiness,
                                            # the great-great child of elementHansardBody, the great-great-grandchild of
                                            # elementHansard and the great-great-great grandchild of root.
                                            if elementSubjectOfBusinessContent.tag == "Timestamp": #Time Stamped every five mintues, at different levels.
                                                timeStampHr = elementSubjectOfBusinessContent.attrib['Hr']
                                                timeStampMin = elementSubjectOfBusinessContent.attrib['Mn']

                                            if elementSubjectOfBusinessContent.tag == "FloorLanguage":
                                                floorLanguage = elementSubjectOfBusinessContent.attrib['language']

                                            if elementSubjectOfBusinessContent.tag == "ParaText":  #This would be procedural text (e.g., It being 5:30 p.m., the House will now proceed to the taking of the....)
                                                pass

                                            if elementSubjectOfBusinessContent.tag == "Intervention": #This is an intervention

                                                for elementSubjectOfBusinessContentIntervention in elementSubjectOfBusinessContent:
                                                    # ========================================================================
                                                    # elementSubjectOfBusinessContentIntervention is the seventh generation,
                                                    # and the only child of elementSubjectOfBusinessContent.
                                                    # elementSubjectOfBusinessContentIntervention has two children:
                                                    #
                                                    # children:     (1) Person Speaking
                                                    #               (2) Content

                                                    #print(elementSubjectOfBusinessContent.attrib)
                                                    interventionId = elementSubjectOfBusinessContent.attrib.get("id")

                                                    if elementSubjectOfBusinessContentIntervention.tag=="PersonSpeaking":

                                                        for elementSubjectOfBusinessContentInterventionPersonSpeaking in elementSubjectOfBusinessContentIntervention:
                                                            # ========================================================================
                                                            # elementSubjectOfBusinessContentInterventionPersonSpeaking is the eigth generation,
                                                            # and child of elementSubjectOfBusinessContentIntervention.
                                                            # elementSubjectOfBusinessContentInterventionPersonSpeaking has one attributes:
                                                            #               (1) Affiliation
                                                            #                   (A) Affiliation Type
                                                            #                   (B) Affiliation DataBase ID
                                                            #                   (C) Affiliation Name (i.e. Speaker's Name)
                                                            if elementSubjectOfBusinessContentInterventionPersonSpeaking.tag=="Affiliation":

                                                                try:
                                                                    affiliationType = elementSubjectOfBusinessContentInterventionPersonSpeaking.attrib['Type']
                                                                except:
                                                                    affiliationType = "NA"

                                                                affiliationDbId = elementSubjectOfBusinessContentInterventionPersonSpeaking.attrib['DbId']

                                                                if elementSubjectOfBusinessContentInterventionPersonSpeaking.text is not None: #a couple of speaker names are missing
                                                                    speakerName = elementSubjectOfBusinessContentInterventionPersonSpeaking.text
                                                                else:
                                                                    speakerName = "NA"

                                                                capturedConId.append(214648) #Ambrose has multiple Aliases.  She seems to start with an old one.
                                                                capturedConId.append(214568)

                                                                capturedLibId.append(213924) #Lamoureux given different IDs for first and subsequent speeches

                                                                if "Justin Trudeau" in speakerName :
                                                                    party = 'Lib'
                                                                    capturedLibId.append(affiliationDbId)

                                                                if "Rona Ambrose" in speakerName:
                                                                    party = 'Con'
                                                                    capturedLibId.append(affiliationDbId)

                                                                elif "Thomas Mulcair" in speakerName:
                                                                    party == 'NDP'
                                                                    capturedNDPId.append(affiliationDbId)

                                                                elif "Stephen Harper" in speakerName:
                                                                    party = 'Con'
                                                                    capturedConId.append(affiliationDbId)

                                                                elif "Andrew Scheer" in speakerName:
                                                                    party = 'Con'
                                                                    capturedConId.append(affiliationDbId)

                                                                #Party
                                                                elif speakerName[-3:] == 'PC)':
                                                                    party = "Con"
                                                                    capturedConId.append(affiliationDbId)

                                                                elif speakerName[-3:] == 'b.)':
                                                                    party = 'Lib'
                                                                    capturedLibId.append(affiliationDbId)

                                                                elif speakerName[-3:] == 'DP)':
                                                                    party = 'NDP'
                                                                    capturedNDPId.append(affiliationDbId)

                                                                elif speakerName[-3:] == 'FD)':
                                                                    party = 'FD'
                                                                    capturedFDId.append(affiliationDbId)

                                                                elif speakerName[-3:] == 'GP)':
                                                                    party = 'Grn'
                                                                    capturedGrnId.append(affiliationDbId)

                                                                elif speakerName[-3:] == 'BQ)':
                                                                    party = 'BQ'
                                                                    capturedBQId.append(affiliationDbId)

                                                                elif speakerName[-3:] == 'd.)':
                                                                    party = 'Ind'
                                                                    capturedIndId.append(affiliationDbId)

                                                                elif affiliationDbId in capturedConId:
                                                                    party = 'Con'

                                                                elif affiliationDbId in capturedLibId:
                                                                    party = 'Lib'

                                                                elif affiliationDbId in capturedNDPId:
                                                                    party = "NDP"

                                                                elif affiliationDbId in capturedBQId:
                                                                    party = 'BQ'

                                                                elif affiliationDbId in capturedFDId:
                                                                    party = 'FD'

                                                                elif affiliationDbId in capturedGrnId:
                                                                    party = "Grn"

                                                                else:
                                                                    party = None
                                     

                                                    if elementSubjectOfBusinessContentIntervention.tag=="Content":
                                                        speech = [] #an empty list, which will collect speech fragements below.
                                                        paraNumRange = [] #empty list, for paragraph number range

                                                        documentType = []  # empty list, for any documents (Type) mentioned by speaker. Type is unlabelled integer.  (To Look up)
                                                        documentID = []  # empty list, for any documents mentioned by speaker.  This is a database ID. (To Look up)
                                                        documentTitleList = []  # mpty list, for title of any document mentioned by Speaker.  This is readable text.

                                                        mentionedEntityType = []  # empty list, for any Members mentioned in the speech.  This is unlabelled integer.  (To Look up)
                                                        mentionedEntityDbId = []  # empty list, for any Members mentioned in the speech.  This is a database ID. (To look up)
                                                        mentionedEntityNameList = []  # empty list, for names of any members mentioned in the speech.

                                                        for elementSubjectOfBusinessContentInterventionContent in elementSubjectOfBusinessContentIntervention:
                                                            # ========================================================================
                                                            # elementSubjectOfBusinessContentInterventionContent is the eigth generation,
                                                            # and child of elementSubjectOfBusinessContentIntervention.
                                                            # elementSubjectOfBusinessContentIntervention has three attributes:
                                                            #               (1) FloorLanguage
                                                            #               (2) TimeStemp
                                                            #               (3) ParaText
                                                            #
                                                            if elementSubjectOfBusinessContentInterventionContent.tag == "FloorLanguage":
                                                                floorLanguage = elementSubjectOfBusinessContentInterventionContent.attrib['language']

                                                            if elementSubjectOfBusinessContentInterventionContent.tag == "Timestamp":
                                                                timeStampHr = elementSubjectOfBusinessContentInterventionContent.attrib['Hr']
                                                                timeStampMin = elementSubjectOfBusinessContentInterventionContent.attrib['Mn']

                                                            if elementSubjectOfBusinessContentInterventionContent.tag == "ParaText":
                                                                try:
                                                                    paraNumRange.append(int(elementSubjectOfBusinessContentInterventionContent.attrib["id"]))
                                                                except:
                                                                    pass
                                                                phrase = ET.tostring(elementSubjectOfBusinessContentInterventionContent, method="text")
                                                                phrase = phrase.decode("UTF-8")
                                                                phrase = phrase.replace('\n', '').replace('\t', '')

                                                                speech.append(phrase)
                                                                speech.append("\n\n")


                                                                for elementSubjectOfBusinessContentInterventionContentParaText in elementSubjectOfBusinessContentInterventionContent:
                                                                    #print(elementSubjectOfBusinessContentInterventionContentParaText.tag)
                                                                    # ========================================================================
                                                                    # elementSubjectOfBusinessContentInterventionContentParaText is the ninth generation,
                                                                    # and child of elementSubjectOfBusinessContentIntervention.
                                                                    # elementSubjectOfBusinessContentIntervention has three attributes:
                                                                    #               (1) Document (Any Bills Mentioned)
                                                                    #               (2) Affiliation (Any Members Mentioned)
                                                                    #               (B) ParaText
                                                                    #
                                                                    # print(elementSubjectOfBusinessContentInterventionContent.tag)
                                                                    if elementSubjectOfBusinessContentInterventionContentParaText.tag=="Document":
                                                                        try:
                                                                            documentType.append(elementSubjectOfBusinessContentInterventionContentParaText.attrib['Type'])
                                                                        except:
                                                                            documentType.append("NA")
                                                                        try:
                                                                            documentID.append(elementSubjectOfBusinessContentInterventionContentParaText.attrib['DbId'])
                                                                        except:
                                                                            documentID.append("NA")
                                                                        try:
                                                                            documentTitle = elementSubjectOfBusinessContentInterventionContentParaText.text
                                                                        except:
                                                                            documentTitle = "NA"
                                                                        documentTitleList.append(documentTitle)

                                                                    if elementSubjectOfBusinessContentInterventionContentParaText.tag=="Affiliation":
                                                                        try:
                                                                            mentionedEntityType.append(elementSubjectOfBusinessContentInterventionContentParaText.attrib['Type'])
                                                                        except:
                                                                            mentionedEntityType.append("NA")
                                                                        try:
                                                                            mentionedEntityDbId.append(elementSubjectOfBusinessContentInterventionContentParaText.attrib['DbId'])
                                                                        except:
                                                                            mentionedEntityDbId.append("NA")
                                                                        mentionedEntityName = elementSubjectOfBusinessContentInterventionContentParaText.text
                                                                        mentionedEntityNameList.append(mentionedEntityName)



                                                        #################################################################
                                                        #Cleanup
                                                        #################################################################

                                                        speech = ''.join(speech)
                                                        for openquote in ['&#8220;']:
                                                            if openquote in speech:
                                                                speech=speech.replace(openquote,"\"")
                                                        for endquote in ['&#8221;']:
                                                            if endquote in speech:
                                                                speech=speech.replace(endquote,"\"")

                                                        speech = re.sub(r' Minister(?! )', 'Minister ', speech)
                                                        speech = re.sub(r'PrimeMinister', 'Prime Minister', speech)
                                                        speech = re.sub(r'Minister of Finance(?! )', 'Minister of Finance ', speech)
                                                        speech = re.sub(r'finance minister(?! )', 'finance minister ',speech)
                                                        speech = re.sub(r'Minister of Environment and Climate Change(?! )', 'Minister of Environment and Climate Change ',speech)

                                                        speechId = str(year) + "-" + str(month) +"-" + str(day) +"-" + str(interventionId)

                                                        
                                                        

                                                        speech_filtered = ""

                                                        if speech != None:
                                                            
                                                            speech_nlp = nlp(speech) # convering speech into Spacy doc
                                                            
                                                            for token in speech_nlp:
                                                                pair = "_".join([token.text, token.tag_])
                                                                speech_filtered = " ".join([speech_filtered, pair])


                                                            '''
                                                            df = df.append({'parliamentNumber': int(parliamentNumber),
                                                                            'parliamentSession': int(parliamentSession),
                                                                            'orderOfBusinessRubric': orderOfBusinessRubric,
                                                                            'subjectOfBusinessTitle': subjectOfBusinessTitle,
                                                                            'subjectOfBusinessID': subjectOfBusinessID,
                                                                            'subjectOfBusinessQualifier': subjectOfBusinessQualifier,
                                                                            'speechId': speechId,
                                                                            'party': party,
                                                                            'interventionId': interventionId,
                                                                            'date': date,
                                                                            'year': int(year),
                                                                            'month': int(month),
                                                                            'day': int(day),
                                                                            'weekday': weekday,
                                                                            'timeStamp': timeStampHr+":"+timeStampMin,
                                                                            'speakerName': speakerName,
                                                                            'affiliationType': affiliationType,
                                                                            'affiliationDbId': affiliationDbId,
                                                                            'floorLanguage': floorLanguage,
                                                                            'speech': speech,
                                                                             'mentionedDocumentsTitle': documentTitleList,
                                                                            'mentionedDocumentsId': documentID,
                                                                            'mentionedDocumentsType': documentType,
                                                                            'mentionedEntityName': mentionedEntityNameList,
                                                                            'mentionedEntityId': mentionedEntityDbId,
                                                                            'mentionedEntityType': mentionedEntityType,
                                                                            }, ignore_index=True)
                                                            '''



                                                            dateYMD = datetime.strptime(str(year)+"-"+str(month)+"-"+str(day), '%Y-%m-%d')
                                                            
                                                            #Calling from dictionary
                                                            
                                                            id = int(affiliationDbId)
                                                            
                                                            try:
                                                                parlInfoId = authorityFile[id]['parlInfoId']
                                                                fullName = authorityFile[id]['fullName']
                                                                firstName = authorityFile[id]['firstName']
                                                                lastName = authorityFile[id]['lastName']
                                                                middleName = authorityFile[id]['middleName']
                                                                sex = authorityFile[id]['sex']
                                                                visibleMinority = authorityFile[id]['visibleMinority']
                                                                indigenous = authorityFile[id]['indigenous']
                                                                dateOfBirth = authorityFile[id]['dateOfBirth']
                                                                isEstimateDOB = authorityFile[id]['isEstimateDOB']
                                                                birthProvince = authorityFile[id]['birthProvince']
                                                                birthCountry = authorityFile[id]['birthCountry']
                                                                firstDay = authorityFile[id]['firstDay']
                                                                provOfRiding = authorityFile[id]['provOfRiding']
                                                                parlInfoPage = authorityFile[id]['parlInfoPage']
                                                                daysInOffice = (dateYMD-firstDay).days
                                                                
                                                            except: #if not dictionary entry for this dbID
                                                                parlInfoId = "NA"
                                                                parlInfoId = "NA"
                                                                fullName = "NA"
                                                                firstName = "NA"
                                                                lastName = "NA"
                                                                middleName = "NA"
                                                                sex = "NA"
                                                                visibleMinority = "NA"
                                                                indigenous = "NA"
                                                                dateOfBirth = "NA"
                                                                isEstimateDOB = "NA"
                                                                birthProvince = "NA"
                                                                birthCountry = "NA"
                                                                firstDay = "NA"
                                                                provOfRiding = "NA"
                                                                parlInfoPage = "NA"
                                                                daysInOffice = "NA"
                                                            
                                                                                                               
                                                            if dateOfBirth != "NA":
                                                                if (dateOfBirth.year == 9999):
                                                                    age = "NA"
                                                                else:
                                                                    age = (dateYMD-dateOfBirth).days/365.25
                                                            else:
                                                                age = "NA"
                                                            
                                                            dfList.append([int(parliamentNumber),
                                                                      int(parliamentSession),
                                                                       orderOfBusinessRubric,
                                                                       subjectOfBusinessTitle,
                                                                       subjectOfBusinessID,
                                                                       subjectOfBusinessQualifier,
                                                                       speechId,
                                                                       interventionId,
                                                                       date,
                                                                       dateYMD,
                                                                       int(year),
                                                                       int(month),
                                                                       int(day),
                                                                       weekday,
                                                                       timeStampHr+":"+timeStampMin,
                                                                       speakerName,
                                                                       party,
                                                                       parlInfoId,
                                                                       fullName,
                                                                       firstName,
                                                                       lastName,
                                                                       middleName,
                                                                       sex,
                                                                       age,
                                                                       daysInOffice,
                                                                       visibleMinority,
                                                                       indigenous,
                                                                       dateOfBirth,
                                                                       isEstimateDOB,
                                                                       birthProvince,
                                                                       birthCountry,
                                                                       firstDay,
                                                                       provOfRiding,
                                                                       parlInfoPage,
                                                                       affiliationType,
                                                                       affiliationDbId,
                                                                       floorLanguage,
                                                                       speech,
                                                                       speech_filtered,
                                                                       documentTitleList,
                                                                       documentID,
                                                                       documentType,
                                                                       mentionedEntityNameList,
                                                                       mentionedEntityDbId,
                                                                       mentionedEntityType,
                                                                       file])



    labels = ['parliamentNumber', 
              'parliamentSession', 
              'orderOfBusinessRubric',
              'subjectOfBusinessTitle',
              'subjectOfBusinessID', 
              'subjectOfBusinessQualifier', 
              'speechId', 
              'interventionId',
              'date', 
              'dateYMD', 
              'year', 
              'month', 
              'day', 
              'weekday', 
              'timeStamp',
              'speakerName', 
              'party', 
              'parlInfoId', 
              'fullName', 
              'firstName', 
              'lastName', 
              'middleName',
              'sex', 
              'age', 
              'daysInOffice', 
              'visibleMinority', 
              'indigenous', 
              'dateOfBirth', 
              'isEstimateDOB', 
              'birthProvince', 
              'birthCountry', 
              'firstDay', 
              'provOfRiding', 
              'parlInfoPage',
              'affiliationType', 
              'affiliationDbId', 
              'floorLanguage',
              'speech', 
              'speechFiltered',
              'mentionedDocumentsTitle', 
              'mentionedDocumentsId', 
              'mentionedDocumentsType',
              'mentionedEntityName', 
              'mentionedEntityId',
              'mentionedEntityType', 
              'filename'
              ]


    df = pd.DataFrame.from_records(dfList, columns=labels)

    df.to_csv("hansardExtractedSpeechesFull.csv", sep='\t', encoding='utf-8')



hansardXmlParserSpeeches()
;

# IX. Word2Vec

## IX. A. Train Model

In [1]:
"""
Created on Fri Jan  4 05:38:46 2019
@author: chriscochrane
"""

import re

import pandas as pd
import nltk
import os
import numpy as np
import sys
from nltk.corpus import stopwords
import time

import gensim

from gensim.models import Word2Vec
from gensim.models import word2vec
from gensim.models import Phrases
import logging


tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

stopwords = stopwords.words('english')


hansardSpeeches = pd.read_csv('hansardExtractedSpeechesFull.csv', sep="\t", encoding="utf-8", header=0) 



print(hansardSpeeches['mentionedEntityName'][1])

def sentence_to_wordlist(sentence, remove_stopwords=False):
    sentence_text = re.sub(r'[^\w\s]','', sentence)
    words = sentence_text.lower().split()

    for word in words: #Remove Stopwords (Cochrane)
        if word in stopwords:
            words.remove(word)

    return words

def hansard_to_sentences(hansard, tokenizer, remove_stopwords=False ):
    #print("currently processing: word tokenizer")
    start_time = time.time()
    try:
        # 1. Use the NLTK tokenizer to split the text into sentences
        raw_sentences = tokenizer.tokenize(hansard.strip())
        # 2. Loop over each sentence
        sentences = []
        for raw_sentence in raw_sentences:
            # If a sentence is empty, skip it
            if len(raw_sentence) > 0:
                # Otherwise, call sentence_to_wordlist to get a list of words
                sentences.append(sentence_to_wordlist(raw_sentence))
        # 3. Return the list of sentences (each sentence is a list of words, so this returns a list of lists)
        len(sentences)
        return sentences
    except:
        print('nope')

    end_time = time.time()-start_time

questions = hansardSpeeches['speech']

questions = pd.Series.tolist(questions)
sentences = []

for i in range(0,len(questions)):

    start_time = time.time()

    try:
        # Need to first change "./." to "." so that sentences parse correctly
        hansard = questions[i].replace("/.", '')
        # Now apply functions
        sentences += hansard_to_sentences(hansard, tokenizer)
    except:
        print('no!')

print("There are " + str(len(sentences)) + " sentences in our corpus of questions.")

print("currently processing: training model")
start_time = time.time()



logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
    level=logging.INFO)

num_features = 300    # Word vector dimensionality
min_word_count = 10   # Minimum word count 
num_workers = 4       # Number of threads to run in parallel
context = 6           # Context window size
downsampling = 1e-3   # Downsample setting for frequent words
iterations = 5        # Epochs, 5 is default

model = word2vec.Word2Vec(sentences, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling, iter=iterations)

model.init_sims(replace=True)

model_name = 'hansardQuestions'
model.save(model_name)
new_model = gensim.models.Word2Vec.load('hansardQuestions')

vocab = list(model.wv.vocab.keys())


print("Process complete--the first 25 words in the vocabulary are:")

print(vocab[:25])



[]
no!
no!
no!


2020-09-11 14:47:06,158 : INFO : collecting all words and their counts
2020-09-11 14:47:06,158 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-09-11 14:47:06,178 : INFO : PROGRESS: at sentence #10000, processed 122535 words, keeping 10496 word types
2020-09-11 14:47:06,199 : INFO : PROGRESS: at sentence #20000, processed 250225 words, keeping 14636 word types
2020-09-11 14:47:06,219 : INFO : PROGRESS: at sentence #30000, processed 372363 words, keeping 17917 word types
2020-09-11 14:47:06,245 : INFO : PROGRESS: at sentence #40000, processed 498552 words, keeping 20694 word types
2020-09-11 14:47:06,269 : INFO : PROGRESS: at sentence #50000, processed 618329 words, keeping 22787 word types
2020-09-11 14:47:06,293 : INFO : PROGRESS: at sentence #60000, processed 748477 words, keeping 24625 word types
2020-09-11 14:47:06,315 : INFO : PROGRESS: at sentence #70000, processed 878598 words, keeping 26542 word types
2020-09-11 14:47:06,335 : INFO : PROGRESS: at 

There are 4006269 sentences in our corpus of questions.
currently processing: training model


2020-09-11 14:47:06,360 : INFO : PROGRESS: at sentence #90000, processed 1127079 words, keeping 29797 word types
2020-09-11 14:47:06,383 : INFO : PROGRESS: at sentence #100000, processed 1251812 words, keeping 31259 word types
2020-09-11 14:47:06,406 : INFO : PROGRESS: at sentence #110000, processed 1380137 words, keeping 32674 word types
2020-09-11 14:47:06,427 : INFO : PROGRESS: at sentence #120000, processed 1504194 words, keeping 33971 word types
2020-09-11 14:47:06,448 : INFO : PROGRESS: at sentence #130000, processed 1627319 words, keeping 35224 word types
2020-09-11 14:47:06,472 : INFO : PROGRESS: at sentence #140000, processed 1750120 words, keeping 36459 word types
2020-09-11 14:47:06,495 : INFO : PROGRESS: at sentence #150000, processed 1880099 words, keeping 37892 word types
2020-09-11 14:47:06,517 : INFO : PROGRESS: at sentence #160000, processed 2004067 words, keeping 38902 word types
2020-09-11 14:47:06,538 : INFO : PROGRESS: at sentence #170000, processed 2124317 words, 

Process complete--the first 25 words in the vocabulary are:
['madam', 'speaker', 'have', 'honour', 'present', 'both', 'official', 'languages', 'second', 'report', 'standing', 'committee', 'justice', 'human', 'rights', 'relation', 'bill', 'c10', 'act', 'enact', 'victims', 'terrorism', 'to', 'amend', 'state']


In [1]:
# -*- coding: utf-8 -*-
"""
Created on Fri Jan  4 14:54:42 2019
@author: chris cochrane
NOTES: Requires Windows
"""

#-----------------------------------------------------------------------------
# Initialization
#-----------------------------------------------------------------------------


import gensim
import numpy as np
from operator import itemgetter
import pylab as pl
import scipy.stats as stats
import nltk
import time
import pandas as pd
import re
from nltk.corpus import stopwords

#-----------------------------------------------------------------------------
# Loading stored w2v Model
#-----------------------------------------------------------------------------


model = gensim.models.Word2Vec.load('hansardQuestions')

#-----------------------------------------------------------------------------
# Seed Words (Adapted from Turney and Littman)
#-----------------------------------------------------------------------------

good = model.wv['good'].astype('float64')
excellent = model.wv['excellent'].astype('float64')
correct = model.wv['correct'].astype('float64')
best = model.wv['best'].astype('float64')
happy = model.wv['happy'].astype('float64')
positive = model.wv['positive'].astype('float64')
fortunate = model.wv['fortunate'].astype('float64')


bad = model.wv['bad'].astype('float64')
terrible = model.wv['terrible'].astype('float64')
wrong = model.wv['wrong'].astype('float64')
worst = model.wv['worst'].astype('float64')
disappointed = model.wv['disappointed'].astype('float64')
negative = model.wv['negative'].astype('float64')
unfortunate = model.wv['unfortunate'].astype('float64')

vocab = list(model.wv.vocab.keys()) #the full vocabulary of Hansard


# an empty list for storing wights and an empty dictionary for linking
# weights and words

runningTally=[]
dictOfWeights = {}

#-----------------------------------------------------------------------------
# Model
#-----------------------------------------------------------------------------

'''for every word in the hansard, calculate its cosine similarity to the 
lists of positive words and negative words, then substract the sum of that
word's cosine simlarity to negative seed words from its cosine similarity to the
postive seed words.''' 

for word in vocab:

    word_model = model.wv[word].astype('float64')

    pos1 = (np.dot(word_model, good) / (np.linalg.norm(word_model) * np.linalg.norm(good))).astype('float64')
    pos2 = np.dot(word_model, excellent) / (np.linalg.norm(word_model) * np.linalg.norm(excellent))
    pos3 = np.dot(word_model, correct) / (np.linalg.norm(word_model) * np.linalg.norm(correct))
    pos4 = np.dot(word_model, best) / (np.linalg.norm(word_model) * np.linalg.norm(best))
    pos5 = np.dot(word_model, happy) / (np.linalg.norm(word_model) * np.linalg.norm(happy))
    pos6 = np.dot(word_model, positive) / (np.linalg.norm(word_model) * np.linalg.norm(positive))
    pos7 = np.dot(word_model, fortunate) / (np.linalg.norm(word_model) * np.linalg.norm(fortunate))

    neg1 = np.dot(word_model, bad) / (np.linalg.norm(word_model) * np.linalg.norm(bad))
    neg2 = np.dot(word_model, terrible) / (np.linalg.norm(word_model) * np.linalg.norm(terrible))
    neg3 = np.dot(word_model, wrong) / (np.linalg.norm(word_model) * np.linalg.norm(wrong))
    neg4 = np.dot(word_model, worst) / (np.linalg.norm(word_model) * np.linalg.norm(worst))
    neg5 = np.dot(word_model, disappointed) / (np.linalg.norm(word_model) * np.linalg.norm(disappointed))
    neg6 = np.dot(word_model, negative) / (np.linalg.norm(word_model) * np.linalg.norm(negative))
    neg7 = np.dot(word_model, unfortunate) / (np.linalg.norm(word_model) * np.linalg.norm(unfortunate))

    pos = sum([pos1, pos2, pos3, pos4,  pos5, pos6, pos7])/7
    neg = sum([neg1, neg2, neg3, neg4, neg5, neg6, neg7])/7
    posneg = pos-neg
    result = (word, posneg)
    runningTally.append(result)
    dictOfWeights[word] = result



#-----------------------------------------------------------------------------
# Results
#-----------------------------------------------------------------------------

'''The 100 most positive signed and most negative signed words'''

runningTally = sorted(runningTally, key=itemgetter(1), reverse=True)
print("Top Positive:", runningTally[:100])
print("Top Negative:", runningTally[len(runningTally)-100:])
print("Total Vocabulary Size:", len(vocab))


'''word counts for the seed words'''
vocab_obj_good = model.wv.vocab["good"]
print("good", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["excellent"]
print("excellent", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["correct"]
print("correct", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["best"]
print("best", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["happy"]
print("happy", vocab_obj_good.count)


vocab_obj_good = model.wv.vocab["positive"]
print("positive", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["fortunate"]
print("fortunate", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["bad"]
print("bad", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["terrible"]
print("terrible", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["wrong"]
print("wrong", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["worst"]
print("worst", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["disappointed"]
print("disappointed", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["negative"]
print("negative", vocab_obj_good.count)

vocab_obj_good = model.wv.vocab["unfortunate"]
print("unfortunate", vocab_obj_good.count)


#-----------------------------------------------------------------------------
# Apply Lexicon
#-----------------------------------------------------------------------------

'''Apply the Lexicon to score the transcripts of the video clips.'''

#remove stopwords.  They are not relevant to sentiment scoring.

stopwords = stopwords.words('english')

def sentence_to_wordlist(sentence, remove_stopwords=False):
    sentence_text = re.sub(r'[^\w\s]',' ', sentence)
    words = sentence_text.lower().split()

    for word in words: #Remove Stopwords (Cochrane)
        if word in stopwords:
            words.remove(word)

    return words

#Import Data re: transcripts of video snippets

gitHub = 'hansardExtractedVideoTranscripts.csv'
hansardVideos = pd.read_csv(gitHub, encoding='utf-8')

df = pd.DataFrame(columns=['Label', 'Date' 'ID_main', 'youTube',
                  'timeStamp', 'Speaker', 'French', 'Party',
                  'Seconds', 'English', 'Floor', 
                  'Sentiment', 'sentencePolarity',
                  'wordPolaritySummed', 'sentencePolaritySTD',
                  'countedWords'])


labelList = []
dateList = []
IDmainList = []
youTubeList = []
timeStampList = []
speakerList = []
frenchList = []
partyList = []
secondsList = []
englishList = []
floorList = []
sentenceList = []
sentimentList = []
wordPolaritySummedList = []
sentencePolarityList = []
countedWordsList = []
    

'''A loop for cycling through the list of rows in the hansardVideo
Transcripts.'''
    
i = 0
for x in range(0, len(hansardVideos)):
    labelList.append(hansardVideos["Label"][i])
    dateList.append(hansardVideos["Date"][i])
    IDmainList.append(hansardVideos["ID_main"][i])
    youTubeList.append(hansardVideos["youTube"][i])
    timeStampList.append(hansardVideos["timeStamp"][i])
    speakerList.append(hansardVideos["Speaker"][i])
    frenchList.append(hansardVideos["French"][i])
    partyList.append(hansardVideos["party"][i])
    secondsList.append(hansardVideos["seconds"][i])
    englishList.append(hansardVideos["english"][i])
    floorList.append(hansardVideos["floor"][i])

    sentenceList.append(hansardVideos["english"][i])
    sentence = hansardVideos["english"][i]
    '''break the sentence into words'''
    sentence_words = sentence_to_wordlist(sentence)
    '''initialize variables'''
    sentiment = 0 #positivity minus negativity
    sentencePolarity = 0 #absolute values of positivity minus negativity
    wordPolaritySummed = 0 #sum of absolute value of word polarities
    countedWords = 0
    
    
    '''for every word in the sentence, subtract the sum of its cosine 
    similarity to the negative seed words from the sum of its cosine
    similarity to the positive seed words, and then sum the difference
    across all words in the sentence.'''
    
    for word in sentence_words:

        try:

            word_model = model[word].astype('float64')

            pos1 = np.dot(word_model, good).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(good))
            pos2 = np.dot(word_model, excellent).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(excellent).astype('float64'))
            pos3 = np.dot(word_model, correct).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(correct).astype('float64'))
            pos4 = np.dot(word_model, best).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(best).astype('float64'))
            pos5 = np.dot(word_model, happy).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(happy).astype('float64'))
            pos6 = np.dot(word_model, positive).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(positive).astype('float64'))
            pos7 = np.dot(word_model, fortunate).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(fortunate).astype('float64'))

            neg1 = np.dot(word_model, bad).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(bad).astype('float64'))
            neg2 = np.dot(word_model, terrible).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(terrible).astype('float64'))
            neg3 = np.dot(word_model, wrong).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(wrong).astype('float64'))
            neg4 = np.dot(word_model, worst).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(worst).astype('float64'))
            neg5 = np.dot(word_model, disappointed).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(disappointed).astype('float64'))
            neg6 = np.dot(word_model, negative).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(negative).astype('float64'))
            neg7 = np.dot(word_model, unfortunate).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(unfortunate).astype('float64'))

            pos = sum([pos1, pos2, pos3, pos4, pos5, pos6, pos7]) / 7
            neg = sum([neg1, neg2, neg3, neg4, neg5, neg6, neg7]) / 7
            posneg = pos - neg
            sentiment += posneg
            wordPolaritySummed += abs(posneg)
            countedWords +=1
            

        except:
            
            #handful of garbage words -- did not meet minimum word threschold
            print("Warning! Word: ", word, " from speech: ", i, " not in w2v model!")

            continue

    sentimentList.append(sentiment)
    wordPolaritySummedList.append(wordPolaritySummed)   
    sentencePolarityList.append(abs(sentiment))
    countedWordsList.append(countedWords)

    
    i+=1


'''output to csv vile'''
    
w2vScores = pd.DataFrame({'label': labelList, 
                               'date': dateList, 
                               'IDMain': IDmainList, 
                               'youTube': youTubeList, 
                               'timeStamp': timeStampList, 
                               'speaker': speakerList,
                               'french': frenchList, 
                               'party': partyList,
                               'seconds': secondsList,
                               'english': englishList,
                               'floor': floorList,
                               'sentiment': sentimentList,
                               'sentencePolarity': sentencePolarityList,
                               'wordPolaritySummed': wordPolaritySummedList,
                               'countedWords': countedWordsList})
                               





w2vScores.to_csv("w2vScores.csv", sep=',', encoding='utf-8')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Top Positive: [('excellent', 0.2646590058375939), ('outstanding', 0.2212536033509594), ('mentorship', 0.22074900753260668), ('invaluable', 0.21737490034471763), ('highquality', 0.21481580187637048), ('innovative', 0.21233921025935604), ('cooperative', 0.19852254101654596), ('tirelessly', 0.19441230001342802), ('welcome', 0.19195264057452213), ('develop', 0.18963906731900787), ('talents', 0.18880922264327804), ('collaboratively', 0.18879452521764434), ('midwives', 0.18672595078660845), ('excellence', 0.18635928122468895), ('collaboration', 0.18456441907168294), ('thoughtful', 0.18274887654800306), ('dedicated', 0.18254853884006073), ('diligent', 0.18120867219162248), ('diligently', 0.18069801331470703), ('productive', 0.18068655745303022), ('constructive', 0.18039648958676818), ('secure', 0.17893176213810552), ('best', 0.17813326050871692), ('cooperatively', 0.17805248108901445), ('happy', 0.1763290887509615), ('highimpact', 0.17622271639581888), ('strengthen', 0.1761013943410477), ('fl





# IX. Integrate Other Measures

### IX. A. Apply Leading Sentiment Lexicons

use applySentimentR_ccEd.R in R-Studio.

### IX. B. Apply Additional Sentiment Lexicons

This script trains Support Vector Machines and FastText on a training subset of the IMDB movie review (https://www.kaggle.com/utathya/imdb-review-dataset) and Stanford handcoded tweet (https://snap.stanford.edu/data/twitter7.html) databases, and applies these models to a testing subset of each corpora. The script applies the models to classify the sentiment of the Hansard transcripts of the video snippets (from Step 2). The script also scores the sentiment of these snippets using the Valence Aware Dictionary and Sentiment Reasoner (VADER - https://github.com/cjhutto/vaderSentiment) and LIWC (https://liwc.wpengine.com/) sentiment dictionaries. The training data and models are available at https://www.dropbox.com/sh/u91njzwcuvdu8oa/AABkl2vUJRUNEq4WEGSCBql_a?dl=0 (~1.5GB combined).

In [None]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Jan 10 06:06:37 2019
@author: ludovic rheault with cochrane """

import pandas as pd
import numpy as np
import pickle
import fastText 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.svm import LinearSVC, SVC
from sklearn import metrics
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re



# Library used to preprocess the texts (already done)
import spacy
nlp = spacy.load('en')

def process(text):
    doc = nlp(text)
    text = ' '.join([w.lemma_ for w in doc if not w.is_punct and not w.is_stop])
    text = text.replace('-PRON-', '') #remocing Spacy lemma convention for pronouns
    return text


#=======================================================#
# Fitting the SVM model
# Imdb dataset
#=======================================================#

# To build tfidf matrix
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=5000)

# Fitting model on the training data
with open('training_imdb.csv') as f:
    training = f.read().splitlines()
training=[(0, t.replace('__label__Negative ','')) if t.startswith('__label__Negative') else (1, t.replace('__label__Positive ','')) for t in training]
y = [l for l,_ in training]
X = [t for _,t in training]
X = vectorizer.fit_transform(X)
kbest = SelectKBest(chi2, k=2000)
X = kbest.fit_transform(X, y)

clf = LinearSVC()
clf.fit(X, y)

# Evaluating the model
with open('testing_imdb.csv') as f:
    testing = f.read().splitlines()
testing=[(0, t.replace('__label__Negative ','')) if t.startswith('__label__Negative') else (1, t.replace('__label__Positive ','')) for t in testing]
y_test = [l for l,_ in testing]
X_test = [t for _,t in testing]
X_test = vectorizer.transform(X_test)
X_test = kbest.transform(X_test)

yhat = clf.predict(X_test)
acc = metrics.accuracy_score(y_test, yhat)
f1 = metrics.f1_score(y_test, yhat)
precision = metrics.precision_score(y_test, yhat)
recall = metrics.recall_score(y_test, yhat)
print(acc)
print(f1)
print(precision)
print(recall)

# Saving model
with open('svm_imdb.pkl', 'wb') as fout:
    pickle.dump((vectorizer, kbest, clf), fout)

#------------------------------------------
# Predicting sentiment of Hansard sentences
#------------------------------------------
    
# Read Data    

gitHub = 'hansardExtractedVideoTranscripts.csv'
df = pd.read_table(gitHub, sep=',', header=0)


# Applying Processing
preProcessed = []
for item in df['english']:
    preProcessed.append(process(item))

df['preprocessed'] = preProcessed



texts = df.preprocessed.tolist()
texts = vectorizer.transform(texts)
texts = kbest.transform(texts)
df['svm_imdb'] = clf.predict(texts)
df['svm_imdb'] = df.svm_imdb.replace({0:'Negative', 1:'Positive'})

#=======================================================#
# Predictions from fastText model
#=======================================================#

# Loading model
fasttext_model = fastText.load_model('fasttext_imdb.bin') 


# Adding predictions from fastText
texts = df.preprocessed.tolist() #de-vectorizing texts (see above)

fasttext_imdb = []
fasttext_imdbWeights = []

for p in texts:
    prediction = fasttext_model.predict(p)
    weight= prediction[1]
    verdict=str(prediction[0])   
    if '__label__Negative' in verdict:
        verdict = 'Negative'
        weight = weight*-1
    elif 'Positive' in verdict:
        verdict = 'Positive'
    else:
        print("Missing Estimate!!: ", p)
    fasttext_imdb.append(verdict)
    weight = str(weight)
    weight = weight.replace("[","")
    weight = weight.replace("]","")
    fasttext_imdbWeights.append(weight)
    

df['fasttext_imbd'] = list(item for item in fasttext_imdb)
df['fasttext_imbdWeight'] = list(item for item in fasttext_imdbWeights)
  
# Saving Hansard file with predictions
df.to_csv('hansardExtractedVideoTranscripts_SVMFastText.csv', index=False, sep=',')


# Evaluating fastText model accuracy

with open('testing_imdb.csv') as f:
    IMDBclassified = f.read().splitlines() #read classified imdb reviews
    

IMDBunclassified = []
IMDBclassifiedScores = [] #list holders


for n, i in enumerate(IMDBclassified): #remove classification labels form imdb 
    if i.startswith("__label__Negative "):
        #remove known classification for unclassified
        IMDBunclassified.append(IMDBclassified[n].replace('__label__Negative ', ''))
        IMDBclassifiedScores.append(0)
    elif i.startswith("__label__Positive "):
        IMDBunclassified.append(IMDBclassified[n].replace('__label__Positive ', ''))
        IMDBclassifiedScores.append(1)
    else:
        print("Flag! Line: ", n, " with text ", i, " Unclassified in IMDB Corpus")


IMDBFastTextScore = []
for review in IMDBunclassified:
    check = list(fasttext_model.predict(review)[0])   
    if check[0] == '__label__Positive':
        IMDBFastTextScore.append(1)
    elif check[0] == '__label__Negative':
        IMDBFastTextScore.append(0)
    else:
        print("Warning! No FastText Generated for: ", review)
    

acc = metrics.accuracy_score(IMDBclassifiedScores, IMDBFastTextScore)
f1 = metrics.f1_score(IMDBclassifiedScores, IMDBFastTextScore)
precision = metrics.precision_score(IMDBclassifiedScores, IMDBFastTextScore)
recall = metrics.recall_score(IMDBclassifiedScores, IMDBFastTextScore)

print("Fast Text IMDB Accuracy")
print(acc)
print(f1)
print(precision)
print(recall)

#=======================================================#
# Fitting the SVM model
# Stanford dataset
#=======================================================#

# To build tfidf matrix
vectorizer = TfidfVectorizer(ngram_range=(1,2), max_features=5000)

# Fitting model on the training data
with open('training_stanford.csv') as f:
    training = f.read().splitlines()
training=[(0, t.replace('__label__Negative ','')) if t.startswith('__label__Negative') else (1, t.replace('__label__Positive ','')) for t in training]
y = [l for l,_ in training]
X = [t for _,t in training]
X = vectorizer.fit_transform(X)
kbest = SelectKBest(chi2, k=2000)
X = kbest.fit_transform(X, y)

clf = LinearSVC(C=100)
clf.fit(X, y)

# Evaluating the model
with open('testing_stanford.csv') as f:
    testing = f.read().splitlines()
testing=[(0, t.replace('__label__Negative ','')) if t.startswith('__label__Negative') else (1, t.replace('__label__Positive ','')) for t in testing]
y_test = [l for l,_ in testing]
X_test = [t for _,t in testing]
X_test = vectorizer.transform(X_test)
X_test = kbest.transform(X_test)

yhat = clf.predict(X_test)
acc = metrics.accuracy_score(y_test, yhat)
f1 = metrics.f1_score(y_test, yhat)
precision = metrics.precision_score(y_test, yhat)
recall = metrics.recall_score(y_test, yhat)
print("IMDB Estimates Model Accuracy")
print(acc)
print(f1)
print(precision)
print(recall)

# Saving model
with open('svm_stanford.pkl', 'wb') as fout:
    pickle.dump((vectorizer, kbest, clf), fout)

# Predicting sentiment of Hansard sentences
texts = df.preprocessed.tolist()
texts = vectorizer.transform(texts)
texts = kbest.transform(texts)
df['svm_stanford'] = clf.predict(texts)
df['svm_stanford'] = df.svm_imdb.replace({0:'Negative', 1:'Positive'})

#=======================================================#
# Predictions from fastText model
#=======================================================#

# Loading model
fasttext_model = fastText.load_model('fasttext_stanford.bin') #Cochrane edited

# Adding predictions from fastText
texts = df.preprocessed.tolist() #de-vectorizing texts (see above)

fasttext_stanford = []
fasttext_stanfordWeights = []
for p in texts:
    prediction = fasttext_model.predict(p)
    weight = prediction[1]
    verdict=str(prediction[0]) 
    if '__label__Negative' in verdict:
        verdict = 'Negative' 
        weight = weight*-1
    elif '__label__Positive' in verdict:
        verdict = 'Positive'
    else:
        print('Missing Estimate!!: ', p)
    weight = str(weight)
    weight = weight.replace("[","")
    weight = weight.replace("]","")
    fasttext_stanford.append(verdict)
    fasttext_stanfordWeights.append(weight)
    

df['fasttext_stanford'] = list(item for item in fasttext_stanford)
df['fasttext_stanfordWeights'] = list(item for item in fasttext_stanfordWeights)
  


# Evaluating fastText model accuracy

with open('testing_stanford.csv') as f:
    StanfordClassified = f.read().splitlines() #read classified Stanford reviews
    

StanfordUnclassified = []
StanfordClassifiedScores = [] 
newStanfordClassified = [] # will exclude handful of unclassified tweets in Stanford corpus

for n, i in enumerate(StanfordClassified): #remove classification labels form imdb 
    if i.startswith("__label__Negative "):
        #remove known classification from unclassified
        StanfordUnclassified.append(StanfordClassified[n].replace('__label__Negative ', ''))
        StanfordClassifiedScores.append(0)
        newStanfordClassified.append(StanfordClassified[n])
    elif i.startswith("__label__Positive "):
        StanfordUnclassified.append(StanfordClassified[n].replace('__label__Positive ', ''))
        StanfordClassifiedScores.append(1)
        newStanfordClassified.append(StanfordClassified[n])
    else:
        print("Flag! Line: ", n, " with text ", i, " Unclassified in Stanford Corpus. Dropping")


StanfordFastTextScore = []
for review in StanfordUnclassified:
    check = list(fasttext_model.predict(review)[0])   
    if check[0] == '__label__Positive':
        StanfordFastTextScore.append(1)
    elif check[0] == '__label__Negative':
        StanfordFastTextScore.append(0)
    else:
        print("Warning! No FastText Generated for: ", review)
    

acc = metrics.accuracy_score(StanfordClassifiedScores, StanfordFastTextScore)
f1 = metrics.f1_score(StanfordClassifiedScores, StanfordFastTextScore)
precision = metrics.precision_score(StanfordClassifiedScores, StanfordFastTextScore)
recall = metrics.recall_score(StanfordClassifiedScores, StanfordFastTextScore)
print("Stanford Estimates Model Accuracy")
print(acc)
print(f1)
print(precision)
print(recall)



#######################################################
#Apply Vader and LIWC Dictionaries
#######################################################


#The popular VADER library for Python, which performs valence shifting for negation words.
# More adapted to social media.
vader = SentimentIntensityAnalyzer()

# Negation words that could be used to account for valence shifting.
negation = ["aint", "arent", "cannot", "cant", "couldnt", "darent", "didnt", "doesnt",
     "ain't", "aren't", "can't", "couldn't", "daren't", "didn't", "doesn't",
     "dont", "hadnt", "hasnt", "havent", "isnt", "mightnt", "mustnt", "neither",
     "don't", "hadn't", "hasn't", "haven't", "isn't", "mightn't", "mustn't",
     "neednt", "needn't", "never", "none", "nope", "nor", "not", "no", "nothing", "nowhere",
     "oughtnt", "shant", "shouldnt", "uhuh", "wasnt", "werent",
     "oughtn't", "shan't", "shouldn't", "uh-uh", "wasn't", "weren't",
     "without", "wont", "wouldnt", "won't", "wouldn't", "rarely", "seldom", "despite"]

# To compute scores from dictionaries with wild card symbols for inflexions. 
def sentiment_scores(text, lexicon):
    text = text.lower()
    count = 0
    for word in lexicon:
        if word.endswith('*'):
            count += len([t for t in text if t.startswith(word[:-1])])
        else:
            count += text.count(word) 
    score = (count)/len(text)
    return score

# Rescaling variable.
def rescale(x, newmin, newmax, oldmin=None, oldmax=None):
    if not oldmin:
        oldmin = min(x)
    if not oldmax:
        oldmax = max(x)
    return (((x - oldmin) * (newmax - newmin)) / (oldmax - oldmin)) + newmin


# The LIWC positive and negative emotion dictionaries.
# Note: these are a part of proprietary software that we don't have permission
# to share. To reproduce our analysis here, visit http://liwc.wpengine.com/
with open('liwc2015_positive.txt') as f:
    liwc_pos = f.read().splitlines()
with open('liwc2015_negative.txt') as f:
    liwc_neg = f.read().splitlines()

# Adding a LIWC sentiment score, and the VADER compound score.
df['liwc'] = df.preprocessed.apply(lambda x: sentiment_scores(x, liwc_pos) - sentiment_scores(x, liwc_neg))
df['vader'] = df.preprocessed.apply(lambda x: vader.polarity_scores(x.lower())['compound'])

# SVM model fitted with Platt probabilities on the IMDb dataset.
with open('svm_imdb_probs.pkl', 'rb') as f:
    vectorizer, kbest, svm = pickle.load(f)

# Predicting Platt probabilities on Hansard texts.
texts = df.preprocessed.tolist()
texts = vectorizer.transform(texts)
texts = kbest.transform(texts)
df['svm_imdb_probability'] = svm.predict_proba(texts)[:,1]

# Rescaling lexicons between 0 and 10.
df['liwc'] = rescale(df.liwc, 0, 10)
# Vader was already rescaled between -1 and 1.
df['vader'] = rescale(df.vader, newmin=0, newmax=10, oldmin=-1, oldmax=1)


# Saving Hansard file with predictions

#removing '-' in preprocessed, which excel reads as minus when first character


df.to_csv('hansardExtractedVideoTranscripts_SVMFastTextVader.csv', index=False, sep=',')

# X. Merge Data

In [141]:
import pandas as pd

In [142]:
coding = pd.read_csv('fullCodingDataVideoOnePer.csv')

In [143]:
coding = coding.rename(columns={'Video': 'ID_main'})

In [144]:
coding['sent_textCoders'] = coding[['t1SentAvg', 't2SentAvg', 't3SentAvg']].mean(axis=1)

In [145]:
coding.head(1)

Unnamed: 0,Label,Date,French,Speaker,ID_main,english,floor,party,seconds,timeStamp,...,v2Act2,v2SentAvg,v2ActAvg,v3Sent1,v3Act1,v3Sent2,v3Act2,v3SentAvg,v3ActAvg,sent_textCoders
0,1,2017-12-13,1,Alexandre Boulerice,2017 12 13 0,I thought we usually hired an investigator to ...,"Moi, je pensais qu'on embauchait d'habitude un...",NDP,6.45,9M 12S,...,8.0,2.0,8.0,3.0,7.0,2.0,8.0,2.5,7.5,2.333333


In [146]:
rSentiment = pd.read_csv('hansardVideoSentiment_lsdweightCCed.csv')

In [147]:
rSentiment = rSentiment.rename(columns={'IDMain': 'ID_main',
                                     'LexiTextScore': 'sent_lexicoder',
                                     'LexiTextContextScore': 'sent_lexicoderContext',
                                     'sentiment_jockers_rinker': 'sent_jockersRinker',
                                     'sentiment_huliu': 'sent_huLiu',
                                     'sentiment_sentiwordnet': 'sent_sentiwordnet'})

In [148]:
rSentiment.head(1)

Unnamed: 0.1,Unnamed: 0,Label,Date_text,Date,num,ID_main,youTube,timeStamp,Speaker,French,...,X.6,X.7,lexiText,lexiTextContext,sent_lexicoder,sent_lexicoderContext,LexiTextScoreRaw,sent_jockersRinker,sent_sentiwordnet,sent_huLiu
0,1,1,2017 12 13,2017-12-13,0,2017 12 13 0,https://youtu.be/6p2IWa2rfO4,9M 12S,Alexandre Boulerice,1,...,,,I thought we usually hired an investigator to ...,Mr Speaker the new Conflict of Interest and Et...,-1.0,-0.272727,-2.0,-0.136865,-0.105193,-0.218384


In [149]:
w2vScores = pd.read_csv('w2vScores.csv')

In [150]:
w2vScores = w2vScores.rename(columns={'IDMain': 'ID_main',
                                     'sentiment': 'sent_w2v'})

In [151]:
w2vScores.head(1)

Unnamed: 0.1,Unnamed: 0,ID_main,countedWords,date,english,floor,french,label,party,seconds,sentencePolarity,sent_w2v,speaker,timeStamp,wordPolaritySummed,youTube
0,0,2017 12 13 0,13,2017-12-13,I thought we usually hired an investigator to ...,"Moi, je pensais qu'on embauchait d'habitude un...",1,1,NDP,6.45,0.417042,-0.417042,Alexandre Boulerice,9M 12S,0.653231,https://youtu.be/6p2IWa2rfO4


In [173]:
SVMetc = pd.read_csv('hansardExtractedVideoTranscripts_SVMFastTextVader.csv')

In [174]:
SVMetc = SVMetc.rename(columns={'vader': 'sent_vader',
                               'svm_imdb_probability': 'sent_svmIMDBPlattProbs',
                               'fasttext_stanford': 'sent_fastTextStanfordClass',
                               'svm_imdb': 'sent_svmIMDBClass',
                               'svm_stanford': 'sent_svmStanfordClass'})

In [175]:
SVMetc['sent_svmStanfordClass'] = SVMetc['sent_svmStanfordClass'].map({'Negative': -1, 'Positive':1})
SVMetc['sent_svmIMDBClass'] = SVMetc['sent_svmIMDBClass'].map({'Negative': -1, 'Positive':1})

In [176]:
SVMetc.head(1)

Unnamed: 0,Label,Date,ID_main,youTube,timeStamp,Speaker,French,party,seconds,english,...,preprocessed,sent_svmIMDBClass,fasttext_imbd,fasttext_imbdWeight,sent_svmStanfordClass,sent_fastTextStanfordClass,fasttext_stanfordWeights,liwc,sent_vader,sent_svmIMDBPlattProbs
0,1,2017-12-13,2017 12 13 0,https://youtu.be/6p2IWa2rfO4,9M 12S,Alexandre Boulerice,1,NDP,6.45,I thought we usually hired an investigator to ...,...,think usually hire investigator crime cover o...,-1,Negative,-0.503811,-1,Positive,0.714237,3.181818,1.408,0.473646


In [177]:
temp1 = pd.merge(coding, rSentiment,
                               how = 'left',                         
                               left_on='ID_main',
                               right_on='ID_main')

temp2 = pd.merge(temp1, w2vScores, how='left', on='ID_main')

sentimentScores = pd.merge(temp2, SVMetc, how='left', on='ID_main')

In [178]:
#list(sentimentScores.columns)

In [179]:
sentimentScores = sentimentScores.rename(columns={
                               'French_x': 'French',
                               'english_x': 'english',
                               'party_x': 'party',
                                'seconds_x': 'seconds',
                                'youTube_x': 'youTube',
                                                 })

In [180]:
sentimentScores = sentimentScores.rename(columns={
                                't1Act1': 'act_txt1_1',
                                't1Act2': 'act_txt1_2',
                                't1ActAvg': 'act_txt1_avg',
                                't2Act1': 'act_txt2_1',
                                't2Act2': 'act_txt2_2',
                                't2ActAvg': 'act_txt2_avg',
                                't3Act1': 'act_txt3_1',
                                't3Act2': 'act_txt3_2',
                                't3ActAvg': 'act_txt3_avg',
                                'v1Act1': 'act_vid1_1',
                                'v1Act2': 'act_vid1_2',
                                'v1ActAvg': 'act_vid1_avg',
                                'v2Act1': 'act_vid2_1',
                                'v2Act2': 'act_vid2_2',
                                'v2ActAvg': 'act_vid2_avg',
                                'v3Act1': 'act_vid3_1',
                                'v3Act2': 'act_vid3_2',
                                'v3ActAvg': 'act_vid3_avg',
                                't1Sent1': 'sent_txt1_1',
                                't1Sent2': 'sent_txt1_2',
                                't1SentAvg': 'sent_txt1_avg',
                                't2Sent1': 'sent_txt2_1',
                                't2Sent2': 'sent_txt2_2',
                                't2SentAvg': 'sent_txt2_avg',
                                't3Sent1': 'sent_txt3_1',
                                't3Sent2': 'sent_txt3_2',
                                't3SentAvg': 'sent_txt3_avg',
                                'v1Sent1': 'sent_vid1_1',
                                'v1Sent2': 'sent_vid1_2',
                                'v1SentAvg': 'sent_vid1_avg',
                                'v2Sent1': 'sent_vid2_1',
                                'v2Sent2': 'sent_vid2_2',
                                'v2SentAvg': 'sent_vid2_avg',
                                'v3Sent1': 'sent_vid3_1',
                                'v3Sent2': 'sent_vid3_2',
                                'v3SentAvg': 'sent_vid3_avg',
                                'textSent': 'sent_textCoders',
                                'videoSent': 'sent_videoCoders',
                                'textAct': 'act_textCoders',
                                'videoAct': 'act_videoCoders',
})

In [181]:
sentimentScores = sentimentScores.drop(['Unnamed: 0_x',
                                        'Unnamed: 0.1',
                                        'Label_x',
                                        'Date_x',
                                        'youTube_x',
                                        'timeStamp_x',
                                        'Speaker_x',
                                        'French_x',
                                        'party_x',
                                        'seconds_x',
                                        'english_x',
                                        'floor_x',
                                        'Label_y',
                                        'Date_y',
                                        'youTube_y',
                                        'timeStamp_y',
                                        'Speaker_y',
                                        'French_y',
                                        'party_y',
                                        'seconds_y',
                                        'english_y',
                                        'floor_y',
                                        'Unnamed: 0_y',
                                        'X.1',
                                        'X.2',
                                        'X.3',
                                        'X.4',
                                        'X.5',
                                        'X.6',
                                        'X.7',
                                        'X',],                                        
                                        axis=1)


In [182]:
list(sentimentScores.columns)

['French',
 'ID_main',
 'english',
 'party',
 'seconds',
 'youTube',
 'sent_txt1_1',
 'act_txt1_1',
 'sent_txt1_2',
 'act_txt1_2',
 'sent_txt1_avg',
 'act_txt1_avg',
 'sent_txt2_1',
 'act_txt2_1',
 'sent_txt2_2',
 'act_txt2_2',
 'sent_txt2_avg',
 'act_txt2_avg',
 'sent_txt3_1',
 'act_txt3_1',
 'sent_txt3_2',
 'act_txt3_2',
 'sent_txt3_avg',
 'act_txt3_avg',
 'sent_vid1_1',
 'act_vid1_1',
 'sent_vid1_2',
 'act_vid1_2',
 'sent_vid1_avg',
 'act_vid1_avg',
 'sent_vid2_1',
 'act_vid2_1',
 'sent_vid2_2',
 'act_vid2_2',
 'sent_vid2_avg',
 'act_vid2_avg',
 'sent_vid3_1',
 'act_vid3_1',
 'sent_vid3_2',
 'act_vid3_2',
 'sent_vid3_avg',
 'act_vid3_avg',
 'sent_textCoders',
 'Date_text',
 'num',
 'FloorGoogleTrans.French.',
 'FloorGoogleTrans.English.',
 'French.Hansard.of.English.Floor',
 'EnglishContext',
 'FrenchContext',
 'lexiText',
 'lexiTextContext',
 'sent_lexicoder',
 'sent_lexicoderContext',
 'LexiTextScoreRaw',
 'sent_jockersRinker',
 'sent_sentiwordnet',
 'sent_huLiu',
 'countedWords',

In [183]:
sentimentScores.head(1)

Unnamed: 0,French,ID_main,english,party,seconds,youTube,sent_txt1_1,act_txt1_1,sent_txt1_2,act_txt1_2,...,preprocessed,sent_svmIMDBClass,fasttext_imbd,fasttext_imbdWeight,sent_svmStanfordClass,sent_fastTextStanfordClass,fasttext_stanfordWeights,liwc,sent_vader,sent_svmIMDBPlattProbs
0,1,2017 12 13 0,I thought we usually hired an investigator to ...,NDP,6.45,https://youtu.be/6p2IWa2rfO4,2.0,7.0,3.0,6.0,...,think usually hire investigator crime cover o...,-1,Negative,-0.503811,-1,Positive,0.714237,3.181818,1.408,0.473646


In [184]:
sentimentScores.to_csv('sentimentScores.csv', sep=',', encoding='utf-8')

In [None]:

#sentimentScores.rename(columns={'preprocessed': 'spacyPreProcessed',
#                                 'lexiProcessed': 'lexiPreProcessed',
#                                 'svm_imdb': 'sent_svmIMDB',
#                                 'svm_stanford': 'sent_svmStanford',
#                                 'fasttext_imbd': 'sent_fastTextIMDB',
#                                 'fasttext_imbdWeight': 'sent_fastTextIMDBweight',
#                                 'fasttext_stanford': 'sent_fastTextStanford',
#                                 'fasttext_stanfordWeights': 'sent_fastTextStanfordWeights',
#                                 'svm_imdb_probability': 'sent_svmIMDBPlattProbs',
#                                 'sentiment_lsd': 'sent_lexicoder',
#                                 'sentiment_jockers_rinker': 'sent_jockersRinker',
#                                 'sentiment_sentiwordnet': 'sent_sentiwordnet',
#                                 'sentiment_huliu': 'sent_huLiu',
#                                 'vader': 'sent_vader',
#                                 'liwc': 'sent_liwc',
#                                 'sentiment': 'sent_w2v',
#                                 'wordPolaritySummed': 'act_wordPolarity_w2v',
#                                 'sentencePolarity': 'act_sentencePolarity_w2v',
#                                 't1Act1': 'act_txt1_1',
#                                 't1Act2': 'act_txt1_2',
#                                 't1ActAvg': 'act_txt1_avg',
#                                 't2Act1': 'act_txt2_1',
#                                 't2Act2': 'act_txt2_2',
#                                 't2ActAvg': 'act_txt2_avg',
#                                 't3Act1': 'act_txt3_1',
#                                 't3Act2': 'act_txt3_2',
#                                 't3ActAvg': 'act_txt3_avg',
#                                 'v1Act1': 'act_vid1_1',
#                                 'v1Act2': 'act_vid1_2',
#                                 'v1ActAvg': 'act_vid1_avg',
#                                 'v2Act1': 'act_vid2_1',
#                                 'v2Act2': 'act_vid2_2',
#                                 'v2ActAvg': 'act_vid2_avg',
#                                 'v3Act1': 'act_vid3_1',
#                                 'v3Act2': 'act_vid3_2',
#                                 'v3ActAvg': 'act_vid3_avg',
#                                 't1Sent1': 'sent_txt1_1',
#                                 't1Sent2': 'sent_txt1_2',
#                                 't1SentAvg': 'sent_txt1_avg',
#                                 't2Sent1': 'sent_txt2_1',
#                                 't2Sent2': 'sent_txt2_2',
#                                 't2SentAvg': 'sent_txt2_avg',
#                                 't3Sent1': 'sent_txt3_1',
#                                 't3Sent2': 'sent_txt3_2',
#                                 't3SentAvg': 'sent_txt3_avg',
#                                 'v1Sent1': 'sent_vid1_1',
#                                 'v1Sent2': 'sent_vid1_2',
#                                 'v1SentAvg': 'sent_vid1_avg',
#                                 'v2Sent1': 'sent_vid2_1',
#                                 'v2Sent2': 'sent_vid2_2',
#                                 'v2SentAvg': 'sent_vid2_avg',
#                                 'v3Sent1': 'sent_vid3_1',
#                                 'v3Sent2': 'sent_vid3_2',
#                                 'v3SentAvg': 'sent_vid3_avg',
#                                 'textSent': 'sent_textCoders',
#                                 'videoSent': 'sent_videoCoders',
#                                 'textAct': 'act_textCoders',
#                                 'videoAct': 'act_videoCoders',
# })


# sentimentScores = sentimentScores.drop(['Unnamed: 0_x',
#                                         'Unnamed: 0.1',
#                                         'Label_x',
#                                         'Date_x',
#                                         'youTube_x',
#                                         'timeStamp_x',
#                                         'Speaker_x',
#                                         'French_x',
#                                         'party_x',
#                                         'seconds_x',
#                                         'english_x',
#                                         'floor_x',
#                                         'Label_y',
#                                         'Date_y',
#                                         'youTube_y',
#                                         'timeStamp_y',
#                                         'Speaker_y',
#                                         'French_y',
#                                         'party_y',
#                                         'seconds_y',
#                                         'english_y',
#                                         'floor_y',
#                                         'Unnamed: 0_y'],                                        
#                                       axis=1)


# sentimentScores = sentimentScores.reindex(sorted(sentimentScores.columns), axis=1)

# sentimentScores.to_csv('sentimentScores.csv', sep=',', encoding='utf-8')


# XI. Comparison of Tools

### Use predictTextCoderScores.R

# XI. Sensitivity Analysis

Samples -.01n% of original corpus, re-trains model, tests model, stores model fit, and repeats for n = 1...99.  

In [3]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Jan  4 05:38:46 2019/modified on Wed May 22 2019

"""

import re

import pandas as pd
import nltk
import os
import numpy as np
import sys
from nltk.corpus import stopwords
import time
import random

import gensim

from gensim.models import Word2Vec
from gensim.models import word2vec
from gensim.models import Phrases
import logging

In [4]:
stopwords_ = stopwords.words('english')


hansardSpeeches = pd.read_csv('hansardExtractedSpeechesFull.csv', sep="\t", encoding="utf-8", header=0) 



print(hansardSpeeches['mentionedEntityName'][1])

[]


In [5]:
def sentence_to_wordlist(sentence, remove_stopwords=False):
    sentence_text = re.sub(r'[^\w\s]','', sentence)
    words = sentence_text.lower().split()

    for word in words: #Remove Stopwords (Cochrane)
        if word in stopwords_:
            words.remove(word)

    return words

def hansard_to_sentences(hansard, tokenizer, remove_stopwords=False ):
    #print("currently processing: word tokenizer")
    start_time = time.time()
    try:
        # 1. Use the NLTK tokenizer to split the text into sentences
        raw_sentences = tokenizer.tokenize(hansard.strip())
        #raw_sentences = [sentence_to_wordlist(raw_sentence) for raw_sentence in raw_sentences]
        #sentences = [sentence for sublist in raw_sentences for sentence in sublist]
        # 2. Loop over each sentence
        sentences = []
        for raw_sentence in raw_sentences:
            # If a sentence is empty, skip it
            if len(raw_sentence) > 0:
                # Otherwise, call sentence_to_wordlist to get a list of words
                sentences.append(sentence_to_wordlist(raw_sentence))
        # 3. Return the list of sentences (each sentence is a list of words, so this returns a list of lists)
        #print(len(sentences))
        return sentences
    except:
        print('nope')

    end_time = time.time()-start_time

In [6]:
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

def speech_tokenizer(hansard):
    sentences = []
    try:
        # Need to first change "./." to "." so that sentences parse correctly
        hansard = hansard.replace("/.","")
        sentences += hansard_to_sentences(hansard, tokenizer)
    except:
        print("no!")
    return sentences

print("Tokenizing ...")
hansardSpeeches["sentences_tokenized"] = hansardSpeeches["speech"].apply(speech_tokenizer)
print("Tokenization Complete")



Tokenizing ...
no!
no!
no!
Tokenization Complete


In [None]:
def day_removal(hansardSpeeches,days_removed,Seed):
    # Set random seed for day removal
    random.seed(Seed)
    
    unique_days_left = hansardSpeeches["date"].unique().tolist()
    to_be_removed = random.sample(unique_days_left,days_removed)
    #print(to_be_removed)
    
    ## Keep rows if date value is not in the to_be_removed list
    hansardSpeeches = hansardSpeeches[~hansardSpeeches["date"].isin(to_be_removed)]
    print("Number of days in corpus:",hansardSpeeches["date"].nunique())
    
    return hansardSpeeches


num_features = 300    # Word vector dimensionality
min_word_count = 10   # Minimum word count 
num_workers = 4       # Number of threads to run in parallel
context = 6           # Context window size
downsampling = 1e-3   # Downsample setting for frequent words
iterations = 5

fraction_removed = 0.05 # (Wong) Fraction of days that are removed at each training instance

total_days = hansardSpeeches["date"].nunique()
print("There are",total_days,"days in corpus")
days_removed = int((total_days)*(fraction_removed))
print("Amount of days to be discarded at each model iteration: ",days_removed)
print("")

## Iterate until there are insufficient days left
i = 0
while fraction_removed*i < 1:
    if i == 0:  
        
        ## sentences is now a list of sentences formatted correctly for word2vec
        sentences = [sentence for sublist in hansardSpeeches["sentences_tokenized"].tolist() for sentence in sublist]
        print("Current population size =",len(sentences))
        
        print("currently processing: training model")
        start_time = time.time()

        logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
            level=logging.INFO)

        model = word2vec.Word2Vec(sentences, workers=num_workers, \
                    size=num_features, min_count = min_word_count, \
                    window = context, sample = downsampling, iter = iteration)

        model.init_sims(replace=True)

        model_name = 'hansardQuestions_removed_0.00.model'
        model.save(model_name)
        model = gensim.models.Word2Vec.load(model_name)

        vocab = list(model.wv.vocab.keys())


        print("Process complete--the first 25 words in the vocabulary are:")

        print(vocab[:25])
        print("")

        i += 1

    else:
        
        hansardSpeeches = day_removal(hansardSpeeches, days_removed, 42)
        
        ## sentences is now a list of sentences formatted correctly for word2vec
        sentences = [sentence for sublist in hansardSpeeches["sentences_tokenized"].tolist() for sentence in sublist]
        print("Current population size =",len(sentences))
        
        print("currently processing: training model, removing",
              "{0:.2f}".format(fraction_removed*i),
              "of samples")

        start_time = time.time()

        logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
            level=logging.INFO)

        model = word2vec.Word2Vec(sentences, workers=num_workers, \
                    size=num_features, min_count = min_word_count, \
                    window = context, sample = downsampling)

        model.init_sims(replace=True)

        model_name = 'hansardQuestions_removed_'+str("{0:.2f}".format(fraction_removed*i)+'.model')
        model.save(model_name)
        model = gensim.models.Word2Vec.load(model_name)

        vocab = list(model.wv.vocab.keys())


        print("Process complete--the first 25 words in the vocabulary are:")

        print(vocab[:25])
        print("")

        i += 1

In [7]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Fri Jan  4 14:54:42 2019/modified on Wed May 23 2019
@author: chriscochrane/michaelwcwong
"""

#-----------------------------------------------------------------------------
# Initialization
#-----------------------------------------------------------------------------


import gensim
import numpy as np
from operator import itemgetter
import pylab as pl
import scipy.stats as stats
import nltk
import time
import pandas as pd
import re
from nltk.corpus import stopwords

In [None]:
#-----------------------------------------------------------------------------
# Loading stored w2v Model
#-----------------------------------------------------------------------------

model_dict = {}
for i in range(0,100,5):
    try:
        model_dict["w2v_model_removed_0."+str("{:02d}".format(i))] = gensim.models.Word2Vec.load("hansardQuestions_removed_0."+str("{:02d}".format(i))+".model")
    except:
        pass
print(model_dict)

In [8]:
#remove stopwords.  They are not relevant to sentiment scoring.

def sentence_to_wordlist(sentence, remove_stopwords=False):
    sentence_text = re.sub(r'[^\w\s]',' ', sentence)
    words = sentence_text.lower().split()

    for word in words: #Remove Stopwords (Cochrane)
        if word in stopwords_:
            words.remove(word)

    return words

In [None]:
stopwords_ = stopwords.words('english')

## Initialize lists
labelList = []
dateList = []
IDmainList = []
youTubeList = []
timeStampList = []
speakerList = []
frenchList = []
partyList = []
secondsList = []
englishList = []
floorList = []
sentenceList = []
sentiment_columnList = []
wordPolaritySummed_columnList = []
sentencePolarity_columnList = []
countedWords_columnList = []

#Import Data re: transcripts of video snippets

gitHub = 'hansardExtractedVideoTranscripts.csv'
hansardVideos = pd.read_csv(gitHub, encoding='utf-8')

'''A loop for cycling through the list of rows in the hansardVideo
Transcripts.'''

## Iterate through all word2vec models for each sentence
for model_name, model in model_dict.items():

    print("")
    print("/nCurrently processing "+model_name)

    #-----------------------------------------------------------------------------
    # Seed Words (Adapted from Turney and Littman)
    #-----------------------------------------------------------------------------

    good = model['good'].astype('float64')
    excellent = model['excellent'].astype('float64')
    correct = model['correct'].astype('float64')
    best = model['best'].astype('float64')
    happy = model['happy'].astype('float64')
    positive = model['positive'].astype('float64')
    fortunate = model['fortunate'].astype('float64')


    bad = model['bad'].astype('float64')
    terrible = model['terrible'].astype('float64')
    wrong = model['wrong'].astype('float64')
    worst = model['worst'].astype('float64')
    disappointed = model['disappointed'].astype('float64')
    negative = model['negative'].astype('float64')
    unfortunate = model['unfortunate'].astype('float64')

    vocab = list(model.wv.vocab.keys()) #the full vocabulary of Hansard


    # an empty list for storing wights and an empty dictionary for linking
    # weights and words

    runningTally=[]
    dictOfWeights = {}

    #-----------------------------------------------------------------------------
    # Model
    #-----------------------------------------------------------------------------

    '''for every word in the hansard, calculate its cosine similarity to the 
    lists of positive words and negative words, then substract the sum of that
    word's cosine simlarity to negative seed words from its cosine similarity to the
    postive seed words.''' 

    for word in vocab:

        word_model = model[word].astype('float64')

        pos1 = np.dot(word_model, good).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(good).astype('float64'))
        pos2 = np.dot(word_model, excellent).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(excellent).astype('float64'))
        pos3 = np.dot(word_model, correct).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(correct).astype('float64'))
        pos4 = np.dot(word_model, best).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(best).astype('float64'))
        pos5 = np.dot(word_model, happy).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(happy).astype('float64'))
        pos6 = np.dot(word_model, positive).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(positive).astype('float64'))
        pos7 = np.dot(word_model, fortunate).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(fortunate).astype('float64'))

        neg1 = np.dot(word_model, bad).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(bad))
        neg2 = np.dot(word_model, terrible).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(terrible))
        neg3 = np.dot(word_model, wrong).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(wrong))
        neg4 = np.dot(word_model, worst).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(worst))
        neg5 = np.dot(word_model, disappointed).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(disappointed))
        neg6 = np.dot(word_model, negative).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(negative))
        neg7 = np.dot(word_model, unfortunate).astype('float64') / (np.linalg.norm(word_model) * np.linalg.norm(unfortunate))

        pos = sum([pos1, pos2, pos3, pos4,  pos5, pos6, pos7])/7
        neg = sum([neg1, neg2, neg3, neg4, neg5, neg6, neg7])/7
        posneg = pos-neg
        result = (word, posneg)
        runningTally.append(result)
        dictOfWeights[word] = result


        #-----------------------------------------------------------------------------
        # Results
        #-----------------------------------------------------------------------------

        '''The 100 most positive signed and most negative signed words'''

        runningTally = sorted(runningTally, key=itemgetter(1), reverse=True)
        print("Top Positive:", runningTally[:100])
        print("Top Negative:", runningTally[len(runningTally)-100:])
        print("Total Vocabulary Size:", len(vocab))
        print("")

        '''word counts for the seed words'''
        vocab_obj_good = model.wv.vocab["good"]
        print("good", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["excellent"]
        print("excellent", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["correct"]
        print("correct", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["best"]
        print("best", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["happy"]
        print("happy", vocab_obj_good.count)


        vocab_obj_good = model.wv.vocab["positive"]
        print("positive", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["fortunate"]
        print("fortunate", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["bad"]
        print("bad", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["terrible"]
        print("terrible", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["wrong"]
        print("wrong", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["worst"]
        print("worst", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["disappointed"]
        print("disappointed", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["negative"]
        print("negative", vocab_obj_good.count)

        vocab_obj_good = model.wv.vocab["unfortunate"]
        print("unfortunate", vocab_obj_good.count)
  
    sentimentList = []
    wordPolaritySummedList = [] 
    sentencePolarityList = []
    countedWordsList = []
    
    for i, sentence in enumerate(hansardVideos["english"]):
        labelList.append(hansardVideos["Label"][i])
        dateList.append(hansardVideos["Date"][i])
        IDmainList.append(hansardVideos["ID_main"][i])
        youTubeList.append(hansardVideos["youTube"][i])
        timeStampList.append(hansardVideos["timeStamp"][i])
        speakerList.append(hansardVideos["Speaker"][i])
        frenchList.append(hansardVideos["French"][i])
        partyList.append(hansardVideos["party"][i])
        secondsList.append(hansardVideos["seconds"][i])
        englishList.append(hansardVideos["english"][i])
        floorList.append(hansardVideos["floor"][i])

        sentenceList.append(sentence)
        '''break the sentence into words'''
        sentence_words = sentence_to_wordlist(sentence)

        '''initialize variables'''
        sentiment = 0 #positivity minus negativity
        sentencePolarity = 0 #absolute values of positivity minus negativity
        wordPolaritySummed = 0 #sum of absolute value of word polarities
        countedWords = 0

        '''for every word in the sentence, subtract the sum of its cosine 
        similarity to the negative seed words from the sum of its cosine
        similarity to the positive seed words, and then sum the difference
        across all words in the sentence.'''

        #-----------------------------------------------------------------------------
        # Apply Lexicon
        #-----------------------------------------------------------------------------

        '''Apply the Lexicon to score the transcripts of the video clips.'''

        for word in sentence_words: 

            try:

                word_model = model[word]

                pos1 = np.dot(word_model, good) / (np.linalg.norm(word_model) * np.linalg.norm(good))
                pos2 = np.dot(word_model, excellent) / (np.linalg.norm(word_model) * np.linalg.norm(excellent))
                pos3 = np.dot(word_model, correct) / (np.linalg.norm(word_model) * np.linalg.norm(correct))
                pos4 = np.dot(word_model, best) / (np.linalg.norm(word_model) * np.linalg.norm(best))
                pos5 = np.dot(word_model, happy) / (np.linalg.norm(word_model) * np.linalg.norm(happy))
                pos6 = np.dot(word_model, positive) / (np.linalg.norm(word_model) * np.linalg.norm(positive))
                pos7 = np.dot(word_model, fortunate) / (np.linalg.norm(word_model) * np.linalg.norm(fortunate))

                neg1 = np.dot(word_model, bad) / (np.linalg.norm(word_model) * np.linalg.norm(bad))
                neg2 = np.dot(word_model, terrible) / (np.linalg.norm(word_model) * np.linalg.norm(terrible))
                neg3 = np.dot(word_model, wrong) / (np.linalg.norm(word_model) * np.linalg.norm(wrong))
                neg4 = np.dot(word_model, worst) / (np.linalg.norm(word_model) * np.linalg.norm(worst))
                neg5 = np.dot(word_model, disappointed) / (np.linalg.norm(word_model) * np.linalg.norm(disappointed))
                neg6 = np.dot(word_model, negative) / (np.linalg.norm(word_model) * np.linalg.norm(negative))
                neg7 = np.dot(word_model, unfortunate) / (np.linalg.norm(word_model) * np.linalg.norm(unfortunate))

                pos = sum([pos1, pos2, pos3, pos4, pos5, pos6, pos7]) / 7
                neg = sum([neg1, neg2, neg3, neg4, neg5, neg6, neg7]) / 7
                posneg = pos - neg
                sentiment += posneg
                wordPolaritySummed += abs(posneg)
                countedWords +=1


            except:

                #handful of garbage words -- did not meet minimum word threschold
                print("Warning! Word: ", word, " from speech: ", i, " not in w2v model!")

                continue

        ## Each value of these lists represent a row for the dataframe
        ## The length of these lists should be equal to the number of total sentences
        sentimentList.append(sentiment)
        wordPolaritySummedList.append(wordPolaritySummed)   
        sentencePolarityList.append(abs(sentiment))
        countedWordsList.append(countedWords)
        
    ## Each value (list) of these lists represents a single column
    sentiment_columnList.append(sentimentList)
    wordPolaritySummed_columnList.append(wordPolaritySummedList)
    sentencePolarity_columnList.append(sentencePolarityList)
    countedWords_columnList.append(countedWordsList)

In [None]:
'''output to csv file'''

## Create separate dataframes and then merge
baseline_df = pd.DataFrame({'label': labelList, 
                               'date': dateList, 
                               'IDMain': IDmainList, 
                               'youTube': youTubeList, 
                               'timeStamp': timeStampList, 
                               'speaker': speakerList,
                               'french': frenchList, 
                               'party': partyList,
                               'seconds': secondsList,
                               'english': englishList,
                               'floor': floorList})


def df_creator(list_of_lists,column_name,df_list)
    df = pd.Dataframe(list_of_lists,columns=[column_name+str("{:02d}".format(i)) for i in range(len(model_dict))])
    df_list.append(df)
    return df_list

df_list = []
df_list = df_creator(sentiment_columnList,"sentiment_removed_0.",df_list)
df_list = df_creator(sentencePolarity_columnList,"sentencePolarity_removed_0.",df_list)
df_list = df_creator(wordPolaritySummed_columnList,"wordPolaritySummed_removed_0.",df_list)
df_list = df_creator(countedWords_columnList,"countedWords_removed_0.",df_list)

w2vScores = pd.concat(df_list.insert(0,baseline_df),axis=1)


In [None]:
display(w2vScores.head())