# Data Analysis: Analysing for the presence of game frames in parliamentary debates in Denmark

This Jupyter Notebook consists of 5 parts representing my text analysis of parliamentary debates in Denmark. 
        1. Part 1 loads all the required packages
        2. Part 2 imports the data (see the notebook "Debate_scraping" for the code that led to the creation of the dataset)
        3. Part 3 creates a function used to detect the proportion of game frame words in a text
        4. Part 4 calculates the proportion of game frame words in all debates
        5. Part 5 store the data in a table and convert it to a CSV-file

## Part 1: Loading packages

In this code section, I load all the packages needed for this codebook.

In [161]:
import pickle
import nltk
from datascience import *
import numpy as np
from datetime import datetime

## Part 2: Import Data

In this section, I import the all the data scraped in the previous jupyter notebook. I use pickle to import the data.

In [2]:
# Importing debate data using pickle
debates = pickle.load( open( "debates.p", "rb" ))

In [90]:
# Checking how many observations we have to ensure that the import was succesful
len(debates)

1172

## Part 3: Creating a function to detect the presence of game frame in a text

In this section, I create a function that can calculate the proportion of game frame words in a parliamentary debate.

In [35]:
#Importing important packages for preprocessing
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

In [114]:
#Importing strings with punctuation and digtis
from string import punctuation
from string import digits

In [45]:
# Creating list for stopwords
danish_stopwords = stopwords.words('danish')

In [121]:
# Creating a danish stemmer using NLTK
danish_stemmer = SnowballStemmer("danish")

The cell below creates a function that can calculate the proportion of game frame words in a text (non-stop words excluded).

In [118]:
def game_frame_detect(text, list_of_game_frame_words):
    """ The following function takes a text and a list of words as input and returns the proportion of the non-stop words 
        in the text what is in the list of words provided to the function.
        
        The function is structured in three parts
            1) Text preprocessing: The text is converted to lower case and stemmend and punctuation and stop words are removed.
            2) List preprocessing: The list of words is stemmed so it matches the preprocessed text.
            3) Calculation: The proportion of the list of words in the text (stop words excluded) is calculated.
        
    """
    ### Part 1: Preprocessing of the text ###
    text_lower = text.lower().strip() # converting the text to lower case
    characters_no_punctuation = [char for char in text_lower if char not in punctuation and char not in digits] # removing all punctuation and digits from text
    text_no_punctuation = "".join(characters_no_punctuation)
    text_tokenized = text_no_punctuation.split() # tokenizing the text (= splitting into a list of words)
    text_no_stop_words = [word for word in text_tokenized if word not in danish_stopwords] # removing stop-words
    text_stemmed = [danish_stemmer.stem(word) for word in text_no_stop_words] # stemming all the words

    ### Part 2: Preprocessing of the list of words
    game_frame_list_lower = [word.lower().strip() for word in list_of_game_frame_words] # converting the list of words to lower case
    game_frame_list_stemmed = [danish_stemmer.stem(word) for word in game_frame_list_lower] # stemming the list of words
    
    ### Part 2: Detecing the presence of game frame words (as proportion of all non-stop words)
    number_of_non_stop_words = len(text_no_stop_words) # counting all non-stop words in the text
    game_frame_words = [word for word in text_stemmed if word in game_frame_list_stemmed] # selecting all game-frame words from list
    number_of_game_frame_words = len(game_frame_words) #counting all game frame words
    game_frame_proportion = number_of_game_frame_words / number_of_non_stop_words # calculating the proportion of game frame words in the text
    
    return (game_frame_proportion)

I test the function in the following cell. It works as intended. 

In [119]:
# I want to test the function on the following text
text = """Valget er lige om hjørnet! Det bliver for fedt! Om lidt går vælgerne til stemmeurnerne. 
            Jeg håber bare, at vi vinder magten tilbage denne gang..."""

# I will use this list of game frame words
game_frame_words = ["Valg", "valgkamp", "valgdag", "vælg"
                   "stem", "stemmeboks", "stemmeurne",
                   "magt", "magtkamp", 
                   "flertal", "regeringsdan"]

# Testing the function
game_frame_detect(text, game_frame_words)

0.21428571428571427

## Part 4: Detecting the presence of game frames in debate transcripts

In this section, I first define a set of words that I think represents presence of a game frame. Thereafter, I use the function created in the previous section to detect the presence of game frame in all the scraped debate transcripts.

In [123]:
# Creating a list of words used in Game Framing
game_frame_words = ["valg", "valgkamp", "valgdag", "vælg"
                   "stem", "stemmeboks", "stemmeurne",
                   "magt", "magtkamp", 
                   "flertal", "regeringsdannelse"]

In [128]:
# Calculation the game frame for all debates (stored with information on the date of the debate)
list_of_debate_date_and_game_frame_proportion = []

for debate in debates:
    date = debate[1] # select the date and time of the debate
    game_frame_proportion = game_frame_detect(debate[2], game_frame_words) # calculating the game frame porportion in that debate
    date_and_proportion = [date, game_frame_proportion] # creates a list with the data on date and game frame proportion
    list_of_debate_date_and_game_frame_proportion.append(date_and_proportion) #append the date to the overall list
    
    

In [175]:
# Checking if the loop above was succesfull 
list_of_debate_date_and_game_frame_proportion[:5]

[['2007-10-02T12:00:00', 0.010309278350515464],
 ['2007-10-03T13:00:00', 0.00231099372730274],
 ['2007-10-04T10:00:00', 0.0024645717806531116],
 ['2007-10-05T10:00:00', 0.0032743942370661427],
 ['2007-10-09T13:00:00', 0.0010649627263045794]]

Unfortunately, some of the scraped transcripts did not contain actual debates between politicians. The content of these could be voting for a bill, which has no interest to this study. Therefore, I remove transcripts with less the 5000 characters, which indicates that they don't contain actual debating in the parliament.

In [188]:
# I here find the indexes to all the transcripts with less than 5000 characters
index = 0 # set the first index to 0
index_of_short_transcripts = [] # creates an empty list

for debate in debates:
    num_characters = len(debate[2]) # calculates the number of characters in the debate
    if num_characters < 5000: # if num_character is less than 5000 the index is saved in a list
        index_of_short_transcripts.append(index)
    index = index + 1
    
index_of_short_transcripts[:10] #show the result

[0, 9, 11, 200, 201, 320, 342, 351, 353, 438]

In [189]:
# Number of transcripts with less than 5000 characters
len(index_of_short_transcripts)

29

## Part 5: Save data as CSV-file 

In this section, I store the data in a table, delete the observation with less than 5000 characters and save it as a CSV-file. 

In [193]:
# I create a empty table
t = Table().empty(make_array("Time", "Game frame proportion"))
t # The warning message is not important



Time,Game frame proportion


In [198]:
# I fill the table with data on the time of the debate and the proportion of game frame words.
game_frame_table = t.with_rows(list_of_debate_date_and_game_frame_proportion)
game_frame_table

Time,Game frame proportion
2007-10-02T12:00:00,0.0103093
2007-10-03T13:00:00,0.00231099
2007-10-04T10:00:00,0.00246457
2007-10-05T10:00:00,0.00327439
2007-10-09T13:00:00,0.00106496
2007-10-10T13:00:00,0.00152555
2007-10-11T10:00:00,0.00147189
2007-10-12T10:00:00,0.000577701
2007-10-23T13:00:00,0.00149031
2007-10-24T13:00:00,0.0


In [199]:
# Now I remove all the transcripts with less than 5000 characters
final_data = game_frame_table.exclude(index_of_short_transcripts)
final_data

Time,Game frame proportion
2007-10-03T13:00:00,0.00231099
2007-10-04T10:00:00,0.00246457
2007-10-05T10:00:00,0.00327439
2007-10-09T13:00:00,0.00106496
2007-10-10T13:00:00,0.00152555
2007-10-11T10:00:00,0.00147189
2007-10-12T10:00:00,0.000577701
2007-10-23T13:00:00,0.00149031
2007-11-27T12:00:00,0.00834403
2007-11-29T10:00:00,0.00737447


In [200]:
## Exporting data as a CSV
final_data.to_df().to_csv("debate_game_frame.csv")