#  Data Analysis: Moral Foundations Theory
---
<img src="https://c1.staticflickr.com/7/6240/6261650491_0cd6c701bb_b.jpg" style="width: 500px; height: 275px;" />

### Professor Amy Tick

Moral Foundations Theory (MFT) hypothesizes that people's sensitivity to the foundations is different based on their political ideology: liberals are more sensitive to care and fairness, while conservatives are equally sensitive to all five. Here, we'll explore whether we can find evidence for MFT in the campaign speeches of 2016 United States presidential candidates. For our main analysis, we'll go through the data science process we learned in Day 1 to recreate a simplified version of the analysis done by Jesse Graham, Jonathan Haidt, and Brian A. Nosek in their 2009 paper ["Liberals and Conservatives Rely on Different Sets of Moral Foundations"](http://projectimplicit.net/nosek/papers/GHN2009.pdf). In part 3, we'll look at other NLP techniques that might be useful in applying this theory.

*Estimated Time: 50 minutes*

---

### Topics Covered
- Plotting data with MatPlotLib
- Interpreting graphs
- Textual analysis methods

### Table of Contents


1 - [Data Set and Test Statistic](#section 1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [2016 Campaign Speeches](#subsection 1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Moral Foundations Dictionary](#subsection 2) <br>

2 - [Exploratory Data Analysis](#section 2)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 - [Hypothesis](#subsection 3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 - [Democrats](#subsection 4)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3 - [Republicans](#subsection 5) <br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.4 - [Democrats vs Republicans](#subsection 6) <br>

3 - [Further explorations](#section 3)<br>




**Dependencies:**

In [3]:
from datascience import *
import numpy as np
import matplotlib as plt
%matplotlib inline
import json

### Before we get started...

Today is going to be a whirlwind tour through the data science process! There will probably be code you don't understand yet, and that's okay. Our goal for today is to show you the ways Data Science can be used in Rhetoric, not to immediately make you into a master programmer. If you have a question at any point (including "I have no idea what's going on"), don't hesitate to ask.

---
## Part 1: Data Set and Test Statistic  <a id='section 1'></a>

As data scientists starting a new analysis, we know we need to start with two things: some data and a question. In Part 1, we'll get familiar with our data set and determine a way to answer our question using the data.

### 2016 Campaign Speeches <a id='subsection 1'></a>

Our data set is the texts of speeches from the 2016 US presidential campaign. Run the cell below to load the data.

In [None]:
# load the data from csv files into a table. This may take a few minutes
campaign_data = Table()
import os
for file in os.listdir(path='csv'):
    if len(campaign_data) == 0:
        campaign_data = Table().read_table('csv/' + file)
    else:
        campaign_data.append(Table().read_table('csv/' + file))

campaign_data

Take a moment to look at this table. What information does it contain? What are the different columns? What does each row represent? How large is this table altogether? Hint: there are three different Types- 'c' for campaign speech, 'p' for press release, and 's' for statement- and two different Parties- 'R' for Republican and 'D' for Democrat.

In [None]:
# answer

In Day 1, we learned that the first step in the data science process is data cleaning. While this data set is mostly cleaned (how can we tell?), it does contain some information we don't care about: the press releases and statements. Run the next cell to create a table with only Type 'c' documents.

In [None]:
# create a new table containing only campaign speeches
speeches = campaign_data.where('Type', 'c')
speeches

Lastly, we can see that the text of each speech is contained in one long string. Run the following cell to add a column to our table called 'Words' that contains a list of the individual words in each speech in all lowercase, with no punctuation. This will make it much easier to run our analysis. 

Note: the code in the following cell is **very** technical, and you do not need to understand it. Just take a look at the table you get after you run it.

In [14]:
def clean_text(text):
    # remove punctuation
    p = re.compile(r'[^\w\s]')
    no_punc = p.sub(' ', text)
    # convert to lowercase
    no_punc_lower = no_punc.lower()
    # split into individual words
    clean = no_punc_lower.split()
    return clean
    
speeches = speeches.with_column('Words', [clean_text(speech) for speech in speeches['Text']])
speeches

NameError: name 'speeches' is not defined

### Moral Foundations Dictionary <a id='subsection 2'></a>

In ["Liberals and Conservatives Rely on Different Sets of Moral Foundations"](http://projectimplicit.net/nosek/papers/GHN2009.pdf), one of the methods Graham, Haidt, and Nosek use to measure people's use of Moral Foundations Theory is to count how often they use words related to each foundation. This will be our test statistic for today. To calculate it, we'll need a dictionary of words related to each moral foundation. Run the cell below to load the dictionary you created in the first module.

In [None]:
# Run this cell to load the dictionary into a variable
with open('foundations_dict.json') as json_data:
    mft_dict = json.load(json_data)

# Show the dictionary entry for the 'care' foundation
mft_dict['care']

Graham, Haidt, and Nosek also used a dictionary to calculate their test statistic, but their dictionary was created in a very different way. From the paper:
> Dictionary development had an expansive phase and a contractive phase, all occurring before reading the sermons. In the expansive phase Jesse Graham and five research assistants generated as many associations, synonyms, and antonyms for the base foundation words as possible, using thesauruses and conver- sations with colleagues. This included full words and word stems (for instance, nation  covers national, nationalistic, etc.). The resulting lists included foundation-supporting words (e.g., kind- ness, equality, patriot, obey, wholesome), as well as foundation- violating words (e.g., hurt, prejudice, betray, disrespect, disgust- ing). In the contractive phase, Jesse Graham and Jonathan Haidt deleted words that seemed too distantly related to the five foun- dations and also words whose primary meanings were not moral (e.g., just more often means only than fair).

How is their process similar to how you made your dictionary? How is it different? What are some pros and cons to each method?

In [None]:
# answer

---
## Part 2: Exploratory Data Analysis <a id='section 2'></a>

Now that we have our speech data and our dictionary, we can start our analysis. First, we'll formally state our hypothesis. Then, to visualize the data we'll perform 3 steps:
1. Count the occurances of words from our dictionary in each speech
2. Calculate how often words from each category are used by each political party
3. Plot the proportions on a bar graph

### Hypothesis <a id='subsection 3'></a>

An important part of data science is understanding the question you're trying to answer and formulating an appropriate hypothesis. The hypothesis must be testable given your data, and you must be able to say what kinds of results would support or refute your hypothesis _even before you've done any analysis_. 

Today, our question asks whether the word use of 2016 presidential candidates aligns with Moral Foundations Theory.

Think about what you know about Moral Foundations Theory. If this data is consistent with the theory, what should our analysis show for Republican candidates? What about for Democratic candidates? Try sketching a possible graph for each political party, assuming that candidates' speech aligns with the theory.

In [None]:
# answer

### Democrats <a id='subsection 4'></a>

Let's start by looking at Democratic candidates. First, we need to make a table that only contains Democrats. Run the cell below to do so.

In [None]:
# Filter out non-Democrat speeches
democrats = speeches.where('Party', 'D')
democrats

Our test statistic is the proportion of words that correspond to a Moral Foundation in Democratic speeches- in other words, what percentage of words in their speeches are related to a specific foundation. Run the next cell to create a function to calculate this proportion. Don't worry too much about the code inside the function, but make sure you understand what arguments it takes and what it returns. If you understand the red docstring, you're in good shape!

(Bonus question: why don't we just use the **number** of Moral Foundation words instead of the **proportion** as our test statistic?)

In [None]:
# Function calculating the test statistic. 
def mft_proportions(table, dictionary):
    """Given a TABLE of speeches and a MFT DICTIONARY, returns a Table
    with a column for the title of the speech, the total word count for 
    each speech and one column for the proportion of foundation words for 
    each speech."""
    texts = table['Words']
    result = (table.select('Title')
                   .with_column('Total_Word_Count', [len(s) for s in texts]))
    for key in dictionary.keys():
        num_key_words = []
        synonyms = dictionary[key]
        for synonym in synonyms:
            syn_count = sum([sum([wd.startswith(synonym) for wd in wds]) for wds in texts])
            num_key_words.append(syn_count) 
        proportions = num_key_words / proportions['Total_Word_Count']
        result = result.with_column(key, proportions)
    return proportions

# Calculate the proportions for Democratic speeches
dem_stats = mft_proportions(democrats, mft_dict)
dem_stats

We have our proportions, but it's much easier to understand what's going on when the results are in graph form. Let's start by looking at the average proportions for Democrats as a group. Run the cell below to show a graph of the average proportions. Again, don't worry about the details of the code.

In [None]:
avg_dem_stats = Table().with_columns('Moral Foundation', mft_dict.keys(),
                                    'Proportion', [mean(dem_stats[mf]) for mf in mft_dict.keys()])
avg_dem_stats.barh('Moral Foundation')

Take a look at this graph. What does it show? Does it support our hypothesis?

### Republicans <a id='subsection 5'></a>

Now, let's repeat the process for Republicans. Replace the ellipses with the correct code to select only Republican speeches, then run the cell to create the table. 

(Hint: look back at how we made the 'democrats' table to see how to fill in the ellipses)

In [None]:
# Filter out non-Republican speeches
republicans = speeches.where('...', '...')
republicans

Next, we need to calculate our test statistic for Republicans. Fill in the ellipses in the cell below with the correct code to create a table with the statistics. Once again, look at how we made this table for Democrats, and think about how you need to change the code for Republicans.

In [1]:
# Calculate the proportions for Republican speeches
rep_stats = mft_proportions(..., ...)
rep_stats

NameError: name 'mft_proportions' is not defined

Then, run the next cell to show a graph of the average Republican proportions.

In [None]:
avg_rep_stats = Table().with_columns('Moral Foundation', mft_dict.keys(),
                                    'Proportion', [mean(rep_stats[mf]) for mf in mft_dict.keys()])
avg_rep_stats.barh('Moral Foundation')

Does this graph support our hypothesis? 

In [None]:
# answer

### Democrats vs Republicans <a id='subsection 6'></a>

Comparing two groups becomes much easier when we can look at them both at the same time. Run the cell below to get a graph for side-by-side comparison.

In [None]:
all_avg_stats = Table().with_columns('Moral Foundation', mft_dict.keys(),
                                    'Democrats', [mean(dem_stats[mf]) for mf in mft_dict.keys()],
                                    'Republicans', [mean(rep_stats[mf]) for mf in mft_dict.keys()])
all_avg_stats.barh('Moral Foundation')

In what ways are Democrats and Republicans similar? In what ways are they different?

---
## Part 3: Further explorations <a id='section 3'></a>

Intro to section 3 here.

In [None]:
# CODE

---

## Bibliography

* Election documents scraped from http://www.presidency.ucsb.edu/2016_election.php
* Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology, 96(5), 1029. http://projectimplicit.net/nosek/papers/GHN2009.pdf, October 9 2017.

---
Notebook developed by: Keeley Takimoto, Sean Seungwoo Son, Sujude Dalieh

Data Science Modules: http://data.berkeley.edu/education/modules
