#  Data Analysis: Moral Foundations Theory
---
<img src="https://c1.staticflickr.com/7/6240/6261650491_0cd6c701bb_b.jpg" style="width: 500px; height: 275px;" />

### Professor Amy Tick

Moral Foundations Theory (MFT) hypothesizes that people's sensitivity to the foundations is different based on their political ideology: liberals are more sensitive to care and fairness, while conservatives are equally sensitive to all five. Here, we'll explore whether we can find evidence for MFT in the campaign speeches of 2016 United States presidential candidates. For our main analysis, we'll go through the data science process we learned in Day 1 to recreate a simplified version of the analysis done by Jesse Graham, Jonathan Haidt, and Brian A. Nosek in their 2009 paper ["Liberals and Conservatives Rely on Different Sets of Moral Foundations"](http://projectimplicit.net/nosek/papers/GHN2009.pdf). In part 3, we'll look at other NLP techniques that might be useful in applying this theory.

*Estimated Time: 50 minutes*

---

### Topics Covered
- Plotting data with MatPlotLib
- Interpreting graphs
- Textual analysis methods

### Table of Contents


1 - [Data Set and Test Statistic](#section 1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [2016 Campaign Speeches](#subsection 1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Moral Foundations Dictionary](#subsection 2) <br>

2 - [Data Analysis](#section 2)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 - [Hypothesis](#subsection 3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 - [Democrats](#subsection 4)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3 - [Republicans](#subsection 5) <br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.4 - [Democrats vs Republicans](#subsection 6) <br>

3 - [Assignment: Analyze With Your Dictionary](#section 3)<br>


**Dependencies:**

In [2]:
import numpy as np
import pandas as pd
import matplotlib as plt
%matplotlib inline
import json
from nltk.stem.snowball import SnowballStemmer

---
## Part 1: Speech Data and Foundations Dictionary  <a id='section 1'></a>

As data scientists starting a new analysis, we know we need to start with two things: some data and a question. In Part 1, we'll get familiar with our data set and determine a way to answer our question using the data.

### 2016 Campaign Speeches <a id='subsection 1'></a>

Our data set is the texts of speeches from the 2016 US presidential campaign. Run the cell below to load the data.

In [None]:
# load the data from csv files into a table. 

speeches = pd.DataFrame()
import os
for file in os.listdir(path='csv'):
    if file.endswith("c.csv"):
        if len(speeches) == 0:
            speeches = pd.read_csv('csv/' + file)
        else:
            speeches = speeches.append(pd.read_csv('csv/' + file))


speeches['Speech'].iloc[80]

---
Take a moment to look at this table. What information does it contain? What are the different columns? What does each row represent? How large is this table altogether? 

### Moral Foundations Dictionary <a id='subsection 2'></a>

In ["Liberals and Conservatives Rely on Different Sets of Moral Foundations"](http://projectimplicit.net/nosek/papers/GHN2009.pdf), one of the methods Graham, Haidt, and Nosek use to measure people's use of Moral Foundations Theory is to count how often they use words related to each foundation. This will be our test statistic for today. To calculate it, we'll need a dictionary of words related to each moral foundation. 

The dictionary we'll use today comes from a database called [WordNet](https://wordnet.princeton.edu), in which "nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept." By querying WordNet for semantically related words, it was possible to build a dictionary automatically using a Python program.

Run the cell below to load the dictionary and assign it to the variable 'mft_dict'.

In [None]:
# Run this cell to load the dictionary into a variable
with open('foundations_dict.json') as json_data:
    mft_dict = json.load(json_data)


We can see the keys of the dictionary using the .keys() function:

In [None]:
mft_dict.keys()

And we can look up the entries associated with a key by putting the key in brackets:

In [None]:
mft_dict['authority/respect']

Try looking up the entries for the other keys by filling in for '...' in the cell below.

In [1]:
mft_dict[...]

NameError: name 'mft_dict' is not defined

There's something odd about some of the entries: they're not words! The entries in this dictionary have been **stemmed**, meaning they have been reduced to their smallest meaningful root. 

We can see why this is helpful with an example. Python can count the number of times a string can be found in another string using the string method 'count':

In [None]:
# Counts the number of times the second string appears in the first string
"Data science is the best major, says data scientist.".count('science')

It returns one match, for the second word. But, 'scientist' is very closely related to 'science', and many times we will want to match them both. A stem allows Python to find all words with a common root. Try running the count again with a stem that matches both 'science' and 'scientist'.

In [None]:
# Fill in the parenthesis with a stem that will match both 'science' and 'scientist'
"Data science is the best major, says data scientist.".count('...')

Another thing you might have noticed is that all the entries in our dictionary are lowercase. This could be a problem when we do our text analysis. Try counting the number of times 'rhetoric' appears in the example sentence.

In [None]:
# Fill in the parenthesis to count how often 'rhetoric' appears in the sentence
"Rhetoric major says back: NEVER argue with a rhetoric student.".count('...')

We can clearly see the word 'rhetoric' appears twice, but the count function only returns 1. That's because Python differentiates between capital and lowercase letters:

In [None]:
'r' is 'R'

To get around this, we can use the .lower() function, which changes all letters in the string to lowercase:

In [None]:
"Rhetoric major says back: NEVER argue with a rhetoric student.".lower()

In [None]:
my_dict = {'care': ['word1', 'word2'], 
           'loyalty': ['wlekf']}
my_dict

Let's add a column to our 'speeches' table that contains the lowercase text of the speeches.

In [None]:
def clean_text(text):
    # remove punctuation
    p = re.compile(r'[^\w\s]')
    no_punc = p.sub(' ', text)
    # convert to lowercase
    no_punc_lower = no_punc.lower()
    # split into individual words
    return no_punc_lower
    
speeches['clean_speech'] = [clean_text(s) for s in speeches['Speech']]

speeches.head()

---
## Part 2: Exploratory Data Analysis <a id='section 2'></a>

Now that we have our speech data and our dictionary, we can start our analysis. First, we'll formally state our hypothesis. Then, to visualize the data we'll perform 3 steps:
1. Count the occurances of words from our dictionary in each speech
2. Calculate how often words from each category are used by each political party
3. Plot the percents on a bar graph

### Hypothesis <a id='subsection 3'></a>

An important part of data science is understanding the question you're trying to answer and formulating an appropriate hypothesis. The hypothesis must be testable given your data, and you must be able to say what kinds of results would support or refute your hypothesis. 

Today, our question asks whether the word use of 2016 presidential candidates aligns with Moral Foundations Theory.

Think about what you know about Moral Foundations Theory. If this data is consistent with the theory, what should our analysis show for Republican candidates? What about for Democratic candidates? Try sketching a possible graph for each political party, assuming that candidates' speech aligns with the theory.

In [None]:
# answer

### Democrats <a id='subsection 4'></a>

Let's start by looking at Democratic candidates. First, we need to make a table that only contains Democrats. Run the cell below to do so.

In [None]:
# Filter out non-Democrat speeches
democrats = speeches[speeches['Party'] == 'D']
democrats.head()

Our test statistic is the percent of words that correspond to a Moral Foundation in Democratic speeches- in other words, how often candidates use words related to a specific foundation. 

(Bonus question: why don't we just use the **number** of Moral Foundation words instead of the **percent** as our test statistic?)

To calculate the percent, we'll first need the total number of words in each speech.

In [None]:
democrats['total_words'] = [len(speech.split()) for speech in democrats['Speech']]
democrats.head()

Next, we need to calculate the number of matches to entries in our dictionary for each speech and for each foundation.

In [None]:
for key in mft_dict.keys():
    num_key_words = np.zeros(len(democrats))
    synonyms = mft_dict[key]
    for synonym in synonyms:
        syn_count = np.array([sum([wd.startswith(synonym) for wd in speech.split()]) for speech in democrats['clean_speech']])
        num_key_words += syn_count
    democrats[key] = num_key_words / democrats['total_words'] * 100

democrats.head()

We have our proportions, but it's much easier to understand what's going on when the results are in graph form. Let's start by looking at the average proportions for Democrats as a group. Run the cell below to show a graph of the average proportions. Again, don't worry about the details of the code.

In [None]:
avg_dem_stats = Table().with_columns('Moral Foundation', mft_dict.keys(),
                                    'Proportion', [np.average(democrats[mf]) for mf in mft_dict.keys()])
avg_dem_stats.barh('Moral Foundation')

Take a look at this graph. What does it show? Does it support our hypothesis?

We can also look at how different candidates used different foundations:

In [None]:
dem_indivs = (democrats.loc[:, ['Candidate', 'authority/respect', 'care', 'loyalty/ingroup', 'fairness/proportionality',
                               'sanctity/purity', 'liberty']]
             .groupby(['Candidate'])
             .mean())
dem_indivs.plot.bar(figsize=(10, 8))

### Republicans <a id='subsection 5'></a>

Now, let's repeat the process for Republicans. Replace the ellipses with the correct code to select only Republican speeches, then run the cell to create the table. 

(Hint: look back at how we made the 'democrats' table to see how to fill in the ellipses)

In [None]:
# Filter out non-Republican speeches
republicans = speeches[speeches['Party'] == 'R']
republicans.head()

Next, we need to calculate our test statistic for Republicans. Fill in the ellipses in the cell below with the correct code to create a table with the statistics. Once again, look at how we made this table for Democrats, and think about how you need to change the code for Republicans.

In [None]:
# Calculate the proportions for Republican speeches
republicans['total_words'] = [len(speech.split()) for speech in republicans['Speech']]
republicans.head()

Then, calculate foundation synonym percents:

In [None]:
for key in mft_dict.keys():
    num_key_words = np.zeros(len(republicans))
    synonyms = mft_dict[key]
    for synonym in synonyms:
        syn_count = np.array([sum([wd.startswith(synonym) for wd in speech.split()]) for speech in republicans['clean_speech']])
        num_key_words += syn_count
    republicans[key] = num_key_words / republicans['total_words'] * 100
    
republicans.head()

Then, run the next cell to show a graph of the average Republican percentages.

In [None]:
avg_rep_stats = Table().with_columns('Moral Foundation', mft_dict.keys(),
                                    'Proportion', [np.mean(republicans[mf]) for mf in mft_dict.keys()])
avg_rep_stats.barh('Moral Foundation')

Does this graph support our hypothesis? 

Finally, let's look at individual Republican candidate averages.

In [None]:
rep_indivs = (republicans.loc[:, ['Candidate', 'authority/respect', 'care', 'loyalty/ingroup', 'fairness/proportionality',
                               'sanctity/purity', 'liberty']]
             .groupby(['Candidate'])
             .mean())
rep_indivs.plot.bar(figsize=(15, 8))

### Democrats vs Republicans <a id='subsection 6'></a>

Comparing two groups becomes much easier when we can look at them both at the same time. Run the cell below to get a graph for side-by-side comparison.

In [None]:
all_avg_stats = Table().with_columns('Moral Foundation', mft_dict.keys(),
                                    'Democrats', [np.mean(democrats[mf]) for mf in mft_dict.keys()],
                                    'Republicans', [np.mean(republicans[mf]) for mf in mft_dict.keys()])
all_avg_stats.barh('Moral Foundation')

We can also compare the stats for the Democratic and Republican nominees.

In [None]:
all_avg_stats = Table().with_columns('Moral Foundation', mft_dict.keys(),
                                    'Hillary Clinton', dem_indivs.loc['Hillary Clinton ', :],
                                    'Donald Trump', rep_indivs.loc['Donald Trump', :])
all_avg_stats.barh('Moral Foundation')

---
## Assignment: Run Analysis With Your Dictionary  <a id='section 3'></a>

One of the advantages of coding is how easy it is to repeat one method of analysis with different parameters. Run the cell below to load the dictionary you compiled into the `mft_dict` variable 

(Note that Section 1 sets `mft_dict` to the Wordnet dictionary. By running the next cell, you will overwrite it and set it to the dictionary you made. It's possible to reset it to the Wordnet dictionary by re-running the cell in [Section 1.2](#subsection 2).)

After you reset `mft_dict`, return to [Section 2](#section 2) and run the code cells to regenerate the graphs using your dictionary. You should be able to answer the following questions:

* What does each graph show?
* How are these graphs different from the ones made using the Wordnet dictionary?
* Do these graphs support Moral Foundations Theory?

In [3]:
# Tip: if you're working on this assignment after class, remember to import your 
# dependencies by running the very first code cell in this module

# Load your dictionary into the mft_dict variable
with open('my_dict.json') as json_data:
    mft_dict = json.load(json_data)

# Stem the words in your dictionary (this will help you get more matches)
stemmer = SnowballStemmer('english')

for foundation in mft_dict.keys():
    curr_words = mft_dict[foundation]
    stemmed_words = [stemmer.stem(word) for word in curr_words]
    mft_dict[foundation] = stemmed_words

FileNotFoundError: [Errno 2] No such file or directory: 'my_dict.json'

---

## Bibliography

* Election documents scraped from http://www.presidency.ucsb.edu/2016_election.php
* Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of personality and social psychology, 96(5), 1029. http://projectimplicit.net/nosek/papers/GHN2009.pdf, October 9 2017.

---
Notebook developed by: Keeley Takimoto, Sean Seungwoo Son, Sujude Dalieh

Data Science Modules: http://data.berkeley.edu/education/modules
