# Module 3: The Rhetoric of Data
---
<img src="https://pixel.nymag.com/imgs/daily/science/2014/10/16/16-trustingraphsnew.nocrop.w536.h2147483647.2x.gif" style="width: 400px; height: 400px;" />

### Professor Amy Tick

This module explores how data science can persuade or mislead through intentional or unintentional decisions at every step of the data science process. First, we'll how human judgment still plays a part in seemingly unbiased, 'automated' programming processes by picking apart how Module 2's Wordnet dictionary was compiled. Then, we'll discover how some common cognitive biases are exploited in charts and graphs to emphasize a particular mesage.

*Estimated Time: 50 minutes*

---

### Topics Covered
- Cognitive biases
- Data visualizations
- Natural Language Processing

### Table of Contents

[Introduction](#section 0)<br>

1 - [Deceitful Data: Three Ways to Make a Dictionary](#section 1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i - [By Hand](#subsection 1)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ii - [By Computer](#subsection 2)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; iii - [Both](#subsection 3)


2 - [Ambiguous Analysis](#section 2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; i - [Simple Word Counts](#subsection 4)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ii - [Natural Language Processing (NLP)](#subsection 5)

3 - [Grifting Graphs](#section 3)<br>

4 - [What's Next?](#section 4)<br>




**Dependencies:**

In [16]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from nltk.corpus import wordnet as wn
import json


---
## Introduction <a id='section 0'></a>

>  *“No study is less alluring or more dry and tedious than statistics, unless the mind and imagination are set to work.” - William Playfair, inventor of the line graph, bar graph, and pie chart. *

As data science becomes more and more in-demand, it has emerged as a powerful rhetorical tool. Major news sources pair their stories with 'infographics,' while [studies]('http://journals.sagepub.com/doi/abs/10.1177/0963662514549688') show that the average person finds data and data visualizations highly persuasive. After all, 'numbers don't lie'. Or do they?

Let's return to the data analysis we did in module 2 and see, at each step of the process, the many opportunities to make our numbers 'lie.' As a refresher, here's a map of the data science process we used.

<img src="https://upload.wikimedia.org/wikipedia/commons/b/ba/Data_visualization_process_v1.png" style="width: 550px; height: 400px;" />

---
## 1. Deceitful Data: Three Ways to Make a Dictionary <a id='section 1'></a>

Data science starts with a data set upon which all subsequent analysis is built. If that data is skewed, incomplete, or just plain wrong, it's impossible to draw accurate conclusions from it.

In module 02, we relied on the set of Moral Foundations words and their synonyms, collected in a Python dictionary, to answer questions about candidate and party values. Let's look further into the ways such data sets are constructed and how they can lead you horribly astray.

### i. Method 1: Do it by hand*<a id='subsection 1'></a>
\**or by your grad students' hands*

The original MFT word count analysis was done on religious sermon texts by Graham, Haidt, and Nosek as detailed in [this paper](http://projectimplicit.net/nosek/papers/GHN2009.pdf). Their methodology for constructing their dictionary is below:

> Dictionary development had an expansive phase and a contractive phase, all occurring before reading the sermons. In the expansive phase Jesse Graham and five research assistants generated as many associations, synonyms, and antonyms for the base foundation words as possible, using thesauruses and conversations with colleagues. This included full words and word stems (for instance, nation  covers national, nationalistic, etc.)...In the contractive phase, Jesse Graham and Jonathan Haidt deleted words that seemed too distantly related to the five foun- dations and also words whose primary meanings were not moral (e.g., just more often means only than fair).

The file `haidt_dict.json` contains the relevant portions of the dictionary Graham, Haidt, and Nosek used in their paper. Run the cell below to load the dictionary into the variable `haidt_dict`.

In [17]:
# Run this cell to load the dictionary into a variable
with open('haidt_dict.json') as json_data:
    haidt_dict = json.load(json_data)


Compiling a dictionary this way is extremely time-consuming. Moreover, it involves many, many judgments from researchers, who like all humans are biased. Some biases that could come into play:

* **availability bias**- 
* **selection bias**- (mostly avoided here)
* **sampling bias**-

### ii. Method 2: Write some code<a id='subsection 2'></a>


Now that we know how to code, we can potentially make a dictionary much faster using Wordnet. 
Outline:
- teach simple wn lookups
- run naive implementation of code
- id problems: includes unrelated senses, unequal numbers of words, missing some intuitively related words...

### ii. Method 3: Write some code and do some by hand <a id='subsection 3'></a>


Method 1 generates a dictionary that feels appropriate intuitively, but takes a lot of time and may be biased.

Method 2 is fast and eliminates or reduces some biases, but results in a pretty hit-or-miss dictionary.

Here's how we combined elements of each to create the dictionary we used in the end:

---
## 2. Ambiguous Analysis <a id='section 2'></a>

After the data is collected and processed, the next step is to do exploratory analysis, then model and estimate. Here, we'll evaluate the approach we used in Module 02 as well as some more advanced text-analysis methods.

### i. Simple word counts <a id='subsection 4'></a>


Outline:
* pros: simple to code, easy to understand
* cons: ambiguous word issues, small dictionary change may have large implications (have students graph), outlier effects (candidates with few/shorter speeches, susceptibility of the mean as a stat)

### ii. Natural Language Processing <a id='subsection 4'></a>

Outline:
* word vectors and dimensionality reduction
* pro: can capture unanticipated pattern
* con: difficulties in visualizing, interpreting, understanding

---
## 3. Grifting Graphs <a id='section 3'></a>

Once analysis is complete, a proud data scientist will want to communicate their results. Visualizations are a powerful tool to summarize and display data. 

Outline:
* graph types: when each is appropriate/inappropriate
* color, scale
* labels
* bad visualization examples

In [8]:
# CODE

### Subsection 1 <a id='subsection 1'></a>

Intro to subsection 1 here.

In [9]:
# CODE

---
## 4. What's Next? <a id='section 4'></a>

(further resources for data science, NLP, CS, stats...)

---

## Bibliography

- Playfair, W. (1801). The Statistical Breviary: Shewing, on a Principle Entirely New, the Resources of Every State and Kingdom in Europe; Illustrated with Stained Copper-plate Charts the Physical Powers of Each Distinct Nation with Ease and Perspicuity: to which is Added, a Similar Exhibition of the Ruling Powers of Hindoostan. T. Bensley, Bolt Court, Fleet Street.

---
Notebook developed by: Keeley Takimoto, Sean Seungwoo Son, Sujude Dalieh

Data Science Modules: http://data.berkeley.edu/education/modules
