# XRHETOR-R1A Data Science Module 

## 01 - Intro to Data Science in Rhetoric

## Professor Amy Tick 

Data Science is a fast-growing discipline with applications in many fields. Over the course of the next week, these modules will explore the use of Data Science in Rhetoric. Module 01 introduces the Python programming language and the Pandas DataFrame table structure and shows how to apply coding skills exploratory analysis for text data. Module 02 walks through the data science process start to finish to test Moral Foundations Theory. Finally, Module 03 examines data science as a rhetorical tool: how human biases, conscious or unconscious, affect how data is processed and perceived. 

Estimated Time: 50 minutes

## __Topics Covered__

### Data Science Intro
- The field of Data Science <br>
- Environment (Jupyter Notebook/Data hub)<br>
- Basic Python
- Introduction to Pandas DataFrames

### Data
- Election 2016 <br> http://www.presidency.ucsb.edu/2016_election.php <br>

### Text processing
- String manipulation
- Word counts

### What you need to know
- data type
- expression
- names
- call expression
- attribute operator
- lists
- dict
- importing library
- table
- functions

**Dependencies: datascience, nltk, pandas

## What is Datascience?
### Data science is an interdisciplinary field that seeks to extract knowledge or insights from various forms of data.

<img src="http://www.kiwidatascience.com/wp/wp-content/uploads/2016/01/data_scientist.png" style="width: 550px; height: 500px;\" />
Statistics is a central component of data science because statistics studies how to make robust conclusions with incomplete information. Computing is a central component because programming allows us to apply analysis techniques to the large and diverse data sets that arise in real-world applications. Domain knowlege is the most important compoent among above. Domain expertise is perhaps most relevant in the interpretation of insights. Without knowlege in the domain of the subject, we can't decide what to analyze.

## The Jupyter Notebook

Note that this page is divided into what are called "cells". For example, the following cell is a "code cell" where you will write your code. You'll see a In [ ]: next to each cell for code, which is a counter for the cells you have run. You can navigate cells by clicking on them or by using the up and down arrows. Cells will be highlighted as you navigate them.

In [None]:
# this is a code cell

## Python

### Data Types

Text analysis uses two basic types of data: numbers and text. 

In Jupyter notebooks, Python numbers are shown in green. 

Python text is referred to as a **string** and is written inside of single or double quotation marks. Jupyter shows it in red.

In [None]:
# Numbers in Python
3
6.0
1837720.8787623

In [None]:
# Python text, aka strings
"a"
"word"
"The quick brown fox jumped over the lazy dog"

### Expressions
Programs are made up of expressions, which describe to the computer how to combine pieces of data. 

Running a code cell will output the result of the expression below the cell. Code cells are run by navigating to the cell and pressing `Shift` + `Enter`.

In [None]:
# number expression
3 * 4

In [None]:
24 * 7

### Names (Identifiers)
Names are given to values in Python using an assignment statement. In an assignment, a name is followed by =, which is followed by any expression. The value of the expression to the right of = is assigned to the name.

In [None]:
#if you want to save it for later use
hours_per_week = 24 * 7

In [None]:
hours_per_week * 60

In [None]:
seconds_per_year = 60 * 60 * 24 * 365

In [None]:
seconds_per_year

In [None]:
seconds_per_hour = 60 * 60
hours_per_year = 24 * 365
seconds_per_year = seconds_per_hour * hours_per_year
seconds_per_year

### Call Expressions

The most important kind of compound expression is a call expression, which applies a function to some arguments. Recall from algebra that the mathematical notion of a function is a mapping from some input arguments to an output value. The way in which Python expresses function application is the same as in conventional mathematics.

For instance, the abs function maps its single inputs to a single output, which is the absolute value of the inputs. 

In [None]:
abs(-5)

For instance, the max function maps its inputs to a single output, which is the largest of the inputs.

In [None]:
max(3, 4)

In [None]:
y = max(3, 4)

In [None]:
y

### Attribute Operator
A few functions are available by default, such as abs and max, but most functions that are built into the Python language are stored in a collection of functions called a module. An import statement is used to provide access to a module, such as math or operator. Operators and call expressions can be used together in an expression.

Put another way, specific types of data have specific functions that can be used with **dot notation**. The order of the syntax goes `data` then a '.' then the function. 

For example, the `upper()` function can be called on a string to make all of its characters uppercase.

In [None]:
message = "python is fun"

In [None]:
message.upper()

### Lists

Python has a great built-in list type named "list". Items in a list are written within square brackets [ ] and separated by commas. Lists work similarly to strings -- use square brackets [ ] to access data, with the first element at index 0.

In [None]:
colors = ['red', 'blue', 'green']

In [None]:
colors[0]    ## red

In [None]:
colors[2]    ## green

In [None]:
len(colors)  ## 3

### Dict (Python dictionary)

Python also has a **dictionary** structure called a "dict". The contents of a dict can be written as a series of key:value pairs within braces { }, e.g. dict = {key1:value1, key2:value2, ... }. The "empty dict" is just an empty pair of curly braces {}.

In [None]:
## Can build up a dict by starting with the the empty dict {}
## and storing key/value pairs into the dict like this:
## dict[key] = value-for-that-key
dict = {}
dict['a'] = 'alpha'
dict['g'] = 'gamma'
dict['o'] = 'omega'

In [None]:
dict

In [None]:
dict['a']     ## Simple lookup, returns 'alpha'

In [None]:
dict['a'] = 6       ## Put new key/value into dict

In [None]:
'a' in dict         ## True

### Errors

The Python language has a very specific syntax. If code is not written in that syntax, running a cell will result in an **error** and show an error message.

Error messages can be confusing, but they can also give information about what is wrong.

In [None]:
# Un-comment the next line, then run the cell to create an error
# dict['b']

### Importing Library Functions

Python defines a very large number of functions, including the operator functions mentioned in the preceding section, but does not make all of their names available by default. Instead, it organizes the functions and other quantities that it knows about into modules, which together comprise the Python Library. To use these elements, one imports them.
These libraries below are what we are using in this demo.

In [None]:
import pandas as pd
import json
from collections import Counter
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
from nltk.stem.snowball import SnowballStemmer

## Data Source
http://www.presidency.ucsb.edu/2016_election.php 

In [None]:
# All the csv files are prepared.
# We are not going to cover web scraping but you can see the code later if you have interest.

# Loading csv data to jupyter notebook and save it as table.

clinton_press = pd.read_csv("../mft_data/csv/Clinton_p.csv")

## Using DataFrame in pandas


A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or table.

Each row represents one **entry**: in this case, a speech, statement, or press release. Each column column describes an **aspect** of entries, like the title or date of the speech.

The `head()` function shows us the first five rows of the table.

In [None]:
clinton_press.head()

While `shape` shows the number of rows and columns.

In [None]:
# A csv file is either campaign speeches, statements or press releases of a candidate.
# This is an example of candiate "Hillary" and her press releases
# A single press release is saved in "Speech" column
# How many press releases do we have?
clinton_press.shape

To select a single column, put the column name in square brackets next to the DataFrame (similar to looking up an entry in a Python dictionary).

In [None]:
# Selecting a single column
title_col = clinton_press['Title']
title_col.head()

Items in a column can be accessed like items in a Python list using square brackets.

In [None]:
title_col[0]

Select a row using `.loc[]` with the number of the row inside the brackets. 

A range of rows can be selected by giving two numbers separated by a colon. The first number is the first row returned, and the last number is the last row returned.

In [None]:
# Locate the 1st row
clinton_press.loc[1]

In [None]:
# Locate the 50th-55th rows
clinton_press.loc[50:55]

These filtering tools help us in text analysis. In the next section, we'll analyze the first speech in the DataFrame. Run the next cell to select the speech and save it to a variable.

In [None]:
a_speech = clinton_press["Speech"][0]
a_speech

## Using String and split

When you break a large string down into smaller chunks, or strings.



In [None]:
sentence = "This is random text we’re going to split apart"
words = sentence.split(" ")
words

In [None]:
# Splitting speech by words
by_words = a_speech.split(' ')

## Counting words
A Counter is a dict subclass for counting. most_common() returns all elements in the counter.

In [None]:
# Counting the number of words showed up in a speech
count_words_freq = Counter(by_words)
count_words_freq

In [None]:
# Guess what are the most frequent words in it
# Is it what you expected?
# Why? / Why not?
count_words_freq.most_common()

## Data Processing

Some libraries are huge. It takes time to retrieve the data from the libraries.
So we prepared stopwords from  "from nltk.corpus import stopwords".

A **stopword** is a word that is often filtered out during text processing. Stopwords are usually common function words, like 'the', 'a', or 'and'. These words are used frequently but aren't very informative. Filtering them out results in more informative and interesting analysis.

In [None]:
# stop_words = set(stopwords.words("english"))
with open('stopwords.json') as json_data:
    stop_words_json = json.load(json_data)
    
# with open('foundations_dict.json') as json_data:
#     mft_dict = json.load(json_data)
stop_words = stop_words_json['words']
stop_words

### Functions

Here are simple rules to define a function in Python. 

Function blocks begin with the keyword `def` followed by the function name and parentheses ( ( ) ). 

Any input parameters or arguments should be placed within these parentheses. You can also define parameters inside these parentheses.

Function definitions consist of a def statement that indicates a `<name>` and a comma-separated list of named `<formal parameters>`, then a return statement, called the function body, that specifies the `<return expression>` of the function, which is an expression to be evaluated whenever the function is applied:



In [None]:
# def <name>(<formal parameters>):
#    return <return expression>

In [None]:
def three():
    return 3

three()

The second line must be indented — most programmers use four spaces to indent. The return expression is not evaluated right away; it is stored as part of the newly defined function and evaluated only when the function is eventually applied.

I defined a function whose name is without_stopwords. I will explain how this works. However, it is okay for you not to understand the details of the code inside the function. All you have to know is what the function does and how to use it.

In [None]:
def without_stopwords(a_speech):
    """ input(a_speech): string type of a speech
        output : list of words in the speech without stop words
        
        >>> exstr = "This is an example showing off stop word filter"
        >>> without_stopwords(exstr)
        ... ['This', 'example', 'showing', 'stop', 'word', 'filter']
    """
    a_speech = word_tokenize(a_speech)
    filtered = []
    #this is for-loop and it makes program repeat the following lines of code.
    for word in a_speech:
        #if is conditional statement that makes program execute the code when the condition is true.
        if word not in stop_words:
            filtered.append(word)
    return filtered

In [None]:
exstr = "This is an example showing off stop word filter"

In [None]:
without_stopwords(exstr)

In [None]:
a_speech

In [None]:
without_stopwords(a_speech)

In [None]:
list_of_words_c = without_stopwords(a_speech)

In [None]:
Counter(list_of_words_c).most_common()

The function below is to filter special characters, such as commas, quotes, colons and so on.

In [None]:
def without_words_set(a_speech, a_set):
    a_speech = word_tokenize(a_speech)
    filtered = []
    for word in a_speech:
        if word not in a_set:
            filtered.append(word)
    return filtered

In [None]:
stop_words_set = set(stop_words)

In [None]:
filter_set = {',', '.', '1', '(', ')','*',':','2', '\'', ';', '?', '=','-', '``', "''"}

In [None]:
custom_filter = stop_words_set | filter_set

In [None]:
custom_filter

In [None]:
filtered = without_words_set(a_speech, custom_filter)
filtered

In [None]:
Counter(filtered).most_common(15)

## Counts words 2
Build your own dictionary you want to count and run it

In [None]:
def words_to_dictionary(words):
    """Making a dictionary with given words. 
    Set key as each word and set value to initail counting 0"""
    dic = {}
    for word in words:
        dic[word] = 0
    return dic

In [None]:
my_words = ["Clinton", "Hillary", "action", "her"]
my_dictionary = words_to_dictionary(my_words)
my_dictionary

In [None]:
def count_words(dictionary, speech):
    speech_words = speech.split(' ')
    for word in speech_words:
        if word in dictionary:
            dictionary[word] = dictionary[word] + 1
    return dictionary
count_words(my_dictionary, a_speech)

### Assignment: Code your dictionary into Python

First, make a list of the words you found corresponding to each foundation. Assign them to the variable names below by filling in the lists.

In [None]:
# Make 6 lists. 
# Tip: each word needs to go in quotation marks and each string must be separated by commas
care_words = []
authority_words = []
loyalty_words = []
sanctity_words = []
fairness_words = []
liberty_words = []


Next, make your dictionary by creating a new dictionary with the foundations as keys and your lists as the values.

In [None]:
# 
my_dict = {'test': 1}

Finally, run the next cell to write your dictionary to a file.

In [None]:
# Run this cell to write dictionary to a file
import json 
# convert dictionary to a JSON-formatted string
with open('../mft_data/my_dict.json', 'w') as fp:
    json.dump(my_dict, fp, sort_keys=True, indent=4)

## What's next? 

### Module 02: Moral Foundations Analysis
### Module 03: The Rhetoric of Data

Notebook developed by: Seungwoo Sean Son,  Keeley Takimoto, Sujude Dalieh

Data Science Modules: http://data.berkeley.edu/education/modules

Some materials are from http://composingprograms.com and https://developers.google.com/edu/python