# DawR Notebook #8: Clean, Clean, Clean

_Lesson Objectives_

1.   To clean and transform data using past methods
2.   To clean and transform data using dplyr methods

This notebook traces the textbook reading (Ch13) but it is adapted for Google Colab, and you are expected to follow along with both, side-by-side.  You have editing privileges to this document.  Submit your completed notebook to the Google Classroom under 'Notebook #8'.

In [None]:
# for us, to use the data explored in Ch 13, we need to install the library
install.packages('learningr')

In [None]:
# let's see what's available
data(package='learningr')

Given the political climate right now, I'm more interested in the election data than the English monarchies covered in text.  Feel free to check out what the author does with that dataset because it involves string manipulations! and that could be helpful to you and your final project.

In [None]:
data(obama_vs_mccain, package = 'learningr')
head(obama_vs_mccain)

Let's learn how the data was collected and/or what each column/row represents.  I do that with the code below.

In [None]:

?learningr::obama_vs_mccain

**<font color=#C7B8EA;>Summarize what you learn from the help function here</font>**:

Often, data comes in a form that is *not* conducive to the analyses we want to do.  For example, in the *obama_v_mccain* data, we might want to understand the voting patterns of region IX rather than a state-by-state analysis.

We *could* use logical operators like we did before (**<font color=#C7B8EA;>show me how below by filling in the blanks</font>**).  

In [None]:
# like before, we identify the appropriate rows

# insert the function we've used before to determine row indices
regIX_indx = _____(obama_vs_mccain$Region == 'IX')
# do we put regIX_indx on the LHS of the comma or RHS to grab rows?  Do it for me.
obama_vs_mccain[ , ]

But there are easier ways to do so!  We will learn to use the _dplyr_ package and its magical functions.

Let's say I'm curious about the percentage of people living in a non-urban area.  Perhaps I want to study a state or region further using this information.  However, the variables present do not include this information.

So, I must create my own variable, and it is convenient for me to keep it attached to all the other information in the data set so I only have to go to one place.

To do this, I could use the *cbind* function from a previous notebook.

In [None]:
# can find rural percentage using the overall population and the urbanization percentage provided
rural_pct = abs(obama_vs_mccain$Population - obama_vs_mccain$Population*(obama_vs_mccain$Urbanization/100))/obama_vs_mccain$Population

The *rural\_pct* should ideally add up to 100% since they are percentages of two disjoint halves of a whole.  

**<font color=#C7B8EA;>Check!</font>**

In [None]:
#

**<font color=#C7B8EA;>Do you notice anything weird?  Is my math wrong?  What do you think is happening?</font>**  Answer these questions in a text box.

In [None]:
totals = rural_pct*100 + obama_vs_mccain$Urbanization

In [None]:
# we can add the new variable to the end of the dataset, and we call it Rural
new_DF = cbind(obama_vs_mccain, Rural = 100*rural_pct)
# run the next line; did our addition work??
colnames(new_DF)

# I) **dplyr**
**The** data manipulation library in R!  It is used to do things we've seen before

*   subsetting data (we did this with crazy nesting of logical operators)
*   merging data (we did simple merging with rbind and cbind)

and you'll see the benefits to doing this with dplyr.  We will now also be able to

*   sort data sets
*   apply functions to data sets
*   perform basic statistics

Are you ready for it??


In [None]:
library(dplyr)

## a) **Sorting and Subsetting Data**

Right now it appears *obama_vs_mccain* is ordered such that the States are in alphabetical order.  But what if I wanted to reorder so that Region was in numerical order?

In [None]:
region_obama_vs_mccain = obama_vs_mccain %>% arrange(Region)

The %>% is called "piping".  Basically, it tells R to use the data on the left and plug it into the function on the right.  The function that allows you to reorder is called \______ **<font color=#C7B8EA;>(fill in this blank with what you think the function is)</font>**.

Also, observe how we only need to use the name of the column inside the function!  We do *not* need to use *name_of_data$name_of_variable*, which can be confusing, especially when nesting everything like we've done before.

Woot for ease and for readability.

Tell me what you learn about the different regions now that they are organized.  Use this space to do so.

In [None]:
regionI = obama_vs_mccain %>% filter(Region == 'I')
other_regionI = region_obama_vs_mccain %>% filter(Region == 'I')



**<font color=#C7B8EA;>**In the space below, check to see if regionI and other_regionI are the same!</font> You should refer back to the lesson on logical operators if you forgot.

**<font color=#C7B8EA;>**If they are or are not, explain why in a text box below this code chunk.</font>

In [None]:
#

*filter* is the function that allows us to subset the data set.  While we are still using logical operators, observe how we do not need the extra step of creating an index vector.  Woot.

ALSO:

In [None]:
poor_regionI = obama_vs_mccain %>% filter(Region == 'I', Income < 25000)
poor_regionI

😎

The *filter* function can also be paired with *arrange* to make our sorting and ordering more sophisticated.

**<font color=#C7B8EA;>Describe what the code below is doing.  Then check the output.  Does it match your prediction?</font>**

In [None]:
obama_vs_mccain %>% filter(Region == 'I', Income > 25000) %>% arrange(Population)

We aren't done with all the cool things!

Check it:

I *only* care about the voting behavior of people who identify as Latinx and Black within each region.  So, I will use the *select* function for these variables.

**<font color=#C7B8EA;>Why won't this lead to meaningful information?</font>**

In [None]:
bl_lat_sample = obama_vs_mccain %>% select(Region, Black, Latino)

Earlier, we sliced the data so that what remained were states in Region I whose average income was less than \$20,000 USD.  This is what we deemed as a "poor" population (though, in 2008, we should really consider any yearly income less than \$40,000 USD as poor because it was still expensive af to live then).

Why not just tack on an extra variable that says "poverty", "middle class", and "upper class" by some arbitrary threshold?

In [None]:
new = obama_vs_mccain %>% mutate(wealth_status = case_when(
                                                    Income < 20000 ~ "poverty",
                                                    Income >= 20000 & Income < 50000 ~ "middle class",
                                                    Income >= 50000 ~ "upper class"
                                                    )
)

Finally, for today, let's group by those wealth statuses we just created.

In [None]:
new %>% filter(Region == 'IV') %>% group_by(wealth_status) %>% summarize(avg_pop = mean(Population))

**<font color=#C7B8EA;>Tell me what you understand about the code chunk above.</font>**

Please note, we have access to a dplyr cheatsheet in our [Resources folder](https://drive.google.com/drive/folders/1exgwAwYZVcrLnPgtjYSaLtqtEoib7DDr).  We'll explore dplyr more in the next lesson.