## Load data into R

Loading data into R refers to the process of importing external datasets into the R environment for analysis and manipulation. R provides various functions and packages to facilitate data import from a wide range of file formats, including CSV, Excel, text files, databases, and more.

Once the data is loaded into R, it becomes accessible as objects (e.g., data frames, matrices) that can be used for statistical analysis, visualization, modeling, and other data-related tasks. Properly loading data is a crucial initial step in any data analysis project, as it enables researchers, data scientists, and analysts to explore, clean, and transform the data to derive meaningful insights and draw conclusions from the information it contains.

In [1]:
# Try!

# call the "tidyverse" library using the library() function

# read our data file into R and assign it to a variable called "chocolateData". 
# Remember that you can find out where the data is by expanding the "Input Files"
# box above by clicking the + sign in the left corner.

# remove the first row of the chocolateData data_frame using a negative index

# check the first few rows of your data using the head() function to make sure it
# looks alright

In [2]:
# let's get rid of the white spaces in the column names of this
# dataset. This will make it possible for us to refer to columns by thier names, since
# any white space in a name will mess R up.

names(chocolateData) <- gsub("[[:space:]+]", "_", names(chocolateData))
str(chocolateData)

ERROR: Error in gsub("[[:space:]+]", "_", names(chocolateData)): object 'chocolateData' not found


## Data Cleaning

Data cleaning is a crucial step in the data analysis process that involves identifying and correcting errors, inconsistencies, and inaccuracies in a dataset. In R, data cleaning is performed using various functions, packages, and techniques to ensure that the data is accurate, reliable, and ready for analysis.

The process of data cleaning in R typically includes tasks such as:

1.  **Missing Value Treatment:** Handling missing values by either imputing them with appropriate values or removing them from the dataset, depending on the nature of the analysis.
    
2.  **Outlier Detection and Treatment:** Identifying and addressing outliers, which are extreme values that can significantly affect statistical analysis and model performance.
    
3.  **Data Type Conversion:** Ensuring that variables are assigned the correct data types to prevent errors during analysis and to optimize memory usage.
    
4.  **Consistency Checks:** Verifying the consistency of data across different variables and identifying and resolving discrepancies.
    
5.  **Duplicate Data:** Detecting and removing duplicate records to avoid skewed analysis results.
    
6.  **Standardization and Formatting:** Converting data into a consistent format, such as changing dates to a standard format or converting text to lowercase.
    
7.  **Handling Categorical Data:** Encoding categorical variables into numerical representations to enable analysis with machine learning algorithms.
    
8.  **Data Transformation:** Applying transformations to variables, such as log transformations or scaling, to meet the assumptions of statistical models.
    
9.  **Data Integration:** Merging and combining datasets from multiple sources to create a unified dataset for analysis.
    
10.  **Data Validation:** Checking for errors and discrepancies in the data to ensure data accuracy and reliability.
    

R provides a wide range of packages, such as `dplyr`, `tidyr`, and `data.table`, that offer powerful functions for performing these data cleaning tasks efficiently. Data cleaning in R is essential for producing high-quality and reliable insights from data analysis and is a fundamental step before proceeding with further data exploration and modeling.

In [3]:
# Try!

# Use the str() function to check the data type of the columns in chocolateData

This has a lot of output, but don't be scared! We'll work through it together. Your output should look something like this:

![](https://image.ibb.co/eKYCK5/Screenshot_from_2017_08_29_16_09_09.png)

The first row shows you the class of the object and its size. Like I mentioned in the last section, a data_frame is a special type of data.frame called a tibble, which is abbreviated here to tbl. You can see that this tibble is 1795 rows long and 9 columns wide.

In R, the dollar sign ($) has a special meaning. It means that whatever comes directly after it is a column in a data_frame. You can use this to look at specific columns in a data_frame, like so:

In [4]:
#Display the initial values from the "Rating" column in the "chocolateData" dataframe.
head(chocolateData$Rating)

ERROR: Error in head(chocolateData$Rating): object 'chocolateData' not found


After each dollar sign ($), the text specifies the column name for the respective data. By counting them, we observe a total of 9 columns, consistent with the information in the first row of the output.

The data type of each column is indicated by the presence of "chr" after the colon (:) for all columns, indicating that they are of the character data type.

While examining the first few observations in each column, it becomes evident that not all of them contain numerical values. It is desirable to have certain columns like REF, Review Date, and Rating to be in numeric format. Although we could manually convert each column using the as.numeric() function, the tidyverse provides a more convenient solution. The type_convert function, which is part of the tidyverse, analyzes the first 1000 rows of each column, makes an educated guess about the appropriate data type, and automatically converts the columns to their respective inferred data types. This simplifies the data type conversion process and improves efficiency.

In [5]:
# automatically convert the data types of our data_frame
chocolateData <- type_convert(chocolateData)

ERROR: Error in type_convert(chocolateData): could not find function "type_convert"


In [6]:
# Try!

# After converting the data, utilize the str() function to inspect its structure. 
# Take note of any observations you make. Are all the columns in the expected data type as you anticipated?

It appears that there is a lingering issue with the "Cocoa Percent" column. Currently, it is in a character format, but ideally, we would want it to be in numeric form to represent percentages. The problem likely stems from the presence of the percent symbol (%) in the data, which prevents it from being recognized as a numeric value. To address this, let's remove all the percent symbols from this dataset and reattempt the analysis.

In [7]:
# To eliminate all the percent signs in the fifth column, execute the following code:
chocolateData$Cocoa_Percent <- sapply(chocolateData$Cocoa_Percent, function(x) gsub("%", "", x))

# Now, attempt the "type_convert()" function once more:
chocolateData <- type_convert(chocolateData)

# Verify the data structure to ensure that the fifth column is now properly represented as a percentage:
str(chocolateData)

ERROR: Error in lapply(X = X, FUN = FUN, ...): object 'chocolateData' not found


Great! Everything appears to be in order. Now, we can commence our analysis! The process we've undertaken up to this point, reaching the stage where analysis is possible, is commonly referred to as "data cleaning." Interestingly, data cleaning constitutes a significant portion of data analysis.

>As the jest humorously implies: "80 percent of data science involves preparing data, and the remaining 20 percent is spent complaining about preparing data."

## Summarizing data

Now, our data is now in R, and we have completed the data cleaning process. It's time to start summarizing the data, and in R, there are a couple of options to achieve this.

R is known for its flexibility, offering multiple ways to accomplish the same task. Once you become familiar with the language, this flexibility becomes a valuable feature. However, I remember it being a bit frustrating when I was still learning.

Let's explore two functions for data summarization: `summary()` from base R and `summarise_all()` from the Tidyverse package. We'll run both functions and compare their outputs.

If you want to learn more about any function, you can check its documentation. In a kernel, you can access the documentation by running a cell with a question mark followed by the function name, without using parentheses. Remember, it's perfectly normal to look up information as professional programmers frequently do so. No one knows everything about every programming language, and seeking information is part of the learning process!

In [8]:
# Execute this cell to gain further insights into the summary() function!
?summary

In [None]:
# Execute this cell to gain a better understanding of the `summarise_all()` function.

?summarise_all

In [10]:
#Summary using Base R:
#To obtain a summary of the "chocolateData" dataset using Base R, you can simply use the "summary" function.
summary(chocolateData)

#Summary using Tidyverse (dplyr):
#To achieve a summary with the Tidyverse (specifically dplyr) package, you'll utilize the "summarise_all" function. 
#For this example, We are calculating the average using the "mean()" function on the "chocolateData" dataset.
summarise_all(chocolateData, funs(mean))


ERROR: Error in summary(chocolateData): object 'chocolateData' not found


In [11]:
# Try!

# To calculate the standard deviation of each numeric column, employ the summarise_all() function along with the sd() function.

## Summarizing a specific variable


In the previous sections, we explored functions that provided an overview of the entire dataset. However, often, our focus is on specific variables. With the help of the `summarise()` function and pipes, this becomes a straightforward task. Pipes, denoted by `%>%`, are a special operator within the Tidyverse package, which we loaded at the beginning. Attempting to use pipes without loading the package will result in an error.

The pipe operator `%>%` takes the output from the right side and passes it as input to the operation on the left side.

Let's demonstrate this with our chocolate dataset. We'll use the pipe to pass the dataset to the `summarise()` function. The `summarise()` function will return a data_frame with columns containing specific types of information we requested and assigned names. In this example, we'll obtain two columns: "averageRating" with the average of the "Rating" column and "sdRating" with the standard deviation of the "Rating" column.

In [12]:
# To obtain a data_frame containing the mean and standard deviation (SD) of the "Rating" column from the "chocolateData" dataset, 
#you can use the following code:

chocolateData %>%
    summarise(averageRating = mean(Rating),
             sdRating = sd(Rating))

# The "result" data_frame will contain the calculated average rating in the "averageRating" column and the standard deviation 
# in the "sdRating" column, extracted from the "chocolateData" dataset.


ERROR: Error in chocolateData %>% summarise(averageRating = mean(Rating), sdRating = sd(Rating)): could not find function "%>%"


> ## Why are there line breaks after "%>%" and "mean(Rating)," in the code block above?

> 
> Up until now, all the functions we've encountered have been written on a single line. However, as we delve into more functions and combine them, some lines of code can become quite lengthy. To enhance code readability, it's beneficial to break up these long lines. Ensuring that your code is easy to read is crucial, especially since you might be the one revisiting it in the future when you've forgotten the details. Making your future self's life a little easier is always a good idea!
>
> You can't break a line up just anywhere, though. Lines of code are like lines of text in a book: you can't just start wrap a line anywhere you want.
>
>> Th
>>
>>is is p
>>
>>retty hard t
>>
>> o read.
>
>When working with text, you can wrap lines between words or between syllables of words using a hyphen (-) to indicate continuation to the reader. In R, certain characters act as "hyphens" to inform the computer that the code continues on the next line. These characters include the comma (,), the pipe operator (%>%), and the plus sign (+), which we'll discuss later. When you split your code line directly after one of these characters, R knows to continue parsing the code on the next line.

>To enhance code readability, it is advisable to indent any wrapped lines after the first one. Though not mandatory, this practice makes your code easier to understand. You can achieve indentation by either hitting TAB once or using four spaces. Both methods work well, and there are varying opinions online about which is better, but the choice ultimately depends on personal preference.

In [13]:
# this is fine! :)
mean(c(5,6,25,16))

# this is fine! :)
mean(c(5,6,
       25,16))

# this won't break your code, but it's hard to read :(
mean(c(5,6,
25,16))

# this will break your code :'(
mean(c(5,6,2
      5,16))

ERROR: Error in parse(text = x, srcfile = src): <text>:14:7: unexpected numeric constant
13: mean(c(5,6,2
14:       5
          ^


In [14]:
# try!

# Could you apply the pipe (%>%) operator and the summarise() function to create a new dataframe containing 
# the average and standard deviation of the Cocoa_Percent column? 
# You have the flexibility to name the new columns as you prefer, 
#but it's essential to use clear and descriptive names for better understanding.

## Summarize a specific variable by group

At first glance, performing calculations such as mean and standard deviation using this approach might appear somewhat trivial. However, the true power of this technique becomes evident when we examine the same variable across different groups.

To achieve this, we utilize a useful function called group_by(). By piping a dataset into the group_by() function and specifying a particular column's name, the function groups together all the rows with identical values in that column. Subsequently, when we pass this data into the summarise() function, it provides the requested values separately for each group. The process can be illustrated as follows:

In [15]:
# "Calculate and provide the mean and standard deviation of ratings based on the year in which the ratings were given."

chocolateData %>%
    group_by(Review_Date) %>%
    summarise(averageRating = mean(Rating),
             sdRating = sd(Rating))

ERROR: Error in chocolateData %>% group_by(Review_Date) %>% summarise(averageRating = mean(Rating), : could not find function "%>%"


In [16]:
# Try!

# Could you provide a data_frame that includes the average and standard deviation of 
# Cocoa_Percent based on the year the reviews were written?

This approach provides a highly effective method to initiate data comprehension. For instance, it appears that chocolate bar ratings may show a slight upward trend over the years. In the mid-2000s, the ratings hovered around 3.0, while currently, they are closer to 3.3.

To gain a more comprehensive insight, I am eager to visualize this data through graphs. By doing so, I can ascertain whether there has been a consistent change over time. Let's proceed to the final part of this tutorial: graphing!