# Data cleaning with R

Importing **here package** for better referencing

In [1]:
library(here)

here() starts at C:/Users/marwane/R Tuto



Importing **SkimR** package for better summarizing

In [2]:
library(skimr)

Importing **Janitor** package for cleaning functionalities

In [3]:
library(janitor)


Attachement du package : 'janitor'


Les objets suivants sont masqués depuis 'package:stats':

    chisq.test, fisher.test




In [4]:
library(dplyr)


Attachement du package : 'dplyr'


Les objets suivants sont masqués depuis 'package:stats':

    filter, lag


Les objets suivants sont masqués depuis 'package:base':

    intersect, setdiff, setequal, union




Import the **Palmerpenguins** library

In [5]:
library(palmerpenguins)

**skim_without_charts** : gives a comprehensive summary of the dataset

In [7]:
skim_without_charts(penguins)

-- Data Summary ------------------------
                           Values  
Name                       penguins
Number of rows             344     
Number of columns          8       
_______________________            
Column type frequency:             
  factor                   3       
  numeric                  5       
________________________           
Group variables            None    

-- Variable type: factor -------------------------------------------------------
# A tibble: 3 x 6
  skim_variable n_missing complete_rate ordered n_unique
* <chr>             <int>         <dbl> <lgl>      <int>
1 species               0         1     FALSE          3
2 island                0         1     FALSE          3
3 sex                  11         0.968 FALSE          2
  top_counts                 
* <chr>                      
1 Ade: 152, Gen: 124, Chi: 68
2 Bis: 168, Dre: 124, Tor: 52
3 mal: 168, fem: 165         

-- Variable type: numeric -------------------------------------

**glimpse()** gives a summary of the data

In [8]:
glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
$ sex               <fct> male, female, female, NA, female, male, female, male~
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~


**select()** specify or exclude column

In [13]:
# shows the species column
penguins %>%
    select(species)

species
<fct>
Adelie
Adelie
Adelie
Adelie
Adelie
Adelie
Adelie
Adelie
Adelie
Adelie


In [14]:
# shows all the columns but species 
penguins %>%
    select(-species)

island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Torgersen,39.1,18.7,181,3750,male,2007
Torgersen,39.5,17.4,186,3800,female,2007
Torgersen,40.3,18.0,195,3250,female,2007
Torgersen,,,,,,2007
Torgersen,36.7,19.3,193,3450,female,2007
Torgersen,39.3,20.6,190,3650,male,2007
Torgersen,38.9,17.8,181,3625,female,2007
Torgersen,39.2,19.6,195,4675,male,2007
Torgersen,34.1,18.1,193,3475,,2007
Torgersen,42.0,20.2,190,4250,,2007


**rename()** change the name of a column

In [15]:
penguins %>%
    rename(island_new = island)

species,island_new,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181,3750,male,2007
Adelie,Torgersen,39.5,17.4,186,3800,female,2007
Adelie,Torgersen,40.3,18.0,195,3250,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193,3450,female,2007
Adelie,Torgersen,39.3,20.6,190,3650,male,2007
Adelie,Torgersen,38.9,17.8,181,3625,female,2007
Adelie,Torgersen,39.2,19.6,195,4675,male,2007
Adelie,Torgersen,34.1,18.1,193,3475,,2007
Adelie,Torgersen,42.0,20.2,190,4250,,2007


**rename_with() :** rename column names to be more consistent

In [17]:
#toupper/tolower parameter
head(rename_with(penguins,toupper))

SPECIES,ISLAND,BILL_LENGTH_MM,BILL_DEPTH_MM,FLIPPER_LENGTH_MM,BODY_MASS_G,SEX,YEAR
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007


**clean_names :** makes sure there are only Characters, underscores and numbers in the column names

In [18]:
clean_names(penguins)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,39.1,18.7,181,3750,male,2007
Adelie,Torgersen,39.5,17.4,186,3800,female,2007
Adelie,Torgersen,40.3,18.0,195,3250,female,2007
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,36.7,19.3,193,3450,female,2007
Adelie,Torgersen,39.3,20.6,190,3650,male,2007
Adelie,Torgersen,38.9,17.8,181,3625,female,2007
Adelie,Torgersen,39.2,19.6,195,4675,male,2007
Adelie,Torgersen,34.1,18.1,193,3475,,2007
Adelie,Torgersen,42.0,20.2,190,4250,,2007
