# Data Cleaning with R 

### We begin by determining the best way to import the file (s). Upon opening the actual txt file, we discover the file is delimited (seperated by commas and tabs).

Import relevant libraries

In [24]:
library(tidyverse)
library(data.table)
library(stringr)
library(readr)
library(plyr)
library(tidyr)
library(dplyr)

In [25]:
# reading in the data 
df = read.delim('R/Data/Homog_monthly_max_temp/mx230N002.txt', skip = 0, header = FALSE, as.is=TRUE, dec=".", sep = ",", na.strings=c(" ", "",'NA'), strip.white = TRUE)

glimpse(df)

Observations: 63
Variables: 35
$ V1  <chr> "230N002", "230N002", "Year", "Annee", "1959", "1960", "1961", ...
$ V2  <chr> "LUPIN", "LUPIN", "Jan", "Janv", "-28.5", "-27.1", "-29.6", "-2...
$ V3  <chr> "NU", "NU", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ V4  <chr> "station joined", "station jointe", "Feb", "Fev", "-24.1", "-26...
$ V5  <chr> "Monthly mean of homogenized daily maximum temperature", "Moyen...
$ V6  <chr> "Deg Celcius", "Deg Celsius", "Mar", "Mars", "-22.3", "-25.5", ...
$ V7  <chr> "Updated to December 2017", "Mise a jour jusqu a decembre 2017"...
$ V8  <chr> NA, NA, "Apr", "Avr", "-13.3", "-11.5", "-16.5", "-13.8", "-8.5...
$ V9  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ V10 <chr> NA, NA, "May", "Mai", "-6.1", "-1.2", "-3.6", "-5.0", "-2.3", "...
$ V11 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
$ V12 <chr> NA, NA, "Jun", "Juin", "2.7", "13.1", "12.1", "10.9", "9.1", "6...
$ V13 <chr> NA, NA, N

In [26]:
head(df, n=10)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35
230N002,LUPIN,NU,station joined,Monthly mean of homogenized daily maximum temperature,Deg Celcius,Updated to December 2017,,,,...,,,,,,,,,,
230N002,LUPIN,NU,station jointe,Moyenne mensuelle des temperatures quotidiennes maximales homogeneisees,Deg Celsius,Mise a jour jusqu a decembre 2017,,,,...,,,,,,,,,,
Year,Jan,,Feb,,Mar,,Apr,,May,...,Annual,,Winter,,Spring,,Summer,,Autumn,
Annee,Janv,,Fev,,Mars,,Avr,,Mai,...,Annuel,,Hiver,,Printemp,,Ete,,Automne,
1959,-28.5,,-24.1,,-22.3,,-13.3,,-6.1,...,-9.4,,-9999.9,M,-13.9,,7.1,,-7.3,
1960,-27.1,,-26.2,,-25.5,,-11.5,,-1.2,...,-8.3,,-23.5,,-12.7,,12.9,,-8.2,
1961,-29.6,,-26.4,,-27.3,,-16.5,,-3.6,...,-9.5,,-26.1,,-15.8,,13.8,,-7.7,
1962,-26.5,,-32.8,,-23.8,,-13.8,,-5.0,...,-8.5,,-29.5,,-14.2,,12.9,,-5.5,
1963,-28.8,,-26.0,,-26.5,,-8.5,,-2.3,...,-7.7,,-25.6,,-12.4,,11.8,,-5.2,
1964,-31.1,,-26.9,,-28.3,,-15.8,,-0.9,...,-8.8,,-26.1,,-15.0,,12.4,,-4.6,


### The tidyverse package will become evidently useful as we proceed

* glimpse (helps) provide a summary of the data 
+ There are 63 observations (rows) and 35 variables (columns)
+ all data is stored as characters
* After looking at the first few rows 10 rows of our current frame. We see, 
+ There is some information we want to remove. Specifically, the first 2 rows of the data frame 
+ Almost every other column is filled with NAs 
+ The 3rd row will later be used as our column names, and V1 as our row names 
+ Our data mainly deals with positive, and negative values. 
+ Unusual data includes: default: -9999.9 and the letters 'E' and 'M' used to mark if the data was estimated or missing 

## Constructing desired Dataframe 

Earlier it has been noticed that the majority NAs appear in a patterned fashion. Since, its clear I went ahead and removed those columns from our data frame. Note also,  letters, 'M' and 'E" will be removed. 

Extract the header and combine it back to the data. 

It is important to check for whitespaces when it comes to extracting names.

In [27]:
df = read.delim('R/Data/Homog_monthly_max_temp/mx230N002.txt', skip = 0, header = FALSE, as.is=TRUE, dec=".", sep = ",", na.strings=c(" ", "",'NA'), strip.white = TRUE)

seq(from = 3, to = 35, by = 2)

df <- select(df, -seq(from = 3, to = 35, by = 2))

data <- slice(df, 5:n()) 
# reset df to show this step... 
(hdr <- slice(df, 3)) 
# check for whitespaces in the column names

# unlist((hdr))
# (c  = select(hdr, contains(" ")))
# (st  = select(hdr, starts_with(" ")))
# (ew = select(hdr, ends_with(" ")))

is.na(hdr)

head ((df <- rename(data, hdr)), n=5)

V1,V2,V4,V6,V8,V10,V12,V14,V16,V18,V20,V22,V24,V26,V28,V30,V32,V34
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Annual,Winter,Spring,Summer,Autumn


V1,V2,V4,V6,V8,V10,V12,V14,V16,V18,V20,V22,V24,V26,V28,V30,V32,V34
False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Annual,Winter,Spring,Summer,Autumn
1959,-28.5,-24.1,-22.3,-13.3,-6.1,2.7,11.1,7.5,4.4,-8.5,-17.8,-17.3,-9.4,-9999.9,-13.9,7.1,-7.3
1960,-27.1,-26.2,-25.5,-11.5,-1.2,13.1,13.5,12.0,3.3,-6.6,-21.3,-22.2,-8.3,-23.5,-12.7,12.9,-8.2
1961,-29.6,-26.4,-27.3,-16.5,-3.6,12.1,17.6,11.6,1.3,-8.4,-15.9,-29.2,-9.5,-26.1,-15.8,13.8,-7.7
1962,-26.5,-32.8,-23.8,-13.8,-5.0,10.9,14.4,13.5,5.7,-3.4,-18.7,-22.1,-8.5,-29.5,-14.2,12.9,-5.5
1963,-28.8,-26.0,-26.5,-8.5,-2.3,9.1,14.1,12.3,3.7,-2.1,-17.3,-20.3,-7.7,-25.6,-12.4,11.8,-5.2


### Standard default Values...
#### In this case we have **-9999.9** 

In some situations, it has been suggested to replace the NAs or default values with the mean of the column. Chosen to go against this, since regression regression smoothly 'fill' in these blanks. 

In [29]:
# head(df, n=10)
df <- data.frame(lapply(df, function(x){
      gsub("-9999.9", "NA", x)
    }))
df


Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,Annual,Winter,Spring,Summer,Autumn
1959,-28.5,-24.1,-22.3,-13.3,-6.1,2.7,11.1,7.5,4.4,-8.5,-17.8,-17.3,-9.4,,-13.9,7.1,-7.3
1960,-27.1,-26.2,-25.5,-11.5,-1.2,13.1,13.5,12.0,3.3,-6.6,-21.3,-22.2,-8.3,-23.5,-12.7,12.9,-8.2
1961,-29.6,-26.4,-27.3,-16.5,-3.6,12.1,17.6,11.6,1.3,-8.4,-15.9,-29.2,-9.5,-26.1,-15.8,13.8,-7.7
1962,-26.5,-32.8,-23.8,-13.8,-5.0,10.9,14.4,13.5,5.7,-3.4,-18.7,-22.1,-8.5,-29.5,-14.2,12.9,-5.5
1963,-28.8,-26.0,-26.5,-8.5,-2.3,9.1,14.1,12.3,3.7,-2.1,-17.3,-20.3,-7.7,-25.6,-12.4,11.8,-5.2
1964,-31.1,-26.9,-28.3,-15.8,-0.9,6.5,16.4,14.3,4.8,-4.0,-14.5,-25.7,-8.8,-26.1,-15.0,12.4,-4.6
1965,-27.6,-35.7,-19.1,-10.2,-0.7,6.5,15.5,13.2,1.5,-5.0,-16.0,-24.0,-8.5,-29.7,-10.0,11.7,-6.5
1966,-35.9,-27.1,-21.0,-14.7,-0.2,13.5,16.4,14.8,7.1,-8.9,-22.2,-21.6,-8.3,-29.0,-12.0,14.9,-8.0
1967,-28.0,-31.0,-23.9,-12.8,-2.8,5.8,13.7,12.6,5.9,-2.9,-13.7,-20.8,-8.2,-26.9,-13.2,10.7,-3.6
1968,-30.1,-23.4,-17.0,-14.0,-2.1,8.9,11.7,11.2,5.3,-2.7,-13.8,-24.2,-7.5,-24.8,-11.0,10.6,-3.7


## Saving files under new names and new and directories
Extracting the station number, city and province for file name use. 

In [31]:
df = read.delim('R/Data/Homog_monthly_max_temp/mx230N002.txt', skip = 0, header = FALSE, as.is=TRUE, dec=".", sep = ",", na.strings=c(" ", "",'NA'), strip.white = TRUE)

(station_number <- select(df, V1)[1,1])
(city <- select(df, V2)[1,1])
(province <- select(df, V3)[1,1])


In [32]:
stationNum_city_prov <- paste(select(df, V1)[1,1], trimws(select(df, V2)[1,1]), province <- select(df, V3)[1,1], sep='_')
stationNum_city_prov

## Result 

338 text files were cleaned, per directory
* Directories: Homog_monthly_min_temp, Homog_monthly_max_temp, Homog_monthly_mean_temp

As shown above, new text files and directories are created
* New Directories: Homog_monthly_min_temp_cleaned, Homog_monthly_max_temp_cleaned, Homog_monthly_mean_temp_cleaned

 

## Intended use 
All data is cleaned prior to the shinyApp being used. The cleaned data is reloaded and processed into a single data frame on app load. 


### Input data frame. 
Depending on the month and year the user selects, the data is trimmed before any statistical method.


![input df](input_df.PNG)

### Output data frame. 
After the desired statistical analysis, we shift our focus to different variables and a new data frame.


![output df](output_df.PNG)

## In conclusion, the data does not appear to have many challenges. The ease and use of the tidy verse package, is one that can be consistently used. Helping many, understand the data they are dealing with.