# Exploration

Use `str()`, `summary()`, `nrow()`, `ncol()`, `dim()`, `colnames()`, `rownames()`, `head()`, and `typeof()` to understand the structure of a data frame.

In [4]:
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [1]:
str(mtcars)

'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...


In [2]:
summary(mtcars)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000  

In [3]:
nrow(mtcars)
ncol(mtcars)
dim(mtcars)

In [5]:
rownames(mtcars)

colnames(mtcars)

# Introduction

In [11]:
setwd('C:/Users/dell/PycharmProjects/MachineLearning/Pandas/datasets')
getwd()

**Usage**

```R
data.frame(..., row.names = NULL, check.rows = FALSE,
           check.names = TRUE, fix.empty.names = TRUE,
           stringsAsFactors = default.stringsAsFactors())
```

**Arguments**


`...`	
these arguments are of either the form value or tag = value. Component names are created based on the tag (if present) or the deparsed argument itself.

`row.names`	
NULL or a single integer or character string specifying a column to be used as row names, or a character or integer vector giving the row names for the data frame.

`check.rows`	
if TRUE then the rows are checked for consistency of length and names.

`check.names`	
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.

`fix.empty.names`	
logical indicating if arguments which are “unnamed” (in the sense of not being formally called as someName = arg) get an automatically constructed name or rather name "". Needs to be set to FALSE even when check.names is false if "" names should be kept.

`stringsAsFactors`	
logical: should character vectors be converted to factors? The ‘factory-fresh’ default is TRUE, but this can be changed by setting options(stringsAsFactors = FALSE).

# Data Frame creation

In [1]:
emp.data <- data.frame(
    emp_id = c (1:5), 
    emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
    salary = c(623.3,515.2,611.0,729.0,843.25),
    start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11", "2015-03-27")),
    stringsAsFactors = FALSE
)
emp.data

emp_id,emp_name,salary,start_date
1,Rick,623.3,2012-01-01
2,Dan,515.2,2013-09-23
3,Michelle,611.0,2014-11-15
4,Ryan,729.0,2014-05-11
5,Gary,843.25,2015-03-27


# View data

**`head()`**, **`tail()`**: Returns the first or last parts of a vector, matrix, table, data frame or function. Since `head()` and `tail()` are generic functions, they may also have been extended to other classes.

In [18]:
#view first 3 rows
head(emp.data, n = 3L)

emp_id,emp_name,salary,start_date
1,Rick,623.3,2012-01-01
2,Dan,515.2,2013-09-23
3,Michelle,611.0,2014-11-15


In [22]:
tail(emp.data, n = 2L)

Unnamed: 0,emp_id,emp_name,salary,start_date
4,4,Ryan,729.0,2014-05-11
5,5,Gary,843.25,2015-03-27


In [25]:
#view every row except last 3 rows
head(emp.data, n = -3L)

emp_id,emp_name,salary,start_date
1,Rick,623.3,2012-01-01
2,Dan,515.2,2013-09-23


In [26]:
#view every row except first 2 rows
tail(emp.data, n = -2L)

Unnamed: 0,emp_id,emp_name,salary,start_date
3,3,Michelle,611.0,2014-11-15
4,4,Ryan,729.0,2014-05-11
5,5,Gary,843.25,2015-03-27


# Get the structure of Data Frame

using **`str()`** function 

In [2]:
#str stands for: STRUCTURE
str(emp.data)

'data.frame':	5 obs. of  4 variables:
 $ emp_id    : int  1 2 3 4 5
 $ emp_name  : chr  "Rick" "Dan" "Michelle" "Ryan" ...
 $ salary    : num  623 515 611 729 843
 $ start_date: Date, format: "2012-01-01" "2013-09-23" ...


# Summary

In [25]:
summary(emp.data)

     emp_id    emp_name             salary        start_date        
 Min.   :1   Length:5           Min.   :515.2   Min.   :2012-01-01  
 1st Qu.:2   Class :character   1st Qu.:611.0   1st Qu.:2013-09-23  
 Median :3   Mode  :character   Median :623.3   Median :2014-05-11  
 Mean   :3                      Mean   :664.4   Mean   :2014-01-14  
 3rd Qu.:4                      3rd Qu.:729.0   3rd Qu.:2014-11-15  
 Max.   :5                      Max.   :843.2   Max.   :2015-03-27  

# Getting and changing column names

**`names()`**: Functions to get or set the names of an object.

```R
names(x)
names(x) <- value
```

In [5]:
pokemon <- data.frame(
    name = c('Pikachu', 'Turtle'),
    type = c('Electric', 'Water')
)
pokemon

name,type
Pikachu,Electric
Turtle,Water


In [6]:
#getting column names
names(pokemon)

In [7]:
#changing column names
names(pokemon) <- c('Pokemon name', 'Pokemon Type')
pokemon

Pokemon name,Pokemon Type
Pikachu,Electric
Turtle,Water


# Row names and column names

Retrieve or set the row or column names of a matrix-like object.

```R
rownames(x, do.NULL = TRUE, prefix = "row")
rownames(x) <- value

colnames(x, do.NULL = TRUE, prefix = "col")
colnames(x) <- value
```

In [2]:
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [3]:
#get index
rownames(mtcars)

In [4]:
#get column
colnames(mtcars)

In [5]:
#set column nmaes
colnames(mtcars) <- as.character(seq(11))
head(mtcars)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


# Extract data

In [26]:
emp.data$emp_name

In [27]:
emp.data$salary

In [28]:
data.frame(emp.data$emp_name, emp.data$salary)

emp.data.emp_name,emp.data.salary
Rick,623.3
Dan,515.2
Michelle,611.0
Ryan,729.0
Gary,843.25


In [29]:
#Extract the first two rows and then all columns
emp.data[1:2,]

emp_id,emp_name,salary,start_date
1,Rick,623.3,2012-01-01
2,Dan,515.2,2013-09-23


In [30]:
# Extract 3rd and 5th row with 2nd and 4th column

emp.data[c(3,5), c(2,4)]

Unnamed: 0,emp_name,start_date
3,Michelle,2014-11-15
5,Gary,2015-03-27


# Expanding Data Frame

### adding columns

In [31]:
emp.data$dept <- c("IT","Operations","IT","HR","Finance")
emp.data

emp_id,emp_name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611.0,2014-11-15,IT
4,Ryan,729.0,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance


### adding rows

To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the **`rbind()`** function.

In [32]:
emp.newdata <- 	data.frame(
   emp_id = c (6:8), 
   emp_name = c("Rasmi","Pranab","Tusar"),
   salary = c(578.0,722.5,632.8), 
   start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
   dept = c("IT","Operations","Fianance"),
   stringsAsFactors = FALSE
)

emp.newdata

emp_id,emp_name,salary,start_date,dept
6,Rasmi,578.0,2013-05-21,IT
7,Pranab,722.5,2013-07-30,Operations
8,Tusar,632.8,2014-06-17,Fianance


In [33]:
emp.data

emp_id,emp_name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611.0,2014-11-15,IT
4,Ryan,729.0,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance


In [34]:
emp.final = rbind(emp.data, emp.newdata)
emp.final

emp_id,emp_name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611.0,2014-11-15,IT
4,Ryan,729.0,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Rasmi,578.0,2013-05-21,IT
7,Pranab,722.5,2013-07-30,Operations
8,Tusar,632.8,2014-06-17,Fianance


# Loading dataset

**`data()`**: Loads specified data sets, or list the available data sets.

**Usage**

```R
data(..., list = character(), package = NULL, lib.loc = NULL,
     verbose = getOption("verbose"), envir = .GlobalEnv,
     overwrite = TRUE)
```

**Arguments**

`...`	
literal character strings or names.

`list`	
a character vector.

`package`	
a character vector giving the package(s) to look in for data sets, or NULL.

By default, all packages in the search path are used, then the ‘data’ subdirectory (if present) of the current working directory.

`lib.loc`	
a character vector of directory names of R libraries, or NULL. The default value of NULL corresponds to all libraries currently known.

`verbose`	
a logical. If TRUE, additional diagnostics are printed.

`envir`	
the environment where the data should be loaded.

`overwrite`	
logical: should existing objects of the same name in envir be replaced?

In [41]:
#list available datasets
data()

# Indexing and Slicing

In [26]:
tips <- read.csv('./tips.csv')
head(tips, 3)

X,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [27]:
#get the values of column `sex`, return a factor
tips$sex

In [36]:
#using integer indexing will return a dataframe
#get the values of column sex
tips[4]

sex
Female
Male
Male
Male
Female
Male
Male
Male
Male
Male


In [28]:
#get the values of column sex, smoker and tip
tips[c('sex', 'smoker', 'tip')]

sex,smoker,tip
Female,No,1.01
Male,No,1.66
Male,No,3.50
Male,No,3.31
Female,No,3.61
Male,No,4.71
Male,No,2.00
Male,No,3.12
Male,No,1.96
Male,No,3.23


In [39]:
#equipvalent
tips[c(4, 5, 3)]

sex,smoker,tip
Female,No,1.01
Male,No,1.66
Male,No,3.50
Male,No,3.31
Female,No,3.61
Male,No,4.71
Male,No,2.00
Male,No,3.12
Male,No,1.96
Male,No,3.23


In [42]:
tips['sex':'day']

"NAs introduced by coercion"

ERROR: Error in "sex":"day": NA/NaN argument


# Selecting and Filtering

**`subset()`**: Subsetting Vectors, Matrices and Data Frames

**Usage**
```R
subset(x, ...)

## Default S3 method:
subset(x, subset, ...)

## S3 method for class 'matrix'
subset(x, subset, select, drop = FALSE, ...)

## S3 method for class 'data.frame'
subset(x, subset, select, drop = FALSE, ...)
```
**Arguments**
`x`	
object to be subsetted.

`subset`	
logical expression indicating elements or rows to keep: missing values are taken as false.

`select`	
expression, indicating columns to select from a data frame.

`drop`	
passed on to [ indexing operator.

`...`	
further arguments to be passed to or from other methods.

<hr>

In [5]:
#For ordinary vectors, the result is simply x[subset & !is.na(subset)].
values <- c(1:12, NA, 20:23)
print(values)
#select not NA values and divisible by 3
subset(values, values %% 3 == 0)

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 NA 20 21 22 23


For data frames, the `subset` argument works on the rows. Note that subset will be evaluated in the data frame, so columns can be referred to (by name) as variables in the expression

In [12]:
titanic <- read.csv('./titanic.csv')
head(titanic)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [13]:
#select Females from Pclass 3
subset(titanic, (titanic$Sex == 'female') & (titanic$Pclass == 3))

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
3,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
9,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
11,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
15,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
19,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)",female,31.0,1,0,345763,18.0000,,S
20,20,1,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C
23,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q
25,25,0,3,"Palsson, Miss. Torborg Danira",female,8.0,3,1,349909,21.0750,,S
26,26,1,3,"Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)",female,38.0,1,5,347077,31.3875,,S
29,29,1,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q


In [23]:
#get the name and gender of survivors
subset(titanic, as.logical(titanic$Survived), select = c(Name, Sex))

Unnamed: 0,Name,Sex
2,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female
3,"Heikkinen, Miss. Laina",female
4,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
9,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female
10,"Nasser, Mrs. Nicholas (Adele Achem)",female
11,"Sandstrom, Miss. Marguerite Rut",female
12,"Bonnell, Miss. Elizabeth",female
16,"Hewlett, Mrs. (Mary D Kingcome)",female
18,"Williams, Mr. Charles Eugene",male
20,"Masselmani, Mrs. Fatima",female


In [24]:
pokemon <- read.csv('./pokemon.csv')
head(pokemon)

X.,Name,Type.1,Type.2,Total,HP,Attack,Defense,Sp..Atk,Sp..Def,Speed,Generation,Legendary
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False


In [22]:
#get the stats of lengendary pokemons
subset(pokemon, as.logical(pokemon$Legendary), select = HP:Speed) #Do not use HP and Speed as string(e.g :"HP")

Unnamed: 0,HP,Attack,Defense,Sp..Atk,Sp..Def,Speed
157,90,85,100,95,125,85
158,90,90,85,125,90,100
159,90,100,90,125,85,90
163,106,110,90,154,90,130
164,106,190,100,154,100,130
165,106,150,70,194,120,140
263,90,85,75,115,100,115
264,115,115,85,90,75,100
265,100,75,115,90,115,85
270,106,90,130,90,154,110


# Transformation

**`transform()`**

In [47]:
#return a new value, DO NOT MODIFY in-place
res <- transform(pokemon, Lengendary = as.logical(Legendary), LogAttack = log(Attack))
res

X.,Name,Type.1,Type.2,Total,HP,Attack,Defense,Sp..Atk,Sp..Def,Speed,Generation,Legendary,Lengendary,LogAttack
1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,FALSE,3.891820
2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,FALSE,4.127134
3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,FALSE,4.406719
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,FALSE,4.605170
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,FALSE,3.951244
5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False,FALSE,4.158883
6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False,FALSE,4.430817
6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False,FALSE,4.867534
6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False,FALSE,4.644391
7,Squirtle,Water,,314,44,48,65,50,64,43,1,False,FALSE,3.871201


# Ranking