-
Notifications
You must be signed in to change notification settings - Fork 0
/
Titanic_Notebook.Rmd
350 lines (284 loc) · 12.7 KB
/
Titanic_Notebook.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
title: "Titanic Dataset Analysis"
output:
html_document:
df_print: paged
pdf_document: default
---
```{r}
library(tufte)
```
# Data Analysis using Titanic Dataset
To summarize in bullet points, there are following merits.
- There is a lot of good literature for data visualization and analysis.
- With ggplot, we can create high quality plots without tuning.
- ggplot has a consistent API and is easy to use. The ggplot object and the geom object allow for a very good syntax and abstraction of the plots.
- With dplyr, it is easy to write complex data wranglings. The Pipe operator can be used to concisely describe the data wrangling process to be followed.
Using R, we can easily perform beautiful plots and analysis using a very well-developed API. We can also refer to the literature of good statistical users who are using R as their main tool.
```{r}
library(caret) # Data Splitting
library(lightgbm) # lightbgm model
library(knitr) # report generation
library(ggrepel) # label
library(OneR) # binning
library(tidyverse) # metapackage of all tidyverse packages
set.seed(123)
```
```{r}
search()
```
## Load data and see overview
```{r}
train_path <- "/Users/amanda/College Mini/titanic/train.csv"
test_path <- "/Users/amanda/College Mini/titanic/test.csv"
sub_path <- "/Users/amanda/College Mini/titanic/gender_submission.csv"
train <- read_csv(file = train_path)
test <- read_csv(file = test_path)
sub <- read_csv(file = sub_path)
```
read_csv returns tibble, not data frame. Tibbles are a modern take on data frames.
In particular, the section on Tibbles vs data frames contains examples of the differences between tibbles and data frames.
We can use tibble like pandas DataFrame. But there are some difference. For examle , when we take subset by [], tibble returns tibble. If we use pandas dataframe, it sometimes returns pandas Series, so we may use different API between dataframe and series. The R data frame may also return a vector as well.
```{r}
# Tibble returns tibble.
train[,c("Pclass")]
```
## Display tibble
We can chek tibble with some method like following,
- glimpse: Display in a horizontal tabular format. All columns are visible.
- View: Display like pandas dataframe.
- Head: Display the first few lines of the View.
- kable: Display simple table that is not HTML.
```{r}
glimpse(train)
```
```{r}
view(train)
```
```{r}
head(train)
```
```{r}
kable(train)
```
## Check statitstics
We can see statistics of dataframe by summary function. This is like describe method of pandas dataframe.
```{r}
summary(train)
```
```{r}
colSums(is.na(train))
```
## Visualize the plot using ggplot
If we want to create plots in R, it is a good idea to start with [ggplot2]. If I’m forced to say, ggplot2 is similar to seaborn in python.
- In step1, we select data to use and define aesthetic relationship of columns of the data. We can do this by ggplot() and returns ggplot object.
- In step2, we choose geom_ function as plot type, and combine it to ggplot object.
- In step3, by combining more options into the ggplot object, we can customize it. For example, the axis format, title, and many other things
```{r}
options(repr.plot.width=12, repr.plot.height=8)
```
We can plot histogram by geom_histogram().
```{r}
p <- ggplot(data = train, mapping = aes(x = Age, fill = factor(Survived), color = factor(Survived)))
p + geom_histogram(alpha = 0.6) + ggtitle("Histogram by Age") +
theme(text=element_text(size = 16))
```
The ratio of deaths seems to vary with age. Are there any other similar differences? For example, we are curious about gender.
By facet_wrap(), we can depicted graphs separately for each conditions very easily.
```{r}
p <- ggplot(data = train, mapping = aes(x = Age, fill = factor(Survived), color = factor(Survived)))
p + geom_histogram(alpha = 0.6) + ggtitle("Histogram by Age") +
theme(text=element_text(size = 16)) + facet_wrap( ~ Sex, ncol = 2)
```
We can see that this difference is more pronounced for male.
We stratificated the data, now let's try to make the effect of age easier to understand by using geom_density() to draw the density.
We can plot the kernel density estimate by geom_density().
```{r}
p <- ggplot(data = train, mapping = aes(x = Age, fill = factor(Survived), color = factor(Survived)))
p + geom_density(alpha = 0.3) + facet_wrap( ~ Sex, ncol = 2) +
theme(text=element_text(size = 16)) + ggtitle("Density estimate by Age")
```
If we look at the blue distribution of male, we can see that children have a higher survival rate because another peek appear at the age of below 20.
Now, since whether the person is child or not should have a lot to do with survival rate, let's draw a graph divided by that condition.
With functions of dplyr, we can manipurate this.
The definition of is_child is,
Yes: 1
No: 0
```{r}
train <- train %>% mutate(is_child = ifelse(Age < 18, 1, 0))
test <- test %>% mutate(is_child = ifelse(Age < 18, 1, 0))
```
dplyr is a set of verbs that help you solve the most common data manipulation challenges. Data rangling can be done by connecting tibble and verv with "Pipe" (%>%).
About feature engineering like this, I'll try more in later.
We can also do groupby and count by dplyr. It might be hard to get used to seeing at first, but we can makes it easier to see the whole operation since we can connect the operations behind.
```{r}
Survived_count <- train %>%
group_by(Survived, is_child) %>%
summarize(count = n())
```
Draw a bar graph to see the impact of child or not more directly. The bar chart is geom_col().
If the condition is not category but continuous value such as real number (of course, if the number is distributed within a reasonable range of common sense), we can use factor().
We can also use coord_flip() if we want to make the plot horizontal. Anyway, if we combined coord_flip(), we can get the plot 90 degrees. In the case of Seaborn, the interface is little confusing. For barplot, the column names for the x and y arguments must be swapped, but for barplot, we have to pass 'h' to orient argment.
By plotting horizontally, we don't have to tilt the label names by 45 degrees, even if they are longer.
```{r}
p <- ggplot(data = Survived_count, mapping = aes(x = is_child, y = count, fill = factor(Survived), color = factor(Survived)))
#By position_dodge(preserve = "single"), we can maintain width of bar.
p + geom_col(position = position_dodge(preserve = "single")) +
coord_flip() + ggtitle("Count of Survived or not per is_Child")
```
If they are children, more than half of them survive, but if they are not children, the survival rate is much lower.
We'll try to find other information that might have an impact on survival. If the fare are high, person may be prioritized for help.
Let's draw a boxplot to see if there is a difference in fare depending on whether you survived or not. This is geom_boxplot().
```{r}
p <- ggplot(data = train, mapping = aes(x = factor(Survived), y = Fare))
p + geom_boxplot(alpha = 0.3) +
theme(text=element_text(size = 16)) +
coord_flip() + ggtitle("Boxplot of Fare per Survived")
```
As we can see from the boxplot, there is a difference in fare between groups which is whether we survived or not.
Let's calculate the statistics.
```{r}
summary_fare <- train %>%
group_by(Survived) %>%
summarize(mean = mean(Fare), std_dev = sd(Fare))
summary_fare
```
The averages are about twice as different.
Let's delve further. Let's draw a scatterplot of each data for each age. We can plot scatterplot by geom_point().
```{r}
p <- ggplot(data = train, mapping = aes(x = Age, y = Fare, color = factor(Survived)))
p + geom_point(alpha = 0.5) + theme(text=element_text(size = 16)) +
ggtitle("Scatterplot of Age vs Fare per Survived")
```
In particular, in geom_point(), by specifying α, the color becomes deeper where the number of data is high, which is good for visual clarity.
We can see that the higher the Fare, the more green dots there are. Also, as we saw above, the green dots are concentrated where the age is low.
There is one more column that is related to Fare. It is the Ticket class.
- 1st: Upper
- 2nd: Middle
- 3rd: Lower
For points with Fare higher than 100, let's see which class they are by assigning labels of pclass. Also, let's draw a horizontal line at 100 to make the boundary easier to understand.
We can overlap labels to points by geom_text_repel() of ggrepel.
And we can overwrite hline by geom_hline().
```{r}
p <- ggplot(data = train, mapping = aes(x = Age, y = Fare, color = factor(Survived)))
p + geom_point(alpha = 0.3) + theme(text=element_text(size = 16)) +
geom_hline(yintercept = 100, size = 1.4, color = "gray80") +
geom_text_repel(data = subset(train, Fare > 100), mapping = aes(label = Pclass))
```
As a matter of course, we can see that all data with a Fare greater than 100 have a pclass of 1. It also shows a high percentage of survivors.
If we look at the graphs separately by pclass, we may be able to see more features.
```{r}
p <- ggplot(data = train, mapping = aes(x = Age, y = Fare, color = factor(Survived)))
p + geom_point(alpha = 0.3) + theme(text=element_text(size = 16)) +
facet_wrap(~ Pclass)
```
We can see that the percentage of survivors is higher for smaller values of pclass. pclass 3 has more red dots. Children are mostly green.
So far, we have revealed the following features.
- Children seem to have a better chance of survival.
- Female seem to have a higher survival rate than male.
- The higher fare or ticket class, the higher the survival rate seems to be.
# Feature Engineering
To more accuracy, I'll try more feature engineering.
I referred A Data Science Framework: To Achieve 99% Accuracy for the features to be created.
## Concat data to process & select sub-set
```{r}
target_cols <- c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")
# Combine the train data and test data for processing.
train_and_test <- rbind(train[, target_cols], test[, target_cols])
# Extract target
train_label <- train$Survived
```
## Fill missing values with statistics
```{r}
# https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode
mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
# https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
train_and_test$Age[is.na(train_and_test$Age)]<-median(train_and_test$Age,na.rm=TRUE)
train_and_test$Embarked[is.na(train_and_test$Embarked)]<-mode(train_and_test$Embarked)
train_and_test$Fare[is.na(train_and_test$Fare)]<-median(train_and_test$Fare,na.rm=TRUE)
```
## Categorical encoding
```{r}
train_and_test$Sex <- as.numeric(factor(train_and_test$Sex))
train_and_test$Embarked <- as.numeric(factor(train_and_test$Embarked))
```
## Create new column from some columns
```{r}
train_and_test$FamilySize = train_and_test$SibSp + train$Parch + 1
```
```{r}
train_and_test = train_and_test %>% mutate(IsAlone = ifelse(FamilySize > 1, 1, 0))
```
## Binning & label encoding
```{r}
train_and_test$Age_bin <- bin(train_and_test$Age, nbins = 5, labels = c(0, 1, 2, 3, 4))
train_and_test <- train_and_test %>% mutate(Age_bin = as.integer(Age_bin))
train_and_test$Fare_bin <- bin(train_and_test$Fare, nbins = 5, labels = c(0, 1, 2, 3, 4))
train_and_test <- train_and_test %>% mutate(Fare_bin = as.integer(Fare_bin))
```
## Horizontal division of tibble
```{r}
train_idx <- 1:nrow(train)
test_idx <- -train_idx
train <- train_and_test[train_idx,]
test <- train_and_test[test_idx,]
```
Finally, we have our data set!
```{r}
input_cols <- append(target_cols, c( "FamilySize", "IsAlone", "Age_bin", "Fare_bin"))
train_data <- train[, input_cols]
test_data <- test[, input_cols]
```
# Modeling with LightGBM
As Python users, you probably want to try LightGBM first after EDA. This method is also available in R. Let's create a model and submit the estimation results.
## Create dataset
In order to pass the tibble to the LightGBM Dataset API, we have to made it into matrix using as.matrix().
```{r}
cat_cols <- c("Pclass", "Sex", "Embarked", "IsAlone", "Age_bin", "Fare_bin")
train_ds <- lgb.Dataset(as.matrix(train_data), label = as.matrix(train_label), categorical_feature = cat_cols)
```
## Cross validation
```{r}
lgb.cv(
objective = "binary"
, metric = "binary_logloss"
, data = train_ds
, nrounds = 1500L
, learning_rate = 0.002
, early_stopping_rounds = 10
, nfold = 5
, shuffle = TRUE
)
```
## Training
```{r}
model <- lightgbm(
objective = "binary"
, metric = "binary_logloss"
, data = train_ds
, nrounds = 1433
, learning_rate = 0.002
)
```
## Check feature importance
```{r}
tree_imp <- lgb.importance(model, percentage = TRUE)
```
```{r}
lgb.plot.importance(
tree_imp,
top_n = 10L,
measure = "Gain",
left_margin = 10L,
cex = NULL
)
```
## Inference & export submission file
```{r}
preds <- predict(model, as.matrix(test_data))
sub$Survived <- ifelse(preds > 0.45,1,0)
write.csv(sub, "titanic/submission.csv", row.names=FALSE)
```