---
title: "Ted Talks Analysis R Markdown"
author: "sri"
date: "01/05/2022"
output: html_document
editor_options: 
  chunk_output_type: console
---


# *EDA ON TED TALKS IN R*

### Loading the basi pacakages. I'll load all other packages as and when needed.

In [None]:
library(tidyverse)
library(dplyr)
library(ggplot2)

### Loading the data.

In [None]:
tedtalks<-read.csv("../input/ted-talks/data.csv")

## Overviewing the data.

In [None]:
head(tedtalks)

In [None]:
tedtalks$month <- substr(tedtalks$date,0, nchar(tedtalks$date)-5)
tedtalks$date = as.numeric(substr(tedtalks$date,nchar(tedtalks$date)-4, nchar(tedtalks$date)))
names(tedtalks)[names(tedtalks) == "date"] <- "year"
head(tedtalks,3)

In [None]:
#Filtering the data frame to do statistic only in a twenty-year period from 2002 to 2021
tedtalks = subset(tedtalks, year >= 2002 & year <= 2021) 

#Add new column "quarter" to the data frame
tedtalks = tedtalks %>%
  mutate(quarter = case_when(
    month == "January" | month == "February" | month == "March"  ~ "Q1",
    month == "April" | month == "May" | month == "June"  ~ "Q2",
    month == "July" | month == "August" | month == "September"  ~ "Q3",
    month == "October" | month == "November" | month == "December"  ~ "Q4"
  ))

head(tedtalks,3)

In [None]:
tedtalks$quarter = as.factor(tedtalks$quarter)

head(tedtalks,3)

levels(tedtalks$quarter)

In [None]:
colnames(tedtalks)

### Getting dimentions of the data

In [None]:
glimpse(tedtalks)

This dataset contains **5,440** entries with **6** columns. Moving on....

### Basic statistical info of the dataset.

In [None]:
library(skimr)
skim(tedtalks)

As we can see that there is **5,440** unique topics and some repetative authors as we have only **4444** unique authors. Moving on...

### Getting info about the missing data.

In [None]:
colSums(is.na(tedtalks))

The dataset is already cleaned. Perfect for analysis. But we can see that the date column is in **char**, we have to change that to date column for time series analysis.

## Feature Engineering

### Changing date column to appropriate date datatype.

In [None]:
#changing "Date" column
tedtalks$date <- as.Date(paste('01', tedtalks$date), format='%d %b %Y')

head(tedtalks,3)

## Setting objectives and starting Exploratory Data Analysis
With the available data lets find some good insights on it:
* Distribution of ted talks over the years.
* Most viewed ted talks of all time.
* Most liked ted talks of all time.
* Top speakers based on likes and views.
* Finding TED talks with the best view to like ratio.
* Month-wise Analysis of TED talk frequency.
* create a worldcloud for Title and see if any insight we can find.

### Distribution of ted talks over the years.

In [None]:
tedtalks %>% 
  ggplot(aes(date))+
  geom_histogram(color = "#000000", fill = "#0099F8")

Here we can see that the ***popularity of ted talks sort of began from early 2000's and attains a peak at the year 2019 and began to move down. COVID-19 can be a cause of no programs happening.*** Moving on...

#### lets filter out and see the oldest ted talks information.

In [None]:
tedtalks %>% 
  filter(date<1980)

We've got 3 results and the data seems to be legit. These are the **3 oldest ted talks that exists** . There views are also high for that time have to wait and see if any of them comes as top viewed.


#### Let's filter and see few more between 1980 and 2000.

In [None]:
tedtalks %>% 
  filter(date>="1980-01-01" & date<="2000-01-01")

Some of the ted talks got some much views views here as well. Moving on....

### Most viewed ted talks of all time.

In [None]:
tedtalks %>% 
  arrange(desc(views)) %>% 
  slice(1:10)

Bill gates talk came true after all!!!

### Most liked ted talks of all time.

In [None]:
tedtalks %>% 
  arrange(desc(likes)) %>% 
  slice(1:10)

Here we can see that a ***disparity remains between views and likes***. Some names who had most viewed was not present in most liked names. 

#### Let's try to understand the relationship betwwen views and likes now.

In [None]:
library(scales)
ggplot(tedtalks,aes(x=likes,y=views))+
  geom_jitter()+
  geom_smooth()+
  scale_y_continuous(labels = comma)

Not much variation a little here and there but thats fine. We can see this in percent now.

In [None]:
cor(tedtalks$likes,tedtalks$views)

That's 99% correlation. ***So 99% of the time the most viewed will be the most liked ted talks***. Moving on...

### Top speakers based on likes and views.

In [None]:
tedtalks %>% 
  group_by(desc(likes),desc(views)) %>% 
  summarise(author) %>% 
  head(10)

These are the top speakers based on views and likes. But there are ***speakers who gave more than 1 ted talk***. So, we have to consider that too.

#### Top speakers based on no. of ted talks

In [None]:
Top.author.count<-tedtalks %>% 
  group_by(author) %>% 
  count(sort = T) %>% 
  head(10)

Top.author.count %>% 
  ggplot(aes(x=reorder(author,(+n)),y=n))+
  geom_col()+
  coord_flip()+
  labs(x="no. of tedtalks summarized by authors")

### Finding TED talks with the best view to like ratio.

In [None]:
#creating a column for the ratio.
tedtalks$vlratio<-tedtalks$views/tedtalks$likes

#finding the top talks with high ratio
tedtalks %>% 
  arrange(desc(vlratio)) %>% 
  slice(1:10)

### Month-wise Analysis of TED talk frequency

In [None]:
#creating a month column from date column
library(lubridate)
tedtalks$month<-month(tedtalks$date)
tedtalks$month <- month.name[tedtalks$month]

#plotting the frequency counts
tedtalks %>% 
  group_by(month) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x = reorder(month,(+count)), y = count)) + 
  geom_bar(stat = 'identity')+
  coord_flip()+
  labs(x="no. of tedtalks summarized by month")

### Year-wise Analysis of TED talk frequency

In [None]:
#creating a year column from date column
tedtalks$year<-year(tedtalks$date)

#plotting the frequency counts(year >2000, since before that only few tedtalks existed)
tedtalks %>% 
  filter(date>="2000-01-01") %>% 
  group_by(year) %>% 
  summarise(count = n()) %>% 
  ggplot(aes(x = reorder(year,(+count)), y = count)) + 
  geom_bar(stat = 'identity')+
  coord_flip()+
  labs(x="no. of tedtalks summarized by year")

Here we can observe that the trend was ***increasing till 2019 but after that there was a huge drop in ted talks counts this as to be due to the COVID-19 pandemic***. 

#### Let's highlight and compare the counts

In [None]:
#viewing the count
tedtalks %>% 
  filter(date>="2018-01-01" & date<="2022-01-01") %>% 
  group_by(year) %>% 
  summarise(count = n())


## creating some data to plot
highlight<-tedtalks %>% 
  group_by(year) %>% 
  summarise(count = n())
  
highlight <- highlight %>% mutate( ToHighlight = ifelse( year == 2021, "yes", "no" ) )

#plotting the highlighted graph
library(scales)
highlight %>% 
  filter(year>2000) %>% 
  ggplot(aes(x=year,y=count,fill = ToHighlight ) ) +
  geom_bar( stat = "identity" ) +
  scale_fill_manual( values = c( "yes"="tomato", "no"="gray" ), guide = "none" )+
  labs(title = "COVID PANDEMIC EFFECT ON TED TALKS",
       subtitle = "No. of ted talks decreased in 2021 after pandemic hit worldwide",
       caption = "~Affected year(2021) marked in orange",y="no. of tedtalks occured")

*As said, Here we can see the decrease in number of ted talks happened in* **2021**

### Finally, wordcloud of most used words in the Title.

In [None]:
library(wordcloud)
library(RColorBrewer)

library(tm)
#Create a vector containing only the text
text <- tedtalks$title
# Create a corpus  
docs <- Corpus(VectorSource(text))

#clean the data
docs <- docs %>%
  tm_map(removeNumbers) %>%
  tm_map(removePunctuation) %>%
  tm_map(stripWhitespace)
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeWords, stopwords("english"))

#Create a document-term-matrix
dtm <- TermDocumentMatrix(docs) 
matrix <- as.matrix(dtm) 
words <- sort(rowSums(matrix),decreasing=TRUE) 
df <- data.frame(word = names(words),freq=words)

#creating wodcloud
set.seed(1234) # for reproducibility 
wordcloud(words = df$word, 
          freq = df$freq, min.freq = 1,
          max.words=200, random.order=FALSE,
          rot.per=0.35,colors=brewer.pal(8, "Dark2"))

# *THE END*