# Introduction to Pandas

In this tutorial, we will learn how to use Pandas by analyzing a real-world dataset.

The dataset that we are going to analyze is the TED talk dataset which is available on Kaggle (https://www.kaggle.com/datasets/ahmadfatani/ted-talks-dataset). The dataset contains information about all video recordings of TED Talks uploaded to the official TED.com website until April 18th, 2020. It contains information about all talks including the number of views, tags, posted-date, speakers and titles.

Note that you do not have to download the dataset from Kaggle since the data is already contained in the Github repository.

In [1]:
import pandas as pd

ted_df = pd.read_csv('../ted_talk_dataset/ted_main.csv')

## Which talk provoked the most online discussion?

Next, we want to know which talk has provoked the most online discussion so far. One way to figure this out would be to look at the number of comments every talk has. The one that has the most comments is the one we are interested in. Hence, we simply sort the dataframe by the number of comments and pick the one that occurs at the top.

In [2]:
ted_df.sort_values(by='comments', ascending=False).head(1)

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
96,6404,Richard Dawkins urges all atheists to openly s...,1750,TED2002,1012608000,42,Richard Dawkins,Richard Dawkins: Militant atheism,1,1176689220,"[{'id': 3, 'name': 'Courageous', 'count': 3236...","[{'id': 86, 'hero': 'https://pe.tedcdn.com/ima...",Evolutionary biologist,"['God', 'atheism', 'culture', 'religion', 'sci...",Militant atheism,https://www.ted.com/talks/richard_dawkins_on_m...,4374792


Or alternatively, we can write ...

In [3]:
ted_df.sort_values(by='comments').tail(1)

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
96,6404,Richard Dawkins urges all atheists to openly s...,1750,TED2002,1012608000,42,Richard Dawkins,Richard Dawkins: Militant atheism,1,1176689220,"[{'id': 3, 'name': 'Courageous', 'count': 3236...","[{'id': 86, 'hero': 'https://pe.tedcdn.com/ima...",Evolutionary biologist,"['God', 'atheism', 'culture', 'religion', 'sci...",Militant atheism,https://www.ted.com/talks/richard_dawkins_on_m...,4374792


Apparently, so far, with 6404 comments the talk in row 95 titled "Militant atheism" provoked the most online discussion.

Unfortunately, there is a problem with this analysis. As the number of comments tends to increase over time, we can expect videos that have been online for a very long time to have a higher number of comments. This causes a bias in the data. 

**How can we correct this bias?** <br/>
For example, we could try to normalize the number of comments by the number of views.

In [4]:
ted_df['comments_per_view'] = ted_df.comments / ted_df.views

In [5]:
ted_df.sort_values('comments_per_view').tail(1)

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,comments_per_view
744,649,Hours before New York lawmakers rejected a key...,453,New York State Senate,1259712000,0,Diane J. Savino,Diane J. Savino: The case for same-sex marriage,1,1282062180,"[{'id': 25, 'name': 'OK', 'count': 100}, {'id'...","[{'id': 217, 'hero': 'https://pe.tedcdn.com/im...",Senator,"['God', 'LGBT', 'culture', 'government', 'law'...",The case for same-sex marriage,https://www.ted.com/talks/diane_j_savino_the_c...,292395,0.00222


According to our new definition of what's most provocative, row 744 titled "The case for same-sex marriage" is the most provocative talk. The talk has received 0.00222 comments per view.

**Views per comment**

However, we can even make this more interpretable. Instead of normalizing the number of comments by the number of views, we instead normalize the number of views by the number of comments.

In [6]:
ted_df['views_per_comment'] = ted_df.views / ted_df.comments

In [7]:
ted_df.sort_values('views_per_comment').head(1)

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,comments_per_view,views_per_comment
744,649,Hours before New York lawmakers rejected a key...,453,New York State Senate,1259712000,0,Diane J. Savino,Diane J. Savino: The case for same-sex marriage,1,1282062180,"[{'id': 25, 'name': 'OK', 'count': 100}, {'id'...","[{'id': 217, 'hero': 'https://pe.tedcdn.com/im...",Senator,"['God', 'LGBT', 'culture', 'government', 'law'...",The case for same-sex marriage,https://www.ted.com/talks/diane_j_savino_the_c...,292395,0.00222,450.531587


We can conclude that the talk "The case for same-sex marriage" is the most provocative talk. Every 450th view left a comment.