# Introduction to Pandas

In this tutorial, we will learn how to use Pandas by analyzing a real-world dataset.

The dataset that we are going to analyze is the TED talk dataset which is available on Kaggle (https://www.kaggle.com/datasets/ahmadfatani/ted-talks-dataset). The dataset contains information about all video recordings of TED Talks uploaded to the official TED.com website until April 18th, 2020. It contains information about all talks including the number of views, tags, posted-date, speakers and titles.

Note that you do not have to download the dataset from Kaggle since the data is already contained in the Github repository.

## Data Exploration

As a first step, we want to get a basic understanding of the data contained in the dataset. In particular, we want to know things such as:

- What data (e.g., columns) is available?
- How large is our dataset (how many rows are there)?
- What datatype does each column have?
- Get some statistics on our data

### Loading the CSV file

The first step is to load the data provided in the csv file. The file that we want to load is `ted_main.csv`.

With the `pd.read_csv()` method, Pandas provides a convenient way to read csv files.

In [1]:
import pandas as pd

In [2]:
# By default, ted_main.csv expects the csv file to use "," as a separator and no delimiters.
# Of course, this can be changed by setting the <sep> and <delimiter> parameters accordingly.
ted_df = pd.read_csv('../ted_talk_dataset/ted_main.csv')

### Printing the first few rows

A good way to start inspecting a dataset is to look at the first few rows. This can be easily done using the `head()` method. <br/>
By default, `head()` prints the first 5 rows of a dataframe

In [3]:
ted_df.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


We can also change the number of rows we want to print. Let's say we want to look at the first two rows only.

In [4]:
ted_df.head(2)

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520


Alternatively, we can also inspect the end of the list using `tail()`.

In [5]:
ted_df.tail(2)

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
2548,32,In an unmissable talk about race and politics ...,1100,TEDxMileHigh,1499472000,1,Theo E.J. Wilson,Theo E.J. Wilson: A black man goes undercover ...,1,1506024042,"[{'id': 11, 'name': 'Longwinded', 'count': 3},...","[{'id': 2512, 'hero': 'https://pe.tedcdn.com/i...",Public intellectual,"['Internet', 'TEDx', 'United States', 'communi...",A black man goes undercover in the alt-right,https://www.ted.com/talks/theo_e_j_wilson_a_bl...,419309
2549,8,With more than half of the world population li...,519,TED2017,1492992000,1,Karoliina Korppoo,Karoliina Korppoo: How a video game might help...,1,1506092422,"[{'id': 21, 'name': 'Unconvincing', 'count': 2...","[{'id': 2682, 'hero': 'https://pe.tedcdn.com/i...",Game designer,"['cities', 'design', 'future', 'infrastructure...",How a video game might help us build better ci...,https://www.ted.com/talks/karoliina_korppoo_ho...,391721


### Obtaining the shape of a dataframe

When we look at the first two rows, we can immediately see that the dataframe contains multiple columns. But how many columns and rows does it have?

We can inspect the size of a dataframe by writing ...

In [6]:
# Returns df shape as tuple rows x columns
print(ted_df.shape)

print('Num rows:', ted_df.shape[0])
print('Num cols:', ted_df.shape[1])

(2550, 17)
Num rows: 2550
Num cols: 17


### Inspecting each column's data type

Next, take a look at what Pandas thinks of the data types by checking the attribute `dtype`.

In [7]:
ted_df.dtypes

comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object

Ok, that's interesting. Apparently, Pandas considers many columns as data type `object` (not `str`). So, what's going on here? <br/>
Pandas stores data (values) in a contiguous, fixed-size memory block. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. To overcome this problem, pandas stores the data in an object array where each element simply contains the address of a string.

A list of available dtypes can be found here: https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes

In [8]:
print(ted_df.dtypes)

comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object


Another option to obtain similar information from a dataframe with just one glance is the `info()` method. This method provides a concise summary of a dataframe.

In [9]:
ted_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   comments            2550 non-null   int64 
 1   description         2550 non-null   object
 2   duration            2550 non-null   int64 
 3   event               2550 non-null   object
 4   film_date           2550 non-null   int64 
 5   languages           2550 non-null   int64 
 6   main_speaker        2550 non-null   object
 7   name                2550 non-null   object
 8   num_speaker         2550 non-null   int64 
 9   published_date      2550 non-null   int64 
 10  ratings             2550 non-null   object
 11  related_talks       2550 non-null   object
 12  speaker_occupation  2544 non-null   object
 13  tags                2550 non-null   object
 14  title               2550 non-null   object
 15  url                 2550 non-null   object
 16  views               2550

As can be seen, the information provided by `info()` also tells us whether a column contains cells without any value. <br/>
Indeed, this seems to be the case for the column `speaker_occupation` where six rows do not contain any value.

### Filtering empty rows in a dataframe

Sometimes we might want to remove empty rows from a dataframe. The following lines illustrate how this can be achieved.

First, we look for cells that contain no values with the `isna()` method. The output is a binary mask.

In [10]:
ted_df.isna()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2545,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2546,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2547,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2548,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


We can now easily count the number of `True`s in each column using `sum()`. <br/>
By default, `sum()` computes the sum along the columns. `True` is regarded as one. `False` is regarded as 0.

In [11]:
ted_df.isna().sum()

comments              0
description           0
duration              0
event                 0
film_date             0
languages             0
main_speaker          0
name                  0
num_speaker           0
published_date        0
ratings               0
related_talks         0
speaker_occupation    6
tags                  0
title                 0
url                   0
views                 0
dtype: int64

If we really want to remove the non-zero rows, we can do this by writing ...

In [12]:
# Remove all nan values from the dataframe
ted_df.dropna(inplace=True)

In [13]:
# New let' check again to see whether the NaNs are gone
print(ted_df.isna().sum())

comments              0
description           0
duration              0
event                 0
film_date             0
languages             0
main_speaker          0
name                  0
num_speaker           0
published_date        0
ratings               0
related_talks         0
speaker_occupation    0
tags                  0
title                 0
url                   0
views                 0
dtype: int64


### Obtaining some overall descriptive statistics

All pandas dataframes provide a `describe()` method that allows you to easily obtain some descriptive statistics from the data. It provides information about the central tendency, dispersion, and shape of a dataset's distribution, excluding `NaN` values (object columns are not analyzed).

By default, only numeric columns are included.

In [14]:
ted_df.describe()

Unnamed: 0,comments,duration,film_date,languages,num_speaker,published_date,views
count,2544.0,2544.0,2544.0,2544.0,2544.0,2544.0,2544.0
mean,191.706761,827.316431,1321828000.0,27.319969,1.028302,1343456000.0,1699779.0
std,282.613719,373.828955,119845500.0,9.563529,0.207945,94718370.0,2501043.0
min,2.0,135.0,74649600.0,0.0,1.0,1151367000.0,50443.0
25%,63.0,578.75,1257466000.0,23.0,1.0,1268341000.0,756580.2
50%,118.0,848.5,1333238000.0,28.0,1.0,1340935000.0,1123870.0
75%,222.0,1047.0,1412921000.0,33.0,1.0,1423519000.0,1702149.0
max,6404.0,5256.0,1503792000.0,72.0,5.0,1506092000.0,47227110.0


As can be seen, 50% of all talks have around 118 comments. On average, they have 191 comments. The average duration of a TED talk is ~826 seconds (13-14 minutes) long. <br/>


### Obtaining information about the remaining columns

As mentioned previously, by default, `describe()` only considers columns with numeric values. However, we can also force `describe()` to look at other data types as well.

Let's force Python to only look at objects (which are strings in our case).

In [15]:
ted_df.describe(include=object)

Unnamed: 0,description,event,main_speaker,name,ratings,related_talks,speaker_occupation,tags,title,url
count,2544,2544,2544,2544,2544,2544,2544,2544,2544,2544
unique,2544,355,2150,2544,2544,2544,1458,2524,2544,2544
top,Sir Ken Robinson makes an entertaining and pro...,TED2014,Hans Rosling,Ken Robinson: Do schools kill creativity?,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Writer,"['art', 'creativity']",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...
freq,1,84,9,1,1,1,45,3,1,1


**Hint:** "Freq" denotes the most common value's frequency.

Alternatively, we can disable both types of columns at the same time.

In [16]:
ted_df.describe(include='all')

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
count,2544.0,2544,2544.0,2544,2544.0,2544.0,2544,2544,2544.0,2544.0,2544,2544,2544,2544,2544,2544,2544.0
unique,,2544,,355,,,2150,2544,,,2544,2544,1458,2524,2544,2544,
top,,Sir Ken Robinson makes an entertaining and pro...,,TED2014,,,Hans Rosling,Ken Robinson: Do schools kill creativity?,,,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Writer,"['art', 'creativity']",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,
freq,,1,,84,,,9,1,,,1,1,45,3,1,1,
mean,191.706761,,827.316431,,1321828000.0,27.319969,,,1.028302,1343456000.0,,,,,,,1699779.0
std,282.613719,,373.828955,,119845500.0,9.563529,,,0.207945,94718370.0,,,,,,,2501043.0
min,2.0,,135.0,,74649600.0,0.0,,,1.0,1151367000.0,,,,,,,50443.0
25%,63.0,,578.75,,1257466000.0,23.0,,,1.0,1268341000.0,,,,,,,756580.2
50%,118.0,,848.5,,1333238000.0,28.0,,,1.0,1340935000.0,,,,,,,1123870.0
75%,222.0,,1047.0,,1412921000.0,33.0,,,1.0,1423519000.0,,,,,,,1702149.0
