
# <font color=red>**TED Talks**</font>

[Kaggle Competition](https://www.kaggle.com/rounakbanik/ted-talks)

[GitHub - Carlos Scovino](https://github.com/cscovino/TED-Talks-Analysis)


###  <font color=blue> **Archivo**:</font>  ted_main.csv


| Columna | Tipo | Descripcion | Tratamiento |
| :--- | :--- | :--- | :--- |
|  <font color=blue>**comments**</font> | Numeric | The number of first level comments made on the talk | Estandarizar |
|  <font color=blue>**description**</font> | Text | A blurb of what the talk is about | ¿? |
|  <font color=blue>**duration**</font> | Numeric | The duration of the talk in seconds | Estandarizar |
|  <font color=blue>**event**</font> | Text | The TED/TEDx event where the talk took place | ¿drop? |
|  <font color=blue>**film_date**</font> | Timestamp | The Unix timestamp of the filming | Se convierte a datetime. |
|  <font color=blue>**languages**</font> | Numeric | The number of languages in which the talk is available | Estandarizar |
|  <font color=blue>**main_speaker**</font> | Text | The first named speaker of the talk | ¿? |
|  <font color=blue>**name**</font> | Text | The official name of the TED Talk. Includes the title and the speaker. | |
|  <font color=blue>**num_speaker**</font> | Numeric | The number of speakers in the talk | |
|  <font color=blue>**published_date**</font> | Timestamp | The Unix timestamp for the publication of the talk on TED.com | Se convierte a datetime. |
|  <font color=blue>**ratings**</font> | Text | A stringified dictionary of the various ratings given to the talk (inspiring, fascinating, jaw dropping, etc.) | Se generó 14 columnas dummy |
|  <font color=blue>**related_talks**</font> | Text | A list of dictionaries of recommended talks to watch next | |
|  <font color=blue>**speaker_occupation**</font> | Text | The occupation of the main speaker | |
|  <font color=blue>**tags**</font> | Text | The themes associated with the talk | Se generó columnas dummy por las 50 más usadas |
|  <font color=blue>**title**</font> | Text | The title of the talk | |
|  <font color=blue>**url**</font> | Text | The URL of the talk | |
|  <font color=blue>**views**</font> | Numeric | The number of views on the talk | |

###  <font color=blue> **Archivo**:</font>  transcripts.csv

| Columna | Tipo | Descripcion |
| :--- | :--- | :--- |
| <font color=blue>**transcript**</font> | Text | The official English transcript of the talk. |
| <font color=blue>**url**</font> | Text | The URL of the talk |

<div class="alert alert-block alert-info">

# Imports

</div>

In [1]:
import ast
import numpy as np
import pandas as pd

from datetime import datetime, date
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

<div class="alert alert-block alert-info">

# Dataframes

</div>

In [2]:
df_main = pd.read_csv('./dataset/ted_main.csv',sep=",",quotechar='"')
df_transcript = pd.read_csv('./dataset/transcripts.csv',sep=",",quotechar='"')

<div class="alert alert-block alert-info">

### Data: Main

</div>

In [3]:
df_main.columns

Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'ratings', 'related_talks', 'speaker_occupation', 'tags', 'title',
       'url', 'views'],
      dtype='object')

In [4]:
df_main.count()[0]

2550

In [5]:
df_main.dtypes

comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object

In [6]:
df_main.describe()

Unnamed: 0,comments,duration,film_date,languages,num_speaker,published_date,views
count,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0
mean,191.562353,826.510196,1321928000.0,27.326275,1.028235,1343525000.0,1698297.0
std,282.315223,374.009138,119739100.0,9.563452,0.207705,94640090.0,2498479.0
min,2.0,135.0,74649600.0,0.0,1.0,1151367000.0,50443.0
25%,63.0,577.0,1257466000.0,23.0,1.0,1268463000.0,755792.8
50%,118.0,848.0,1333238000.0,28.0,1.0,1340935000.0,1124524.0
75%,221.75,1046.75,1412964000.0,33.0,1.0,1423432000.0,1700760.0
max,6404.0,5256.0,1503792000.0,72.0,5.0,1506092000.0,47227110.0


In [7]:
df_main.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


<div class="alert alert-block alert-success">

## Fechas: _film\_date, published\_date_

1.  **film_date:** The Unix timestamp of the filming.
2.  **published_date:** The Unix timestamp for the publication of the talk on TED.com

 > **Tratamiento:** 
 > * Se extrae mes en columna y se debe transformar a dummies
 > * **FALTARIA**: Variable por día de la semana

</div>

In [8]:
fun_ux2dttm = lambda x: datetime.fromtimestamp(int(x))
fun_ux2dt = lambda x: date.fromtimestamp(int(x))

df_main['film_date'] = df_main['film_date'].apply(fun_ux2dttm)
df_main['published_date'] = df_main['published_date'].apply(fun_ux2dttm)

df_main['film_date'] =  pd.to_datetime(df_main['film_date'])
df_main['published_date'] =  pd.to_datetime(df_main['published_date'])

In [9]:
df_main.dtypes

comments                       int64
description                   object
duration                       int64
event                         object
film_date             datetime64[ns]
languages                      int64
main_speaker                  object
name                          object
num_speaker                    int64
published_date        datetime64[ns]
ratings                       object
related_talks                 object
speaker_occupation            object
tags                          object
title                         object
url                           object
views                          int64
dtype: object

In [10]:
fun_dttm_month = lambda x: datetime.month

df_main['film_year'] = pd.DatetimeIndex(df_main['film_date']).year
df_main['film_month'] = pd.DatetimeIndex(df_main['film_date']).month

df_main['published_year'] = pd.DatetimeIndex(df_main['published_date']).year
df_main['published_month'] = pd.DatetimeIndex(df_main['published_date']).month

In [11]:
df_main.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,...,related_talks,speaker_occupation,tags,title,url,views,film_year,film_month,published_year,published_month
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,2006-02-24 21:00:00,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,2006-06-26 21:11:00,...,"[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,2006,2,2006,6
1,265,With the same humor and humanity he exuded in ...,977,TED2006,2006-02-24 21:00:00,43,Al Gore,Al Gore: Averting the climate crisis,1,2006-06-26 21:11:00,...,"[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,2006,2,2006,6
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,2006-02-23 21:00:00,26,David Pogue,David Pogue: Simplicity sells,1,2006-06-26 21:11:00,...,"[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,2006,2,2006,6
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,2006-02-25 21:00:00,35,Majora Carter,Majora Carter: Greening the ghetto,1,2006-06-26 21:11:00,...,"[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,2006,2,2006,6
4,593,You've never seen data presented like this. Wi...,1190,TED2006,2006-02-21 21:00:00,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,2006-06-27 17:38:00,...,"[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,2006,2,2006,6


<div class="alert alert-block alert-success">

## Campo: _comments_
The number of first level comments made on the talk.

 > **Tratamiento:** Se debe estandarizar el campo. Primero se debe realizar el 'train_test_split' y unicamente standarizar con los datos de train.

</div>

<div class="alert alert-block alert-success">

## Campo: _description_
A blurb of what the talk is about.

 > **Tratamiento:** A definir.
 >  * sklearn --> CountVectorizer (stop_words)
 >  * sklearn.feature_extraction.text.TfidfVectorizer

</div>

In [22]:
textos = ["texto 1",
          "texto 2",
          "texto 3" ]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()

s = vec.fit_transform(textos)

In [291]:
df_main['description'][0]

'Sir Ken Robinson makes an entertaining and profoundly moving case for creating an education system that nurtures (rather than undermines) creativity.'

In [15]:
df_main['title'][0]

'Do schools kill creativity?'

In [18]:
df_main['main_speaker'][0]

'Ken Robinson'

In [19]:
df_main['tags'][0]

"['children', 'creativity', 'culture', 'dance', 'education', 'parenting', 'teaching']"

<div class="alert alert-block alert-success">

## Campo: _duration_
The duration of the talk in seconds.

 > **Tratamiento:** Se debe estandarizar el campo. Primero se debe realizar el 'train_test_split' y unicamente standarizar con los datos de train.

</div>

<div class="alert alert-block alert-success">

## Campo: _event_
The TED/TEDx event where the talk took place.

 > **Tratamiento:** 
 > * El nombre del evento no tiene una estructura consistente.
 > * El nombre del evento no necesariamente indica que las charlas se realizaron un un mismo día. Ej.: Para TED2006 se realizaron 45 charlas y las mismas variaron entre 01-Feb-2006 hasta 02-Mar-20016.
 > * La información de Año de la charla se puede obtener del campo 'film_date'.
 > * **AGREGAR**: 3 columas (0/1) TED, TEDx, notTED

</div>

In [292]:
df_main['event'].unique()

array(['TED2006', 'TED2004', 'TED2005', 'TEDGlobal 2005', 'TEDSalon 2006',
       'TED2003', 'TED2007', 'TED2002', 'TEDGlobal 2007',
       'TEDSalon 2007 Hot Science', 'Skoll World Forum 2007', 'TED2008',
       'TED1984', 'TED1990', 'DLD 2007', 'EG 2007', 'TED1998',
       'LIFT 2007', 'TED Prize Wish', 'TEDSalon 2009 Compassion',
       'Chautauqua Institution', 'Serious Play 2008', 'Taste3 2008',
       'TED2001', 'TED in the Field', 'TED2009', 'EG 2008',
       'Elizabeth G. Anderson School', 'TEDxUSC', 'TED@State',
       'TEDGlobal 2009', 'TEDxKC', 'TEDIndia 2009',
       'TEDSalon London 2009', 'Justice with Michael Sandel',
       'Business Innovation Factory', 'TEDxTC',
       'Carnegie Mellon University', 'Stanford University',
       'AORN Congress', 'University of California', 'TEDMED 2009',
       'Royal Institution', 'Bowery Poetry Club', 'TEDxSMU',
       'Harvard University', 'TEDxBoston 2009', 'TEDxBerlin', 'TED2010',
       'TEDxAmsterdam', 'World Science Festival', 

In [293]:
len(df_main[ df_main['event'] == "TED2006" ])

45

In [294]:
df_main[ df_main['event'] == "TED2006" ][['event','comments','views','film_date']]

Unnamed: 0,event,comments,views,film_date
0,TED2006,4553,47227110,2006-02-24 21:00:00
1,TED2006,265,3200520,2006-02-24 21:00:00
2,TED2006,124,1636292,2006-02-23 21:00:00
3,TED2006,200,1697550,2006-02-25 21:00:00
4,TED2006,593,12005869,2006-02-21 21:00:00
5,TED2006,672,20685401,2006-02-01 21:00:00
6,TED2006,919,3769987,2006-02-23 21:00:00
7,TED2006,46,967741,2006-02-22 21:00:00
8,TED2006,852,2567958,2006-02-01 21:00:00
9,TED2006,900,3095993,2006-02-24 21:00:00


<div class="alert alert-block alert-success">

## Campo: _languages_
The number of languages in which the talk is available.

 > **Tratamiento:** Se debe estandarizar el campo. Primero se debe realizar el 'train_test_split' y unicamente standarizar con los datos de train.

</div>

<div class="alert alert-block alert-success">

## Campo: _main_speaker_
The first named speaker of the talk.

 > **Tratamiento:**
 > * Son más de 2156 speakers distintos. [1 talk, 1880][2 talks, 202][3 talks, 48][4 talks, 16][5 talks,6][6 talks,2][7 talks, 1][9 talks, 1]
 > * **Nuevos Feature:**
 1. previous_talks': Cantidad de charlas previas
 2. previous_talk_views: Cantidad de views en su última charla
 3. previous_views_sum: Suma de views de todas sus charlas previas
 4. previous_views_max: Máxima cantidad de views en charlas previas
 5. previous_views_min: Mínima cantidad de views en charlas previas
 

</div>

In [295]:
ser_main = df_main['main_speaker'].value_counts()

In [296]:
# Total de Speakers
len(ser_main.index)

2156

In [297]:
# Speakers con mas de N charlas
ser_main[ser_main>=3].index

Index(['Hans Rosling', 'Juan Enriquez', 'Rives', 'Marco Tempest',
       'Jacqueline Novogratz', 'Dan Ariely', 'Julian Treasure',
       'Nicholas Negroponte', 'Clay Shirky', 'Bill Gates', 'Eve Ensler',
       'Steven Johnson', 'Jonathan Drori', 'Kevin Kelly', 'Stewart Brand',
       'David Pogue', 'Ken Robinson', 'Barry Schwartz', 'Tom Wujec',
       'Lawrence Lessig', 'Dan Dennett', 'Stefan Sagmeister', 'Al Gore',
       'Chris Anderson', 'Robert Full', 'Jonathan Haidt', 'Rory Sutherland',
       'Sugata Mitra', 'Philip Zimbardo', 'Sebastian Wernicke', 'Ray Kurzweil',
       'Michael Green', 'Michael Sandel', 'Christopher Soghoian', 'Seth Godin',
       'Derek Sivers', 'Margaret Heffernan', 'Raghava KK', 'Brian Cox',
       'Ngozi Okonjo-Iweala', 'Adam Savage', 'Pico Iyer', 'Nathan Myhrvold',
       'Sarah Parcak', 'Arthur Benjamin', 'Tim Berners-Lee',
       'Malcolm Gladwell', 'Helen Fisher', 'Aimee Mullins', 'Dean Kamen',
       'Sarah Jones', 'Blaise Agüera y Arcas', 'Dan Gilbert

In [298]:
lst_previous_talks = []
lst_previous_talk_views = [] 
lst_previous_views_sum = [] 
lst_previous_views_max = []
lst_previous_views_min = []

df_main.sort_values(by=['film_date'], ascending=False, inplace=True)
df_main = df_main.reset_index(drop=True)

for ix, row in df_main.iterrows():
 
    df_where = df_main[ df_main['film_date'] < row['film_date'] ]
    df_where = df_where[ df_where['main_speaker'] == row['main_speaker'] ]
    df_where.reset_index(drop=True, inplace=True) 

    if len(df_where) == 0:
        lst_previous_talks.append(0)
        lst_previous_talk_views.append(0)
    else:
        lst_previous_talks.append(df_where.shape[0])
        lst_previous_talk_views.append(df_where.at[0,'views'])

    lst_previous_views_sum.append(df_where['views'].sum())
    lst_previous_views_max.append(df_where['views'].max())
    lst_previous_views_min.append(df_where['views'].min())
    

df_main['previous_talks'] = pd.Series(lst_previous_talks)
df_main['previous_talk_views'] = pd.Series(lst_previous_talk_views)
df_main['previous_views_sum'] = pd.Series(lst_previous_views_sum)
df_main['previous_views_max'] = pd.Series([ 0 if np.isnan(x) else x for x in lst_previous_views_max])
df_main['previous_views_min'] = pd.Series([ 0 if np.isnan(x) else x for x in lst_previous_views_min])

In [299]:
len(df_main[df_main.views==9999])

0

In [300]:
df_main[df_main['main_speaker']=="Hans Rosling"]

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,...,views,film_year,film_month,published_year,published_month,previous_talks,previous_talk_views,previous_views_sum,previous_views_max,previous_views_min
1258,491,Hans Rosling had a question: Do some religions...,800,TEDxSummit,2012-04-15 21:00:00,36,Hans Rosling,Hans Rosling: Religions and babies,1,2012-05-22 12:00:56,...,2138419,2012,4,2012,5,8,2391977,25428708,12005869,738895
1658,268,What was the greatest invention of the industr...,555,TEDWomen 2010,2010-12-03 21:00:00,46,Hans Rosling,Hans Rosling: The magic washing machine,1,2011-03-21 10:33:00,...,2391977,2010,12,2011,3,7,738895,23036731,12005869,738895
1709,342,Hans Rosling reframes 10 years of UN data with...,934,TEDxChange,2010-09-19 21:00:00,33,Hans Rosling,Hans Rosling: The good news of the decade? We'...,1,2010-10-07 06:12:00,...,738895,2010,9,2010,10,6,2934262,22297836,12005869,904813
1777,607,The world's population will grow to 9 billion ...,604,TED@Cannes,2010-06-20 21:00:00,46,Hans Rosling,"Hans Rosling: Global population growth, box by...",1,2010-07-09 05:15:00,...,2934262,2010,6,2010,7,5,1738069,19363574,12005869,904813
1933,276,Hans Rosling was a young guest student in Indi...,950,TEDIndia 2009,2009-11-03 21:00:00,36,Hans Rosling,Hans Rosling: Asia's rise -- how and when,1,2009-11-22 22:00:00,...,1738069,2009,11,2009,11,4,1471039,17625505,12005869,904813
2031,122,Talking at the US State Department this summer...,1196,TED@State,2009-06-03 21:00:00,33,Hans Rosling,Hans Rosling: Let my dataset change your mindset,1,2009-08-26 22:00:00,...,1471039,2009,6,2009,8,3,904813,16154466,12005869,904813
2068,125,Hans Rosling unveils data visuals that untangl...,602,TED2009,2009-02-05 21:00:00,40,Hans Rosling,"Hans Rosling: Insights on HIV, in stunning dat...",1,2009-05-13 03:50:00,...,904813,2009,2,2009,5,2,3243784,15249653,12005869,3243784
2314,261,Researcher Hans Rosling uses his cool data too...,1137,TED2007,2007-03-02 21:00:00,35,Hans Rosling,Hans Rosling: New insights on poverty,1,2007-06-25 06:12:00,...,3243784,2007,3,2007,6,1,12005869,12005869,12005869,12005869
2354,593,You've never seen data presented like this. Wi...,1190,TED2006,2006-02-21 21:00:00,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,2006-06-27 17:38:00,...,12005869,2006,2,2006,6,0,0,0,0,0


<div class="alert alert-block alert-success">

## Campo: _name_
The official name of the TED Talk. Includes the title and the speaker.

 > **Tratamiento:** El nombre del speaker ya se encuentra en el campo 'main_speaker'. El título de charla ya se encuentra en el campo 'title'. Se elimina el campo.

</div>

In [302]:
df_main[['name','title']].head()

Unnamed: 0,name,title
0,Olúfẹ́mi Táíwò: Why Africa must become a ...,Why Africa must become a center of knowledge a...
1,Sethembile Msezane: Living sculptures that sta...,Living sculptures that stand for history's truths
2,OluTimehin Adegbeye: Who belongs in a city?,Who belongs in a city?
3,Pierre Thiam: A forgotten ancient grain that c...,A forgotten ancient grain that could help Afri...
4,Augie Picado: The real reason manufacturing jo...,The real reason manufacturing jobs are disappe...


In [303]:
df_main.drop(columns=['name'], inplace=True)

<div class="alert alert-block alert-success">

## Campo: _ratings_
A stringified dictionary of the various ratings given to the talk (inspiring, fascinating, jaw dropping, etc.)

 > **Tratamiento:** 
 * El campo es Texto: "[{ 'id':<...>, 'name':<...>, 'count':<...> }]"
 * El campo 'name' tiene un total de 14 valores posibles. 
 ["Funny", "Beautiful", "Ingenious", "Courageous", "Longwinded", "Confusing", "Informative", "Fascinating", "Unconvincing", "Persuasive", "Jaw-dropping", "OK", "Obnoxious", "Inspiring"]
 * Se genera una columna por cada rating posible

</div>

###  <font color=blue> **Estructura**:</font>  df_ratings

| ix | rating_funny | rating_beautiful | rating_ingenious | ... | rating_<14> | 
| :--- | :--- | :--- | :--- | --- | --- |
| 0 | 19645 | 4573 | 6073 | ... | ... |
| 1 | 544 | 58 | 56 | ... | ... |
| 2 | 964 | 60 | 183 | ... | ... | 
| ... | ...| ... | ... | ... | ... |
| 2549 | ...| ... | ... | ... | ... |

In [304]:
lst_ratings = ["Funny", "Beautiful", "Ingenious",
               "Courageous", "Longwinded", "Confusing",
               "Informative", "Fascinating", "Unconvincing",
               "Persuasive", "Jaw-dropping", "OK",
               "Obnoxious", "Inspiring"]

In [305]:
df_main['ratings'][0]

"[{'id': 1, 'name': 'Beautiful', 'count': 9}, {'id': 8, 'name': 'Informative', 'count': 37}, {'id': 24, 'name': 'Persuasive', 'count': 30}, {'id': 11, 'name': 'Longwinded', 'count': 1}, {'id': 21, 'name': 'Unconvincing', 'count': 1}, {'id': 26, 'name': 'Obnoxious', 'count': 1}, {'id': 10, 'name': 'Inspiring', 'count': 31}, {'id': 3, 'name': 'Courageous', 'count': 8}, {'id': 9, 'name': 'Ingenious', 'count': 4}, {'id': 22, 'name': 'Fascinating', 'count': 15}, {'id': 25, 'name': 'OK', 'count': 6}, {'id': 2, 'name': 'Confusing', 'count': 0}, {'id': 7, 'name': 'Funny', 'count': 0}, {'id': 23, 'name': 'Jaw-dropping', 'count': 0}]"

In [306]:
all_ratings = []

df_ratings = df_main['ratings']

for ix, ratings in df_ratings.items():
    
    rec_rating = {}
    rec_rating['ix'] = ix
    
    # String to Python Data Type
    ratings_lst = ast.literal_eval(ratings)

    for rating in ratings_lst:   
        rec_rating["rating_"+rating['name'].lower()]=rating['count']

    all_ratings.append(rec_rating)

In [307]:
df_ratings =  pd.DataFrame(all_ratings,
                           index=[ dic['ix'] for dic in all_ratings],
                           columns=["rating_funny","rating_beautiful","rating_ingenious",
                                    "rating_courageous","rating_longwinded","rating_confusing",
                                    "rating_informative","rating_fascinating","rating_unconvincing",
                                    "rating_persuasive","rating_jaw-dropping","rating_ok",
                                    "rating_obnoxious","rating_inspiring"] )

In [308]:
len(df_ratings)

2550

In [309]:
df_ratings.shape

(2550, 14)

In [310]:
df_ratings.head()

Unnamed: 0,rating_funny,rating_beautiful,rating_ingenious,rating_courageous,rating_longwinded,rating_confusing,rating_informative,rating_fascinating,rating_unconvincing,rating_persuasive,rating_jaw-dropping,rating_ok,rating_obnoxious,rating_inspiring
0,0,9,4,8,1,0,37,15,1,30,0,6,1,31
1,6,41,7,43,3,1,15,14,5,2,3,1,5,35
2,19,54,12,77,6,1,71,34,2,63,16,3,1,72
3,2,7,19,6,1,0,78,28,3,36,3,0,0,73
4,4,15,3,2,11,4,157,35,38,62,7,9,18,18


<div class="alert alert-block alert-success">

## Campo: tags
The themes associated with the talk

 > **Tratamiento:** 
 * El campo es Texto: "['tag_1', 'tag_2', ... , 'tag']"
 * Existe un numero muy alto de posibles valores.
 * Se van solo a utilizar unicamente la siguiente lista (**26 tags**):
 ["technology", "science", "design", "business", "collaboration", "innovation", "social_change", "health", "nature", "environment", "future", "communication", "activism", "children", "personal_growth", "humanity", "society", "identity", "community", "culture", "global_issues", "entertainment", "art", "politics", "economics", "religion"]
 * Se genera una columna ('tag_<...>') por cada 'tag' en la lista y una columna adicional ('tag_other') donde se suman todas las restantes.

</div>

###  <font color=blue> **Estructura**:</font>  df_tags

| ix | tag_children | tag_creativity | tag_culture | ... | tag_other |
| :--- | :--- | :--- | :--- | --- | --- |
| 0 | 1 | 1 | 1 | ... | 1 |
| 1 | 0 | 0 | 0 | ... | 3 |
| 2 | 0 | 0 | 0 | ... | 2 |
| ... | ...| ... | ... | ... | ... |
| 2549 | ...| ... | ... | ... | ... |

In [311]:
df_main['tags'].head()

0    ['Africa', 'agriculture', 'history', 'leadersh...
1    ['Africa', 'activism', 'art', 'community', 'hi...
2    ['Africa', 'activism', 'cities', 'government',...
3    ['Africa', 'agriculture', 'farming', 'food', '...
4    ['business', 'capitalism', 'collaboration', 'e...
Name: tags, dtype: object

In [312]:
lst_tags = ["technology", "science", "design", "business", "collaboration", "innovation", "social_change",
            "health", "nature", "environment", "future", "communication", "activism", "children",
            "personal_growth", "humanity", "society", "identity", "community", "culture", "global_issues",
            "entertainment", "art", "politics", "economics", "religion"]

In [313]:
all_tags = []
df_tags = df_main['tags']

for ix, tags in df_tags.items():

    # --- Empty Dict ---
    rec_tag = {}
    rec_tag['ix'] = ix
    for tag in lst_tags:
        rec_tag["tag_"+tag.lower().replace(" ","_")]=0
    rec_tag['tag_other']=0
    
    # --- Update Dict ---
    tags = ast.literal_eval(tags)
    for tag in tags:
        
        if "tag_"+tag.lower() in rec_tag.keys():
            rec_tag["tag_"+tag.lower().replace(" ","_")]+=1
        else:
            rec_tag["tag_other"]+=1
            
    all_tags.append(rec_tag)

In [314]:
columns=[ "tag_"+tag.lower() for tag in lst_tags]
columns.append("tag_other")
print(columns)

['tag_technology', 'tag_science', 'tag_design', 'tag_business', 'tag_collaboration', 'tag_innovation', 'tag_social_change', 'tag_health', 'tag_nature', 'tag_environment', 'tag_future', 'tag_communication', 'tag_activism', 'tag_children', 'tag_personal_growth', 'tag_humanity', 'tag_society', 'tag_identity', 'tag_community', 'tag_culture', 'tag_global_issues', 'tag_entertainment', 'tag_art', 'tag_politics', 'tag_economics', 'tag_religion', 'tag_other']


In [315]:
df_tags =  pd.DataFrame(all_tags,
                        index=[ row['ix'] for row in all_tags ],
                        columns=columns)

In [316]:
df_tags.head()

Unnamed: 0,tag_technology,tag_science,tag_design,tag_business,tag_collaboration,tag_innovation,tag_social_change,tag_health,tag_nature,tag_environment,...,tag_identity,tag_community,tag_culture,tag_global_issues,tag_entertainment,tag_art,tag_politics,tag_economics,tag_religion,tag_other
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,9
1,0,0,0,0,0,0,0,0,0,0,...,1,1,0,0,0,1,0,0,0,5
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,4
4,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,6


In [317]:
len(df_tags)

2550

In [318]:
df_tags.shape

(2550, 27)

In [29]:
df_main.head(2)

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,...,related_talks,speaker_occupation,tags,title,url,views,film_year,film_month,published_year,published_month
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,2006-02-24 21:00:00,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,2006-06-26 21:11:00,...,"[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,2006,2,2006,6
1,265,With the same humor and humanity he exuded in ...,977,TED2006,2006-02-24 21:00:00,43,Al Gore,Al Gore: Averting the climate crisis,1,2006-06-26 21:11:00,...,"[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,2006,2,2006,6


In [30]:
df_main['url'][1]

'https://www.ted.com/talks/al_gore_on_averting_climate_crisis\r\n'

In [32]:
df_main['related_talks'][1]

'[{\'id\': 243, \'hero\': \'https://pe.tedcdn.com/images/ted/566c14767bd62c5ff760e483c5b16cd2753328cd_2880x1620.jpg\', \'speaker\': \'Al Gore\', \'title\': \'New thinking on the climate crisis\', \'duration\': 1674, \'slug\': \'al_gore_s_new_thinking_on_the_climate_crisis\', \'viewed_count\': 1751408}, {\'id\': 547, \'hero\': \'https://pe.tedcdn.com/images/ted/89288_800x600.jpg\', \'speaker\': \'Ray Anderson\', \'title\': \'The business logic of sustainability\', \'duration\': 954, \'slug\': \'ray_anderson_on_the_business_logic_of_sustainability\', \'viewed_count\': 881833}, {\'id\': 2093, \'hero\': \'https://pe.tedcdn.com/images/ted/146d88845861cbf768bbf8bec8b2e41f8bfc7903_2400x1800.jpg\', \'speaker\': \'Lord Nicholas Stern\', \'title\': \'The state of the climate — and what we might do about it\', \'duration\': 993, \'slug\': \'lord_nicholas_stern_the_state_of_the_climate_and_what_we_might_do_about_it\', \'viewed_count\': 773779}, {\'id\': 2784, \'hero\': \'https://pe.tedcdn.com/imag

<div class="alert alert-block alert-success">

## Join: df_ratings, df_tags

</div>

In [67]:
df_main = df_main.merge(df_ratings, left_index=True, right_index=True)
df_main.drop(columns=['ratings'], inplace=True)

In [68]:
df_main = df_main.merge(df_tags, left_index=True, right_index=True)
df_main.drop(columns=['tags'], inplace=True)

In [69]:
df_main.describe()

Unnamed: 0,comments,duration,languages,num_speaker,views,rating_funny,rating_beautiful,rating_ingenious,rating_courageous,rating_longwinded,...,tag_identity,tag_community,tag_culture,tag_global_issues,tag_entertainment,tag_art,tag_politics,tag_economics,tag_religion,tag_other
count,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,...,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0,2550.0
mean,191.562353,826.510196,27.326275,1.028235,1698297.0,154.468627,192.293725,150.739608,164.723529,32.683922,...,0.044314,0.058039,0.190588,0.0,0.117255,0.086667,0.050196,0.064314,0.021961,5.281176
std,282.315223,374.009138,9.563452,0.207705,2498479.0,589.137728,477.375664,283.800437,433.805453,41.608618,...,0.205832,0.233863,0.392842,0.0,0.321787,0.281401,0.218392,0.245359,0.146584,3.36736
min,2.0,135.0,0.0,1.0,50443.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,63.0,577.0,23.0,1.0,755792.8,8.0,26.0,26.0,20.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
50%,118.0,848.0,28.0,1.0,1124524.0,21.0,68.0,69.0,51.0,19.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
75%,221.75,1046.75,33.0,1.0,1700760.0,92.0,190.75,170.75,149.0,41.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
max,6404.0,5256.0,72.0,5.0,47227110.0,19645.0,9437.0,6073.0,8668.0,447.0,...,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,22.0


In [70]:
df_main.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,...,tag_identity,tag_community,tag_culture,tag_global_issues,tag_entertainment,tag_art,tag_politics,tag_economics,tag_religion,tag_other
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,24-02-2006,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,26-06-2006,...,0,0,1,0,0,0,0,0,0,5
1,265,With the same humor and humanity he exuded in ...,977,TED2006,24-02-2006,43,Al Gore,Al Gore: Averting the climate crisis,1,26-06-2006,...,0,0,1,0,0,0,0,0,0,5
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,23-02-2006,26,David Pogue,David Pogue: Simplicity sells,1,26-06-2006,...,0,0,0,0,1,0,0,0,0,7
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,25-02-2006,35,Majora Carter,Majora Carter: Greening the ghetto,1,26-06-2006,...,0,0,0,0,0,0,1,0,0,5
4,593,You've never seen data presented like this. Wi...,1190,TED2006,21-02-2006,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,27-06-2006,...,0,0,0,0,0,0,0,1,0,9


In [73]:
df_main.columns

Index(['comments', 'description', 'duration', 'event', 'film_date',
       'languages', 'main_speaker', 'name', 'num_speaker', 'published_date',
       'related_talks', 'speaker_occupation', 'title', 'url', 'views',
       'rating_funny', 'rating_beautiful', 'rating_ingenious',
       'rating_courageous', 'rating_longwinded', 'rating_confusing',
       'rating_informative', 'rating_fascinating', 'rating_unconvincing',
       'rating_persuasive', 'rating_jaw-dropping', 'rating_ok',
       'rating_obnoxious', 'rating_inspiring', 'tag_technology', 'tag_science',
       'tag_design', 'tag_business', 'tag_collaboration', 'tag_innovation',
       'tag_social_change', 'tag_health', 'tag_nature', 'tag_environment',
       'tag_future', 'tag_communication', 'tag_activism', 'tag_children',
       'tag_personal_growth', 'tag_humanity', 'tag_society', 'tag_identity',
       'tag_community', 'tag_culture', 'tag_global_issues',
       'tag_entertainment', 'tag_art', 'tag_politics', 'tag_economics',


### Data: Transcripts

In [24]:
df_transcript.count()[0]

2467

In [25]:
df_transcript.columns

Index(['transcript', 'url'], dtype='object')

In [26]:
df_transcript.head()

Unnamed: 0,transcript,url
0,Good morning. How are you?(Laughter)It's been ...,https://www.ted.com/talks/ken_robinson_says_sc...
1,"Thank you so much, Chris. And it's truly a gre...",https://www.ted.com/talks/al_gore_on_averting_...
2,"(Music: ""The Sound of Silence,"" Simon & Garfun...",https://www.ted.com/talks/david_pogue_says_sim...
3,If you're here today — and I'm very happy that...,https://www.ted.com/talks/majora_carter_s_tale...
4,"About 10 years ago, I took on the task to teac...",https://www.ted.com/talks/hans_rosling_shows_t...
