# Data Preparation

In this notebook I will do some data praparetion. I will extract additional data from the dataset and prepare it to be easier to use in the future experiments.

## Lets start by importing what we need

In [1]:
import pandas
import ast

## Lets load the data and have a look at it

In [2]:
data_frame = pandas.read_csv('data/ted_main.csv')
data_frame.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869


## Parse JSON data

Some of the data is encoded as JSON, so that it can be saved in a CSV file. In order to use it we will need to parse it into python arrays. For now we will need only the ratings column. However, sice thare are more columns that need parsing inorder to use them (i.e. related_talks and tags), I will create a function for that. It is able to parse a column from the Data Frame and save the Python objects back in the same column.

In [3]:
def parseJsonInDataframe(df, columns):
    for i in range(0, df.shape[0]):
        for column in columns:
            if (column in df and type(df.at[i, column]) is str):
                df.at[i, column] = ast.literal_eval(df.at[i, column])

            
print("Type of ratings before parsing:", type(data_frame.ratings[0]))
parseJsonInDataframe(data_frame, ["ratings"])
print("Type of ratings after parsing:", type(data_frame.ratings[0]))

Type of ratings before parsing: <class 'str'>
Type of ratings after parsing: <class 'list'>


## Extract impressions

Since there are a lot of types of ratings in the dataset (14 to be exact), I will summarize their counts in a column called "Impressions". In general the reason to do this is to have a simpler metric for popularity.

However, not all of the impressions are positive. To take that into account, I will create two more columns, one presenting the total amount of positive impressions and one for the negative.

In [4]:
def addImpressionsColumn(df):
    for i in range(0, df.shape[0]):
        positive_impression_types = ["Funny","Beautiful","Ingenious","Courageous","Informative","Fascinating","Persuasive","Jaw-dropping","OK","Inspiring"]
        negative_impression_types = ["Longwinded","Confusing","Unconvincing","Obnoxious"]
        
        df.at[i, "impressions"] = sum(rating["count"] for rating in df.ratings[i])
        df.at[i, "positive_impressions"] = sum((rating["count"] if rating["name"] in positive_impression_types else 0) for rating in df.ratings[i])
        df.at[i, "negative_impressions"] = sum((rating["count"] if rating["name"] in negative_impression_types else 0) for rating in df.ratings[i])

addImpressionsColumn(data_frame)
data_frame.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,ratings,related_talks,speaker_occupation,tags,title,url,views,impressions,positive_impressions,negative_impressions
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 19645}, {...","[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,93850.0,92712.0,1138.0
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 544}, {'i...","[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,2936.0,2372.0,564.0
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,"[{'id': 7, 'name': 'Funny', 'count': 964}, {'i...","[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,2824.0,2473.0,351.0
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,"[{'id': 3, 'name': 'Courageous', 'count': 760}...","[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,3728.0,3572.0,156.0
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,"[{'id': 9, 'name': 'Ingenious', 'count': 3202}...","[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,25620.0,25310.0,310.0


## Popularity score

I will invent a simple popularity score, to use later on. Having a quick look at the data there are several columns that we need to focus on when talking about popularity:
* comments
* views
* impressions

However, when you think about it, having a lot of views alone is not a very good metric for popularity. On the contrary, a ted talk should be considered more popular if more of the people who watched it, expressed their emotions in some way (comments and impressions). Because of that, I will compute popularity as weighted sum of the comments and impressions over the vies.


In [5]:
def calculatePopularity(row, comments_weight=1, positive_impressions_weight=1, negative_impressions_weight=2, views_weight=0.01):
    weighted_comments = comments_weight * row["comments"]
    weighted_pi = positive_impressions_weight * row["positive_impressions"]
    weighted_ni = negative_impressions_weight * row["negative_impressions"]
    weighted_views = views_weight * row["views"]
    
    return (weighted_comments + weighted_pi - weighted_ni)/weighted_views

scores = []

for i, row in data_frame.iterrows():
    scores.append(calculatePopularity(row))

minScore = min(scores)
maxScore = max(scores)
print("Min Score:", minScore)
print("Max Score:", maxScore)

Min Score: -0.5874016430423047
Max Score: 1.1981828098983751


## Normalize the popularity score

The min and max values are currently pretty random. It will be better if we normalize them to be between 0 and 1.

In [6]:
normalized_scores = [(x-minScore)/(maxScore-minScore) for x in scores]

print("New min:", min(normalized_scores))
print("New max:", max(normalized_scores))

New min: 0.0
New max: 1.0


## Add the new data to the DataFrame

In [7]:
popularity_column = pandas.DataFrame({'popularity': normalized_scores})

data_frame = data_frame.join(popularity_column)
data_frame.head()

Unnamed: 0,comments,description,duration,event,film_date,languages,main_speaker,name,num_speaker,published_date,...,related_talks,speaker_occupation,tags,title,url,views,impressions,positive_impressions,negative_impressions,popularity
0,4553,Sir Ken Robinson makes an entertaining and pro...,1164,TED2006,1140825600,60,Ken Robinson,Ken Robinson: Do schools kill creativity?,1,1151367060,...,"[{'id': 865, 'hero': 'https://pe.tedcdn.com/im...",Author/educator,"['children', 'creativity', 'culture', 'dance',...",Do schools kill creativity?,https://www.ted.com/talks/ken_robinson_says_sc...,47227110,93850.0,92712.0,1138.0,0.441611
1,265,With the same humor and humanity he exuded in ...,977,TED2006,1140825600,43,Al Gore,Al Gore: Averting the climate crisis,1,1151367060,...,"[{'id': 243, 'hero': 'https://pe.tedcdn.com/im...",Climate advocate,"['alternative energy', 'cars', 'climate change...",Averting the climate crisis,https://www.ted.com/talks/al_gore_on_averting_...,3200520,2936.0,2372.0,564.0,0.355374
2,124,New York Times columnist David Pogue takes aim...,1286,TED2006,1140739200,26,David Pogue,David Pogue: Simplicity sells,1,1151367060,...,"[{'id': 1725, 'hero': 'https://pe.tedcdn.com/i...",Technology columnist,"['computers', 'entertainment', 'interface desi...",Simplicity sells,https://www.ted.com/talks/david_pogue_says_sim...,1636292,2824.0,2473.0,351.0,0.393828
3,200,"In an emotionally charged talk, MacArthur-winn...",1116,TED2006,1140912000,35,Majora Carter,Majora Carter: Greening the ghetto,1,1151367060,...,"[{'id': 1041, 'hero': 'https://pe.tedcdn.com/i...",Activist for environmental justice,"['MacArthur grant', 'activism', 'business', 'c...",Greening the ghetto,https://www.ted.com/talks/majora_carter_s_tale...,1697550,3728.0,3572.0,156.0,0.443118
4,593,You've never seen data presented like this. Wi...,1190,TED2006,1140566400,48,Hans Rosling,Hans Rosling: The best stats you've ever seen,1,1151440680,...,"[{'id': 2056, 'hero': 'https://pe.tedcdn.com/i...",Global health expert; data visionary,"['Africa', 'Asia', 'Google', 'demo', 'economic...",The best stats you've ever seen,https://www.ted.com/talks/hans_rosling_shows_t...,12005869,25620.0,25310.0,310.0,0.446907


## Lets have a look of the shape of the data 

In [8]:
print("Number of samples: ", data_frame.shape[0])
print("Number of attributes: ", data_frame.shape[1])

Number of samples:  2550
Number of attributes:  21


## Save the data

In [9]:
data_frame.to_csv('data/ted_updated.csv', index=False)
print("Saved!")

Saved!
