# Mining TED Talk Scripts

In [1]:
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
from contraction import CONTRACTION_MAP
import unicodedata

In [2]:
import en_core_web_sm
nlp = en_core_web_sm.load()

## Importing TED data and TED Scripts

In [3]:
transcript = pd.read_csv('transcripts.csv')
ted = pd.read_csv('ted_main.csv')
ted_new = ted[['main_speaker','related_talks','tags','title','url']]
data = pd.merge(transcript,ted_new,on='url')

## Wrangling before Mining

  Data Wrangling is an important process before applying machine learning algorithmns. Especially for text below are the major Data Wrangling steps. They are

- Removing Accented characters
- Expanding Contractions
- Removing Special Characters
- Removing Stop Words
- Lemmatization
- Stemming
- Removing unnecessary White spaces

### Removing Accented Characters

Accented Charcters are the characters have accents or symbols above them. Replacing them with normal charcters is important before analysis. Examples of accented charcters are á, à, â, é, è, ê, í, ì, î, ó

In [4]:
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

### Expanding Contractions

Contractions are common in English Language. Contractions are like aren't,isn't,they've,they're . For Semantic analysis expanding them help to identify the negation effects in text and negative sentiments.

In [5]:
def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

### Removing Special Characters

TED Scripts are the text from a talk, there is a chance for several special charcters. Removing the special chaarcters are essential for further analysis.


In [6]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, ' ', text)
    return text

### Lemmatization

Lemmatization consider the morphological forms of words. Example lemmatozation consider 'studies','studies','studying' are considered as the root word 'study'. This helps to identify all these words as a single word and finding frequency based on them, instead of considering each as separate identity.

In [7]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

### Removing White Spaces

While typo there are lot of chances for having unwanted white spaces.sometimes words with and without spaces are considered as different words

In [8]:
def remove_whitespace(x):
    try:
        # remove spaces inside and outside of string
        x = " ".join(x.split())

    except:
        pass
    return x

#### Applying all the functions to the text

In [9]:
transcript['transcript_clean'] = transcript.transcript.apply(remove_accented_chars)
transcript['transcript_clean'] = transcript.transcript_clean.apply(expand_contractions)
transcript['transcript_clean'] = transcript.transcript_clean.apply(remove_special_characters)
transcript['transcript_clean'] = transcript.transcript_clean.apply(remove_whitespace)

### Removing Stopwords

Text contains stopwords like 'the','an,'he','is','was' those words are just fillers, but when analysing sentiments those words does not have any impact so those can be removed. For current analysis the words 'no' and 'not' are removed from stopwords list as they will clearly identify the neagtive sentiments.

In [10]:
stopword_list = nltk.corpus.stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')


## tf-idf Vectorization

The Input type that machine learning algorithms understand are the numeric vectors, so it is necessary to convert texts into numneric vectors. tf-idf stands for Term Freqency - Inverse Document Frequency. Term frequency gives the frequency of the word in each document.It is the ratio of number of times the word appears in a document compared to the total number of words in that document. Inverse Document Frequency used to calculate the weight of rare words across all documents in the corpus.

In [11]:
from sklearn.feature_extraction import text
Text=transcript['transcript_clean'].tolist()

tfidf=text.TfidfVectorizer(input=Text,stop_words=stopword_list)

matrix=tfidf.fit_transform(Text)
print(matrix.shape)


(2467, 59850)


## Recommendations

Inorder to generate recommendations for each talk, we need to find the similarity between the talks. There are several similarity measures available most prominent are Jaccard,Cosine,Euclidean distance and Manhattan distance.

Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle.

It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
similar =cosine_similarity(matrix)
similar_df = pd.DataFrame(similar)

In [13]:
def get_similar_articles(x):
    return ",  ".join(data['title'].loc[x.argsort()[-5:-1]])
data['similar_talks']=[get_similar_articles(x) for x in similar]

New column similar_talks is generated containing three related talks for each talk.

In [14]:
#data['title','similar_talks'][12]

print ("The recommended talks for title: {} are \n\n {} ".format(data['title'][12],data['similar_talks'][12]))

The recommended talks for title: My wish: Help me stop pandemics are 

 HIV and flu -- the vaccine strategy,  Lessons from the 1918 flu,  How we'll stop polio for good,  The case for optimism 


In [15]:
print ("The recommended talks for title: {} are \n\n {} ".format(data['title'][1],data['similar_talks'][1]))

The recommended talks for title: Averting the climate crisis are 

 Design and discovery,  A one-man world summit,  A climate solution where all sides can win,  New thinking on the climate crisis 


## Topic Modelling

As for TED data is concern there already exists tags and categories to group the talks,what if there are no categories or search tags. Topic Modelling provides methods to organize, understand and summarize laarge collection of data. 

### LDA
LDA is most widdely used technique. LDA stands for Latent Dirichlet Allocation. It uses two probability values: P( word | topics) and P( topics | documents). 

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words=stopword_list,
                        use_idf=True,
                        ngram_range=(1,1), # considering only 1-grams
                        min_df = 0.05,     # cut words present in less than 5% of documents
                        max_df = 0.3)      # cut words present in more than 30% of documents 

tfidf = vectorizer.fit_transform(transcript['transcript_clean'])


In [17]:
from sklearn.decomposition import LatentDirichletAllocation

n_topics = 10
lda = LatentDirichletAllocation(n_components=n_topics,random_state=0)

topics = lda.fit_transform(tfidf)
top_n_words = 5
t_words, word_strengths = {}, {}
for t_id, t in enumerate(lda.components_):
    t_words[t_id] = [vectorizer.get_feature_names()[i] for i in t.argsort()[:-top_n_words - 1:-1]]
    word_strengths[t_id] = t[t.argsort()[:-top_n_words - 1:-1]]
t_words



{0: ['women', 'brain', 'music', 'data', 'water'],
 1: ['god', 'book', 'building', 'creativity', 'writing'],
 2: ['ca', 'language', 'ok', 'community', 'audience'],
 3: ['universe', 'stars', 'earth', 'planet', 'space'],
 4: ['song', 'oh', 'music', 'film', 'yeah'],
 5: ['god', 'force', 'education', 'push', 'oh'],
 6: ['design', 'ok', 'designers', 'building', 'music'],
 7: ['happiness', 'fuel', 'happy', 'design', 'waste'],
 8: ['news', 'god', 'answers', 'google', 'dollars'],
 9: ['music', 'ends', 'starts', 'africa', 'black']}

### NMF
NMF stands for Non-negative Matrix Factorization that factors high-dimensional vectors into a low-dimensionality representation.

In [18]:
from sklearn.decomposition import NMF

n_topics = 10
nmf = NMF(n_components=n_topics,random_state=0)

topics = nmf.fit_transform(tfidf)
top_n_words = 5
t_words, word_strengths = {}, {}
for t_id, t in enumerate(nmf.components_):
    t_words[t_id + 1] = [vectorizer.get_feature_names()[i] for i in t.argsort()[:-top_n_words - 1:-1]]
    word_strengths[t_id + 1] = t[t.argsort()[:-top_n_words - 1:-1]]
t_words

{1: ['god', 'book', 'stories', 'oh', 'art'],
 2: ['music', 'play', 'sound', 'song', 'ends'],
 3: ['women', 'men', 'girls', 'woman', 'sex'],
 4: ['brain', 'brains', 'cells', 'body', 'activity'],
 5: ['water', 'earth', 'planet', 'ocean', 'species'],
 6: ['countries', 'africa', 'government', 'global', 'dollars'],
 7: ['cancer', 'cells', 'patients', 'disease', 'cell'],
 8: ['kids', 'children', 'education', 'students', 'teachers'],
 9: ['city', 'design', 'cities', 'building', 'buildings'],
 10: ['data', 'information', 'computer', 'machine', 'internet']}

### Compare with an example

In [19]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('tfidf', vectorizer),
    ('nmf', nmf)
])

document_id = 8
t = pipe.transform([transcript['transcript'].iloc[document_id]]) 
print('Topic distribution for document #{}: \n'.format(document_id),t)
print('Relevant topics for document #{}: \n'.format(document_id),np.where(t>0.01)[1])
print('\nTranscript:\n',transcript['transcript'].iloc[document_id][:500],'...')

talk = ted[ted['url']==transcript['url'].iloc[document_id]]
print('\nTrue tags from ted_main.csv: \n',talk['tags'])

Topic distribution for document #8: 
 [[0.06924094 0.00939016 0.         0.0490575  0.02995617 0.00534906
  0.         0.03283779 0.01871856 0.01609445]]
Relevant topics for document #8: 
 [0 3 4 7 8 9]

Transcript:
 It's wonderful to be back. I love this wonderful gathering. And you must be wondering, "What on earth? Have they put up the wrong slide?" No, no. Look at this magnificent beast, and ask the question: Who designed it?This is TED; this is Technology, Entertainment, Design, and there's a dairy cow. It's a quite wonderfully designed animal. And I was thinking, how do I introduce this? And I thought, well, maybe that old doggerel by Joyce Kilmer, you know: "Poems are made by fools like me, but only G ...

True tags from ted_main.csv: 
 8    ['God', 'TED Brain Trust', 'atheism', 'brain',...
Name: tags, dtype: object


>  According to NMF relavant topics for the above talk are 0__,3,4,7,8,9__. The __0th topic according to NMF is 'god','book','art','stories' and 3rd topic is 'brain','cell','body','activity'___ which are relavant to the tags given in __tags__ column __'God','atheism','brain'__ . 

>  According to LDA relavant topic for above talk is __topic 0 that is 'women','brain','music','data','water' but actual tags are 'God','atheism','brain'__

>  So __NMF performs better than LDA__

In [20]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('tfidf', vectorizer),
    ('lda', lda)
])

document_id = 8
t = pipe.transform([transcript['transcript'].iloc[document_id]]) 
print('Topic distribution for document #{}: \n'.format(document_id),t)
print('Relevant topics for document #{}: \n'.format(document_id),np.where(t>0.01)[1])
print('\nTranscript:\n',transcript['transcript'].iloc[document_id][:500],'...')

talk = ted[ted['url']==transcript['url'].iloc[document_id]]
print('\nTrue tags from ted_main.csv: \n',talk['tags'])

Topic distribution for document #8: 
 [[0.93190255 0.00756637 0.00756637 0.00756637 0.00756637 0.00756637
  0.00756637 0.00756637 0.00756637 0.00756652]]
Relevant topics for document #8: 
 [0]

Transcript:
 It's wonderful to be back. I love this wonderful gathering. And you must be wondering, "What on earth? Have they put up the wrong slide?" No, no. Look at this magnificent beast, and ask the question: Who designed it?This is TED; this is Technology, Entertainment, Design, and there's a dairy cow. It's a quite wonderfully designed animal. And I was thinking, how do I introduce this? And I thought, well, maybe that old doggerel by Joyce Kilmer, you know: "Poems are made by fools like me, but only G ...

True tags from ted_main.csv: 
 8    ['God', 'TED Brain Trust', 'atheism', 'brain',...
Name: tags, dtype: object


In [21]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
total_topics = 10

 > __pyLDAvis__ is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an __interactive web-based visualization.__

In [22]:
pyLDAvis.sklearn.prepare(nmf,tfidf,vectorizer, R=10,sort_topics=False)

  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
  log_lift = np.log(topic_term_dists / term_proportion)
  log_ttd = np.log(topic_term_dists)


> The above visulaization shows the clusters of topics and how closely they are related. The __cluster number 2 is related to topics like music,sound,song and videos thats why it stood out.The Cluster 6 and 8 have overlap since cluster 6 topics are global,countries,economy,social etc and cluster 8 are kids,children,education,food.__