In [1]:
# Find most relevant terms for each topic using LDA clustering

In [2]:
import pandas as pd
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import nltk
from nltk.corpus import stopwords

In [3]:
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.max_colwidth = None

In [4]:
df_transcripts = pd.read_csv("transcripts_merged_10.csv")

In [5]:
df_transcripts.head(5)

Unnamed: 0.1,Unnamed: 0,text,MIN(start),MAX(start),SUM(duration),video_id
0,0,"- So today's agenda, we're gonna start with why am I talking China and Vietnam. We're then gonna talk about reform in China in the period leading up to the 4th of June, 1989 which is the Tiananmen Square massacre. We'll then talk about Tiananmen in the 1990s that will lead us into a discussion of what I'm calling the sequencing debate of political and economic reform. Once it was clear that communism was gonna be replaced by capitalism,",8.11,41.73,33.639,4eUS8trd_yI
1,1,"there was a huge debate about, well, is it better to have political reform first or economic reform first. We'll try to do them together and we'll talk about that sequencing debate which will lead us into a larger discussion of what since the 1950s has been known as modernization theory. The thesis that economic modernization will eventually produce demand for",45.73,69.72,26.45,4eUS8trd_yI
2,2,"and the establishment of democracy and that will then leave us to think about the future. So, China and Vietnam Today. - [Narrator] China so far has built the equivalent of Europe's entire housing stock in just 15 years. In November 2015, Beijing replaced the substantially larger 1300 ton Sanyuan Bridge in just 43 hours. Between 1996 and 2016, China has built 2.6 million miles of roads including 70,000 miles of highways",72.18,109.55,36.986,4eUS8trd_yI
3,3,"connecting 95% of the country's villages and overtaking the US as the country with the most extensive highway system by almost 50%. Over the past decade, China has constructed the world's longest high speed rail network. 12,000 miles of rail lines that carry passengers between cities, at speeds up to 180 miles per hour. China now has more high speed rail tracks",112.28,135.52,25.74,4eUS8trd_yI
4,4,"than the rest of the world combined. - So that's one of any number of video clips one could pick to just give a snapshot of the incredible transformation of the Chinese economy over the last couple of decades and indeed, over the last decade, I went to Beijing last year for the first time in about 12 years and 12 years ago, there were lots of potholes in the streets,",138.02,168.288,33.273,4eUS8trd_yI


In [6]:
df_transcripts['prepared_text'] = df_transcripts['text'].str.lower()

In [7]:
len(df_transcripts)

735

In [8]:
# remove stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')
df_transcripts['prepared_text'] = df_transcripts['prepared_text'].apply(lambda words: ' '.join(word.lower() for word in words.split() if word not in stop))

In [9]:
df_transcripts['prepared_text'] = df_transcripts['prepared_text'].str.replace("[^\w\d'\s]+", ' ')

In [10]:
#df_transcripts['text']

In [11]:
split_text = [word for word in df_transcripts['prepared_text'] if not word in stop and len(word) >2] 

In [12]:
split_text[:5]

["  today's agenda  we're gonna start talking china vietnam  we're gonna talk reform china period leading 4th june  1989 tiananmen square massacre  we'll talk tiananmen 1990s lead us discussion i'm calling sequencing debate political economic reform  clear communism gonna replaced capitalism ",
 "huge debate about  well  better political reform first economic reform first  we'll try together we'll talk sequencing debate lead us larger discussion since 1950s known modernization theory  thesis economic modernization eventually produce demand",
 "establishment democracy leave us think future  so  china vietnam today     narrator  china far built equivalent europe's entire housing stock 15 years  november 2015  beijing replaced substantially larger 1300 ton sanyuan bridge 43 hours  1996 2016  china built 2 6 million miles roads including 70 000 miles highways",
 "connecting 95  country's villages overtaking us country extensive highway system almost 50  past decade  china constructed world

In [13]:
tfv = TfidfVectorizer(stop_words = stop, ngram_range = (1,1))

In [14]:
vec_text = tfv.fit_transform(split_text)

In [15]:
words = tfv.get_feature_names()

In [16]:
words[:10]

['00', '000', '10', '1000', '101', '109', '11', '11th', '12', '125']

In [17]:
# now working through https://medium.com/@yanlinc/how-to-build-a-lda-topic-model-using-from-text-601cdcbfd3a6

In [18]:
lda_model = LatentDirichletAllocation(n_components=10)

#https://www.kaggle.com/rajmehra03/topic-modelling-using-lda-and-lsa-in-sklearn
lda_output = lda_model.fit_transform(vec_text)

In [19]:
print(lda_output)  # Model attributes

[[0.01610136 0.01610082 0.01609844 ... 0.27740534 0.01611342 0.01609881]
 [0.01761367 0.01761575 0.01761361 ... 0.01761425 0.01761443 0.01761417]
 [0.01423945 0.01424157 0.01423966 ... 0.01424211 0.01423978 0.01423956]
 ...
 [0.01568424 0.01567901 0.01568191 ... 0.01568102 0.01567904 0.01568056]
 [0.01734458 0.01734372 0.01734201 ... 0.01734781 0.01733985 0.01733866]
 [0.01954219 0.01954375 0.01954244 ... 0.01954272 0.0195446  0.01954417]]


In [20]:
df_documents = pd.DataFrame(lda_output)

In [21]:
len(df_documents)

735

In [22]:
words = tfv.get_feature_names()

for i, comp in enumerate(lda_model.components_):
    words_comp = dict(zip(words, comp))
    sorted_words = sorted(words_comp.items(), reverse=True, key=lambda item: item[1])
    print("Document", i)
    for w in sorted_words[:10]:
        print(w[0], w[1])
    print("\n")

Document 0
people 2.397345564562909
gonna 2.179512407361763
would 1.582874786129593
political 1.5294461597233258
nato 1.4451920410058396
said 1.4397287449954765
economy 1.3661680511117242
course 1.3614492571614536
like 1.35151843133902
war 1.3033961916203811


Document 1
people 2.7441351120994835
union 2.123135490205857
soviet 1.9721164639892805
going 1.9332310933834553
one 1.8983638558047427
talk 1.8308145050124098
gonna 1.777455180579476
say 1.6232300368209378
war 1.610417565470093
lot 1.576049973674321


Document 2
party 2.2648206218383176
gonna 1.7659115398045322
parties 1.7117004587554128
think 1.4415889323523101
right 1.4261333798286102
people 1.3842676733357067
might 1.2910339846313008
would 1.2471329340948025
one 1.1942470516051171
systems 1.191653662126941


Document 3
like 1.9469354832160555
nato 1.6341158409171637
people 1.594126063710462
male 1.4884593190230635
speaker 1.4635501218003635
actually 1.3866745120834103
democrats 1.142889119169288
think 1.1412703213296491
cheeri

In [23]:
# make a prediction# make a prediction
vec_text = ["people many think like working know getting rich",
           "giving devane lectures",
           "might saying china vietnam interested called reform"]
lda_model.fit_transform(tfv.transform(vec_text))

array([[0.02671357, 0.75957785, 0.02671357, 0.02671357, 0.02671357,
        0.02671357, 0.02671357, 0.02671357, 0.02671357, 0.02671357],
       [0.03673052, 0.03673052, 0.03673052, 0.03673052, 0.03673052,
        0.03673052, 0.66942536, 0.03673052, 0.03673052, 0.03673052],
       [0.02809101, 0.02809101, 0.02809101, 0.02809101, 0.02809101,
        0.02809101, 0.74718092, 0.02809101, 0.02809101, 0.02809101]])

In [24]:
# view top document matches for a particular category

In [25]:
df_all = pd.concat([df_documents, df_transcripts], axis=1)
df_all.sort_values(2, ascending=False).head(20)

Unnamed: 0.1,0,1,2,3,4,5,6,7,8,9,Unnamed: 0,text,MIN(start),MAX(start),SUM(duration),video_id,prepared_text
153,0.014033,0.014033,0.873689,0.014032,0.014043,0.014035,0.014032,0.014033,0.014037,0.014033,153,"even though a lot of their membership didn't want it. So terrified were they of the prospect of another election in which the far right would do even better. So we thought German politics was kind of settling down at this point but the following year, this is what you see happening. - [Presnter] German Chancellor, Angela Merkel, who's led Germany for 13 years has offered to step down as her party's leader and said she won't run for office again after her term ends in 2021.",868.56,898.06,31.126,BDqvzFY72mg,even though lot membership want it terrified prospect another election far right would even better thought german politics kind settling point following year see happening presnter german chancellor angela merkel who's led germany 13 years offered step party's leader said run office term ends 2021
327,0.014361,0.014364,0.870719,0.014366,0.014369,0.014363,0.014362,0.014365,0.014363,0.014367,327,"On the contrary, there may be a trade-off between creating employment for the long-term unemployed and protecting the wages of unionized workers. So it's not surprising that even in multi-party systems you might think that unionized workers as represented by traditional left of center parties Are going to be less well protected and less effective as instruments of solidarity Among others whose income is below the median voter. So, how might this play out in a multi-party system?",3433.91,3481.226,39.133,T3-VlQu3iRM,contrary may trade off creating employment long term unemployed protecting wages unionized workers surprising even multi party systems might think unionized workers represented traditional left center parties going less well protected less effective instruments solidarity among others whose income median voter so might play multi party system
58,0.014369,0.014371,0.870679,0.014363,0.014373,0.014371,0.014368,0.014369,0.014371,0.014366,58,"have come later after the economic take off, but also there're country. Russia is an extremely well educated population, Ethiopia has now got every child in school. They're one of the first African countries to have such high levels of K through 12 education but they have not been replicating this success. There was a less entrenched command communist system, the Moleskine London piece put a lot of credence in this idea",2060.3,2097.912,31.67,4eUS8trd_yI,come later economic take off also there're country russia extremely well educated population ethiopia got every child school they're one first african countries high levels k 12 education replicating success less entrenched command communist system moleskine london piece put lot credence idea
222,0.01438,0.014377,0.870657,0.014355,0.014374,0.014367,0.014376,0.014379,0.014364,0.014372,222,"- I started the lecture last Thursday by playing a clip of Michael Foot making fun of Sir Keith Joseph who had been the intellectual architect of Thatcherism. With his conjurer's trick, and said at the end of the day that the joke was on Michael Foot, because he thought Thatcher would be a flash in the pan, and she went on to be Prime Minister for 11 1/2 years and went on fundamentally to restructure the British economy and political landscape as we saw.",9.44,48.53,41.378,T3-VlQu3iRM,started lecture last thursday playing clip michael foot making fun sir keith joseph intellectual architect thatcherism conjurer's trick said end day joke michael foot thought thatcher would flash pan went prime minister 11 1 2 years went fundamentally restructure british economy political landscape saw
168,0.015074,0.015075,0.864327,0.015069,0.015083,0.015072,0.015081,0.015073,0.015073,0.015074,168,"And we will talk about some of the differences among them but now we have new data and whether modernizing economies will produce democracy. It is long been conventional wisdom that democratic systems are incompatible with state-run economies. If we look at what's happened since 1989, we've gone to market economies in some of the post-communist systems but others like China and Vietnam have become state capitalist systems of a certain kind while retaining non democratic politics.",1420.75,1456.72,35.491,BDqvzFY72mg,talk differences among new data whether modernizing economies produce democracy long conventional wisdom democratic systems incompatible state run economies look what's happened since 1989 we've gone market economies post communist systems others like china vietnam become state capitalist systems certain kind retaining non democratic politics
354,0.015113,0.015115,0.863961,0.015113,0.015117,0.015115,0.015116,0.015116,0.015117,0.015117,354,"people both at home in Russia, as the Soviet Union as it then was, and abroad. He seemed to be a different kind of politician, he was extremely charismatic. He was much younger. He could talk and behave like a Western politician and he quickly went on a kind of charm offensive around the world. He developed a strong rapport with Ronald Reagan which led to arms reduction talks in Reykjavik",173.8,204.39,30.936,f5nbT4xQqwI,people home russia soviet union was abroad seemed different kind politician extremely charismatic much younger could talk behave like western politician quickly went kind charm offensive around world developed strong rapport ronald reagan led arms reduction talks reykjavik
96,0.015161,0.015162,0.863506,0.015166,0.015167,0.015178,0.015165,0.015162,0.015165,0.015167,96,"they're flying to places where labor is cheaper. This is the Nike plant in Ho Chi Minh City, which I visited in March of 2001. It was run by a Vietnamese management company, with about 20,000 Vietnamese women working in it there. The whole thing consisted of, it looked like four giant football stadiums. This is about an hour's drive outside of Ho Chi Minh City, which used to be Saigon. It looks like four giant football fields",3306.54,3344.47,38.276,4eUS8trd_yI,they're flying places labor cheaper nike plant ho chi minh city visited march 2001 run vietnamese management company 20 000 vietnamese women working there whole thing consisted of looked like four giant football stadiums hour's drive outside ho chi minh city used saigon looks like four giant football fields
641,0.015482,0.015481,0.860675,0.015479,0.015481,0.01548,0.01548,0.015484,0.01548,0.01548,641,"is evidence of the path they will follow. If there is anything certain today, if there is anything inevitable in the future, it is the will of the people of the world for freedom and for peace. - So there was the creation of NATO, an alliance to face down what was seen as the Soviet threat. The previous month, just after the text of the proposed treaty had been released to the public, Secretary of State, Dean Acheson went on radio.",1129.64,1163.92,34.483,s48b9B5gd88,evidence path follow anything certain today anything inevitable future people world freedom peace creation nato alliance face seen soviet threat previous month text proposed treaty released public secretary state dean acheson went radio
485,0.015526,0.015526,0.860255,0.015527,0.01553,0.015527,0.015526,0.015525,0.015526,0.015529,485,"And here is Michael Foot who is by then the leader of the Labor Party in opposition commenting on Keith Joseph in the House of Commons just to give you a sense of how seriously they really took the threat from Thatcher. - [Michael] I wouldn't like to miss out the right honorable gentleman, the Secretary of State for Industry who's had such a tremendous effect upon the government and our politics altogether.",568.74,596.343,27.11,q53DF6ySOZg,michael foot leader labor party opposition commenting keith joseph house commons give sense seriously really took threat thatcher michael like miss right honorable gentleman secretary state industry who's tremendous effect upon government politics altogether
614,0.015538,0.015536,0.860157,0.015539,0.015539,0.015538,0.015537,0.015537,0.01554,0.015539,614,"it's an alliance, but it it has an institutional presence. So this second lens focuses on institutional arrangements which may or may not be consistent with the way people's interests line up, right? So for instance, George Kennan, who I mentioned to you last time, thought the United Nations was a waste of time because countries always behave in their interests and if the UN told them to do something that wasn't in their interest, they would ignore it.",251.53,276.97,28.47,s48b9B5gd88,alliance institutional presence second lens focuses institutional arrangements may may consistent way people's interests line up right instance george kennan mentioned last time thought united nations waste time countries always behave interests un told something interest would ignore it
