## Get list of PyCon 2014 videos

In [4]:
import feedparser
pycon_feed_url = "http://pyvideo.org/category/50/pycon-us-2014/rss"
pycon_feed = feedparser.parse(pycon_feed_url)

In [5]:
import numpy as np
import pandas as pd

videos = []
links = pycon_feed.entries[0]['links']

for entry in pycon_feed.entries:
    links = entry['links']
    try:
        video = {
            'title': entry['title'],
            'author': entry['author'],
            'link': entry['link'],
            'summary': entry['summary'],
            'youtube_url': ([link for link in links if link.type == 'video/flv'])[0].href
        }

        videos.append(video)
    except (IndexError, KeyError):
        print("No video or key error for: %s" % video['title'])
            
videos = pd.DataFrame(videos)
print("Got %s videos from RSS" % len(videos))

No video or key error for: Cheap Helicopters In My Living Room
No video or key error for: Know Thy Neighbor: Scikit and the K-Nearest Neighbor Algorithm
No video or key error for: Know Thy Neighbor: Scikit and the K-Nearest Neighbor Algorithm
No video or key error for: Know Thy Neighbor: Scikit and the K-Nearest Neighbor Algorithm
No video or key error for: Pushing Python: Building a High Throughput, Low Latency System
No video or key error for: Let's Learn Twisted Python
No video or key error for: How to formulate a (science) problem and analyze it using Python code
Got 133 videos from RSS


In [6]:
videos.head(2)

Unnamed: 0,author,link,summary,title,youtube_url
0,Niko Skrypnik,http://www.pyvideo.org/video/2676/2d3d-graphic...,"<p>Abstract</p>\nGames, application with rich ...",2D/3D graphics with Python on mobile platforms,http://www.youtube.com/watch?v=Y6pLr11yf-4
1,Julien Phalip,http://www.pyvideo.org/video/2660/advanced-tec...,<p>Abstract</p>\nThis talk will present multip...,Advanced techniques for Web functional testing,http://www.youtube.com/watch?v=St9cL47_1GI


## Download Subtitles from YouTube

In [10]:
import youtube_dl

### Download subtitles and add youtube_id column

In [82]:
# options: https://github.com/rg3/youtube-dl/blob/master/youtube_dl/YoutubeDL.py#L119
ydl_opts = {
    #'listsubtitles': True,
    'outtmpl': '/work/subtitles/%(id)s',
    'writesubtitles': True,
    'writeautomaticsub': True,
    'skip_download': True,
    'no_warnings': True,
    'quiet': True
}

In [84]:
# test getting youtube info
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
    url = videos.ix[0].youtube_url
    youtube_vars = ydl.extract_info(url)
    print(youtube_vars['duration'])


1483


In [86]:
videos['youtube_id'] = None
videos['duration'] = None

for video_ix in videos.index:
    with youtube_dl.YoutubeDL(ydl_opts) as ydl:
        url = videos.ix[video_ix].youtube_url
        video_info = ydl.extract_info(url)
        videos['youtube_id'][videos.index == video_ix] = video_info['id']
        videos['duration'][videos.index == video_ix] = video_info['duration']


In [87]:
# save so we don't have to re-download
videos_save_file = '../videos.csv'
videos.to_csv(videos_save_file, index=False)

In [88]:
# read from saved file
videos = pd.read_csv(videos_save_file)

In [89]:
videos.head(2)

Unnamed: 0,author,link,summary,title,youtube_url,subtitles,youtube_id,duration
0,Niko Skrypnik,http://www.pyvideo.org/video/2676/2d3d-graphic...,"<p>Abstract</p>\nGames, application with rich ...",2D/3D graphics with Python on mobile platforms,http://www.youtube.com/watch?v=Y6pLr11yf-4,thank you guys let's start first ball good aft...,Y6pLr11yf-4,1483
1,Julien Phalip,http://www.pyvideo.org/video/2660/advanced-tec...,<p>Abstract</p>\nThis talk will present multip...,Advanced techniques for Web functional testing,http://www.youtube.com/watch?v=St9cL47_1GI,good afternoon everyone now we have my friend...,St9cL47_1GI,1794


### Put subtitles in subtitles column

#### SRT Format
1. A number indicating which subtitle it is in the sequence.
2. The time that the subtitle should appear on the screen, and then disappear.
3. The subtitle itself.
4. A blank line indicating the start of a new subtitle.

<pre>
1
00:00:06,759 --> 00:00:09,559
thank you guys let's start first ball
good afternoon

2
00:00:09,559 --> 00:00:14,070
by from there's on my name is really
cause great nique
</pre>


In [92]:
videos['subtitles'] = None

for video_ix in videos.index:
    youtube_id = videos.ix[video_ix].youtube_id
    
    next_item = 'number'
    captions = []
    try:
        path = '/work/subtitles/%s.en.srt' % youtube_id
        with open(path, 'r') as f:
            for line in f.readlines():
                if next_item == 'number':
                    next_item = 'timecode'
                elif next_item == 'timecode':
                    next_item = 'content'
                elif next_item == 'content':
                    captions.append(line.strip())
                    
                if line == '\n':
                    next_item = 'number'
                    
    except FileNotFoundError:
        # print('Cannot find file: %s' % path)
        pass

    
    videos['subtitles'][videos.index == video_ix] = ' '.join(captions)


A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [93]:
videos.head(2)

Unnamed: 0,author,link,summary,title,youtube_url,subtitles,youtube_id,duration
0,Niko Skrypnik,http://www.pyvideo.org/video/2676/2d3d-graphic...,"<p>Abstract</p>\nGames, application with rich ...",2D/3D graphics with Python on mobile platforms,http://www.youtube.com/watch?v=Y6pLr11yf-4,thank you guys let's start first ball good aft...,Y6pLr11yf-4,1483
1,Julien Phalip,http://www.pyvideo.org/video/2660/advanced-tec...,<p>Abstract</p>\nThis talk will present multip...,Advanced techniques for Web functional testing,http://www.youtube.com/watch?v=St9cL47_1GI,good afternoon everyone now we have my friend...,St9cL47_1GI,1794


In [94]:
videos_with_subtitles = videos[(videos['subtitles'] != "")].copy()
print("Got %s videos left out of %s" % (len(videos_with_subtitles), len(videos)))

Got 79 videos left out of 133


## k-means clustering

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import KMeans

In [51]:
num_of_clusters = 6
test_tokens = ['docker', 'learning', 'database', 'automation', 'async', 'robots', 'geographic']
vectorizer = CountVectorizer(stop_words='english', vocabulary=test_tokens) # CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(videos_with_subtitles['subtitles'])
print("Number of features (unique words) extracted: %s" % len(vectorizer.get_feature_names()))
# print(vectorizer.stop_words_)
# print(vectorizer.get_feature_names())

Number of features (unique words) extracted: 7


In [52]:
km = KMeans(n_clusters=num_of_clusters, init='k-means++', n_init=1, verbose=0)
km.fit(X)

KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=6, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [53]:
videos_with_subtitles['cluster'] = km.labels_
for cluster in range(num_of_clusters):
    this_cluster = videos_with_subtitles[videos_with_subtitles['cluster'] == cluster].sort(columns=['author'])

    print("Cluster: %s " % cluster)
    for video in this_cluster.iterrows():
        print("%34s %s" % (video[1]['author'], video[1]['title']))


Cluster: 0 
                   Craig Kerstiens Postgres Performance for Humans
                     Hannu Krosing PostgreSQL is Web Scale (Really :) )
                        Mike Bayer Building the App
Cluster: 1 
                    Allison Kaptur Import-ant Decisions
                  Amjith Ramanujam Introduction to Docker
       Asheesh Laroia,Karen Rustad Turn Your Computer Into a Server
                  Ashwini Oruganti Kneel And Disconnect: Getting The Fastest Connection Out Of A Hostname
   Augie Fackler,Nathaniel Manista Deliver Your Software In An Envelope
                 Benjamin Peterson Garbage Collection in Python
                   Blaise Laflamme Upgrade your Web Development Toolchain
                    Bradley Angell Smart Dumpster: Employing Python to Report Real-Time Resource
                    Brandon Rhodes The Day of the EXE Is Upon Us
                    Brandon Rhodes All Your Ducks In A Row: Data Structures in the Standard Library and Beyond
              

## Number Words

In [101]:
#sklearn.feature_extraction.text
videos_with_subtitles.ix[2]['subtitles'].split()

# videos_with_subtitles['word_count'] = videos_with_subtitles.
videos_with_subtitles['words_count'] = videos_with_subtitles['subtitles'].apply(lambda s: len(s.split()))
videos_with_subtitles['words_count'] = videos_with_subtitles['subtitles'].apply(lambda s: len(s.split()))
videos_with_subtitles['words_per_min'] = videos_with_subtitles['words_count'] / (videos_with_subtitles['duration'] / 60)

videos_with_subtitles.head(2)

Unnamed: 0,author,link,summary,title,youtube_url,subtitles,youtube_id,duration,words_count,words_per_min
0,Niko Skrypnik,http://www.pyvideo.org/video/2676/2d3d-graphic...,"<p>Abstract</p>\nGames, application with rich ...",2D/3D graphics with Python on mobile platforms,http://www.youtube.com/watch?v=Y6pLr11yf-4,thank you guys let's start first ball good aft...,Y6pLr11yf-4,1483,3589,145.205664
1,Julien Phalip,http://www.pyvideo.org/video/2660/advanced-tec...,<p>Abstract</p>\nThis talk will present multip...,Advanced techniques for Web functional testing,http://www.youtube.com/watch?v=St9cL47_1GI,good afternoon everyone now we have my friend...,St9cL47_1GI,1794,4575,153.010033
