## Summary
For this project, I decided to use the Criterion Channel Collection for a webscraping exercise. The Criterion Channel is a streaming service like Netflix, but for the classic films inducted into the Criterion Collection. I am interested in what directors, countries, and decades are most highly represented in this collection.
## Scraping policy
As far as I could tell, the Criterion Channel has no scraping policy. I encountered no obstacles in scraping the entire contents of their collection multiple times.

## Section 1.1: Scraping the main page
In this section, I ran for loops over the main page of the collection to scrape titles, urls, directors, countries and years of release. I filtered out multi-part films that do not contain this data, incorrectly formatted films, and films with broken links.

In [1]:
# Make soup
import requests
from bs4 import BeautifulSoup
request = requests.get('https://films.criterionchannel.com/')
soup = BeautifulSoup(request.content, 'html.parser')

In [2]:
# Scrape titles, get rid of tabs and new lines
titles = []
for title in soup.findAll(class_ = "criterion-channel__td criterion-channel__td--title"):
    nt = title.get_text()
    no_t = nt.replace('\t', '')
    no_nt = no_t.replace('\n', '')
    titles.append(no_nt)
print(len(titles))

2502


In [3]:
# Scrape urls
urls = []
for url in soup.findAll('a', href = True):
    urls.append(url.get('href'))
# Only keep urls that correspond to films
urls = urls[3:]
urls = urls[1:-21]
print(len(urls))

2502


In [4]:
# Scrape directors
directors = []
for director in soup.findAll(class_ = 'criterion-channel__td criterion-channel__td--director'):
    nt = director.get_text()
    no_t = nt.replace('\t', '')
    no_nt = no_t.replace('\n', '')
    directors.append(no_nt)
print(len(directors))

2502


In [5]:
# Scrape countries
countries = []
for country in soup.findAll(class_ = 'criterion-channel__td criterion-channel__td--country'):
    nt = country.get_text()
    no_t = nt.replace('\t', '')
    no_nt = no_t.replace('\n', '')
    no_comma = no_nt[:-1]
    countries.append(no_comma)
print(len(countries))

2502


In [6]:
# Scrape years
years = []
for year in soup.findAll(class_ = 'criterion-channel__td criterion-channel__td--year'):
    nt = year.get_text()
    no_t = nt.replace('\t', '')
    no_nt = no_t.replace('\n', '')
    years.append(no_nt)
print(len(years))

2502


In [7]:
# Create dataframe
import pandas as pd
data = pd.DataFrame({'Title': titles, 'Director': directors, 'Country': countries, 'Year': years, 'Url': urls})
# Remove rows without durations (parts > 1 of a film)
data = data[~data['Url'].str.contains('/videos/')]
# Remove two rows with urls that don't work
# ....
data = data.reset_index(drop = True)
print(len(data))

2356


In [8]:
# # Check for broken links, do not run this, it takes a long time
# fourohfour = []
# for url in data['Url']:
#     # 200 = working, 404 = broken
#     fourohfour.append(requests.get(url))
#     print(url)
# print(len(fourohfour))
# # Save as text file (Excel often incorrectly reformats csv files upon opening)
# with open('data/Fourohfour.txt', 'w') as file:
#     for line in fourohfour:
#         file.write("%s\n" % line)
# print(len(fourohfour))

In [9]:
# Open pre-scraped 404 file
with open('data\Fourohfour.txt') as file:
    fourohfour = file.read().splitlines()
# Insert 404 column
data.insert(5, '404', fourohfour)
# Convert from BeautifulSoup type to string
data['404'] = data['404'].astype(str)
# Remove 404 rows from data
data = data[~data['404'].str.contains('404')]
print(len(data)) # Removed 52 broken links

2356


In [10]:
# Reset index after filtering out rows
data = data.reset_index(drop = True)

In [11]:
data.head()

Unnamed: 0,Title,Director,Country,Year,Url,404
0,2 or 3 Things I Know About Her,Jean-Luc Godard,France,1967,https://www.criterionchannel.com/2-or-3-things...,<Response [200]>
1,Les 3 boutons,Agnès Varda,France,2015,https://www.criterionchannel.com/les-3-boutons,<Response [200]>
2,3 Faces,Jafar Panahi,Iran,2018,https://www.criterionchannel.com/3-faces,<Response [200]>
3,"4 Months, 3 Weeks and 2 Days",Cristian Mungiu,Romania,2007,https://www.criterionchannel.com/4-months-3-we...,<Response [200]>
4,4 Quarters,Ashley McKenzie,Canada,2015,https://www.criterionchannel.com/4-quarters,<Response [200]>


## Section 1.2: Scraping within each film's own page
In order to obtain the duration and description of the films, I needed to write a for loop that entered into each film's page via its url. Because this takes a long time (> 1000 films), I saved the results into a .txt or .csv file so I would not have to re-run the scraping each time I tested out the code. I wanted to focus on feature length films for my analysis, so I wrote some code to conver the HH:MM:SS string-type duration data into a "Total Hours" float-type, and then excluded all films < 1 hour.

In [12]:
# # Scrape durations, do not run this, it takes a long time
# durations = []
# for url in data['Url']:
#     request = requests.get(url)
#     soup = BeautifulSoup(request.content, 'html.parser')
#     for duration in soup.findAll(class_ = 'duration-container')[:1]:
#         durations.append(duration.get_text())
#     print(url)
# # Save as text file
# with open('data/Durations.txt', 'w') as file:
#     for line in durations:
#         file.write("%s\n" % line)
# print(len(durations))

In [13]:
# Open pre-scraped duration file
with open('data\Durations.txt') as file:
    durations = file.read().splitlines()

In [14]:
# Clean durations
durations = durations[1:]
durations = durations[::3]
durations = [x.strip(' ') for x in durations]

In [15]:
# Insert duration column
try:
    data.insert(4, 'Duration', durations)
except:
    pass

In [16]:
# Remove seconds, keep only hours and minutes
data['Duration'] = data['Duration'].str[:-3]

In [17]:
# Append '0:' to beginning of duration to indicate 0 hours for all films < 1 hour
# that are not formatted consistently with the rest of the data
for i, duration in enumerate(data['Duration']):
    if ':' not in duration:
        data.loc[i, 'Duration'] = '0:' + duration

In [18]:
# Split duration by colon
hours_minutes = data['Duration'].str.split(':', expand = True)

In [19]:
# Insert hours and minutes columns
data.insert(5, 'Hours', hours_minutes[0])
data['Hours'] = data['Hours'].astype(int)
data.insert(6, 'Minutes', hours_minutes[1])
data['Minutes'] = data['Minutes'].astype(int)

In [20]:
# Calculate and insert total hours
total_hours = []
for i in range(len(data)):
    hours = (data.loc[i]['Hours'].astype(int) + data.loc[i]['Minutes'].astype(int)/60).round(2)
    total_hours.append(hours)
try:
    data.insert(7, 'Total Hours', total_hours)
except:
    pass
# Drop old columns
try:
    data = data.drop(['Minutes', 'Hours', '404'], axis = 1)
except:
    pass

In [21]:
# # Scrape descriptions, do not run this, it takes a long time
# descriptions = []
# for url in data['Url']:
#     request = requests.get(url)
#     soup = BeautifulSoup(request.content, 'html.parser')
#     paragraphs = soup.findAll('p')
#     # Select paragraph containing the description
#     paragraphs = paragraphs[1]
#     string = []
#     for x in paragraphs:
#         string.append(str(x))
#     descriptions.append(string[0])
#     print(url)
# # Save to csv (list is incorrectly loaded as text file)
# descriptions = pd.DataFrame({'Description': descriptions})
# descriptions.to_csv('data/Descriptions.csv', index = False)

In [22]:
# Open pre-scraped description file
descriptions = pd.read_csv('data\Descriptions.csv')

In [23]:
# Insert description column
data.insert(5, 'Description', descriptions)

In [26]:
# Remove films < 1 hour, as these are mostly shorts, not films
data = data[data['Total Hours'] > 1]

In [27]:
# Create decade column
import numpy as np
try:
    data.insert(4, 'Decade', (data['Year'].astype(int)/10).apply(np.floor))
except:
    pass
data['Decade'] = data['Decade'].astype(str)
data['Decade'] = data['Decade'].str.replace('.', '')
data['Decade'] = data['Decade'].astype(str) + 's'

  data['Decade'] = data['Decade'].str.replace('.', '')


In [28]:
# Replace NaN with 'None'
data = data.replace(np.nan, 'None', regex = True)

In [29]:
# Save to csv
data.to_csv('data\Criterion.csv', index = False)

In [30]:
# Read csv
data = pd.read_csv('data\Criterion.csv')

In [31]:
data.head(10)

Unnamed: 0,Title,Director,Country,Year,Decade,Duration,Description,Total Hours,Url
0,2 or 3 Things I Know About Her,Jean-Luc Godard,France,1967,1960s,1:27,In 2 OR 3 THINGS I KNOW ABOUT HER (2 OU 3 CHOS...,1.45,https://www.criterionchannel.com/2-or-3-things...
1,3 Faces,Jafar Panahi,Iran,2018,2010s,1:40,Iranian master Jafar Panahi’s fourth feature s...,1.67,https://www.criterionchannel.com/3-faces
2,"4 Months, 3 Weeks and 2 Days",Cristian Mungiu,Romania,2007,2000s,1:53,Romanian filmmaker Cristian Mungiu shot to int...,1.88,https://www.criterionchannel.com/4-months-3-we...
3,"The VI Olympic Winter Games, Oslo 1952",Tankred Ibsen,Norway,1952,1950s,1:43,Director Tancred Ibsen's penchant for depictin...,1.72,https://www.criterionchannel.com/the-vi-olympi...
4,8½,Federico Fellini,Italy,1963,1960s,2:19,"Marcello Mastroianni plays Guido Anselmi, a di...",2.32,https://www.criterionchannel.com/81-2
5,The IX Olympiad at Amsterdam,,Netherlands,1928,1920s,4:11,"Made by Istituto Luce, the Italian film compan...",4.18,https://www.criterionchannel.com/the-ix-olympi...
6,"IX Olympic Winter Games, Innsbruck 1964",Theo Hörmann,Austria,1964,1960s,1:30,Joy and good humor pervades Theo Hörmann's doc...,1.5,https://www.criterionchannel.com/ix-olympic-wi...
7,13 Days in France,François Reichenbach…,France,1968,1960s,1:52,"13 DAYS IN FRANCE, a personal project for Fren...",1.87,https://www.criterionchannel.com/13-days-in-fr...
8,XIVth Olympiad: The Glory of Sport,Castleton Knight,United Kingdom,1948,1940s,2:18,The official film of the Games of the XIV Olym...,2.3,https://www.criterionchannel.com/xivth-olympia...
9,16 Days of Glory,Bud Greenspan,United States,1986,1980s,4:44,"Director Bud Greenspan, whose career covering ...",4.73,https://www.criterionchannel.com/16-days-of-glory


In [32]:
data['Description 2'] = data['Director'] + ' - ' + data['Country'] + ' - ' + data['Decade'] + ' - ' + data['Description']

### Topic modeling

In [33]:
documents = data['Description']

In [34]:
import re
f = open('C:/Users/HP/Documents/NLP/MySQL_stopwords.txt', 'r', encoding = 'utf-8')
stop_words = f.read()
stop_words = re.split(' \t|\n', stop_words)

In [35]:
import nltk
from nltk import TweetTokenizer
# stop_words = nltk.corpus.stopwords.words('english')
tokenizer = TweetTokenizer()
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
def normalize_corpora(corpora):
    normalized_corpora = []
    for i, corpus in enumerate(corpora):
        # Lowercase
        corpus = corpus.lower()
        # Replace 
        corpus = corpus.replace("/", " ")
        corpus = corpus.replace("’", "'")
        corpus = corpus.replace("'s", "")
        # Remove numbers
        corpus = re.sub('[^A-Za-z0-9\']+', ' ', corpus)
        # Strip spaces
        corpus_tokens = tokenizer.tokenize(corpus)
        # Remove stopwords
        corpus_tokens = [token for token in corpus_tokens if token not in stop_words]
        # Lemmatize
        corpus_tokens = [lemmatizer.lemmatize(token) for token in corpus_tokens if not token.isnumeric()]
        # Remove single characters
        corpus_tokens = [token for token in corpus_tokens if len(token) > 1]
        # Remove empty corpus
        if corpus_tokens:
            normalized_corpora.append(corpus_tokens)
    return normalized_corpora
normalized_documents = normalize_corpora(documents)

In [36]:
import gensim
bigram = gensim.models.Phrases(normalized_documents, min_count = 5, threshold = 5, delimiter = b'_')
bigram_model = gensim.models.phrases.Phraser(bigram)

In [37]:
normalized_corpus_bigrams = [bigram_model[post] for post in normalized_documents]
# Create a dictionary representation of the documents.
dictionary = gensim.corpora.Dictionary(normalized_corpus_bigrams)
print('Total Vocabulary Size:', len(dictionary))

Total Vocabulary Size: 15467


In [38]:
# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below = 5, no_above = 0.5)
print('Total Vocabulary Size:', len(dictionary))

Total Vocabulary Size: 3277


In [39]:
# Transforming corpus into bag of words vectors
bow_corpus = [dictionary.doc2bow(text) for text in normalized_corpus_bigrams]

In [40]:
MALLET_PATH = 'C:/mallet-2.0.8/bin/mallet'
import os
from gensim.models.wrappers import LdaMallet
from tqdm import tqdm
os.environ['MALLET_HOME'] = 'C:/mallet-2.0.8'

def topic_model_coherence_generator(corpus, texts, dictionary, 
                                    start_topic_count = 1, end_topic_count = 20, step = 1,
                                    cpus = 8):
    models = []
    coherence_scores = []
    for topic_nums in tqdm(range(start_topic_count, end_topic_count + 1, step)):
        mallet_lda_model = gensim.models.wrappers.LdaMallet(mallet_path = MALLET_PATH, corpus = corpus,
                                                            num_topics = topic_nums, id2word = dictionary,
                                                            iterations = 100, workers = cpus, random_seed = 20210224)
        cv_coherence_model_mallet_lda = gensim.models.CoherenceModel(model = mallet_lda_model, corpus = corpus, 
                                                                     texts = texts, dictionary = dictionary, 
                                                                     coherence = 'c_v')
        coherence_score = cv_coherence_model_mallet_lda.get_coherence()
        coherence_scores.append(coherence_score)
        models.append(mallet_lda_model)
    return models, coherence_scores

In [41]:
end_topic_count = 40
lda_models, coherence_scores = topic_model_coherence_generator(corpus = bow_corpus, texts = normalized_corpus_bigrams,
                                                               dictionary = dictionary, start_topic_count = 1,
                                                               end_topic_count = end_topic_count, step = 1, cpus = 8)

  2%|██                                                                                 | 1/40 [00:23<15:00, 23.10s/it]


KeyboardInterrupt: 

In [None]:
coherence_df = pd.DataFrame({'Number of Topics': range(1, end_topic_count + 1, 1),
                             'Coherence Score': np.round(coherence_scores, 4)})
coherence_df = coherence_df.sort_values(by = 'Coherence Score', ascending = False).head(10)

In [None]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

x_ax = range(1, end_topic_count + 1, 1)
y_ax = coherence_scores
plt.figure(figsize=(12, 6))
plt.plot(x_ax, y_ax, c = 'r')
plt.rcParams['figure.facecolor'] = 'white'
xl = plt.xlabel('Number of Topics')
yl = plt.ylabel('Coherence Score')

In [None]:
best_model_idx = coherence_df['Number of Topics'].index[1]
best_lda_model = lda_models[best_model_idx]
best_lda_model.num_topics

In [None]:
topics = [[(term, round(wt, 3)) 
               for term, wt in best_lda_model.show_topic(n, topn=20)] 
                   for n in range(0, best_lda_model.num_topics)]
# for idx, topic in enumerate(topics):
#     print('Topic #'+str(idx+1)+':')
#     print([term for term, wt in topic])
#     print()

In [None]:
pd.set_option('display.max_colwidth', 200)
topics_df = pd.DataFrame([', '.join([term for term, wt in topic])  
                              for topic in topics],
                         columns = ['Topic Desc'],
                         index = range(1, best_lda_model.num_topics + 1)
                         )
topics_df.head()

In [None]:
tm_results = best_lda_model[bow_corpus]

In [None]:
corpus_topics = [sorted(topics, key = lambda record: -record[1])[0] for topics in tm_results]

In [None]:
corpus_topic_df = pd.DataFrame()
corpus_topic_df['Document'] = range(0, len(documents))
corpus_topic_df['Dominant Topic'] = [item[0] + 1 for item in corpus_topics]
corpus_topic_df['Contribution %'] = [round(item[1] * 100, 2) for item in corpus_topics]
corpus_topic_df['Topic Desc'] = [topics_df.iloc[t[0]]['Topic Desc'] for t in corpus_topics]
corpus_topic_df['Post'] = documents
corpus_topic_df.head()

In [None]:
topic_stats_df = corpus_topic_df.groupby('Dominant Topic').count()
topic_stats_df = topic_stats_df.drop(['Contribution %', 'Topic Desc', 'Post'], axis = 1)
topic_stats_df.columns = ['# of Docs']
topic_stats_df['% Total Docs'] = round(100 * topic_stats_df['# of Docs'] / sum(topic_stats_df['# of Docs']), 2)
topic_stats_df['Topic Desc'] = topics_df['Topic Desc']
topic_stats_df.sort_values('% Total Docs', ascending = False).head()

In [None]:
relevant_posts = corpus_topic_df.groupby('Dominant Topic') \
.apply(lambda topic_set: (topic_set.sort_values(by=['Contribution %'], ascending=False).iloc[0]))
relevant_posts.sort_values('Contribution %', ascending = False).head()