# NLP analysis of movie plots: scraping wikipedia

This is a pet project to explore web scraping and natural language processing (NLP) techniques. The main research question is to find some patterns in the plot description of highly and low rated movies. We assume that some elements of the movie plot can be attractive to the audience while others are not.

In [1]:
#import essentional libraries
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

We use ratings from IMBD website: https://datasets.imdbws.com/. IMBD post data on movies and their average ratings given by the its website users. The data on the website is daily uploaded so we state that tha tables were downloaded on December 29, 2022.

First table contains some general information on the movies and series like the title, genres and year of release.

In [2]:
names=pd.read_csv('data/title.basics.tsv.gz', compression='gzip', header=0, sep="\t")
names.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


We filter only movies (removing series) in order to shrink our data and avoid complications with loading plot from wikipedia

In [3]:
names_movies=names[names['titleType']=='movie']
names_movies.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 630466 entries, 8 to 9470970
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          630466 non-null  object
 1   titleType       630466 non-null  object
 2   primaryTitle    630466 non-null  object
 3   originalTitle   630466 non-null  object
 4   isAdult         630466 non-null  object
 5   startYear       630466 non-null  object
 6   endYear         630466 non-null  object
 7   runtimeMinutes  630466 non-null  object
 8   genres          630466 non-null  object
dtypes: object(9)
memory usage: 48.1+ MB


Only 2 columns are necessary for scraping from wikipedia - unique number of film on imbd, its title and year of release. The later column is used to differentiate between movies with the same title. 

In [4]:
names_movies=names_movies[['tconst','primaryTitle', 'startYear']]
names_movies=names_movies[names_movies['startYear']!="\\N"] #removing movies with unknown year of release
names_movies['startYear']=pd.to_numeric(names_movies['startYear']) #turning year string to integer
names_movies.head()

Unnamed: 0,tconst,primaryTitle,startYear
8,tt0000009,Miss Jerry,1894
144,tt0000147,The Corbett-Fitzsimmons Fight,1897
498,tt0000502,Bohemios,1905
570,tt0000574,The Story of the Kelly Gang,1906
587,tt0000591,The Prodigal Son,1907


Another file from IMBD we need is table with ratings.

In [5]:
ratings=pd.read_csv('data/title.ratings.tsv.gz', compression='gzip', header=0, sep="\t")
ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,1930
1,tt0000002,5.8,261
2,tt0000003,6.5,1746
3,tt0000004,5.6,176
4,tt0000005,6.2,2561


We can now merge two dataframes.

In [6]:
ratings_complete=ratings.merge(names_movies)
ratings_complete.head()

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,startYear
0,tt0000009,5.3,201,Miss Jerry,1894
1,tt0000147,5.2,460,The Corbett-Fitzsimmons Fight,1897
2,tt0000502,4.2,14,Bohemios,1905
3,tt0000574,6.0,797,The Story of the Kelly Gang,1906
4,tt0000591,4.4,19,The Prodigal Son,1907


As there are many films which are rated by few people, it would be a good option to filter table by number of votes - if more than one thousand people voted then we take the movie for analysis. We also removing movies older than 1980 to delete too old movies.

In [7]:
ratings_popular = ratings_complete[ratings_complete['numVotes']>=1000]
ratings_popular = ratings_popular[ratings_popular['startYear'] >= 1980].reset_index(drop = True)
ratings_popular.shape

(31739, 5)

31739 movies is a good amount for our analysis. We expect this number to reduce as we start looking for the plot description from wikipedia.

## Loading plots from wikipedia

Uncomment the cell in case the library is not installed:

In [8]:
#pip install wikipedia

In [9]:
import wikipedia
import csv
import os

We will write the function which searches the film title from wikipedia API library and add the plot to the dataframe if there is such article. 

In [10]:
def plot_wiki(tconst, title, year):
    try:
        plot_words = ['Plot ==\n', 'Synopsis ==\n', 'Content ==\n', 'Plot synopsis ==\n'] #possible plot section titles in wiki
        wiki_page = wikipedia.page(title+' film '+str(year))
        for word in plot_words:
            start = wiki_page.content.find(word) 
            if start > 0: #if no plot related section is found, method returns 0
                start+=len(word)-1
                end=wiki_page.content.find('\n\n\n', start)
                content=wiki_page.content[start:end].replace('\n', '').replace('\'','')
                return content
    except:
        pass

Scraping took more than 24 hours so we formulated the code the way it can be shutdown at any time and when the notebook is run again, it picks up where it stopped. This way we can optimize our computational resources.

In [11]:
# Import DictWriter class from CSV module
#checking if the file exisits and checking the last scraped movie index
if not os.path.exists('data/movie_plots.csv'):
    with open('data/movie_plots.csv', 'a', newline = "") as f_object:
        writer = csv.writer(f_object)
        writer.writerow(['index','tconst', 'primaryTitle', 'plot', 'rating'])
    last_i = -1
else:
    last_i = pd.read_csv('data/movie_plots.csv', encoding = 'latin1', header=1).iloc[-1, 0]

#loop for scraping plots
with open('data/movie_plots.csv', 'a', newline = "") as f_object:
    writer = csv.writer(f_object)
    for i in tqdm(range(last_i+1,len(ratings_popular))):
        try:
            row = ratings_popular.iloc[i]
            content = plot_wiki(row.tconst, row.primaryTitle, row.startYear)
            if content:
                writer.writerow([i, row.tconst, row.primaryTitle, content, row.averageRating])
        except:
            continue


100%|███████████████████████████████████████████████████████████████████████████| 31739/31739 [01:56<00:00, 273.06it/s]


All collected data is saved to csv file which can be used for further analysis.