# NLP analysis of movie plots: scraping wikipedia

This is a pet project to master web scraping and natural language processing (NLP) techniques. The main research question is to find some patterns in the plot of highly rated movies.

In [1]:
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

We use ratings from IMBD website. IMBD post data on movies and their ratings given by the its website users.

In [2]:
names=pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz', compression='gzip', header=0, sep="\t")

Then we filter only movies (removing series) in order to shrink our data and avoid complications with loading plot from wikipedia

In [3]:
names_movies=names[names['titleType']=='movie']

Only 2 columns are necessary - unique number of film on imbd and its title.

In [4]:
names_movies=names_movies[['tconst','primaryTitle']]
names_movies.sample(5)

Unnamed: 0,tconst,primaryTitle
1000975,tt10059440,For The Voiceless
108318,tt0110808,Perdiamoci di vista
51059,tt0052008,Nochnoy gost
78275,tt0079968,Sunnyside
785543,tt0810934,Lei


Another file from IMBD contains data on the ratings and number of votes by movie

In [5]:
ratings=pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', compression='gzip', header=0, sep="\t")

We can now merge to dataframes to get a table with ratings, votes and titles.

In [6]:
ratings_complete=ratings.merge(names_movies)
ratings_complete.sample(5)

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle
43695,tt0080848,7.3,50,The School We Went To
17739,tt0042316,6.0,147,Dear Caroline
10015,tt0030986,5.9,15,Women in Prison
239906,tt4307132,8.5,52,G Kutta Se
92379,tt0221335,3.6,54,The Godfather's Advisor


As there are many films which are not rated by many people, it would be a good option to filter table by number of votes - if more than 50 thousand people voted then we take the movie for analysis.

In [7]:
ratings_popular=ratings_complete[ratings_complete['numVotes']>5000]
ratings_popular.shape

(15495, 4)

3825 movies is a good amount for our analysis. We expect this number to reduce as we start looking for the plot description from wikipedia.

## Loading plots from wikipedia

In [8]:
pip install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [9]:
import wikipedia

We will write the function which searches the film title from wikipedia and add the plot to the dataframe if there is such article. For this purpose we are using wikipedia API.

In [10]:
def plot_wiki(df, i):
    try:
        title = df.iloc[i]['primaryTitle'].replace(" ", "_" )
        wiki_page = wikipedia.page(df.iloc[i]['primaryTitle'])
        if wiki_page.content.find('Plot ==\n')>0:
            start=wiki_page.content.find('Plot ==\n')+8
            end=wiki_page.content.find('\n\n\n', start)
            content=wiki_page.content[start:end].replace('\n', '').replace('\'','')
        return df.iloc[i]['tconst'], content
    except:
        pass

We form the database with plot description. As there are many movies it takes some time to load all the data.

In [11]:
movie_plot=pd.DataFrame(columns={'tconst', 'plot'})
for i in tqdm(range(len(ratings_popular))):
    try:
      tconst, content = plot_wiki(ratings_popular, i)
      movie_plot = movie_plot.append({'tconst':tconst,'plot':content}, ignore_index=True)
    except:
      continue

movie_plot.head()


100%|██████████| 15495/15495 [3:08:08<00:00,  1.37it/s]


Unnamed: 0,plot,tconst
0,The film consists of two parts of similar leng...,tt0004972
1,=== Episode table ===,tt0006206
2,Cheng Huan leaves his native China because he...,tt0009968
3,"In what appears to be a park, Francis sits on ...",tt0010323
4,Anna (Lillian Gish) is a poor country girl who...,tt0011841


Merging the plots we find with the dataframe from IMBD

In [12]:
data=ratings_complete.merge(movie_plot)
data.head(10)

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,plot
0,tt0004972,6.2,25041,The Birth of a Nation,The film consists of two parts of similar leng...
1,tt0006206,7.3,5003,Les vampires,=== Episode table ===
2,tt0009968,7.3,10491,Broken Blossoms,Cheng Huan leaves his native China because he...
3,tt0010323,8.0,64866,The Cabinet of Dr. Caligari,"In what appears to be a park, Francis sits on ..."
4,tt0011841,7.4,5598,Way Down East,Anna (Lillian Gish) is a poor country girl who...
5,tt0012364,8.0,12858,The Phantom Carriage,"On New Years Eve, dying Salvation Army Sister ..."
6,tt0012532,7.3,5223,Orphans of the Storm,"Just before the French Revolution, Henriette t..."
7,tt0013086,7.8,8583,"Dr. Mabuse, the Gambler",=== Part I ===The Great Gambler: A Picture of ...
8,tt0013427,7.6,12575,Nanook of the North,"The documentary follows the lives of an Inuk, ..."
9,tt0013442,7.9,98694,Nosferatu,"In 1838, in the fictional German town of Wisbo..."


In order to run binary classification models we sort the ratings and add new column which indicates whether the rating for a movie is high or not. We decide to put high rating for ratings more or equal to 7.

In [13]:
indices=[True if x != None and len(x)>100  else False for x in data['plot'] ]
data = data[indices]
data.head()

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,plot
0,tt0004972,6.2,25041,The Birth of a Nation,The film consists of two parts of similar leng...
2,tt0009968,7.3,10491,Broken Blossoms,Cheng Huan leaves his native China because he...
3,tt0010323,8.0,64866,The Cabinet of Dr. Caligari,"In what appears to be a park, Francis sits on ..."
4,tt0011841,7.4,5598,Way Down East,Anna (Lillian Gish) is a poor country girl who...
5,tt0012364,8.0,12858,The Phantom Carriage,"On New Years Eve, dying Salvation Army Sister ..."


Let's save our scraped data to csv for further analysis.

In [15]:
data.to_csv('data.csv')