# NLP analysis of movie plots: scraping wikipedia

This is a pet project to master web scraping and natural language processing (NLP) techniques. The main research question is to find some patterns in the plot of highly rated movies.

In [1]:
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

We use ratings from IMBD website. IMBD post data on movies and their ratings given by the its website users.

In [2]:
names=pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz', compression='gzip', header=0, sep="\t")

Then we filter only movies (removing series) in order to shrink our data and avoid complications with loading plot from wikipedia

In [3]:
names_movies=names[names['titleType']=='movie']

Only 2 columns are necessary - unique number of film on imbd and its title.

In [4]:
names_movies=names_movies[['tconst','primaryTitle']]
names_movies.sample(5)

Unnamed: 0,tconst,primaryTitle
209175,tt0218183,Every Child Is Born a Poet: The Life and Work ...
85312,tt0087248,Het feest en de grote leugen
61576,tt0062825,"Run, Man, Run"
3953486,tt1545594,Chhoto Bou
300747,tt0314245,K Chyornomu moryu


Another file from IMBD contains data on the ratings and number of votes by movie

In [5]:
ratings=pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', compression='gzip', header=0, sep="\t")

We can now merge to dataframes to get a table with ratings, votes and titles.

In [6]:
ratings_complete=ratings.merge(names_movies)
ratings_complete.sample(5)

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle
36121,tt0068789,6.7,216,Justin Morgan Had a Horse
152192,tt1091746,6.6,89,The Idol
126557,tt0404752,5.3,13,Al canto del cucù
174461,tt14043522,6.8,19,Tales of a Toy Horse
120663,tt0364530,5.4,185,Mr. Romeo


As there are many films which are not rated by many people, it would be a good option to filter table by number of votes - if more than 50 thousand people voted then we take the movie for analysis.

In [7]:
ratings_popular=ratings_complete[ratings_complete['numVotes']>1000]
ratings_popular.shape

(38725, 4)

3825 movies is a good amount for our analysis. We expect this number to reduce as we start looking for the plot description from wikipedia.

## Loading plots from wikipedia

In [8]:
pip install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11695 sha256=464dd04183cc5cef8cdc5dbdb50341dd62327199d075032d88110b22cb5f9c27
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [9]:
import wikipedia

We will write the function which searches the film title from wikipedia and add the plot to the dataframe if there is such article. For this purpose we are using wikipedia API.

In [10]:
def plot_wiki(df, i):
    try:
        title = df.iloc[i]['primaryTitle'].replace(" ", "_" )
        wiki_page = wikipedia.page(df.iloc[i]['primaryTitle'])
        if wiki_page.content.find('Plot ==\n')>0:
            start=wiki_page.content.find('Plot ==\n')+8
            end=wiki_page.content.find('\n\n\n', start)
            content=wiki_page.content[start:end].replace('\n', '').replace('\'','')
        return df.iloc[i]['tconst'], content
    except:
        return df.iloc[i]['tconst'], None

We form the database with plot description. As there are many movies it takes some time to load all the data.

In [None]:
movie_plot=pd.DataFrame(columns={'tconst', 'plot'})
for i in tqdm(range(len(ratings_popular))):
  tconst, content = plot_wiki(ratings_popular, i)
  movie_plot = movie_plot.append({'tconst':tconst,'plot':content}, ignore_index=True)
movie_plot.head()


  1%|          | 267/38725 [01:00<2:58:36,  3.59it/s]

Merging the plots we find with the dataframe from IMBD

In [12]:
data=ratings_complete.merge(movie_plot)
data.head(10)

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,plot
0,tt0002130,7.0,2984,Dante's Inferno,
1,tt0002844,6.9,2327,Fantômas: In the Shadow of the Guillotine,
2,tt0003014,7.0,1249,Ingeborg Holm,
3,tt0003037,6.9,1584,Fantomas: The Man in Black,
4,tt0003165,6.9,1250,Fantômas: The Dead Man Who Killed,=== Episode table ===
5,tt0003419,6.4,2166,The Student of Prague,
6,tt0003643,6.4,1309,The Avenging Conscience: or 'Thou Shalt Not Kill',A young man (Henry B. Walthall) interested in ...
7,tt0003740,7.1,3625,Cabiria,
8,tt0003772,6.1,1036,Cinderella,
9,tt0003930,6.8,1344,Fantomas: The Mysterious Finger Print,


In order to run binary classification models we sort the ratings and add new column which indicates whether the rating for a movie is high or not. We decide to put high rating for ratings more or equal to 7.

Let's save our scraped data to csv for further analysis.

In [13]:
data.to_csv('data.csv')