# NLP analysis of movie plots: scraping wikipedia

This is a pet project to master web scraping and natural language processing (NLP) techniques. The main research question is to find some patterns in the plot of highly rated movies.

In [1]:
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

We use ratings from IMBD website. IMBD post data on movies and their ratings given by the its website users.

In [2]:
names=pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz', compression='gzip', header=0, sep="\t")

Then we filter only movies (removing series) in order to shrink our data and avoid complications with loading plot from wikipedia

In [3]:
names_movies=names[names['titleType']=='movie']

Only 2 columns are necessary - unique number of film on imbd and its title.

In [4]:
names_movies=names_movies[['tconst','primaryTitle']]
names_movies.sample(5)

Unnamed: 0,tconst,primaryTitle
294123,tt0307371,La regina di Sparta
7143209,tt4968982,Kingdom Point
3572055,tt14751200,Cavu
1063997,tt10173114,Gang Qin Meng
6115042,tt2600444,Krzyzacy


Another file from IMBD contains data on the ratings and number of votes by movie

In [5]:
ratings=pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', compression='gzip', header=0, sep="\t")

We can now merge to dataframes to get a table with ratings, votes and titles.

In [6]:
ratings_complete=ratings.merge(names_movies)
ratings_complete.sample(5)

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle
265568,tt7109902,6.7,124,Tarzan's Testicles
234101,tt3886228,5.5,19,All About Eva
89543,tt0208176,6.8,294,Fallen Angels Paradise
260938,tt6495810,6.3,19,Bruce!!!!
207644,tt21855724,8.2,5,Kurdbûn - Essere curdo


As there are many films which are not rated by many people, it would be a good option to filter table by number of votes - if more than 50 thousand people voted then we take the movie for analysis.

In [7]:
ratings_popular=ratings_complete[ratings_complete['numVotes']>1000]
ratings_popular.shape

(38712, 4)

3825 movies is a good amount for our analysis. We expect this number to reduce as we start looking for the plot description from wikipedia.

## Loading plots from wikipedia

In [8]:
pip install wikipedia

Note: you may need to restart the kernel to use updated packages.


In [9]:
import wikipedia

We will write the function which searches the film title from wikipedia and add the plot to the dataframe if there is such article. For this purpose we are using wikipedia API.

In [10]:
def plot_wiki(df, i):
    try:
        title = df.iloc[i]['primaryTitle'].replace(" ", "_" )
        wiki_page = wikipedia.page(df.iloc[i]['primaryTitle'])
        if wiki_page.content.find('Plot ==\n')>0:
            start=wiki_page.content.find('Plot ==\n')+8
            end=wiki_page.content.find('\n\n\n', start)
            content=wiki_page.content[start:end].replace('\n', '').replace('\'','')
        return df.iloc[i]['tconst'], content
    except:
        return df.iloc[i]['tconst'], None

We form the database with plot description. As there are many movies it takes some time to load all the data.

In [58]:
movie_plot=pd.DataFrame(columns={'tconst', 'plot'})
for i in tqdm(range(len(ratings_popular)):
  tconst, content = plot_wiki(ratings_popular, i)
  movie_plot = movie_plot.append({'tconst':tconst,'plot':content}, ignore_index=True)
movie_plot.head()


100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:38<00:00,  1.95s/it]


Unnamed: 0,plot,tconst
0,,tt0002130
1,,tt0002844
2,,tt0003014
3,,tt0003037
4,=== Episode table ===,tt0003165


Merging the plots we find with the dataframe from IMBD

In [59]:
data=ratings_complete.merge(movie_plot)
data.head(10)

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,plot
0,tt0002130,7.0,2983,Dante's Inferno,
1,tt0002844,6.9,2325,Fantômas: In the Shadow of the Guillotine,
2,tt0003014,7.0,1248,Ingeborg Holm,
3,tt0003037,6.9,1584,Fantomas: The Man in Black,
4,tt0003165,6.9,1248,Fantômas: The Dead Man Who Killed,=== Episode table ===
5,tt0003419,6.4,2164,The Student of Prague,
6,tt0003643,6.4,1307,The Avenging Conscience: or 'Thou Shalt Not Kill',A young man (Henry B. Walthall) interested in ...
7,tt0003740,7.1,3623,Cabiria,
8,tt0003772,6.1,1036,Cinderella,
9,tt0003930,6.8,1343,Fantomas: The Mysterious Finger Print,


In order to run binary classification models we sort the ratings and add new column which indicates whether the rating for a movie is high or not. We decide to put high rating for ratings more or equal to 7.

In [60]:
indices=[True if x != None and len(x)>100  else False for x in data['plot'] ]
data = data[indices]
data.head()

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,plot
6,tt0003643,6.4,1307,The Avenging Conscience: or 'Thou Shalt Not Kill',A young man (Henry B. Walthall) interested in ...
15,tt0004972,6.2,24965,The Birth of a Nation,The film consists of two parts of similar leng...


Let's save our scraped data to csv for further analysis.

In [14]:
data.to_csv('data.csv')