# Word analysis of movie plots: scraping wikipedia

This is a pet project to master web scraping and natural language processing (NLP) techniques. The main research question is to find some patterns in the plot of highly rated movies.   

In [1]:
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

We use ratings from imbd website. IMBD post data on movies and their ratings given by the imdb website users https://datasets.imdbws.com/title.basics.tsv.gz

In [2]:
names=pd.read_csv('https://datasets.imdbws.com/title.basics.tsv.gz', compression='gzip', header=0, sep="\t")

Then we filter only movies in order to shrink our data and avoid complications with loading plot from wikipedia

In [3]:
names_movies=names[names['titleType']=='movie']

Only 2 columns are necessary - unique number of film on imbd and its title.

In [4]:
names_movies=names_movies[['tconst','primaryTitle']]
names_movies.sample(5)

Unnamed: 0,tconst,primaryTitle
5124999,tt19849810,Flight
265915,tt0277797,Lena
2324398,tt12445918,The Rivals of Amziah King
53161,tt0054194,Policejní hodina
416685,tt0434268,Papillon du vertige


Another file from imbd contains data on the ratings and number of votes by movie https://datasets.imdbws.com/title.ratings.tsv.gz

In [5]:
ratings=pd.read_csv('https://datasets.imdbws.com/title.ratings.tsv.gz', compression='gzip', header=0, sep="\t")

We can now merge to dataframes to get a table with ratings, votes and titles.

In [6]:
ratings_complete=ratings.merge(names_movies)
ratings_complete.sample(5)

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle
196159,tt1834889,7.4,19,The Instant Messenger Mission
42862,tt0079570,6.8,111,No More Easy Life
157039,tt1156075,6.9,86,"Soeur Innocenta, priez pour nous!"
258403,tt6265620,6.1,142,I Miss You When I See You
103076,tt0270674,6.6,8,Street Love - Amor de la calle


As there are many films which are not rated by many people, it would be a good option to filter table by number of votes - if more than 50 thousand people voted then we take the movie for analysis.

In [7]:
ratings_popular=ratings_complete[ratings_complete['numVotes']>50000]
ratings_popular.shape

(3825, 4)

3824 movies is a good amount for our analysis. We expect this number to reduce as we start looking for the plot description from wikipedia.

## Loading plots from wikipedia

In [8]:
pip install wikipedia

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11696 sha256=da4a70ed5183af0af3b39c9ebf8f275710bee45fe643d18f9ab8ebb3efd9c423
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [9]:
import wikipedia

We will write the function which searches the film title from wikipedia and add the plot to the dataframe if there is such chapter in an article "Plot".

In [10]:
def plot_wiki(df, i):
    try:
        wiki_page = wikipedia.page(df.iloc[i]['primaryTitle'])
        if wiki_page.content.find('Plot ==\n')>0:
            start=wiki_page.content.find('Plot ==\n')+8
            end=wiki_page.content.find('\n\n\n', start)
            content=wiki_page.content[start:end].replace('\n', '').replace('\'','')
        return df.iloc[i]['tconst'], content
    except:
        return df.iloc[i]['tconst'], None

We form the database with plot description. As there are many movies it takes some time to load all the data.

In [11]:
movie_plot=pd.DataFrame(columns={'tconst', 'plot'})
for i in tqdm(range(len(ratings_popular))):
  tconst, content = plot_wiki(ratings_popular, i)
  movie_plot = movie_plot.append({'tconst':tconst,'plot':content}, ignore_index=True)
movie_plot.head()


100%|██████████| 3825/3825 [48:37<00:00,  1.31it/s]


Unnamed: 0,plot,tconst
0,"In what appears to be a park, Francis sits on ...",tt0010323
1,,tt0012349
2,"In 1838, in the fictional German town of Wisbo...",tt0013442
3,Buster is a movie theater projectionist and ja...,tt0015324
4,The film is set in June 1905; the protagonists...,tt0015648


Merging the plots we find with the dataframe from imbd

In [12]:
data=ratings_complete.merge(movie_plot)

In order to run logistic regression we sort the values and add new column which indicates whether the rating for a movie is high or not. We decide to put high rating for ratings more or equal to 7.

In [13]:
data['rating']=[1 if x >= 7 else 0 for x in data['averageRating']]
data.head()

Unnamed: 0,tconst,averageRating,numVotes,primaryTitle,plot,rating
0,tt0010323,8.0,64516,The Cabinet of Dr. Caligari,"In what appears to be a park, Francis sits on ...",1
1,tt0012349,8.3,126789,The Kid,,1
2,tt0013442,7.9,98252,Nosferatu,"In 1838, in the fictional German town of Wisbo...",1
3,tt0015324,8.2,50390,Sherlock Jr.,Buster is a movie theater projectionist and ja...,1
4,tt0015648,7.9,58131,Battleship Potemkin,The film is set in June 1905; the protagonists...,1


Let's save our scraped data to csv for further analysis.

In [14]:
data.to_csv('data.csv')