# NLP analysis of movie plots: preprocessing data

## Importing libraries

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split 

## Loading data

Next step in our analysis is preprocessing data. Let's load scraped data from csv file

In [35]:
data = pd.read_csv('data/data.csv', index_col = 'tconst').iloc[:,1:]
data.head()

Unnamed: 0_level_0,averageRating,numVotes,primaryTitle,plot,rating
tconst,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
tt0010323,8.0,64516,The Cabinet of Dr. Caligari,"In what appears to be a park, Francis sits on ...",1
tt0012349,8.3,126789,The Kid,,1
tt0013442,7.9,98252,Nosferatu,"In 1838, in the fictional German town of Wisbo...",1
tt0015324,8.2,50390,Sherlock Jr.,Buster is a movie theater projectionist and ja...,1
tt0015648,7.9,58131,Battleship Potemkin,The film is set in June 1905; the protagonists...,1


First of all, we delete the rows where plot was not scraped

In [36]:
data = data[~data['plot'].isna()]
data.shape

(1854, 5)

Sadly, we didn't find plot description of half of the movies. We end up with high dimensional low sample size data so we need to run some dimensionality reduction techniques in order to run classical models.

In [37]:
data['rating'].value_counts()

0    979
1    875
Name: rating, dtype: int64

Good news - our data is balanced, we have the same amount for highly and low rated movies.

## Splitting data into train and test samples

Before starting the preprocessing steps we need to split data into training and testing samples. It would help us to evaluate the performance of the model.

In [47]:
X = data['plot']
y = data['rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1234, shuffle = True)

print('Train sample size: '+str(len(y_train)))
print('Test sample size: '+str(len(y_test)))

Train sample size: 1390
Test sample size: 464


## Building pipeline for data cleaning and preprocessing