## Data Collection

This file aims at collecting a dataset with movies' genre attachments as well as poster links and retrieving poster images with the provided links. In this project, the data contains a csv file containing movie titles, genre attachments by IMDB, as well as poster links, and original poster images. The dataset was gained from Kaggle (https://www.kaggle.com/neha1703/movie-genre-from-its-poster) and the original poster images were gained from the website of IMDB. The csv file and the original poster images are stored in Google Cloud bucket. 

In [1]:
# Import Packages
import pandas as pd
import urllib.request
import os

In [3]:
# Read the csv file and load it as dataframe 
df = pd.read_csv("MovieGenre.csv", encoding = 'latin')
df = df.dropna()
df = df.reset_index(drop = True)

In [4]:
data_root = "./Data"

In [5]:
# Download the image with the provided poster links to the destination folder
link_list = []
filename_list = []
n = len(df)
for i in range(n):
    link = df.loc[i, 'Poster']
    file = os.path.join(data_root, link.split("/")[-1])
    try:
        urllib.request.urlretrieve(link, file)
        link_list.append(link)
        filename_list.append(file)
    except urllib.error.HTTPError:
        continue

In [6]:
# Drop off the rows with no matched image in the directory
copy = [link.split("/")[-1] for link in df['Poster']]
df['Poster'] = copy

In [7]:
stored_image = os.listdir(data_root)
for i in range(len(copy)):
    df_image = copy[i]
    if df_image not in stored_image:
        df = df.drop(i, axis = 0)
df = df.reset_index(drop = True)

In [10]:
# Save the final csv file to local
df.to_csv("collected_dataframe.csv", index = False)

After the above steps, the modified csv file and the poster images are uploaded to and stored in Google Cloud bucket. 