# Previsione incasso al botteghino dei film

**Progetto di Programmazione di Applicazioni Data Intensive**  
Laurea in Ingegneria e Scienze Informatiche, a.a. 2022/2023

DISI - Università di Bologna, Cesena

Giosuè Giocondo Mainardi  `giosue.mainardi@studio.unibo.it`

## Caricamento Librerie

Per prima cosa carichiamo le librerie per effettuare operazioni sui dati

* NumPy per creare e operare su array a N dimensioni   
* pandas per caricare e manipolare dati tabulari
* matplotlib per creare grafici

Lo facciamo usando i loro alias convenzionali e abilitando l'inserimento dei grafici direttamente nel notebook

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Gli altri componenti necessari li installeremo man mano che serviranno

## Caricamento dati

Definiamo la seguente funzione di supporto per scaricare i file di dati necessari

In [2]:
import os
from urllib.request import urlretrieve
def download(file, url):
    if not os.path.isfile(file):
        urlretrieve(url, file)

Scarichiamo i due dataset

In [26]:
download("film_train.csv", "https://raw.githubusercontent.com/g-mainardi/app-data-intensive/main/film_train.csv")
download("film_test.csv",  "https://raw.githubusercontent.com/g-mainardi/app-data-intensive/main/film_test.csv")

###Estrai

Importiamo la libreria Abstract Syntax Trees

In [27]:
import ast
dict_columns = ['belongs_to_collection', 'genres', 'production_companies',
                'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']
def text_to_dict(df):
    for column in dict_columns:
        df[column] = df[column].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x) )
    return df

In [28]:
train_df = text_to_dict(pd.read_csv("film_train.csv", index_col="id", parse_dates=["release_date"]))
test_df = text_to_dict(pd.read_csv("film_test.csv", index_col="id", parse_dates=["release_date"]))

In [31]:
train_df.shape, test_df.shape

((3000, 22), (4398, 21))

###Test

In [29]:
train_df[train_df.columns[(train_df.isna().sum()!=0)]]

Unnamed: 0_level_0,homepage,overview,poster_path,runtime,tagline
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,,"When Lou, who has become the ""father of the In...",/tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg,93.0,The Laws of Space and Time are About to be Vio...
2,,Mia Thermopolis is now a college graduate and ...,/w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg,113.0,It can take a lifetime to find true love; she'...
3,http://sonyclassics.com/whiplash/,"Under the direction of a ruthless instructor, ...",/lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg,105.0,The road to greatness can take you to the edge.
4,http://kahaanithefilm.com/,Vidya Bagchi (Vidya Balan) arrives in Kolkata ...,/aTXRaPrWSinhcmCrcfJK17urp3F.jpg,122.0,
5,,Marine Boy is the story of a former national s...,/m22s7zvkVFDU9ir56PiiqIEWFdT.jpg,118.0,
...,...,...,...,...,...
2996,,Military men Rock Reilly and Eddie Devane are ...,/j8Q7pQ27hvH54wpxJzIuQgQCdro.jpg,102.0,It was supposed to be a routine prisoner trans...
2997,,Three girls in 1980s Stockholm decide to form ...,/sS01LSy6KDrCZAhtkO18UdnWFT1.jpg,102.0,
2998,,"Samantha Caine, suburban homemaker, is the ide...",/4MENR8x6mYqnZvp2hGjSaPJz64J.jpg,120.0,What's forgotten is not always gone.
2999,http://www.alongcamepolly.com/,Reuben Feffer is a guy who's spent his entire ...,/nIY4kvJTTnxoBR0wycrXng5MOYs.jpg,90.0,"For the most cautious man on Earth, life is ab..."


In [None]:
train_df[train_df["runtime"].isna()]["original_title"]

id
1336          Королёв
2303    Happy Weekend
Name: original_title, dtype: object

In [None]:
vuoto = train_df["belongs_to_collection"].iloc[2]
train_df[train_df["belongs_to_collection"]!=vuoto].count()

belongs_to_collection    604
budget                   604
genres                   604
homepage                 224
imdb_id                  604
original_language        604
original_title           604
overview                 603
popularity               604
poster_path              604
production_companies     604
production_countries     604
release_date             604
runtime                  604
spoken_languages         604
status                   604
tagline                  538
title                    604
Keywords                 604
cast                     604
crew                     604
revenue                  604
dtype: int64