# Cinetel Data Cleaning

Cinetel data scraped daily from their online boxoffice at this [link](https://www.cinetel.it/pages/boxoffice.php?edperiodo=aWVyaQ==).

## Reading In Files

In [36]:
import pandas as pd

In [37]:
df1 = pd.read_csv(r"E:\data_analysis_python\cinetel\cinetel_.csv")
pd.set_option('display.max_rows', 8)
df1

Unnamed: 0,Pos.,Titolo,Prima Progr.,Nazione,Distribuzione,Incasso,Presenze,Incasso al 16/11/2023,Presenze al 16/11/2023,2023-11-17 15:31:50.351708
0,1,C'E' ANCORA DOMANI,'2023-10-26'26/10/2023,ITA,VISION DISTRIBUTION,449991.06,66700,15231636.85,2202169,2023-11-17 15:31:50.351708
1,2,HUNGER GAMES - LA BALLATA DELL'USIGNOLO E DEL ...,'2023-11-15'15/11/2023,USA,MEDUSA FILM S.P.A.,238466.31,32469,584902.45,79311,2023-11-17 15:31:50.351708
2,3,THE MARVELS,'2023-11-08'08/11/2023,USA,WALT DISNEY S.M.P. ITALIA,60914.35,8628,2260142.44,305291,2023-11-17 15:31:50.351708
3,4,THANKSGIVING,'2023-11-16'16/11/2023,USA,EAGLE PICTURES S.P.A.,41232.57,5709,41232.57,5709,2023-11-17 15:31:50.351708
...,...,...,...,...,...,...,...,...,...,...
50,7,COMANDANTE,'2023-10-31'31/10/2023,ITA,01 DISTRIBUTION,19637.37,3455,3252914.25,477655,2023-11-21 10:43:31.249268
51,8,IO CAPITANO,'2023-09-07'07/09/2023,COP,01 DISTRIBUTION,14251.70,3374,4189929.12,726793,2023-11-21 10:43:31.249268
52,9,TROLLS 3 - TUTTI INSIEME (TROLLS BAND TOGETHER),'2023-11-09'09/11/2023,USA,UNIVERSAL S.R.L.,13974.40,2376,1834876.28,270536,2023-11-21 10:43:31.250268
53,10,DREAM SCENARIO - HAI MAI SOGNATO QUEST'UOMO?,'2023-11-16'16/11/2023,USA,I WONDER PICTURES S.R.L.,12969.19,2175,136938.61,19876,2023-11-21 10:43:31.250268


---

## Cleaning the dataframe

To clean the dataframe I decided to drop duplicate values and columns that aren't relevant to my analysis, such as `df1['Incasso al 16/11/2023']` and `df1['Presenze al 16/11/2023]`. I then decided to rename the existing columns using the snake_case rule in order to make the analysis easier.

In [38]:
df1 = df1.drop_duplicates()

In [39]:
df1 = df1.drop(columns= ['Incasso al 16/11/2023', 'Presenze al 16/11/2023'])

In [40]:
df1 = df1.rename(columns = {'Pos.':'daily_rank', 'Titolo' : 'title', 'Prima Progr.' : 'first_screening_date', 'Nazione' : 'nation', 'Distribuzione' : 'distribution', 'Incasso' : 'daily_takings', 'Presenze' : 'daily_attendance',  '2023-11-17 15:31:50.351708' : 'date_pulled'})

In [41]:
# dropping the extra 'header row' that is imported everyday when the .csv file gets updated
df1 = df1[df1['daily_rank'].str.contains('Pos.')==False]

In [42]:
# making the 'title' and 'distribution' entries consistent by appling the title format
df1['title'] = df1['title'].apply(lambda x: x.title())
df1['distribution'] = df1['distribution'].apply(lambda x: x.title())

In [43]:
# cleaning the 'first_screening_date' entries
df1['first_screening_date'] = df1['first_screening_date'].str[12:]

In [44]:
# cleaning the column showing the date the data was pulled from the site
df1['date_pulled'] = pd.to_datetime(df1['date_pulled'])
df1['date_pulled'] = pd.to_datetime(df1['date_pulled']).dt.date

#extra tries I didi before:
#df1['date_pulled'] = df1['date_pulled'].str[:10]
#df1['date_pulled'] = df1['date_pulled'].str.replace('[^a-zA-Z0-9]', '', regex = True)
#df1['date_pulled'] = df1['date_pulled'].apply(lambda x: str(x))
#df1['date_pulled'] = df1['date_pulled'].apply(lambda x: x[0:2] + '/' + x[2:4] + '/' + x[4:8])

In [45]:
from datetime import timedelta, datetime

In [46]:
# formatting
df1['first_screening_date'] = pd.to_datetime(df1['first_screening_date'], format='%d/%m/%Y')
df1['first_screening_date'] = pd.to_datetime(df1['first_screening_date']).dt.date

---

## Data manipulation

I decided to create new columns to get information I may need later on during the analysis.

In [47]:
# calculating the number of days of screenings for each movie 
df1['screening_days'] = datetime.today().date() - df1['first_screening_date']
df1['screening_days'] = df1['screening_days'].map(lambda x: str(x)[:-9])

In [48]:
# creating a column to show the date the 'daily_takings' and 'daily_attendace' refer to
df1['date'] = df1['date_pulled'] - timedelta(days=1)
df1['date'] = pd.to_datetime(df1['date'])

In [49]:
# creating a column showing the day of the week. it'll be useful for further analysis
df1['day_of_week'] = df1['date'].dt.day_name()

In [50]:
df1

Unnamed: 0,daily_rank,title,first_screening_date,nation,distribution,daily_takings,daily_attendance,date_pulled,screening_days,date,day_of_week
0,1,C'E' Ancora Domani,2023-10-26,ITA,Vision Distribution,449991.06,66700,2023-11-17,26 days,2023-11-16,Thursday
1,2,Hunger Games - La Ballata Dell'Usignolo E Del ...,2023-11-15,USA,Medusa Film S.P.A.,238466.31,32469,2023-11-17,6 days,2023-11-16,Thursday
2,3,The Marvels,2023-11-08,USA,Walt Disney S.M.P. Italia,60914.35,8628,2023-11-17,13 days,2023-11-16,Thursday
3,4,Thanksgiving,2023-11-16,USA,Eagle Pictures S.P.A.,41232.57,5709,2023-11-17,5 days,2023-11-16,Thursday
...,...,...,...,...,...,...,...,...,...,...,...
50,7,Comandante,2023-10-31,ITA,01 Distribution,19637.37,3455,2023-11-21,21 days,2023-11-20,Monday
51,8,Io Capitano,2023-09-07,COP,01 Distribution,14251.70,3374,2023-11-21,75 days,2023-11-20,Monday
52,9,Trolls 3 - Tutti Insieme (Trolls Band Together),2023-11-09,USA,Universal S.R.L.,13974.40,2376,2023-11-21,12 days,2023-11-20,Monday
53,10,Dream Scenario - Hai Mai Sognato Quest'Uomo?,2023-11-16,USA,I Wonder Pictures S.R.L.,12969.19,2175,2023-11-21,5 days,2023-11-20,Monday


---

## Visualizations

In [51]:
# import numpy as np
# import matplotlib.pyplot as plt