# Cinetel Data Cleaning

Cinetel data scraped daily from their boxoffice published everyday at this [link](https://www.cinetel.it/pages/boxoffice.php?edperiodo=aWVyaQ==).

## Reading In Files

In [51]:
import pandas as pd

In [52]:
df1 = pd.read_csv(r"E:\data_analysis_python\cinetel\cinetel_.csv")
df1.head()

Unnamed: 0,Pos.,Titolo,Prima Progr.,Nazione,Distribuzione,Incasso,Presenze,Incasso al 16/11/2023,Presenze al 16/11/2023,2023-11-17 15:31:50.351708
0,1,C'E' ANCORA DOMANI,'2023-10-26'26/10/2023,ITA,VISION DISTRIBUTION,449991.06,66700,15231636.85,2202169,2023-11-17 15:31:50.351708
1,2,HUNGER GAMES - LA BALLATA DELL'USIGNOLO E DEL ...,'2023-11-15'15/11/2023,USA,MEDUSA FILM S.P.A.,238466.31,32469,584902.45,79311,2023-11-17 15:31:50.351708
2,3,THE MARVELS,'2023-11-08'08/11/2023,USA,WALT DISNEY S.M.P. ITALIA,60914.35,8628,2260142.44,305291,2023-11-17 15:31:50.351708
3,4,THANKSGIVING,'2023-11-16'16/11/2023,USA,EAGLE PICTURES S.P.A.,41232.57,5709,41232.57,5709,2023-11-17 15:31:50.351708
4,5,THE OLD OAK,'2023-11-16'16/11/2023,FRA,LUCKY RED DISTRIB.,36605.89,5879,56540.79,8609,2023-11-17 15:31:50.351708


---

## Cleaning the dataframe

To clean the dataframe I decided to drop duplicate values and columns that aren't relevant to my analysis, such as `df1['Incasso al 16/11/2023']` and `df1['Presenze al 16/11/2023]`. I then decided to rename the existing columns using the snake_case rule in order to make the analysis easier.

In [53]:
df1 = df1.drop_duplicates()

In [54]:
df1 = df1.drop(columns= ['Incasso al 16/11/2023', 'Presenze al 16/11/2023'])

In [55]:
df1 = df1.rename(columns = {'Pos.':'daily_rank', 'Titolo' : 'title', 'Prima Progr.' : 'first_screening_date', 'Nazione' : 'nation', 'Distribuzione' : 'distribution', 'Incasso' : 'daily_takings', 'Presenze' : 'daily_attendance',  '2023-11-17 15:31:50.351708' : 'date_pulled'})

In [56]:
# dropping the extra 'header row' that is imported everyday when the .csv file gets updated
df1 = df1[df1['daily_rank'].str.contains('Pos.')==False]

In [57]:
# making the 'title' and 'distribution' entries consistent by appling the title format
df1['title'] = df1['title'].apply(lambda x: x.title())
df1['distribution'] = df1['distribution'].apply(lambda x: x.title())

In [58]:
# cleaning the 'first_screening_date' entries
df1['first_screening_date'] = df1['first_screening_date'].str[12:]

In [59]:
# cleaning the column showing the date the data was pulled from the site
df1['date_pulled'] = pd.to_datetime(df1['date_pulled'])
df1['date_pulled'] = pd.to_datetime(df1['date_pulled']).dt.date

#extra tries:
#df1['date_pulled'] = df1['date_pulled'].str[:10]
#df1['date_pulled'] = df1['date_pulled'].str.replace('[^a-zA-Z0-9]', '', regex = True)
#df1['date_pulled'] = df1['date_pulled'].apply(lambda x: str(x))
#df1['date_pulled'] = df1['date_pulled'].apply(lambda x: x[0:2] + '/' + x[2:4] + '/' + x[4:8])

In [60]:
from datetime import timedelta, datetime

In [61]:
# formatting
df1['first_screening_date'] = pd.to_datetime(df1['first_screening_date'], format='%d/%m/%Y')
df1['first_screening_date'] = pd.to_datetime(df1['first_screening_date']).dt.date

In [62]:
df1['daily_takings'] = df1['daily_takings'].astype(float)
df1['daily_attendance'] = df1['daily_attendance'].astype(int)

---

## Data manipulation

I decided to create new columns to get information I may need later on during the analysis.

In [63]:
# calculating the number of days of screenings for each movie 
df1['screening_days'] = datetime.today().date() - df1['first_screening_date']
df1['screening_days'] = df1['screening_days'].map(lambda x: str(x)[:-9])

In [64]:
# creating a column to show the date the 'daily_takings' and 'daily_attendace' refer to
df1['date'] = df1['date_pulled'] - timedelta(days=1)
df1['date'] = pd.to_datetime(df1['date'])

In [65]:
# creating a column showing the day of the week. it'll be useful for further analysis
df1['day_of_week'] = df1['date'].dt.day_name()

In [66]:
df1['avg_ticket_price'] = df1['daily_takings'] / df1['daily_attendance']

In [67]:
pd.set_option('display.max_rows', 8)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
df1

Unnamed: 0,daily_rank,title,first_screening_date,nation,distribution,daily_takings,daily_attendance,date_pulled,screening_days,date,day_of_week,avg_ticket_price
0,1,C'E' Ancora Domani,2023-10-26,ITA,Vision Distribution,449991.06,66700,2023-11-17,30 days,2023-11-16,Thursday,6.75
1,2,Hunger Games - La Ballata Dell'Usignolo E Del ...,2023-11-15,USA,Medusa Film S.P.A.,238466.31,32469,2023-11-17,10 days,2023-11-16,Thursday,7.34
2,3,The Marvels,2023-11-08,USA,Walt Disney S.M.P. Italia,60914.35,8628,2023-11-17,17 days,2023-11-16,Thursday,7.06
3,4,Thanksgiving,2023-11-16,USA,Eagle Pictures S.P.A.,41232.57,5709,2023-11-17,9 days,2023-11-16,Thursday,7.22
...,...,...,...,...,...,...,...,...,...,...,...,...
94,7,The Old Oak,2023-11-16,GBR,Lucky Red Distrib.,34340.83,5408,2023-11-25,9 days,2023-11-24,Friday,6.35
95,8,Comandante,2023-10-31,ITA,01 Distribution,19473.82,2972,2023-11-25,25 days,2023-11-24,Friday,6.55
96,9,Trolls 3 - Tutti Insieme (Trolls Band Together),2023-11-09,USA,Universal S.R.L.,19122.86,3119,2023-11-25,16 days,2023-11-24,Friday,6.13
97,10,La Chimera,2023-11-23,ITA,01 Distribution,16140.23,2427,2023-11-25,2 days,2023-11-24,Friday,6.65


---

## Data Exploration

In [68]:
import seaborn as sns
import matplotlib.pyplot as plt

In [69]:
pd.set_option('display.max_rows', 8)

In [70]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 90 entries, 0 to 97
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   daily_rank            90 non-null     object        
 1   title                 90 non-null     object        
 2   first_screening_date  90 non-null     object        
 3   nation                90 non-null     object        
 4   distribution          90 non-null     object        
 5   daily_takings         90 non-null     float64       
 6   daily_attendance      90 non-null     int32         
 7   date_pulled           90 non-null     object        
 8   screening_days        90 non-null     object        
 9   date                  90 non-null     datetime64[ns]
 10  day_of_week           90 non-null     object        
 11  avg_ticket_price      90 non-null     float64       
dtypes: datetime64[ns](1), float64(2), int32(1), object(8)
memory usage: 8.8+ KB


In [71]:
df1.describe()

Unnamed: 0,daily_takings,daily_attendance,date,avg_ticket_price
count,90.0,90.0,90,90.0
mean,163342.28,23227.34,2023-11-20 00:00:00,6.8
min,8904.68,1458.0,2023-11-16 00:00:00,4.22
25%,19851.64,3640.25,2023-11-18 00:00:00,6.23
50%,43573.77,6567.0,2023-11-20 00:00:00,6.74
75%,162611.63,22648.25,2023-11-22 00:00:00,7.34
max,1544231.0,211764.0,2023-11-24 00:00:00,10.94
std,268945.18,38032.76,,1.11


In [72]:
df1.isnull().sum()

daily_rank              0
title                   0
first_screening_date    0
nation                  0
                       ..
screening_days          0
date                    0
day_of_week             0
avg_ticket_price        0
Length: 12, dtype: int64

In [73]:
df1.nunique()

daily_rank              10
title                   16
first_screening_date    11
nation                   5
                        ..
screening_days          11
date                     9
day_of_week              7
avg_ticket_price        90
Length: 12, dtype: int64

In [74]:
df1.corr(numeric_only=True)

Unnamed: 0,daily_takings,daily_attendance,avg_ticket_price
daily_takings,1.0,0.99,0.18
daily_attendance,0.99,1.0,0.13
avg_ticket_price,0.18,0.13,1.0


In [75]:
df1.groupby('title').mean(numeric_only=True).sort_values(by = 'daily_attendance', ascending= False)

Unnamed: 0_level_0,daily_takings,daily_attendance,avg_ticket_price
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C'E' Ancora Domani,752609.26,111728.11,6.62
Napoleon,478751.41,64918.50,7.37
Hunger Games - La Ballata Dell'Usignolo E Del Serpente,331998.34,43894.67,7.39
The Marvels,89322.75,12277.11,6.97
...,...,...,...
Io Capitano,15849.42,3666.00,4.34
Dream Scenario - Hai Mai Sognato Quest'Uomo?,24143.02,3622.86,6.47
Killers Of The Flower Moon,23930.22,3229.33,7.20
La Chimera,12522.45,1942.50,6.38


In [76]:
df2 = df1.groupby('title').mean(numeric_only=True).sort_values(by = 'daily_attendance', ascending= False)
df2

Unnamed: 0_level_0,daily_takings,daily_attendance,avg_ticket_price
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C'E' Ancora Domani,752609.26,111728.11,6.62
Napoleon,478751.41,64918.50,7.37
Hunger Games - La Ballata Dell'Usignolo E Del Serpente,331998.34,43894.67,7.39
The Marvels,89322.75,12277.11,6.97
...,...,...,...
Io Capitano,15849.42,3666.00,4.34
Dream Scenario - Hai Mai Sognato Quest'Uomo?,24143.02,3622.86,6.47
Killers Of The Flower Moon,23930.22,3229.33,7.20
La Chimera,12522.45,1942.50,6.38
