In [1]:
import pandas as pd 

# Specify path to Excel file
file_path = '/Users/drod/de_projects/project3/clean_netflix.xlsx'

# Load the Excel file into a dictionary of DataFrames, one for each sheet
all_sheets_dict = pd.read_excel(file_path, sheet_name=None)

<h3 align="center">Displaying Data</h3>

In [2]:
for sheet_name, df in all_sheets_dict.items():
    print(f"Original Sheet name: {sheet_name}")
    print(df.head(), "\n")  # Display the first few rows
    print("Null values count:")
    print(df.isnull().sum(), "\n")  # Display the count of null values
    print("Original Data types:")
    print(df.dtypes, "\n")  # Display the data types


Original Sheet name: Netflix annual revenue 
   Year  Revenue ($bn)
0  2011            3.1
1  2012            3.5
2  2013            4.3
3  2014            5.4
4  2015            6.7 

Null values count:
Year             0
Revenue ($bn)    0
dtype: int64 

Original Data types:
Year               int64
Revenue ($bn)    float64
dtype: object 

Original Sheet name: Netflix annual revenue by regio
   Year  US & Canada  EMEA  Latin America  Asia-Pacific
0  2018         8.28  3.95           2.22          0.94
1  2019        10.05  5.54           2.78          1.46
2  2020        11.45  7.77           3.13          2.37
3  2021        12.97  9.69           3.57          3.26
4  2022        14.08  9.74           4.06          3.57 

Null values count:
Year             0
US & Canada      0
EMEA             0
Latin America    0
Asia-Pacific     0
dtype: int64 

Original Data types:
Year               int64
US & Canada      float64
EMEA             float64
Latin America    float64
Asia-Pacific   

<h3 align="center">Thought Process</h3>

Upon initial examination of the dataset, several areas for data cleaning and standardization were identified:

- **Incomplete and Misformatted Titles**:
  - The titles of the tables appeared incomplete and misformatted, likely due to the data scraping process or a lack of post-processing.
  - Titles with extraneous numbers and letters, not conveying meaningful information, were cleaned for clarity.

- **Data Type Correction for 'Year' and 'Date'**:
  - Initially, 'Year' and 'Date' fields were set as integers (`int`), leading to incorrect display in visualizations (e.g., '2018' displaying as '2,018').
  - To rectify this, these fields were converted to string data type. This ensures correct representation and avoids misinterpretation in visualizations, while keeping the data suitable for year-based grouping and analysis.

- **Column Name Standardization**:
  - Identified spaces in column names could lead to compatibility issues with certain data analysis tools.
  - For improved readability and adherence to standard data processing practices, spaces in column names were replaced with underscores (`_`).

- **Data Integrity Check**:
  - A comprehensive check for null values was conducted across all tables to ensure data completeness.
  - The examination revealed no missing data, as indicated by the null value counts (e.g., 'Year' and 'Revenue' columns showing zero null values), confirming the dataset's integrity for analysis.

These steps were taken to enhance the dataset's usability, ensuring that it is clean, consistent, and well-prepared for in-depth analysis and interactive visualization.



<h3 align="center">Correcting Data Types and spaces in between words in titles</h3>

In [8]:
for sheet_name, df in all_sheets_dict.items():
    # Replace spaces with underscores in column names
    df.columns = [col.replace(' ', '_') for col in df.columns]

    # Convert 'Year' and 'Date' columns to datetime format, if they exist
    for col in ['Year', 'Date']:
        if col in df.columns:
            df[col] = df[col].astype(str)

    # Save the modified DataFrame back to the dictionary
    all_sheets_dict[sheet_name] = df


<h3 align="center">Saving Changes</h3>

In [9]:
# Create a new Excel file with the modified DataFrames
output_file_path = '/Users/drod/de_projects/project3/netflix_modified.xlsx'
with pd.ExcelWriter(output_file_path) as writer:
    for sheet_name, df in all_sheets_dict.items():
        df.to_excel(writer, sheet_name=sheet_name, index=False)


<h3 align="center">Final Results</h3>

In [10]:
# Specify the path to the modified Excel file
mod_path = '/Users/drod/de_projects/project3/netflix_modified.xlsx'

# Load the Excel file into a dictionary of DataFrames, one for each sheet
all_mod_sheets_dict = pd.read_excel(mod_path, sheet_name=None)

# Displaying data from each sheet
for sheet_name, df in all_mod_sheets_dict.items():
    print(f"Sheet name: {sheet_name}")
    print(df.head(), "\n")  # Display the first few rows of the DataFrame
    print("Null values count:")
    print(df.isnull().sum(), "\n")  # Display the count of null values in each column
    print("Data types:")
    print(df.dtypes, "\n")  # Display the data types of each column

Sheet name: Netflix annual revenue 
         Year  Revenue_($bn)
0  2011-01-01            3.1
1  2012-01-01            3.5
2  2013-01-01            4.3
3  2014-01-01            5.4
4  2015-01-01            6.7 

Null values count:
Year             0
Revenue_($bn)    0
dtype: int64 

Data types:
Year              object
Revenue_($bn)    float64
dtype: object 

Sheet name: Netflix annual revenue by regio
         Year  US_&_Canada  EMEA  Latin_America  Asia-Pacific
0  2018-01-01         8.28  3.95           2.22          0.94
1  2019-01-01        10.05  5.54           2.78          1.46
2  2020-01-01        11.45  7.77           3.13          2.37
3  2021-01-01        12.97  9.69           3.57          3.26
4  2022-01-01        14.08  9.74           4.06          3.57 

Null values count:
Year             0
US_&_Canada      0
EMEA             0
Latin_America    0
Asia-Pacific     0
dtype: int64 

Data types:
Year              object
US_&_Canada      float64
EMEA             float64
Lati