# **Project Name**    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Vikash Tiwari
##### **Team Member 1 -**
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**



*   There were 119390 rows & 32 columns in the actual dataset
*   List item



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions! This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. Explore and analyze the data to discover important factors that govern the bookings.

#### **Define Your Business Objective?**

Analysing the data of Resort & City Hotel to gain insights on various factors like Best time to book hotel, optimal length of stay to get best rates,time when hotels can expect maximum & minimum bookings,factors that lead to hotel cancellations

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





In [None]:
#HANDLING MISSING DATA
#STARDARDIZIG DATA FORMATS
#FILTER UNWANTED OUTLIERS
#HANDLING DUPLICATES

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import missingno as msno
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
%matplotlib inline

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')
#Loading Linkedin dataset
data= '/content/drive/MyDrive/ROBIN SHARMA/Final_CL_Robin_Sharma_YouTube_Data.csv'
Youtube_data= pd.read_csv(data)

### Dataset First View

In [None]:
# Youtube Dataset First Look
Youtube_data

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows, columns= Youtube_data.shape
print(f'there are {rows} Rows & {columns} Columns in this dataset')

### Dataset Information

In [None]:
# Dataset Info
Youtube_data.info()


**<h1>Changing Column names**

In [None]:
#changing 'Unnamed: 0' column name to 'SR. No'
Youtube_data.rename(columns={'Unnamed: 0':'SR. No'},inplace=True)

In [None]:
#changing 'Unnamed: 3' column name to 'Published_Date'
Youtube_data.rename(columns={'Unnamed: 3':'Published_date'},inplace=True)

In [None]:
#Checking whether column name have been successfully changed or not
Youtube_data.info()

**<h1>1. Handling Duplicate Values**

In [None]:
# Dataset Duplicate Value Count
duplicate = Youtube_data.duplicated()
#creating a boolean Series object called duplicate by calling the duplicated() method on the Linkedin_data DataFrame

In [None]:
duplicate

In [None]:
# filtering the Linkedin_data DataFrame using the boolean Series 'duplicate' & storig that in duplicate_rows. This would only display duplicate values
duplicate_rows= Youtube_data[duplicate]


In [None]:
duplicate_rows
#there are no duplicate rows in this dataframe

**<h1>2. Handling Missing Values/Null Values**

In [None]:
# Missing Values/Null Values Count
missing_values= Youtube_data.isnull().sum().sort_values(ascending=False)
missing_values
#This dataframe dont have any missing values

In [None]:
#creating bar plot
plt.figure(figsize=(10,6))
#creating a color pallete
palette = sns.color_palette("husl", len(missing_values))
bars = plt.bar(missing_values.index, missing_values, color=palette)
plt.title('Missing Values')
plt.xlabel('Columns')
# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
#Lets Annotate the bar plot with missing value counts
for bar in bars:
  height=bar.get_height()
  plt.text(
      bar.get_x() + bar.get_width() / 2,
      height,
      f'{int(height)}',
      ha='center',
      va='bottom'
  )
  #using plt.text() which a function from matplotlib library that adds text to the plot at specified coordinates
plt.ylabel('No of Missing Values')
plt.show()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
Youtube_data.columns

In [None]:
# Dataset Describe
# The Describe method in pandas would give you standard statistics data
Youtube_data.describe()

In [None]:
# Check Unique Values for each variable.
for elements in Youtube_data.columns:
  print('No of Unique values in',elements,'column is',Youtube_data[elements].nunique())

##Here element represents each column

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#Missing values
Youtube_data.isnull().sum().sort_values(ascending=False)
#Thus, there are no null values

In [None]:
Youtube_data.columns

**<h1>3) Standardizing Data Formats**

In [None]:
#Checking Data types of all columns
Youtube_data.info()

In [None]:
# 'Published_date' column is of object data type & has to be changed to datetime format
Youtube_data["Published_date"] = pd.to_datetime(Youtube_data["Published_date"],dayfirst=True)

In [None]:
Youtube_data["Published_date"]

In [None]:
#converting 'Published_time' from object datatype to time datatype
Youtube_data['Published_time'] = pd.to_datetime(Youtube_data['Published_time'], format='%I:%M:%S %p')
#the above step would give us time with default date of '1900-01-01' .


In [None]:
Youtube_data['Published_time']

In [None]:
#we would be using time extractor from the result to ignore date and get only time
Youtube_data['Published_time'] = Youtube_data['Published_time'].dt.time

In [None]:
Youtube_data['Published_time']

In [None]:
def parse_dates(date_str):
    for fmt in ('%d-%b-%y', '%d-%b-%Y', '%d %b %Y'):
        try:
            return pd.to_datetime(date_str, format=fmt, dayfirst=True)
        except ValueError:
            continue
    return pd.NaT  # return NaT if none of the formats match

# Apply the custom date parser function to the 'post_date' column
Linkedin_data["parsed_post_date"] = Linkedin_data["post_date"].apply(parse_dates)

# Identify rows with faulty 'post_date' data
faulty_dates = Linkedin_data[Linkedin_data["parsed_post_date"].isna()]

# Display rows with faulty 'post_date' data
print("Rows with faulty 'post_date' data:\n", faulty_dates)

In [None]:
#Checking Data types of all columns
Linkedin_data.info()

In [None]:
#converting post_date from string data type to datetime
Linkedin_data["post_date"] = pd.to_datetime(Linkedin_data["post_date"],dayfirst=True)
#THERE WERE ERRORS HERE
#errors='coerce': This is a parameter of the pd.to_datetime() function.
#When errors='coerce' is specified, it tells Pandas to set any errors it encounters during conversion to NaT (Not a Time).
#NaT is a special value in Pandas that represents missing or undefined datetime values.
#This parameter is crucial because it allows the function to handle cases where the input data is not formatted as expected or contains invalid dates without throwing an error.
#Instead of halting the conversion process, Pandas will skip over problematic entries and mark them as NaT.
#WE DID THE CONVERSION IN THE NEXT STEP

In [None]:
#converting post_date from string data type to datetime
Linkedin_data["post_date"] = pd.to_datetime(Linkedin_data["post_date"], errors='coerce')
Linkedin_data.info()

In [None]:
#Converting post_time to time datatype
import re

def clean_time_string(time_str):
    # Remove leading and trailing spaces
    time_str = time_str.strip()
    # Remove unwanted characters (keep only digits and colons)
    time_str = re.sub(r'[^0-9:]', '', time_str)
    return time_str

# Apply the cleaning function to the 'time' column
Linkedin_data['cleaned_time'] = Linkedin_data['post_time'].apply(clean_time_string)
print("\nCleaned Time Strings:")
print(Linkedin_data)

# Function to convert cleaned time strings to datetime.time
def convert_to_time(time_str):
    try:
        dt = datetime.strptime(time_str, "%H:%M:%S")  # Parse the time string
        return dt.time()  # Extract the time part
    except ValueError:
        return None  # Handle invalid formats

# Apply the conversion function to the cleaned 'cleaned_time' column
Linkedin_data['cl_post_time'] = Linkedin_data['cleaned_time'].apply(convert_to_time)
print("\nConverted Time Objects:")
print(Linkedin_data)

In [None]:
Linkedin_data.info()

In [None]:
Linkedin_data['cl_post_time'] = pd.to_datetime(Linkedin_data['cl_post_time'], format='%H:%M:%S', errors='coerce').dt.time
Linkedin_data.info()

In [None]:
print(Linkedin_data[Linkedin_data['cl_post_time'].isnull()])  # Check rows where conversion failed
#THIS OUTPUT SHOWS THAT ALL CONVERSION HAVE BEEN DONE SUCCESSFULLY
#WE WOULD BE USING THIS COLUMN AS IT IS . IT IS CLEANED COMPLETELY


In [None]:
#we would drop parsed_post_date,post_time,cl_post_date
columns_to_drop = ['parsed_post_date','post_time','cleaned_time']
Linkedin_data.drop(columns=columns_to_drop,inplace=True)


In [None]:
Linkedin_data.info()

In [None]:
#Converting Cl_post_time which was of object datatype to datetime datatype
Linkedin_data['cl_post_time'] = pd.to_datetime(Linkedin_data['cl_post_time'], format='%H:%M:%S')


In [None]:
Linkedin_data.info()

<h1> 1. There are 189 Rows & 16 Columns in this dataset

<h1> 2. There were no duplicate data in this dataset
                      
<h1>3. This dataframe had 106 null values in video_length column,53 nulls in hashtags_used columns,14 nulls in tags_used

<h1>4. Replaced null values of video_length column with 0

<h1>5. Replaced null values of hashtags_used column with 'none'