# **EDA_Facebook_utilization_Analysis**
#**By Amit Kharche**
**Follow me** on [Linkedin](https://www.linkedin.com/in/amit-kharche) and [Medium](https://medium.com/@amitkharche14) for more insights on **Data Science** and **AI**

---
# **Table of Contents**
---

1. [**Introduction**](#Section1)<br>
2. [**Problem Statement**](#Section2)<br>
3. [**Installing & Importing Libraries**](#Section3)<br>
  3.1 [**Installing Libraries**](#Section31)<br>
  3.2 [**Upgrading Libraries**](#Section32)<br>
  3.3 [**Importing Libraries**](#Section33)<br>
4. [**Data Acquisition & Description**](#Section4)<br>
5. [**Data Pre-Profiling**](#Section5)<br>
6. [**Data Pre-Processing**](#Section6)<br>
7. [**Data Post-Profiling**](#Section7)<br>
8. [**Exploratory Data Analysis**](#Section8)<br>
9. [**Summarization**](#Section9)</br>
  9.1 [**Conclusion**](#Section91)</br>
  9.2 [**Actionable Insights**](#Section91)</br>

---

# 📊 Exploratory Data Analysis on Facebook Metrics

Social media platforms generate vast amounts of user interaction data daily.  
Understanding how users engage with posts, ads, and content is essential for businesses to improve **reach** and **engagement**.

This notebook focuses on performing **Exploratory Data Analysis (EDA)** on Facebook data to:

- 🔍 Uncover patterns and correlations in post performance  
- 👍 Analyze user engagement through likes, shares, comments  
- 📈 Identify what content strategies yield better outcomes  

> 🎯 **Goal**: Provide actionable insights for **social media strategists**, **digital marketers**, and **business analysts** to make data-driven decisions.


---
<a name = Section2></a>
# **2. Problem Statement**
---

- This section is emphasised on providing some generic introduction to the problem that most companies confronts.
- **Example Problem Statement:**

  - This era has been under the influence of rapid development for the past few years.

  - Most of the time social applications fails to stay up to date, resulting in great loss to the company.

  - Unlike any other incorporation, Facebook has become quite popular in the past few years, more specifically since 2005.

  - People from all over the world are still using it as a medium to share their thoughts and feelings among others.

  - With the rise in the popularity of the application it is very necessary to stay updated and they are trying really hard...
  
<p align="center"><img width="35%" src="https://chi2016.acm.org/wp/wp-content/uploads/2016/02/Facebook-06-2015-Blue.png"></p>

- Derive a scenario related to the problem statement and heads on to the journey of exploration.

- **Example Scenario:**
  - Facebook, Inc. is an American social media conglomerate corporation, founded on February, 2004 by Mark Zuckerberg.

  - People from all age groups are connected with each other through facebook.

  - However, there are certain differences in the way of using it, let's say sending friend requests, sending likes, comments etc.

  - Let's say they want to study and analyze these differences and identify the pattern out of it.

  - This is result will help the company to utilize these patterns in the next set of iteration development.

  - Maybe this hidden pattern could end up improving their application and the user experience.

  - To tackle this problem they hired a genius team of data scientists. Consider you are one of them...

---
<a id = Section3></a>
# **3. Installing & Importing Libraries**
---

- This section is emphasised on installing and importing the necessary libraries that will be required.

### **Installing Libraries**

In [1]:
!pip install -q datascience                                         # Package that is required by pandas profiling
!pip install -q pandas-profiling                                    # Library to generate basic statistics about data

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/1.6 MB[0m [31m18.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m23.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.6/262.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.7/4.7 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/102.4 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.8/309.8 kB[0m [31m18.1 MB/s

### **Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync.

- Make sure not to execute the cell under Installing Libraries and Upgrading Libraries again after restarting the runtime.

In [2]:
!pip install -q --upgrade datascience                               # Package that is required by pandas profiling
!pip install -q --upgrade pandas-profiling                          # Library to generate basic statistics about data

In [4]:
!pip install ydata_profiling

Collecting ydata_profiling
  Downloading ydata_profiling-4.16.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting visions<0.8.2,>=0.7.5 (from visions[type_image_path]<0.8.2,>=0.7.5->ydata_profiling)
  Downloading visions-0.8.1-py3-none-any.whl.metadata (11 kB)
Collecting multimethod<2,>=1.4 (from ydata_profiling)
  Downloading multimethod-1.12-py3-none-any.whl.metadata (9.6 kB)
Collecting imagehash==4.3.1 (from ydata_profiling)
  Downloading ImageHash-4.3.1-py2.py3-none-any.whl.metadata (8.0 kB)
Collecting dacite>=1.8 (from ydata_profiling)
  Downloading dacite-1.9.2-py3-none-any.whl.metadata (17 kB)
Collecting puremagic (from visions<0.8.2,>=0.7.5->visions[type_image_path]<0.8.2,>=0.7.5->ydata_profiling)
  Downloading puremagic-1.29-py3-none-any.whl.metadata (5.8 kB)
Downloading ydata_profiling-4.16.1-py2.py3-none-any.whl (400 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.1/400.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ImageHash-4.3.1-py2

### **Importing Libraries**

- You can headstart with the basic libraries as imported inside the cell below.

- If you want to import some additional libraries, feel free to do so.


In [5]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from ydata_profiling import ProfileReport                        # To perform data profiling
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # For numerical python operations
#-------------------------------------------------------------------------------------------------------------------------------
import plotly.graph_objs as go                                      # For interactive graphs
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- This section is emphasised on the accquiring the data and obtain some descriptive information out of it.

- You could either scrap the data and then continue, or use a direct source of link (generally preferred in most cases).

- You will be working with a direct source of link to head start your work without worrying about anything.

- Before going further you must have a good idea about the features of the data set:

|Id|Feature|Description|
|:--|:--|:--|
|01| userid                 | A numeric value uniquely identifying the user.|
|02| age                    | Age of the user in years.|
|03| dob_day                | Day part of the user's date of birth.|
|04| dob_year               | Year part of the user's date of birth.|
|05| dob_month              | Month part of the user's date of birth.|
|06| gender                 | Gender of the user.|
|07| tenure                 | Number of days since the user has been on FB.|
|08| friend_count           | Number of friends the user has.|
|09| friendships_initiated  | Number of friendships initiated by the user.|
|10| likes                  | Total number of posts liked by the user.|
|11| likes_received         | Total Number of likes received by user's posts.|
|12| mobile_likes           | Number of posts liked by the user through mobile app.|
|13| mobile_likes_received  | Number of likes received by user through mobile app.|
|14| www_likes              | Number of posts liked by the user through web.|
|15| www_likes_received     | Number of likes received by user  through web.|


In [6]:
facebook_df = pd.read_csv(filepath_or_buffer = 'https://raw.githubusercontent.com/amitkharche/exploratory_data_analysis_projects_amit_kharche/refs/heads/main/04.EDA_facebook_utilization_amit_kharche/facebook_data.csv')
print('facebook_df Shape:', facebook_df.shape)
facebook_df.head()

facebook_df Shape: (99003, 15)


Unnamed: 0,userid,age,dob_day,dob_year,dob_month,gender,tenure,friend_count,friendships_initiated,likes,likes_received,mobile_likes,mobile_likes_received,www_likes,www_likes_received
0,2094382,14,19,1999,11,male,266.0,0,0,0,0,0,0,0,0
1,1192601,14,2,1999,11,female,6.0,0,0,0,0,0,0,0,0
2,2083884,14,16,1999,11,male,13.0,0,0,0,0,0,0,0,0
3,1203168,14,25,1999,12,female,93.0,0,0,0,0,0,0,0,0
4,1733186,14,4,1999,12,male,82.0,0,0,0,0,0,0,0,0


### **Data Description**

To gain insights from data we must look into each aspect of it very carefully. We will start with observing few rows and columns of data both from the starting and from the end

In [7]:
facebook_df.describe(include='all')

Unnamed: 0,userid,age,dob_day,dob_year,dob_month,gender,tenure,friend_count,friendships_initiated,likes,likes_received,mobile_likes,mobile_likes_received,www_likes,www_likes_received
count,99003.0,99003.0,99003.0,99003.0,99003.0,98828,99001.0,99003.0,99003.0,99003.0,99003.0,99003.0,99003.0,99003.0,99003.0
unique,,,,,,2,,,,,,,,,
top,,,,,,male,,,,,,,,,
freq,,,,,,58574,,,,,,,,,
mean,1597045.0,37.280224,14.530408,1975.719776,6.283365,,537.887375,196.350787,107.452471,156.078785,142.689363,106.1163,84.120491,49.962425,58.568831
std,344059.2,22.589748,9.015606,22.589748,3.529672,,457.649874,387.304229,188.786951,572.280681,1387.919613,445.252985,839.889444,285.560152,601.416348
min,1000008.0,13.0,1.0,1900.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1298806.0,20.0,7.0,1963.0,3.0,,226.0,31.0,17.0,1.0,1.0,0.0,0.0,0.0,0.0
50%,1596148.0,28.0,14.0,1985.0,6.0,,412.0,82.0,46.0,11.0,8.0,4.0,4.0,0.0,2.0
75%,1895744.0,50.0,22.0,1993.0,9.0,,675.0,206.0,117.0,81.0,59.0,46.0,33.0,7.0,20.0


### **Data Information**

In [8]:
facebook_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99003 entries, 0 to 99002
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   userid                 99003 non-null  int64  
 1   age                    99003 non-null  int64  
 2   dob_day                99003 non-null  int64  
 3   dob_year               99003 non-null  int64  
 4   dob_month              99003 non-null  int64  
 5   gender                 98828 non-null  object 
 6   tenure                 99001 non-null  float64
 7   friend_count           99003 non-null  int64  
 8   friendships_initiated  99003 non-null  int64  
 9   likes                  99003 non-null  int64  
 10  likes_received         99003 non-null  int64  
 11  mobile_likes           99003 non-null  int64  
 12  mobile_likes_received  99003 non-null  int64  
 13  www_likes              99003 non-null  int64  
 14  www_likes_received     99003 non-null  int64  
dtypes:

- ```info``` function gives us the following insights into the CancerData dataframe:

  - There are a total of **99003 samples (rows)** and **15 columns** in the dataframe.
  
  - There are **13 columns** with a **numeric** datatype.
  
  - There is a **float and Category column** each.
  
  - There are **missing** values in the data.

## 🧾 Data Profiling with Pandas Profiling

Using **Pandas Profiling**, we can automatically generate an interactive **HTML report** that provides a comprehensive overview of the dataset.

This report includes:
- 📌 Summary statistics for each column (e.g., data types, counts, missing values)
- 🔄 Correlation analysis between numerical features
- 📊 Visual distributions and histograms for all variables
- 🧪 A preview sample of the dataset

> ✅ It helps in quickly understanding the structure, quality, and relationships within the data through **rich visualizations** and **detailed profiling**.


In [10]:
profile = ProfileReport(df=facebook_df)
profile.to_file(output_file='Pre Profiling Report.html')
from google.colab import files
files.download("Pre Profiling Report.html")
print('Accomplished!')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/15 [00:00<?, ?it/s][A
  7%|▋         | 1/15 [00:00<00:08,  1.68it/s][A
 27%|██▋       | 4/15 [00:00<00:02,  4.82it/s][A
 40%|████      | 6/15 [00:01<00:01,  5.44it/s][A
 60%|██████    | 9/15 [00:01<00:00,  7.94it/s][A
 73%|███████▎  | 11/15 [00:01<00:00,  9.33it/s][A
100%|██████████| 15/15 [00:01<00:00,  8.52it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Accomplished!


---
<a name = Section6></a>
# **6. Data Pre-Processing**
---

- This section is emphasised on performing data manipulation over unstructured data for further processing and analysis.

- To modify unstructured data to strucuted data you need to verify and manipulate the integrity of the data by:
  - Handling missing data,

  - Handling redundant data,

  - Handling inconsistent data,

  - Handling outliers,

  - Handling typos

In [11]:
facebook_df.drop('userid',axis =1,inplace=True)

In [12]:
facebook_df['gender'].mode()

Unnamed: 0,gender
0,male


In [14]:
facebook_df['gender'] = facebook_df['gender'].replace(np.nan, 'male')

In [15]:
facebook_df['gender'].unique()

array(['male', 'female'], dtype=object)

In [16]:
facebook_df['tenure'].median()

412.0

In [18]:
facebook_df['tenure'] = facebook_df['tenure'].replace(np.nan,412.0)

In [20]:
facebook_df.isnull().sum().sort_values(ascending = False)

Unnamed: 0,0
age,0
date_of_birth,0
dob_day,0
dob_year,0
dob_month,0
gender,0
tenure,0
friend_count,0
friendships_initiated,0
likes,0


- We need to create **date_of_birth** column using these variables: __dob_year__ , __dob_month__ , and __dob_day__

In [19]:
facebook_df.insert(1,"date_of_birth",pd.to_datetime(facebook_df.dob_year*10000+facebook_df.dob_month*100+facebook_df.dob_day,format='%Y%m%d'))

In [22]:
facebook_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99003 entries, 0 to 99002
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   age                    99003 non-null  int64         
 1   date_of_birth          99003 non-null  datetime64[ns]
 2   dob_day                99003 non-null  int64         
 3   dob_year               99003 non-null  int64         
 4   dob_month              99003 non-null  int64         
 5   gender                 99003 non-null  object        
 6   tenure                 99003 non-null  float64       
 7   friend_count           99003 non-null  int64         
 8   friendships_initiated  99003 non-null  int64         
 9   likes                  99003 non-null  int64         
 10  likes_received         99003 non-null  int64         
 11  mobile_likes           99003 non-null  int64         
 12  mobile_likes_received  99003 non-null  int64         
 13  w

In [23]:
facebook_df.describe(include = 'all')

Unnamed: 0,age,date_of_birth,dob_day,dob_year,dob_month,gender,tenure,friend_count,friendships_initiated,likes,likes_received,mobile_likes,mobile_likes_received,www_likes,www_likes_received
count,99003.0,99003,99003.0,99003.0,99003.0,99003,99003.0,99003.0,99003.0,99003.0,99003.0,99003.0,99003.0,99003.0,99003.0
unique,,,,,,2,,,,,,,,,
top,,,,,,male,,,,,,,,,
freq,,,,,,58749,,,,,,,,,
mean,37.280224,1976-03-12 16:14:12.628708256,14.530408,1975.719776,6.283365,,537.884832,196.350787,107.452471,156.078785,142.689363,106.1163,84.120491,49.962425,58.568831
min,13.0,1900-01-01 00:00:00,1.0,1900.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,20.0,1963-08-14 12:00:00,7.0,1963.0,3.0,,226.0,31.0,17.0,1.0,1.0,0.0,0.0,0.0,0.0
50%,28.0,1985-01-20 00:00:00,14.0,1985.0,6.0,,412.0,82.0,46.0,11.0,8.0,4.0,4.0,0.0,2.0
75%,50.0,1993-01-01 00:00:00,22.0,1993.0,9.0,,675.0,206.0,117.0,81.0,59.0,46.0,33.0,7.0,20.0
max,113.0,2000-10-27 00:00:00,31.0,2000.0,12.0,,3139.0,4923.0,4144.0,25111.0,261197.0,25111.0,138561.0,14865.0,129953.0


---
<a name = Section7></a>
# **7. Data Post-Profiling**
---

- This section is emphasised on getting a report about the data after the data manipulation.

- You may end up observing some new changes, so keep it under check and make right observations.

In [24]:
profile = ProfileReport(df=facebook_df)
profile.to_file(output_file='Post Profiling Report.html')
from google.colab import files
files.download("Post Profiling Report.html")
print('Accomplished!')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/15 [00:00<?, ?it/s][A
 13%|█▎        | 2/15 [00:00<00:00, 19.05it/s][A
 33%|███▎      | 5/15 [00:00<00:00, 14.68it/s][A
 47%|████▋     | 7/15 [00:00<00:00, 15.39it/s][A
100%|██████████| 15/15 [00:00<00:00, 24.64it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Accomplished!


## ✅ Post-Processing Summary & Profiling Insights

After completing the data preprocessing steps, the dataset is now **clean and free of missing values**.  
We have also engineered a new feature: **`date_of_birth`**.

The updated **Pandas Profiling Report** generated after preprocessing will offer **more accurate and meaningful insights** compared to the initial one.

You can compare the two profiling reports:
- 📄 `Pre Profile Facebook Data Analysis before Processing.html`
- 📄 `Post Profile Facebook Data Analysis after Processing.html`

### 🔍 Key Observations:

- 🧼 **Total Missing (%)**: `0.0%` — No missing values remain  
- 📊 **Number of Variables**: `15`  
- 🆕 **New Feature**: `date_of_birth` has been successfully added  
- 🔁 **Duplicates**: `8 rows` (~<0.1%) are duplicates and should be removed to avoid skewed insights


---
###  Post Processing

In [25]:
facebook_df.drop_duplicates(inplace=True)

In [26]:
facebook_df.describe(include='all')

Unnamed: 0,age,date_of_birth,dob_day,dob_year,dob_month,gender,tenure,friend_count,friendships_initiated,likes,likes_received,mobile_likes,mobile_likes_received,www_likes,www_likes_received
count,98995.0,98995,98995.0,98995.0,98995.0,98995,98995.0,98995.0,98995.0,98995.0,98995.0,98995.0,98995.0,98995.0,98995.0
unique,,,,,,2,,,,,,,,,
top,,,,,,male,,,,,,,,,
freq,,,,,,58741,,,,,,,,,
mean,37.281146,1976-03-12 08:29:41.035405824,14.531502,1975.718854,6.283792,,537.916986,196.366625,107.461124,156.091399,142.700894,106.124875,84.127289,49.966463,58.573564
min,13.0,1900-01-01 00:00:00,1.0,1900.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,20.0,1963-08-14 00:00:00,7.0,1963.0,3.0,,226.0,31.0,17.0,1.0,1.0,0.0,0.0,0.0,0.0
50%,28.0,1985-01-19 00:00:00,14.0,1985.0,6.0,,412.0,82.0,46.0,11.0,8.0,4.0,4.0,0.0,2.0
75%,50.0,1993-01-01 00:00:00,22.0,1993.0,9.0,,675.0,206.0,117.0,81.0,59.0,46.0,33.0,7.0,20.0
max,113.0,2000-10-27 00:00:00,31.0,2000.0,12.0,,3139.0,4923.0,4144.0,25111.0,261197.0,25111.0,138561.0,14865.0,129953.0


**8 duplicate rows** has been removed and the data set size now has 98995 rows and 15 Columns

---
<a name = Section8></a>
# **8. Exploratory Data Analysis**
---

- This section is emphasised on asking the right questions and perform analysis using the data.

- Note that there is no limit how deep you can go, but make sure not to get distracted from right track.

In [None]:
# Insert your code here...

---
<a name = Section9></a>
# **9. Summarization**
---

<a name = Section91></a>
### **9.1 Conclusion**

- In this part you need to provide a conclusion about your overall analysis.

- Write down some short points that you have observed so far.

<a name = Section92></a>
### **9.2 Actionable Insights**

- This is a very crucial part where you will present your actionable insights.
- You need to give suggestions about what could be applied and what not.
- Make sure that these suggestions are short and to the point, ultimately it's a catalyst to your business.