# Predicting Total Interactions of Facebook page
## Phase 1: Data Preparation & Visualisation

#### Group Number: 
45
#### Names & Student IDs of Group Members:
Athul Varghese Thampan, S3958556

Mohamed Bilal Naeem, S3967700


## Table of Contents
* [Introduction](#itr) 
  + [Dataset Source](#Dataset-Source)
  + [Dataset Details](#Dataset-Details)
  + [Dataset Variables](#Dataset-Variables)
  + [Response Variable](#Response-Variable)
* [Goals and Objectives](#Goals-and-Objectives)
* [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
* [Data Exploration and Visualisation](#Data-Exploration-and-Visualisation)
* [Summary and Conclusion](#Summary-and-Conclusion)
* [References](#References)



## Introduction <a id='itr'></a>

### Dataset Source

The Facebook metrics dataset used in this study was sourced from Dr. David Akman's github datasets repository. This dataset is related to Facebook posts published during the year of 2014 on the Facebook page of a renowned cosmetics brand and shows different Facebook metrics.





### Dataset Details

The dataset is about the different Facebook metrics of a renowned cosmetics brand and contains features such as page total likes, type of content, category, post month, post weekday, post hour, paid, lifetime post total reach, lifetime post total impressions, lifetime engaged users, lifetime post consumers,	lifetime post consumptions, lifetime post impressions by people who have liked your page, lifetime post reach by people who like your page,	lifetime people who have liked your page and engaged with your post, comment, like, share and total interactions.

These features will be more than sufficient for predicting total interactions of the Facebook page as a regression problem.

This dataset has a total of 19 features and 500 observations.

**Dataset Retrieval**

- Reading the dataset from the github repository which will be used in this project.
- Displaying 10 random rows.

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

pd.set_option('display.max_columns', None) 

###
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")
###

In [None]:
# url of the dataset from github
df_url = 'https://raw.githubusercontent.com/akmand/datasets/main/fb_metrics.csv'
url_content = requests.get(df_url, verify=False).content
melb_df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

In [None]:
melb_df.sample(10, random_state=999)

### Dataset Variables

The variables in this dataset are displayed in the table below along with their data type, units and description.

In [None]:
from tabulate import tabulate

table = [['Name','Data Type','Units','Description'],
         ['page_total_likes','Numeric','NA','Total number of likes for the page'],
         ['type','Nominal categorical','NA','Type of content'],
         ['post_month','Date','NA','Month that the post is uploaded'],
         ['post_weekday','Date','NA','Day of the week the post is uploaded'],
         ['post_hour','Date','NA','Time of day the post is uploaded'],
         ['paid','Binary categorical','NA','Whether paid or unpaid for content'],
         ['lifetime_post_total_reach','Numeric','NA','lifetime total reach of the post to users'],
         ['lifetime_post_total_impressions','Numeric','NA','Lifetime total impressions made by the post'],
         ['lifetime_engaged_users','Numeric','NA','Lifetime engagement by users'],
         ['lifetime_post_consumers','Numeric','NA','Lifetime post consumers'],
         ['lifetime_post_consumptions','Numeric','NA','Lifetime post consumptions'],
         ['lifetime_post_impressions_by_people_who_have_liked_your_page','Numeric','NA','Lifetime post impressions by people who liked the page'],
         ['lifetime_post_reach_by_people_who_like_your_page','Numeric','NA','Lifetime post reach by the people who like the page'],
         ['lifetime_people_who_have_liked_your_page_and_engaged_with_your_post','Numeric','NA','Lifetime people who have liked the page and engaged with post'],
         ['comment','Numeric','NA','Number of comments on post'],
         ['like','Numeric','NA','Number of likes on post'],
         ['share','Numeric','NA','Number of shares of post'],
         ['total_interactions','Numeric','NA','Number of total interactions on post']]

print(tabulate(table, headers='firstrow', tablefmt='grid'))

### Response Variable

For this project, the target feature in this dataset will be the house price in Australian dollars. That is, the price of Melbourne houses will be predicted based on the explanatory/ descriptive variables. 

## Goals and Objectives

Melbourne has a very active housing market as demand for housing throughout the city is usually well above the supply. For this reason, a model that can accurately predict a house's selling price in Melbourne would have many real-world uses. For instance, real estate agents can provide better service to their customers using this predictive model. Likewise, banks lending out money to home buyers can better estimate the financial aspects of this home loan. Perhaps more importantly as potential home buyers, we as individuals can better figure out if we are being ripped off or we are getting a good deal provided that our predictive model is a reliable one.

Thus, the main objective of this project is two-fold: (1) predict the price of a house sold in Melbourne based on publicly available features of the house, and (2) which features seem to be the best predictors of the house sale price. A secondary objective is to perform some exploratory data analysis by basic descriptive statistics & data visualisation plots to gain some insight into the patterns and relationships existing in the data subsequent to some data cleaning & preprocessing, which is the subject of this Phase 1 report.

At this point, we make the important assumption that rows in our dataset are not correlated. That is, we assume that house price observations are independent of one another in this dataset. Of course, this is not a very realistic assumption, however, this assumption allows us to circumvent time series aspects of the underlying dynamics of house prices and also to resort to rather classical predictive models such as multiple linear regression.

## Data Cleaning and Preprocessing

In this section, we describe the data cleaning and preprocessing steps undertaken for this project.

### Data Cleaning Steps

*   Drop irrelevant features in our dataset
*   Check and rename/ modify some column names
*   Check for missing values
*   Remove all the rows with missing values 
*   Random sampling of the dataset for 5000 rows

Let's first display all the columns in our dataset.

In [None]:
melb_df.columns

`Postcode` is very similar to the `Suburb` feature, thus it is considered redundant for the model and removed. `Bedroom2` is just the number of bedrooms based on other source, which is also the same as `Rooms` feature, thus it is also considered redundant and removed. 

In [None]:
#drop irrelevant/repeated columns/features
melb_df = melb_df.drop(columns=["Postcode", "Bedroom2"]) 

As this is not a time series project, the `Date` feature is transformed into `Months` and `Years` features for the model. After transforming, the original `Date` feature is removed.

In [None]:
# Convert Date feature from string to Pandas' datatime dtype
melb_df['Date'] = pd.to_datetime(melb_df['Date'], dayfirst=True)
###
# Create Year and Month features
melb_df['year_sold'] = melb_df['Date'].dt.year
melb_df['month_sold'] = melb_df['Date'].dt.month
###
# Remove Date as we will not need it anymore
melb_df = melb_df.drop(columns=["Date"]) 

Some of the columns are not labelled properly, which will be problematic when modelling. We will make all column names lower case and replace any spaces with underscores for consistency. We will also remove any white spaces before & after column names.

In [None]:
# make column names lower case and also remove any white spaces
# before & after the column names using the strip() function
melb_df.columns = melb_df.columns.str.lower().str.strip()

columns_mapping = {
    'sellerg': 'real_estate',
    'landsize': 'land_size',
    'buildingarea': 'building_area',
    'yearbuilt': 'year_built',
    'councilarea': 'council',
    'regionname': 'region',
    'propertycount': 'property_count'    
}

# rename columns
melb_df = melb_df.rename(columns = columns_mapping)
melb_df.sample(5, random_state=999)

Next we check the data types and observe that they match the intended data types, thus no change is needed here. 

In [None]:
# Check for data types
print(f"Shape of the dataset = {melb_df.shape} \n")
print(f"Data types are below where 'object' indicates a string type: ")
print(melb_df.dtypes)

The unique values for all columns with categorical data types are displayed to check for any white spaces and other data quality issues. It turns out that the data is already clean, and no futher data cleaning steps are necessary here. 

In [None]:
from IPython.display import display, HTML
display(HTML('<b>Table 1: Summary of categorical features</b>'))
melb_df.describe(include='object').T

In [None]:
# To see all unique values for categorical data types
categoricalColumns = melb_df.columns[melb_df.dtypes==object].tolist()

for col in categoricalColumns:
    print('Unique values for ' + col)
    print(melb_df[col].unique())
    print('')

The summary statistics are generated for all the numerical features. There does not seem to be any outliers in the data. 

In [None]:
from IPython.display import display, HTML
display(HTML('<b>Table 2: Summary of numerical features</b>'))
melb_df.describe(include=['int64','float64']).T

Missing values are checked by displaying the number of missing values in every column. We observe that `car`, `building_area`, `year_built` and `council` features have missing values. We decide to drop these observations for simplicity.

In [None]:
# Count missing values in each column
print(f"\nNumber of missing values for each column/ feature:")
print(melb_df.isnull().sum())

In [None]:
# Drop all rows with missing values/NaN
melb_df = melb_df.dropna()
melb_df.shape

### Random Sampling

As the data has more than 5000 rows, random sampling is done to get only 5000 rows out of the remaining 6196 rows for ease of computation. At the end, we display 5 random rows from our cleaned data.

In [None]:
melb_df = melb_df.sample(n=5000, random_state=999)
melb_df.shape
melb_df.sample(5, random_state=999)

## Data Exploration and Visualisation

Our dataset is now considered to be clean and we are ready to start visualising and explore each of the features.

### Univariate Visualisation



#### Bar Chart  of Region Name

We count the region name to see which region has the highest count in Melbourne Housing in a descending order. As we can see in Figure 1, the Southern Metropolitan has the highest number of house sold compared to the 
other regions in Melbourne. 

In [None]:
plt.figure(figsize = (20,8))
fig = sns.countplot(x = 'region', data = melb_df, palette = 'magma', 
                    order = melb_df['region'].value_counts().index)
fig = plt.title('Figure 1: Region Count for Melbourne Housing', fontsize = 15)
plt.show()

#### Bar Chart of Method

From Figure 2, we can see that the method "Property Sold" ("S") is the most common method used in selling house based on the dataset. 

In [None]:
plt.figure(figsize = (15,8))
fig = sns.countplot(x = 'method', data = melb_df, palette = None, 
                    order = melb_df['method'].value_counts().index)
fig = plt.title('Figure 2: Method Count in Melbourne Housing', fontsize = 15)
plt.show()

#### Boxplot & Histogram of Price

We can see in Figures 3A and 3B that the distribution of price is clearly right-skewed and has a huge range, which indicates the price variable will probably need a log transformation in the second phase of the project.

In [None]:
# Boxplot of Price
plt.figure(figsize = (15,8))
sns.boxplot(melb_df['price']).set_title('Figure 3A: Box Plot of Price', fontsize = 15)
plt.show();

In [None]:
# Boxplot of Price
plt.figure(figsize = (15,8))
sns.distplot(melb_df['price'], kde=True, bins=40).set_title('Figure 3B: Histogram of Price', fontsize = 15)
plt.show();

### Two-Variable Visualisation

#### Scatterplot of price and distance from CBD

Figure 4 shows that there is a correlation between the distance from CBD and price. As the distance increases, the price tends to be lower. 

In [None]:
plt.figure(figsize = (15,8))
plt.scatter(melb_df['distance'], melb_df['price'], alpha = 0.3)
plt.title('Figure 4: Scatterplot of Price and Distance from CBD', fontsize = 15)
plt.xlabel('Distance from CBD')
plt.ylabel('Price')
plt.show();

#### Boxplot of Price by House Type

From figure 5, we can see that H type (house, cottage, villa, semi, terrace) of house have overall a higher price compared to other types. Furthermore, U (unit or duplex) house has a lower price compared to other types. 


In [None]:
plt.figure(figsize = (15,8))
sns.boxplot(melb_df['type'], melb_df['price']);
plt.title('Figure 5: Boxplot of Price by House Type', fontsize = 15)
plt.show();

#### Boxplot of Price by Number of Rooms

In figure 6, the price tends to increase as the number of rooms increases. 

In [None]:
plt.figure(figsize = (15,8))
sns.boxplot(melb_df['rooms'], melb_df['price']);
plt.title('Figure 6: Boxplot of Price by Number of Rooms', fontsize = 15)
plt.show();

### Three-Variable Visualisation

#### Boxplot of Price broken down by Method and House Type

We can clearly see that in figure 7, the comparison between house type and price in different method is somewhat comparable. There is no significant difference. 

In [None]:
plt.figure(figsize = (15,8))
sns.boxplot(melb_df['method'], melb_df['price'], 
            hue = melb_df['type'])
plt.title('Figure 7: Boxplot of Price broken down by Method and House Type', fontsize = 15)
plt.show();

#### Scatterplot of Price by Distance from Central Business District (CBD) and House Type

Figure 8 shows that U house type (unit, duplex) tends to be closer to CBD and have a lower price, based on the clustering of the red points in the lower left corner of the plot.

In [None]:
plt.figure(figsize = (15,8))
sns.scatterplot(melb_df['distance'], melb_df['price'], hue = melb_df['type'])
plt.title('Figure 8: Scatterplot of Price by Distance coloured by House Type', fontsize = 15);
plt.legend(loc = 'upper right')
plt.show();

#### Barplot of Price by Month and Year Sold

Overall, we observe that the price of house sold was higher in 2017 than 2016 per month, but there is no clear pattern regarding which month has higher price of house sold. 

In [None]:
plt.figure(figsize = (15,8))
fig_4 = sns.barplot(x ='month_sold', y ='price', hue = 'year_sold', data = melb_df)
plt.title('Figure 9: Barplot of Melbourne Housing Prices by Month in 2016 and 2017', fontsize = 15)
plt.xlabel('Month', fontsize = 10)
plt.ylabel('House Prices', fontsize = 10) 

## Summary and Conclusions

Accurate prediction of house prices has many practical applications for all stakeholders; including buyers, sellers, banks, and real-estate agencies. Without a doubt, a model that can reliably predict a house's actual sale price would be indispensable  for healthy functioning of the housing market. Our goal in this project is to investigate if we can come up with a reliable model for predicting house prices using the Melbourne House Prices dataset.

In Phase I of this project, we undertook the tasks of data cleaning & preprocessing and data visualisation. First we decided to drop two variables which we considered to be not useful for predictive modelling, namely the `Postcode` and `Bedroom2` variables. We also checked the data for any missing values and outliers, and we decided to remove any rows containing such data quality issues. There was not much cleaning needed, as the data seemed to be relatively clean to begin with. Furthermore, we sampled the data to get only 5000 random rows to save time when running models in the second phase of this project.

We generated several visualisations in order to explore the data. Using the explanatory variables, we clearly see that Southern Metropolitan region has the highest count of selling Melbourne houses. During the years 2016 and 2017, the house prices in Melbourne only have a slight difference in prices which we can see that 2017 has more houses sold than 2016. Furthermore, in the years 2016 and 2017, the unit/ duplex house prices tend to be lower when it is closer to CBD than houses and townhouses. Therefore, Melbourne house prices tend to fluctuate more when people buy houses from popular region areas or CBD in Melbourne. 

We also observed that some of the numerical data are skewed, including the target feature, Price. Scaling/ transformation of the data will be needed to address this issue. We have seen that house types, distance from CBD, and number of rooms are few of the most significant features in determining the house price. However, further analysis needs be done in order to identify any further relationships. 

## References

- Becker, D. (n.d.). Melbourne Housing Snapshot (Kaggle). Retrieved September 21, 2021 from https://www.kaggle.com/dansbecker/melbourne-housing-snapshot

***