In [1]:
import pandas as pd

# Options
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 15)

# Data Cleaning & Modeling

## Information
**Data:**
- 7 data sets - each data set contains different columns and values
- A data model - this shows the relationships between all of the data sets, as well as any links that you can use to merge tables.

**Steps:**
- Requirements gathering
- Data cleaning
- Data modelling

## Gathering the data needed

**Problem to solve for Social Buzz:**
1. What are the top 5 popular categories based on the score? (required)

For the remaining questions, please refer to the "Data Vis & Story" notebook:

1. How many unique categories are there?
2. What is the total number of reactions for the most popular category? Additionally, how many reactions are there for the combined top 5 and top 10 popular categories?
3. Which month had the highest number of posts?
4. Which day of the week had the most posts?

**Definitions of different data types:**
- String - Sequence of characters, digits, or symbols—always treated as text
- UUID - Universally Unique Identifiers
- Array - List with a number of elements in a specific order—typically of the same type
- Integer - Numeric data type for numbers without fractions
- Timestamp - Number of seconds that have elapsed since midnight (00:00:00 UTC), 1st January 1970 (Unix time)

**Data that I need:**

Content (Connected to User by UserID)
- ID: Unique ID of the content that was uploaded (automatically generated)
- User ID: Unique ID of a user that exists in the User table
- Type: A string detailing the type of content that was uploaded
- Category: A string detailing the category that this content is relevant to
- URL: Link to the location where this content is stored

Reaction (Connected to Content by ContentID)
- Content ID: Unique ID of a piece of content that was uploaded
- User ID: Unique ID of a user that exists in the User table who reacted to this piece of content
- Type: A string detailing the type of reaction this user gave
- Datetime: The date and time of this reaction

ReactionTypes (Connected to Reaction by Type)
- Type: A string detailing the type of reaction this user gave
- Sentiment: A string detailing whether this type of reaction is considered as positive, negative or neutral
- Score: This is a number calculated by Social Buzz that quantifies how “popular” each reaction is. A reaction type with a higher score
- should be considered as a more popular reaction.


### Explanation why we need the data above

The brief carefully states that the client wanted to see “An analysis of their content categories showing the top 5 categories with the largest popularity”.
- popularity is quantified by the “Score” given to each reaction type.
- We therefore need data showing the content ID, category, content type, reaction type, and reaction score.
- to figure out popularity, we’ll have to add up which content categories have the largest score.

## Data Wrangling

In [2]:
df_content = pd.read_csv('Content.csv').drop('Unnamed: 0', axis=1)
df_reactions = pd.read_csv('Reactions.csv').drop('Unnamed: 0', axis=1)
df_react_type = pd.read_csv('ReactionTypes.csv').drop('Unnamed: 0', axis=1)

df_content.shape, df_reactions.shape, df_react_type.shape

((1000, 5), (25553, 4), (16, 3))

In [3]:
df_content.sample(3)

Unnamed: 0,Content ID,User ID,Type,Category,URL
815,990f1598-4dbb-403c-a5b2-48de2c30b1ca,c5d04879-87bd-4950-9e63-334382ef53af,audio,food,
245,45752c15-a54c-4b0d-8fe3-f39c40f6c8d9,1932a904-86ba-4438-bb52-b7e6516a4019,photo,Education,
765,964a54b7-7dc7-40b0-bd94-9898b670f9cd,15e325b1-c221-4bf8-8010-18d76b03646e,photo,science,


In [4]:
df_reactions.sample(3)

Unnamed: 0,Content ID,User ID,Type,Datetime
21739,b3bcbff0-d160-4fbc-b3c6-8113ceb38056,,,2020-10-31 18:22:55
20017,6912f8e6-86b7-44f6-8aa8-c21c2ed0b53c,c43c2351-9591-4122-acdd-b521723d7292,cherish,2021-01-22 09:34:38
21839,2ceee25a-8461-4409-ab41-7a72d97d722d,fe224147-e893-4178-b46e-b12f22bd7ed1,dislike,2020-08-02 01:45:59


In [5]:
df_react_type.sample(3)

Unnamed: 0,Type,Sentiment,Score
15,worried,negative,12
3,hate,negative,5
0,heart,positive,60


### Cleaning the data

Clean the data by:

- removing rows that have values which are missing.
- changing the data type of some values within a column.
- removing columns which are not relevant to this task.

Before we start let's change things a bit by lowering values and columns.

In [6]:
# Easily access columns by lower it
df_content.columns = df_content.columns.str.lower().str.replace(' ', '_')
df_reactions.columns = df_reactions.columns.str.lower().str.replace(' ', '_')
df_react_type.columns = df_react_type.columns.str.lower()

**Content**

category column

In [7]:
df_content.category.unique()

array(['Studying', 'healthy eating', 'technology', 'food', 'cooking',
       'dogs', 'soccer', 'public speaking', 'science', 'tennis', 'travel',
       'fitness', 'education', 'studying', 'veganism', 'Animals',
       'animals', 'culture', '"culture"', 'Fitness', '"studying"',
       'Veganism', '"animals"', 'Travel', '"soccer"', 'Education',
       '"dogs"', 'Technology', 'Soccer', '"tennis"', 'Culture', '"food"',
       'Food', '"technology"', 'Healthy Eating', '"cooking"', 'Science',
       '"public speaking"', '"veganism"', 'Public Speaking', '"science"'],
      dtype=object)

In [8]:
# Lower all categories and remove the ""
df_content['category'] = df_content['category'].str.lower().str.replace('"', '')

In [9]:
df_content.category.unique()

array(['studying', 'healthy eating', 'technology', 'food', 'cooking',
       'dogs', 'soccer', 'public speaking', 'science', 'tennis', 'travel',
       'fitness', 'education', 'veganism', 'animals', 'culture'],
      dtype=object)

category_type column

In [10]:
df_content.type.unique()

array(['photo', 'video', 'GIF', 'audio'], dtype=object)

Better to make this lower so when we do something to GIF, we can easily access it because most of the people are used to typing lowercase.

In [11]:
df_content['type'] = df_content['type'].str.lower().str.lower()
df_content.type.unique()

array(['photo', 'video', 'gif', 'audio'], dtype=object)

**Reactions**

reaction_type column

In [12]:
for x in df_reactions.type.unique():
    print(x)

nan
disgust
dislike
scared
interested
peeking
cherish
hate
love
indifferent
super love
intrigued
worried
like
heart
want
adore


Everything looks good here let's check the last which is the react type dataframe

**React Type**

In [13]:
for x in df_react_type.type.unique():
    print(x)

heart
want
disgust
hate
interested
indifferent
love
super love
cherish
adore
like
dislike
intrigued
peeking
scared
worried


Lastly, let's check if the reaction type in the dataframe "React Type" and "Reactions" are the same.

In [14]:
are_same = set(df_reactions.type.unique()) == set(df_react_type.type.unique())
are_same

False

Now we can say that every category, category type and reaction type are valid data we can now check for null values and duplicates.

#### Checking for null values

In [15]:
df_content.isnull().sum()

content_id      0
user_id         0
type            0
category        0
url           199
dtype: int64

Null values on URL.

In [16]:
df_reactions.isnull().sum()

content_id       0
user_id       3019
type           980
datetime         0
dtype: int64

Null values on User ID and Type.

In [17]:
df_react_type.isnull().sum()

type         0
sentiment    0
score        0
dtype: int64

No null values on reaction type.

Since we don't care about the user_id and url, we will ignore that. We will just drop the null values of type because that is needed for the analysis.

In [18]:
# Remove null values from the content and reactions
df_reactions.dropna(subset=['type'], axis=0, inplace=True)

In [19]:
df_content.isnull().sum()

content_id      0
user_id         0
type            0
category        0
url           199
dtype: int64

In [20]:
df_reactions.isnull().sum()

content_id       0
user_id       2039
type             0
datetime         0
dtype: int64

#### removing columns which are not relevant to this task.

- Content we don't need user id and url. 
- Reactions we don't need datetime and user id. 
- React type we don't need sentiment.

In [21]:
df_content.drop(columns=['user_id', 'url'], axis=1, inplace=True)
df_reactions.drop(columns=['user_id'], axis=1, inplace=True)

In [22]:
df_content.sample()

Unnamed: 0,content_id,type,category
772,85260ec2-7b35-4266-8071-580a8b4341ca,photo,tennis


In [23]:
df_reactions.sample()

Unnamed: 0,content_id,type,datetime
13913,01aff5ec-2aa8-412e-99ec-526f0f9a6d5e,cherish,2020-07-14 16:38:33


In [24]:
df_react_type.sample()

Unnamed: 0,type,sentiment,score
0,heart,positive,60


#### Changing data types

**Content**

In [25]:
df_content.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   content_id  1000 non-null   object
 1   type        1000 non-null   object
 2   category    1000 non-null   object
dtypes: object(3)
memory usage: 23.6+ KB


In [26]:
df_content.type.unique()

array(['photo', 'video', 'gif', 'audio'], dtype=object)

In [27]:
df_content.category.unique()

array(['studying', 'healthy eating', 'technology', 'food', 'cooking',
       'dogs', 'soccer', 'public speaking', 'science', 'tennis', 'travel',
       'fitness', 'education', 'veganism', 'animals', 'culture'],
      dtype=object)

**Reactions**

In [28]:
df_reactions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24573 entries, 1 to 25552
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   content_id  24573 non-null  object
 1   type        24573 non-null  object
 2   datetime    24573 non-null  object
dtypes: object(3)
memory usage: 767.9+ KB


In [29]:
df_reactions.type.unique()

array(['disgust', 'dislike', 'scared', 'interested', 'peeking', 'cherish',
       'hate', 'love', 'indifferent', 'super love', 'intrigued',
       'worried', 'like', 'heart', 'want', 'adore'], dtype=object)

**React Type**

In [30]:
df_react_type.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   type       16 non-null     object
 1   sentiment  16 non-null     object
 2   score      16 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 512.0+ bytes


In [31]:
df_react_type.type.unique()

array(['heart', 'want', 'disgust', 'hate', 'interested', 'indifferent',
       'love', 'super love', 'cherish', 'adore', 'like', 'dislike',
       'intrigued', 'peeking', 'scared', 'worried'], dtype=object)

In [32]:
df_react_type.score.unique()

array([60, 70,  0,  5, 30, 20, 65, 75, 72, 50, 10, 45, 35, 15, 12],
      dtype=int64)

base on this we will just change type/category to categorical. I would also rename the type column so it won't be confusing.

In [33]:
# Content
df_content['category'] = df_content['category'].astype('category')
df_content['type'] = df_content['type'].astype('category')
df_content.rename(columns={'type': 'category_type'}, inplace=True)

# Reactions
df_reactions['type'] = df_reactions['type'].astype('category')
df_reactions.rename(columns={'type': 'reaction_type'}, inplace=True)

# React type
df_react_type['type'] = df_react_type['type'].astype('category')
df_react_type.rename(columns={'type': 'reaction_type'}, inplace=True)

### Merging the dataframe

We will use inner join because we want the data to have no null.

In [34]:
df_content.sample()

Unnamed: 0,content_id,category_type,category
841,ec69ad7c-c57b-4d10-ad0f-930ff10bf687,video,science


In [35]:
df_reactions.sample()

Unnamed: 0,content_id,reaction_type,datetime
6541,7fe36d94-8867-4719-9c62-60b077cf5973,interested,2020-10-14 13:42:50


In [36]:
df_react_type.sample()

Unnamed: 0,reaction_type,sentiment,score
1,want,positive,70


In [37]:
df_content.sample()

Unnamed: 0,content_id,category_type,category
718,ef1ed168-7433-4233-ade8-95070d70a382,audio,travel


In [38]:
df = df_content.merge(df_reactions, how='inner', on='content_id').merge(df_react_type, how='inner', on='reaction_type')

In [39]:
df.isnull().sum()

content_id       0
category_type    0
category         0
reaction_type    0
datetime         0
sentiment        0
score            0
dtype: int64

Now we have the final dataframe, let's answer the question.

## EDA

**Question**: top 5 categories with the largest popularity

In [40]:
df_copy = df.copy()
df_copy.sample(3)

Unnamed: 0,content_id,category_type,category,reaction_type,datetime,sentiment,score
18330,b2a20047-ac6f-4caa-90b1-fe735433e362,photo,culture,worried,2020-10-28 06:30:44,negative,12
2864,b9ff65d6-b79a-4f9b-8745-10088ddf9934,video,animals,dislike,2020-08-19 07:49:55,negative,10
12434,026973ef-4b73-4901-9160-bc9e04516057,audio,culture,indifferent,2021-05-03 17:58:20,neutral,20


In [41]:
df_top_5_cat = (
    df_copy
    .groupby('category')
    .agg({'score': 'sum'})
    .reset_index()
    .nlargest(5, 'score')
)
df_top_5_cat

Unnamed: 0,category,score
0,animals,74965
9,science,71168
7,healthy eating,69339
12,technology,68738
6,food,66676


They are asking for 1 excel file, sheet 1 is the cleaned merged data and sheet 2 is the top 5 category.

before making the final file, let's check the datatype.

In [42]:
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24573 entries, 0 to 24572
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   content_id     24573 non-null  object  
 1   category_type  24573 non-null  category
 2   category       24573 non-null  category
 3   reaction_type  24573 non-null  category
 4   datetime       24573 non-null  object  
 5   sentiment      24573 non-null  object  
 6   score          24573 non-null  int64   
dtypes: category(3), int64(1), object(3)
memory usage: 1.0+ MB


In [43]:
df_top_5_cat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 6
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   category  5 non-null      category
 1   score     5 non-null      int64   
dtypes: category(1), int64(1)
memory usage: 769.0 bytes


Everything looks good now we'll save it as xlsx as they mentioned they want xlsx. We will use ExcelWriter to make the final file.

In [44]:
# Create an ExcelWriter object
writer = pd.ExcelWriter('data_modeling_final.xlsx', engine='xlsxwriter')

# Write each DataFrame to a separate sheet
df_copy.to_excel(writer, sheet_name='Cleaned Data', index=False)
df_top_5_cat.to_excel(writer, sheet_name='Top 5 Category by Score', index=False)

# Save the Excel file
writer.close()