### EDA and Data Cleaning

Let's begin by importing the necessary libraries and examining our data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
content_df = pd.read_csv('data/Content.csv', index_col = 0)
content_df.head()

Unnamed: 0,Content ID,User ID,Type,Category,URL
0,97522e57-d9ab-4bd6-97bf-c24d952602d2,8d3cd87d-8a31-4935-9a4f-b319bfe05f31,photo,Studying,https://socialbuzz.cdn.com/content/storage/975...
1,9f737e0a-3cdd-4d29-9d24-753f4e3be810,beb1f34e-7870-46d6-9fc7-2e12eb83ce43,photo,healthy eating,https://socialbuzz.cdn.com/content/storage/9f7...
2,230c4e4d-70c3-461d-b42c-ec09396efb3f,a5c65404-5894-4b87-82f2-d787cbee86b4,photo,healthy eating,https://socialbuzz.cdn.com/content/storage/230...
3,356fff80-da4d-4785-9f43-bc1261031dc6,9fb4ce88-fac1-406c-8544-1a899cee7aaf,photo,technology,https://socialbuzz.cdn.com/content/storage/356...
4,01ab84dd-6364-4236-abbb-3f237db77180,e206e31b-5f85-4964-b6ea-d7ee5324def1,video,food,https://socialbuzz.cdn.com/content/storage/01a...


In [3]:
content_df.describe()

Unnamed: 0,Content ID,User ID,Type,Category,URL
count,1000,1000,1000,1000,801
unique,1000,446,4,41,801
top,97522e57-d9ab-4bd6-97bf-c24d952602d2,72d2587e-8fae-4626-a73d-352e6465ba0f,photo,technology,https://socialbuzz.cdn.com/content/storage/975...
freq,1,8,261,71,1


In [4]:
content_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Content ID  1000 non-null   object
 1   User ID     1000 non-null   object
 2   Type        1000 non-null   object
 3   Category    1000 non-null   object
 4   URL         801 non-null    object
dtypes: object(5)
memory usage: 46.9+ KB


Let's start by dropping rows with null values.

In [5]:
content_df.dropna(inplace = True)
content_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 801 entries, 0 to 999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Content ID  801 non-null    object
 1   User ID     801 non-null    object
 2   Type        801 non-null    object
 3   Category    801 non-null    object
 4   URL         801 non-null    object
dtypes: object(5)
memory usage: 37.5+ KB


We won't need the User ID or URL columns for our analysis, so let's drop those. Then let's check the values in the category column to make sure they make sense given our data.

In [6]:
content_df.drop(axis = 1, labels = ['User ID', 'URL'], inplace = True)
content_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 801 entries, 0 to 999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Content ID  801 non-null    object
 1   Type        801 non-null    object
 2   Category    801 non-null    object
dtypes: object(3)
memory usage: 25.0+ KB


In [7]:
content_df['Category'].value_counts()

travel             60
science            55
fitness            54
animals            54
culture            54
technology         54
tennis             53
cooking            51
healthy eating     50
dogs               49
education          48
soccer             47
food               47
veganism           43
studying           41
public speaking    40
Studying            1
Name: Category, dtype: int64

Looks good except for the one entry that has "Studying" capitalized. Let's fix that.

In [9]:
content_df.replace("Studying", "studying", inplace = True)
content_df['Category'].value_counts()

travel             60
science            55
technology         54
animals            54
fitness            54
culture            54
tennis             53
cooking            51
healthy eating     50
dogs               49
education          48
food               47
soccer             47
veganism           43
studying           42
public speaking    40
Name: Category, dtype: int64

Great - now let's look at our second table, Reactions.

In [10]:
reactions_df = pd.read_csv('data/Reactions.csv', index_col = 0)
reactions_df.head()

Unnamed: 0,Content ID,User ID,Type,Datetime
0,97522e57-d9ab-4bd6-97bf-c24d952602d2,,,2021-04-22 15:17:15
1,97522e57-d9ab-4bd6-97bf-c24d952602d2,5d454588-283d-459d-915d-c48a2cb4c27f,disgust,2020-11-07 09:43:50
2,97522e57-d9ab-4bd6-97bf-c24d952602d2,92b87fa5-f271-43e0-af66-84fac21052e6,dislike,2021-06-17 12:22:51
3,97522e57-d9ab-4bd6-97bf-c24d952602d2,163daa38-8b77-48c9-9af6-37a6c1447ac2,scared,2021-04-18 05:13:58
4,97522e57-d9ab-4bd6-97bf-c24d952602d2,34e8add9-0206-47fd-a501-037b994650a2,disgust,2021-01-06 19:13:01


In [11]:
reactions_df.describe()

Unnamed: 0,Content ID,User ID,Type,Datetime
count,25553,22534,24573,25553
unique,980,500,16,25542
top,4b2d0fff-3b4f-43ca-a7df-c430479cb9ba,c76c3393-88e2-47b0-ac37-dc4f2053f5a5,heart,2020-10-29 20:51:08
freq,49,65,1622,2


In [12]:
reactions_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25553 entries, 0 to 25552
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Content ID  25553 non-null  object
 1   User ID     22534 non-null  object
 2   Type        24573 non-null  object
 3   Datetime    25553 non-null  object
dtypes: object(4)
memory usage: 998.2+ KB


Again, we can drop the User ID column. We'll also need to drop rows with missing values, and change the "Datetime" column to a datetime data type.

In [13]:
reactions_df.drop(axis = 1, labels = ['User ID'], inplace = True)
reactions_df['Datetime'] = pd.to_datetime(reactions_df['Datetime'])
reactions_df.dropna(inplace = True)
reactions_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24573 entries, 1 to 25552
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   Content ID  24573 non-null  object        
 1   Type        24573 non-null  object        
 2   Datetime    24573 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 767.9+ KB


That looks good. Finally, let's look at the Reaction Types Table.

In [14]:
reactiontypes_df = pd.read_csv('data/ReactionTypes.csv', index_col = 0)
reactiontypes_df.head()

Unnamed: 0,Type,Sentiment,Score
0,heart,positive,60
1,want,positive,70
2,disgust,negative,0
3,hate,negative,5
4,interested,positive,30


In [15]:
reactiontypes_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16 entries, 0 to 15
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Type       16 non-null     object
 1   Sentiment  16 non-null     object
 2   Score      16 non-null     int64 
dtypes: int64(1), object(2)
memory usage: 512.0+ bytes


It looks like our data is clean, based on the client's specifications (remove rows with missing values, drop unnecessary colunmns, and ensure data types match the column values).