# Lab | Data Cleaning

## Introduction

We keep seeing a common phrase that 80% of the work of a data scientist is data cleaning. We have no idea whether this number is accurate but a data scientist indeed spends lots of time and effort in collecting, cleaning and preparing the data for analysis. This is because datasets are usually messy and complex in nature. It is a very important ability for a data scientist to refine and restructure datasets into a usable state in order to proceed to the data analysis stage.

In this exercise, you will both practice the data cleaning techniques we discussed in the lesson and learn new techniques by looking up documentations and references. You will work on your own but remember the teaching staff is at your service whenever you encounter problems.


## Resources

[Data Cleaning with Numpy and Pandas](https://realpython.com/python-data-cleaning-numpy-pandas/#python-data-cleaning-recap-and-resources)

[Data Preparation](https://www.kdnuggets.com/2017/06/7-steps-mastering-data-preparation-python.html)

# Import library 

In [1]:
# Your code here

import numpy as np
import pandas as pd

# Read the users dataset.

Take a look at what is the `users.csv` separator.

In [2]:
# Your code here

pd.read_csv("./assets/w06_02_users.csv", nrows=2)

Unnamed: 0,Id#Reputation#CreationDate#DisplayName#LastAccessDate#WebsiteUrl#Location#AboutMe#Views#UpVotes#DownVotes#AccountId#Age#ProfileImageUrl
"-1#1#2010-07-19 06:55:26#Community#2010-07-19 06:55:26#http://meta.stackexchange.com/#on the server farm#""<p>Hi",I'm not really a person.</p>
<p>I'm a background process that helps keep this site clean!</p>,


In [3]:
df_users = pd.read_csv("./assets/w06_02_users.csv", sep="#")

## Check its shape

See the number of rows and columns you're dealing.

In [4]:
# Your code here

df_users.shape

(40503, 14)

## Use the .head() to see some rows of your dataframe.

In [5]:
# Your code here

df_users.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


## Get the data info. 

Which columns have a great number of missing values? How many space does this dataframe is occupying in your memory?

Expected output:
````
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40503 entries, 0 to 40502
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               40503 non-null  int64  
 1   Reputation       40503 non-null  int64  
 2   CreationDate     40503 non-null  object 
 3   DisplayName      40497 non-null  object 
 4   LastAccessDate   40503 non-null  object 
 5   WebsiteUrl       8158 non-null   object 
 6   Location         11731 non-null  object 
 7   AboutMe          9424 non-null   object 
 8   Views            40503 non-null  int64  
 9   UpVotes          40503 non-null  int64  
 10  DownVotes        40503 non-null  int64  
 11  AccountId        40503 non-null  int64  
 12  Age              8352 non-null   float64
 13  ProfileImageUrl  16540 non-null  object 
dtypes: float64(1), int64(6), object(7)
memory usage: 4.3+ MB
````

In [6]:
# Your code here

df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40503 entries, 0 to 40502
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               40503 non-null  int64  
 1   Reputation       40503 non-null  int64  
 2   CreationDate     40503 non-null  object 
 3   DisplayName      40497 non-null  object 
 4   LastAccessDate   40503 non-null  object 
 5   WebsiteUrl       8158 non-null   object 
 6   Location         11731 non-null  object 
 7   AboutMe          9424 non-null   object 
 8   Views            40503 non-null  int64  
 9   UpVotes          40503 non-null  int64  
 10  DownVotes        40503 non-null  int64  
 11  AccountId        40503 non-null  int64  
 12  Age              8352 non-null   float64
 13  ProfileImageUrl  16540 non-null  object 
dtypes: float64(1), int64(6), object(7)
memory usage: 4.3+ MB


## Rename Id column to user_id.

Remember to store you results back at the dataframe.

In [7]:
# Your code here

df_users = df_users.rename(columns={"Id": "user_id"})
df_users.head(2)

Unnamed: 0,user_id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,


# Import the `posts_file.csv` dataset

In [8]:
# Your code here

df_posts = pd.read_csv("./assets/w06_02_posts_file.csv", sep=",")

## Perform the same as above to understand a bit of your data (head, info, shape)

In [9]:
# shape

df_posts.shape

(8299, 21)

In [10]:
# head

df_posts.head()

Unnamed: 0,post_id,PostTypeId,AcceptedAnswerId,CreaionDate,Score,ViewCount,Body,user_id,LasActivityDate,Title,...,AnswerCount,CommentCount,FavoriteCount,LastEditorUserId,LastEditDate,CommunityOwnedDate,ParentId,ClosedDate,OwnerDisplayName,LastEditorDisplayName
0,67711,2,,2013-08-19 02:31:14,3,,"<p>At least in OLS, flipping the direction ($x...",805.0,2013-08-19 10:17:50,,...,,0,,805.0,2013-08-19 10:17:50,,67709.0,,,
1,92493,1,,2014-04-04 05:35:59,2,18.0,<p>I have used a psychometric survey of 10 ite...,43085.0,2014-04-04 05:35:59,Multiple Regression - Extreme F-statistic and ...,...,0.0,1,,,,,,,,
2,86981,2,,2014-02-18 08:39:04,1,,<p>I think that this is due to familywise erro...,38450.0,2014-02-18 09:04:57,,...,,1,,805.0,2014-02-18 09:04:57,,86889.0,,,
3,38717,2,,2012-10-05 09:49:08,2,,"<p>There's a <a href=""http://www.ncbi.nlm.nih....",4598.0,2014-02-15 11:26:53,,...,,3,,4598.0,2014-02-15 11:26:53,,38541.0,,,
4,113919,2,,2014-09-01 01:41:05,3,,"<p>For that data, the estimated regression equ...",805.0,2014-09-01 20:09:53,,...,,1,,805.0,2014-09-01 20:09:53,,113871.0,,,


In [11]:
# info

df_posts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8299 entries, 0 to 8298
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   post_id                8299 non-null   int64  
 1   PostTypeId             8299 non-null   int64  
 2   AcceptedAnswerId       1344 non-null   float64
 3   CreaionDate            8299 non-null   object 
 4   Score                  8299 non-null   int64  
 5   ViewCount              3966 non-null   float64
 6   Body                   8284 non-null   object 
 7   user_id                8197 non-null   float64
 8   LasActivityDate        8299 non-null   object 
 9   Title                  3966 non-null   object 
 10  Tags                   3966 non-null   object 
 11  AnswerCount            3966 non-null   float64
 12  CommentCount           8299 non-null   int64  
 13  FavoriteCount          1217 non-null   float64
 14  LastEditorUserId       4071 non-null   float64
 15  Last

## Rename Id column to post_id and OwnerUserId to user_id.

Again, remember to check that your results are correctly stored inside the dataframe.

In [12]:
# Your code here

df_posts = df_posts.rename(columns={"Id": "post_id", "OwnerUserId": "user_id"})

## Define new dataframes for users and posts with the following selected columns:
**users columns**: user_id, Reputation, Views, UpVotes, DownVotes  
**posts columns**: post_id, Score, user_id, ViewCount, CommentCount, Body

In [13]:
# Your code here

users = df_users.loc[:, ["user_id", "Reputation", "Views", "UpVotes", "DownVotes"]]

users.head(2)

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0


In [14]:
posts = df_posts.loc[
    :, ["post_id", "Score", "user_id", "ViewCount", "CommentCount", "Body"]
]

posts.head(2)

Unnamed: 0,post_id,Score,user_id,ViewCount,CommentCount,Body
0,67711,3,805.0,,0,"<p>At least in OLS, flipping the direction ($x..."
1,92493,2,43085.0,18.0,1,<p>I have used a psychometric survey of 10 ite...


## Merge the new dataframes you have created
- Create dataframe called `posts_from_users` merging users and posts.
- You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes. 
- Think carefully which should be the key(s) for your merging.

In [15]:
# Your code here

posts_from_users = pd.merge(users, posts, on="user_id", how="inner")
posts_from_users

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,10131,0,,0,
1,-1,1,0,5007,1920,16366,0,,0,
2,-1,1,0,5007,1920,40689,0,,0,
3,-1,1,0,5007,1920,28333,0,,0,
4,-1,1,0,5007,1920,32803,0,,0,
...,...,...,...,...,...,...,...,...,...,...
8517,55628,1,0,0,0,115148,0,21.0,2,<p>I am currently doing research on social med...
8518,55633,4,1,0,0,115162,0,25.0,1,<p>I am new to using R. I am trying to figur...
8519,55637,26,4,0,0,115170,1,,0,"<p>When you say class, I hope you mean 'output..."
8520,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."


## Check the number of duplicated rows.

Remember you can sum the results of a mask to get how many numbers the True value appeared in the results. This occurs because `True` is interpreted as `1` in Python whereas `False` is interpreted as `0`.

In [16]:
# Your code here

posts_from_users.duplicated().sum()

351

## Find those duplicate values and try to understand what happened.

Hints:   
- You can use the argument `keep=False` from the `.duplicated()` method to bring the duplication.
- You can sort the values `by=['user_id', 'post_id']` to see them in order.  


In [17]:
# Your code here

posts_from_users[posts_from_users.duplicated(keep=False)].sort_values(
    by=["user_id", "post_id"]
)

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
735,760,168,13,13,0,1289,7,1139.0,8,<p>I am having difficulties to select the righ...
739,760,168,13,13,0,1289,7,1139.0,8,<p>I am having difficulties to select the righ...
734,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
736,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
738,760,168,13,13,0,8625,6,1799.0,3,<p>I was fiddling with PCA and LDA methods and...
...,...,...,...,...,...,...,...,...,...,...
8475,54711,4,18,0,0,114527,0,45.0,5,<p>From Shapiro-Wilk's test I see that the res...
8477,54741,16,1,0,0,113334,3,122.0,9,<p>I am confused on what I have read about the...
8478,54741,16,1,0,0,113334,3,122.0,9,<p>I am confused on what I have read about the...
8486,54911,1,1,0,0,113691,0,36.0,11,<p>I extract data related to a movie by sentim...


## Should you drop it? 
If you think it is reasonable to drop it, then drop it. Think how would you correct it in the first place? That is, what was wrong in the first place?  
*Hint: There's a pandas method to drop duplicates. If you wanted to do it by hand, you could select the indexes of the duplicated values and `.drop()` it.*

In [18]:
# Your code here

posts_from_users = posts_from_users.drop_duplicates()
posts_from_users

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,10131,0,,0,
1,-1,1,0,5007,1920,16366,0,,0,
2,-1,1,0,5007,1920,40689,0,,0,
3,-1,1,0,5007,1920,28333,0,,0,
4,-1,1,0,5007,1920,32803,0,,0,
...,...,...,...,...,...,...,...,...,...,...
8517,55628,1,0,0,0,115148,0,21.0,2,<p>I am currently doing research on social med...
8518,55633,4,1,0,0,115162,0,25.0,1,<p>I am new to using R. I am trying to figur...
8519,55637,26,4,0,0,115170,1,,0,"<p>When you say class, I hope you mean 'output..."
8520,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."


## 10. How many missing values do you have in your merged dataframe? On which columns?

In [19]:
# Your code here

posts_from_users.isnull().sum().sort_values(ascending=False)

ViewCount       4277
Body              15
user_id            0
Reputation         0
Views              0
UpVotes            0
DownVotes          0
post_id            0
Score              0
CommentCount       0
dtype: int64

## Select only the rows in which there are at least some missing values.

In [20]:
# Your code here

posts_from_users.isnull().value_counts()

user_id  Reputation  Views  UpVotes  DownVotes  post_id  Score  ViewCount  CommentCount  Body 
False    False       False  False    False      False    False  True       False         False    4262
                                                                False      False         False    3894
                                                                True       False         True       15
dtype: int64

In [21]:
posts_from_users["ViewCount"].isnull().value_counts()

True     4277
False    3894
Name: ViewCount, dtype: int64

In [22]:
posts_from_users["Body"].isnull().value_counts()

False    8156
True       15
Name: Body, dtype: int64

In [23]:
missing_values = posts_from_users.isnull().any(axis=1)
posts_from_users[missing_values]

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,-1,1,0,5007,1920,10131,0,,0,
1,-1,1,0,5007,1920,16366,0,,0,
2,-1,1,0,5007,1920,40689,0,,0,
3,-1,1,0,5007,1920,28333,0,,0,
4,-1,1,0,5007,1920,32803,0,,0,
...,...,...,...,...,...,...,...,...,...,...
8504,55365,321,23,1,1,114888,1,,0,<p>You need to have some indication of the unc...
8507,55435,94,1,2,0,114837,-1,,4,"<p>Yes, it is. Identifiability means that if ..."
8511,55484,38,0,3,0,114869,1,,0,<p>I am battling similar problems at the momen...
8515,55599,31,2,0,0,115233,0,,0,<p>Before computing the variance-covariance ma...


## You will need to make something with missing values.  Will you clean or filling them? 

Pay attention. There can be different reasons for the missings numbers. Look at the `user_id` of some of them, look at the body of the message. Which ones you're sure of what should be and which one can you infer? Don't hurry up, take a look at your data.

In [24]:
# Your code here

posts_from_users = posts_from_users.dropna()
posts_from_users

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
21,5,6792,1145,662,5,2213,15,3182.0,2,"<p>What is the difference between a <a href=""h..."
23,5,6792,1145,662,5,2077,21,9878.0,0,"<p>Besides taking differences, what are other ..."
26,5,6792,1145,662,5,2167,11,709.0,3,"<p>The <a href=""http://en.wikipedia.org/wiki/K..."
30,8,6764,1089,604,25,168,17,1022.0,1,<p>For univariate kernel density estimators (K...
39,18,128,8,16,0,3,54,3613.0,4,<p>What are some valuable Statistical Analysis...
...,...,...,...,...,...,...,...,...,...,...
8516,55608,1,0,0,0,115114,0,10.0,0,"<p>I have a dataset with 3 variables (X,Y and ..."
8517,55628,1,0,0,0,115148,0,21.0,2,<p>I am currently doing research on social med...
8518,55633,4,1,0,0,115162,0,25.0,1,<p>I am new to using R. I am trying to figur...
8520,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."


## Reset the index

In [25]:
# Your code here

posts_from_users.reset_index(drop=True, inplace=True)

In [26]:
posts_from_users

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
0,5,6792,1145,662,5,2213,15,3182.0,2,"<p>What is the difference between a <a href=""h..."
1,5,6792,1145,662,5,2077,21,9878.0,0,"<p>Besides taking differences, what are other ..."
2,5,6792,1145,662,5,2167,11,709.0,3,"<p>The <a href=""http://en.wikipedia.org/wiki/K..."
3,8,6764,1089,604,25,168,17,1022.0,1,<p>For univariate kernel density estimators (K...
4,18,128,8,16,0,3,54,3613.0,4,<p>What are some valuable Statistical Analysis...
...,...,...,...,...,...,...,...,...,...,...
3889,55608,1,0,0,0,115114,0,10.0,0,"<p>I have a dataset with 3 variables (X,Y and ..."
3890,55628,1,0,0,0,115148,0,21.0,2,<p>I am currently doing research on social med...
3891,55633,4,1,0,0,115162,0,25.0,1,<p>I am new to using R. I am trying to figur...
3892,55734,1,0,0,0,115352,0,16.0,0,"<p>For example, I was looking at <a href=""http..."


## Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [27]:
# Your code here

posts_from_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3894 entries, 0 to 3893
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   user_id       3894 non-null   int64  
 1   Reputation    3894 non-null   int64  
 2   Views         3894 non-null   int64  
 3   UpVotes       3894 non-null   int64  
 4   DownVotes     3894 non-null   int64  
 5   post_id       3894 non-null   int64  
 6   Score         3894 non-null   int64  
 7   ViewCount     3894 non-null   float64
 8   CommentCount  3894 non-null   int64  
 9   Body          3894 non-null   object 
dtypes: float64(1), int64(8), object(1)
memory usage: 304.3+ KB


In [28]:
posts_from_users["user_id"] = posts_from_users["user_id"].astype("category")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_from_users["user_id"] = posts_from_users["user_id"].astype("category")


In [29]:
posts_from_users["ViewCount"] = posts_from_users["ViewCount"].astype("int64")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_from_users["ViewCount"] = posts_from_users["ViewCount"].astype("int64")


In [30]:
posts_from_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3894 entries, 0 to 3893
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   user_id       3894 non-null   category
 1   Reputation    3894 non-null   int64   
 2   Views         3894 non-null   int64   
 3   UpVotes       3894 non-null   int64   
 4   DownVotes     3894 non-null   int64   
 5   post_id       3894 non-null   int64   
 6   Score         3894 non-null   int64   
 7   ViewCount     3894 non-null   int64   
 8   CommentCount  3894 non-null   int64   
 9   Body          3894 non-null   object  
dtypes: category(1), int64(8), object(1)
memory usage: 370.1+ KB


# Bonus (filtering) 
What is the average number of comments for users who are above the average reputation?  
*Hint: Calculate the average of the user Reputation. Store it in a variable called `avg_reputation` and then use that variable for filtering the dataset and generating the results for each case (for the case in which `Reputation > avg_reputation`*

In [31]:
# Your code here

avg_reputation = posts_from_users["Reputation"].mean()

print("")
print(f"\033[1;43m User's average reputation: {round(avg_reputation, 2)} ")
print("")


[1;43m User's average reputation: 518.24 



In [32]:
posts_from_users[posts_from_users.Reputation > avg_reputation].sort_values(
    by="Reputation"
)

Unnamed: 0,user_id,Reputation,Views,UpVotes,DownVotes,post_id,Score,ViewCount,CommentCount,Body
156,862,531,90,2,0,2092,9,28723,0,<p>The waiting times for poisson distribution ...
431,3301,533,44,12,0,13658,1,1745,1,"<p>New to R, and am trying to do text classifi..."
2363,25944,537,75,33,0,64031,1,85,6,<p>Has anyone any idea how one could distingui...
674,5561,539,58,39,0,22472,13,1329,0,"<p><em>(ignore the R code if needed, as my mai..."
1343,11634,545,45,100,2,29446,2,1402,3,<p>I am trying to make predictions using a ran...
...,...,...,...,...,...,...,...,...,...,...
133,686,44152,7357,2156,82,64409,5,162,5,"<p>In <a href=""http://stats.stackexchange.com/..."
134,686,44152,7357,2156,82,22406,8,1689,0,<p>I am looking for a program (in R or SAS or ...
132,686,44152,7357,2156,82,58073,6,151,5,"<p>In Andrew Gelman's book ""Red State, Blue St..."
131,686,44152,7357,2156,82,97693,1,16,0,<p>In a fairly complex survival analysis case ...


In [33]:
avg_df = posts_from_users[posts_from_users.Reputation > avg_reputation]
avg_comments = avg_df.CommentCount.mean()

print("")
print(
    f"\033[1;46m Average number of comments for users above the average reputation: {round(avg_comments, 2)} "
)
print("")


[1;46m Average number of comments for users above the average reputation: 2.26 



In [34]:
# Cases where the average number of comments for users are above the average reputation

above_avg_reputation = (
    posts_from_users[["user_id", "CommentCount", "Reputation"]]
    .groupby(by="user_id", as_index=False)
    .mean()
)
above_avg_reputation[above_avg_reputation["Reputation"] > avg_reputation]

Unnamed: 0,user_id,CommentCount,Reputation
0,5,1.666667,6792.0
1,8,1.000000,6764.0
3,22,3.000000,591.0
4,25,6.000000,4968.0
5,30,6.500000,2185.0
...,...,...,...
2300,36545,0.500000,600.0
2337,37188,13.000000,1161.0
2690,44269,5.000000,4767.0
2697,44451,0.000000,1460.0
